Scraping a Large Set of Products

This project involves scraping an online wholesale company, Mayesh. The shop contains thousands of flowers, and our goal is to collect comprehensive data including images, prices, and descriptions.

1. Project Overview

The objective of this project is to build an automated web scraping tool that can extract all relevant product information from Mayesh's online shop. The data collected will be stored in a structured format for further analysis and utilization.

2. Tools & Technologies

Programming Language: Python
Web Scraping Libraries:
- Beautiful Soup - For parsing HTML content
- Scrapy - For large-scale scraping with built-in scheduling and data storage
- Selenium - For dynamic content handling and web automation
Data Storage:
- Pandas - For data manipulation and exporting to CSV
- MongoDB - For storing data in a NoSQL database
Development Environment: Visual Studio Code
Version Control: GitHub

3. Installation and Setup

Before starting with the data collection, ensure you have Python installed and set up your environment with the necessary libraries. Here are the quick steps to get started:

3.1 Install Python

Make sure Python is installed on your system. You can download it from python.org. After installation, you can check the version by running:

python --version

3.2 Setting Up a Virtual Environment

It's a good practice to use a virtual environment for your Python projects. Here's how you can set it up:

python -m venv myenv
source myenv/bin/activate  # On Windows use `myenv\\Scripts\\activate`

3.3 Install Required Libraries

Install all required Python libraries with pip. Here are the commands to install the main libraries used in this project:

pip install Scrapy BeautifulSoup4 selenium pandas pymongo

4. Data Collection

For data collection, we utilized a combination of Scrapy and Selenium. Scrapy was used for its speed and efficiency in scraping static content, while Selenium handled dynamic content loading. The collected data includes:

Product Name
Price
Image URL
Description

We will provide more details on the data collection process in the next page.