This project involves scraping an online wholesale company, Mayesh. The shop contains thousands of flowers, and our goal is to collect comprehensive data including images, prices, and descriptions.
1. Project Overview
The objective of this project is to build an automated web scraping tool that can extract all relevant product information from Mayesh's online shop. The data collected will be stored in a structured format for further analysis and utilization.
2. Tools & Technologies
- Programming Language: Python
-
Web Scraping Libraries:
- Beautiful Soup - For parsing HTML content
- Scrapy - For large-scale scraping with built-in scheduling and data storage
- Selenium - For dynamic content handling and web automation
- Data Storage:
- Development Environment: Visual Studio Code
- Version Control: GitHub
3. Installation and Setup
Before starting with the data collection, ensure you have Python installed and set up your environment with the necessary libraries. Here are the quick steps to get started:
3.1 Install Python
Make sure Python is installed on your system. You can download it from python.org. After installation, you can check the version by running:
python --version
3.2 Setting Up a Virtual Environment
It's a good practice to use a virtual environment for your Python projects. Here's how you can set it up:
python -m venv myenv
source myenv/bin/activate # On Windows use `myenv\\Scripts\\activate`
3.3 Install Required Libraries
Install all required Python libraries with pip. Here are the commands to install the main libraries used in this project:
pip install Scrapy BeautifulSoup4 selenium pandas pymongo
4. Data Collection
For data collection, we utilized a combination of Scrapy and Selenium. Scrapy was used for its speed and efficiency in scraping static content, while Selenium handled dynamic content loading. The collected data includes:
- Product Name
- Price
- Image URL
- Description
We will provide more details on the data collection process in the next page.