Completed
-
1. Introduction to Python
-
2. Python Basics
-
3. Working with Data Structures
-
4. Functions and Modules
-
5. Object-Oriented Programming (OOP)
-
6. File Handling
-
7. Error and Exception Handling
-
8. Python for Data Analysis
-
9. Advanced Topics in Python
-
10. Working with APIs
-
11. Python for Automation
-
12. Capstone Projects
- 13. Final Assessment and Quizzes
12.3 Web Scraping Project
Web Scraping Project: Automating Data Extraction from Websites
Web scraping is a powerful technique for extracting data from websites, and Python makes it easy to automate this process using libraries like requests, BeautifulSoup, and Selenium. In this project, we’ll create a Python script that will automate the task of scraping data from a website, parsing the content, and saving it into a structured format like a CSV file.
Project Overview
In this project, we will scrape product data from an e-commerce website and save it into a CSV file. The data we’ll scrape includes:
- Product Name
- Product Price
- Product Rating
- Product URL
We'll be using the following tools:
- requests: To send HTTP requests to retrieve the web pages.
- BeautifulSoup: To parse the HTML content and extract the necessary information.
- csv: To save the extracted data into a CSV file.
Steps for Web Scraping Project
1. Setting Up the Environment
First, ensure you have the necessary libraries installed. You can install the required libraries using pip:
pip install requests beautifulsoup4
2. Inspecting the Website
Before scraping data, inspect the website to understand its structure. You’ll need to find the elements that contain the data you want to extract (e.g., product names, prices, etc.). You can do this by right-clicking on the page and selecting Inspect in your browser.
For this example, let's assume the website we're scraping is an e-commerce platform where product details are contained within <div> tags with specific classes for each piece of information.
3. Scraping the Website
Now, we’ll write the Python script that will:
- Make an HTTP request to the website.
- Parse the HTML content.
- Extract product data.
- Save the data into a CSV file.
import requests from bs4 import BeautifulSoup import csv # URL of the website to scrape url = 'https://www.example.com/products' # Send a GET request to the website response = requests.get(url) # Check if the request was successful if response.status_code == 200: # Parse the HTML content using BeautifulSoup soup = BeautifulSoup(response.text, 'html.parser') # Open a CSV file to write the data with open('products.csv', mode='w', newline='', encoding='utf-8') as file: writer = csv.writer(file) writer.writerow(['Product Name', 'Price', 'Rating', 'Product URL']) # Find all product containers (adjust the class or tag based on the website structure) products = soup.find_all('div', class_='product-container') for product in products: # Extract product name, price, rating, and URL name = product.find('h2', class_='product-title').text.strip() price = product.find('span', class_='price').text.strip() rating = product.find('div', class_='rating').text.strip() if product.find('div', class_='rating') else 'N/A' product_url = product.find('a', href=True)['href'] # Write data to CSV writer.writerow([name, price, rating, product_url]) print("Scraping completed and data saved to 'products.csv'.") else: print("Failed to retrieve the website.")
4. Running the Script
Once the script is ready, run it using Python:
python scrape_products.py
If the script runs successfully, it will create a CSV file called products.csv with the following columns:
- Product Name
- Price
- Rating
- Product URL
5. Handling Pagination
Most e-commerce websites paginate their product listings. To scrape all products across multiple pages, you’ll need to handle pagination in your script.
Here’s how you can modify the script to navigate through multiple pages:
import requests from bs4 import BeautifulSoup import csv # Base URL of the website (without page number) base_url = 'https://www.example.com/products?page=' # Open a CSV file to write the data with open('products.csv', mode='w', newline='', encoding='utf-8') as file: writer = csv.writer(file) writer.writerow(['Product Name', 'Price', 'Rating', 'Product URL']) page_number = 1 while True: # Send a GET request to the page url = base_url + str(page_number) response = requests.get(url) # If the page exists, continue scraping if response.status_code == 200: soup = BeautifulSoup(response.text, 'html.parser') products = soup.find_all('div', class_='product-container') if not products: # Break the loop if no products are found (end of pages) break for product in products: name = product.find('h2', class_='product-title').text.strip() price = product.find('span', class_='price').text.strip() rating = product.find('div', class_='rating').text.strip() if product.find('div', class_='rating') else 'N/A' product_url = product.find('a', href=True)['href'] # Write data to CSV writer.writerow([name, price, rating, product_url]) print(f"Scraping page {page_number} completed.") page_number += 1 # Move to the next page else: print("Failed to retrieve the page.") break print("Scraping completed and data saved to 'products.csv'.")
This version of the script will scrape data from multiple pages until there are no more products to scrape.
6. Handling Errors and Challenges
- Timeouts: Websites might take longer to load, and requests might time out. You can add error handling for timeouts to ensure your script doesn’t fail unexpectedly.
response = requests.get(url, timeout=10) # 10-second timeout
- User-Agent: Some websites block non-browser requests. You can mimic a browser request by including a User-Agent in the headers.
headers = {'User-Agent': 'Mozilla/5.0'} response = requests.get(url, headers=headers)
- IP Blocking: Websites may block your IP if you send too many requests in a short amount of time. To avoid this, you can:
- Introduce delays between requests using time.sleep().
- Use rotating proxies or a service like ScraperAPI.
7. Conclusion
By following this project, you’ve learned how to scrape data from an e-commerce website, parse the HTML content, handle pagination, and save the results in a CSV file. This skill can be applied to a variety of web scraping tasks across different domains, such as extracting job listings, real estate data, product prices, and more. Always ensure that you respect a website’s robots.txt and terms of service when scraping data.
Commenting is not enabled on this course.