Skip to Content
Course content

12.3 Web Scraping Project

Web Scraping Project: Automating Data Extraction from Websites

Web scraping is a powerful technique for extracting data from websites, and Python makes it easy to automate this process using libraries like requests, BeautifulSoup, and Selenium. In this project, we’ll create a Python script that will automate the task of scraping data from a website, parsing the content, and saving it into a structured format like a CSV file.

Project Overview

In this project, we will scrape product data from an e-commerce website and save it into a CSV file. The data we’ll scrape includes:

  • Product Name
  • Product Price
  • Product Rating
  • Product URL

We'll be using the following tools:

  • requests: To send HTTP requests to retrieve the web pages.
  • BeautifulSoup: To parse the HTML content and extract the necessary information.
  • csv: To save the extracted data into a CSV file.

Steps for Web Scraping Project

1. Setting Up the Environment

First, ensure you have the necessary libraries installed. You can install the required libraries using pip:

pip install requests beautifulsoup4

2. Inspecting the Website

Before scraping data, inspect the website to understand its structure. You’ll need to find the elements that contain the data you want to extract (e.g., product names, prices, etc.). You can do this by right-clicking on the page and selecting Inspect in your browser.

For this example, let's assume the website we're scraping is an e-commerce platform where product details are contained within <div> tags with specific classes for each piece of information.

3. Scraping the Website

Now, we’ll write the Python script that will:

  • Make an HTTP request to the website.
  • Parse the HTML content.
  • Extract product data.
  • Save the data into a CSV file.
import requests
from bs4 import BeautifulSoup
import csv

# URL of the website to scrape
url = 'https://www.example.com/products'

# Send a GET request to the website
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Open a CSV file to write the data
    with open('products.csv', mode='w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['Product Name', 'Price', 'Rating', 'Product URL'])

        # Find all product containers (adjust the class or tag based on the website structure)
        products = soup.find_all('div', class_='product-container')

        for product in products:
            # Extract product name, price, rating, and URL
            name = product.find('h2', class_='product-title').text.strip()
            price = product.find('span', class_='price').text.strip()
            rating = product.find('div', class_='rating').text.strip() if product.find('div', class_='rating') else 'N/A'
            product_url = product.find('a', href=True)['href']

            # Write data to CSV
            writer.writerow([name, price, rating, product_url])

    print("Scraping completed and data saved to 'products.csv'.")
else:
    print("Failed to retrieve the website.")

4. Running the Script

Once the script is ready, run it using Python:

python scrape_products.py

If the script runs successfully, it will create a CSV file called products.csv with the following columns:

  • Product Name
  • Price
  • Rating
  • Product URL

5. Handling Pagination

Most e-commerce websites paginate their product listings. To scrape all products across multiple pages, you’ll need to handle pagination in your script.

Here’s how you can modify the script to navigate through multiple pages:

import requests
from bs4 import BeautifulSoup
import csv

# Base URL of the website (without page number)
base_url = 'https://www.example.com/products?page='

# Open a CSV file to write the data
with open('products.csv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Product Name', 'Price', 'Rating', 'Product URL'])

    page_number = 1
    while True:
        # Send a GET request to the page
        url = base_url + str(page_number)
        response = requests.get(url)

        # If the page exists, continue scraping
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            products = soup.find_all('div', class_='product-container')

            if not products:  # Break the loop if no products are found (end of pages)
                break

            for product in products:
                name = product.find('h2', class_='product-title').text.strip()
                price = product.find('span', class_='price').text.strip()
                rating = product.find('div', class_='rating').text.strip() if product.find('div', class_='rating') else 'N/A'
                product_url = product.find('a', href=True)['href']
                
                # Write data to CSV
                writer.writerow([name, price, rating, product_url])

            print(f"Scraping page {page_number} completed.")
            page_number += 1  # Move to the next page
        else:
            print("Failed to retrieve the page.")
            break

print("Scraping completed and data saved to 'products.csv'.")

This version of the script will scrape data from multiple pages until there are no more products to scrape.

6. Handling Errors and Challenges

  • Timeouts: Websites might take longer to load, and requests might time out. You can add error handling for timeouts to ensure your script doesn’t fail unexpectedly.
response = requests.get(url, timeout=10)  # 10-second timeout
  • User-Agent: Some websites block non-browser requests. You can mimic a browser request by including a User-Agent in the headers.
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
  • IP Blocking: Websites may block your IP if you send too many requests in a short amount of time. To avoid this, you can:
    • Introduce delays between requests using time.sleep().
    • Use rotating proxies or a service like ScraperAPI.

7. Conclusion

By following this project, you’ve learned how to scrape data from an e-commerce website, parse the HTML content, handle pagination, and save the results in a CSV file. This skill can be applied to a variety of web scraping tasks across different domains, such as extracting job listings, real estate data, product prices, and more. Always ensure that you respect a website’s robots.txt and terms of service when scraping data.

Commenting is not enabled on this course.