Skip to content

Latest commit

Β 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Β 
Β 
Β 
Β 

README.md

πŸ•· Advanced Web Scraper

Python CSV

A robust and professional web scraping tool built with Python to extract product data from paginated websites. It uses requests and BeautifulSoup with logging, retry logic, and CSV export capabilities.


✨ Features

  • 🌐 Fetch HTML pages with custom browser headers.
  • πŸ›’ Extract product details:
    • Product name
    • Product price
    • Product link
  • πŸ”„ *Retry logic for failed requests.
  • πŸ“„ Scrape multiple pages automatically.
  • πŸ’Ύ Save results to CSV file.
  • πŸ“Š Logging for progress and error tracking.
  • πŸ›  Easily customizable CSS selectors for any website structure.

πŸ›  Requirements

  • Python 3.x
  • Python libraries:
    • requests
    • beautifulsoup4
    • pandas (optional for CSV formatting)

Install dependencies:

pip install requests beautifulsoup4 pandas

πŸš€ Usage

  1. Open Web Scraper Code.py.

  2. Modify the BASE_URL to target the website you want to scrape.

  3. Adjust pagination in scrape_all_pages(start, end).

  4. Run the script:

python Web Scraper Code.py The script will log progress and save all scraped products to:

products.csv


πŸ“Š Example Output

Name Price Link
Product 1 $99.99 /products/product1
Product 2 $49.99 /products/product2
Product 3 $149.99 /products/product3

πŸ’‘ Tips & Best Practices

βœ… Always check the website's robots.txt before scraping.

βœ… Use time.sleep() between requests to avoid overwhelming servers.

βœ… Use headers to mimic a real browser.

⚑ For dynamic content (JS-loaded pages), consider Selenium.

πŸ”§ Customize CSS selectors in parse_products() for each website.

πŸ—‚ For large datasets, you can save output to JSON or a database.


πŸ“œ Logging

  • The script logs:

  • URL fetch attempts

  • Status codes and errors

  • Number of products found per page

  • CSV save confirmation


🌟 Bonus

  • You can extend this project to:

  • Scrape multiple websites simultaneously

  • Schedule scraping tasks with cron or task scheduler

  • Visualize product trends with Matplotlib or Seaborn

  • Integrate with APIs or dashboards for real-time updates