A robust NestJS-based web crawler that extracts product links from e-commerce sites using Puppeteer and Cheerio. Handles dynamic pages, infinite scroll, and domain-specific patterns.
- Virgio
- TataCliq
- Nykaa Fashion
- Westside
- Domain-Specific Configurations: Base URLs, product URL patterns, optional custom Puppeteer functions.
- Dynamic Content Handling: Uses Puppeteer to render and scroll dynamic product listings.
- Fallback Mechanism: If Puppeteer fails, uses Cheerio to parse static HTML.
- Modular Architecture: Built with NestJS for maintainability and scalability.
| Tech | Purpose |
|---|---|
| NestJS | Backend architecture |
| Puppeteer | Headless browser for JavaScript rendering |
| Cheerio | Fast static HTML scraping |
| TypeScript | Modern type-safe code |
src/
├── crawler/
│ ├── crawler.service.ts # Core crawling logic
│ ├── crawler.domain.ts # Domain configurations
│ └── utils/
│ └── autoScroll.ts # Infinite scroll helper
└── main.ts # App entrypoint
git clone https://github.com/SumitGupta016/Product-Crawler.git
cd Product-Crawlernpm installnpm run start:devnpm run start:prodconst links = await crawlerService.crawl('https://www.virgio.com/men/all');
console.log(links); // Array of product URLsexport async function autoScroll(page: puppeteer.Page): Promise<void> {
await page.evaluate(async () => {
await new Promise<void>((resolve) => {
let totalHeight = 0;
const distance = 500;
const timer = setInterval(() => {
window.scrollBy(0, distance);
totalHeight += distance;
if (totalHeight >= document.body.scrollHeight) {
clearInterval(timer);
resolve();
}
}, 300);
});
});
}MIT License