diff --git a/docs/01_introduction/quick-start.mdx b/docs/01_introduction/quick-start.mdx index da166da9..c74bd848 100644 --- a/docs/01_introduction/quick-start.mdx +++ b/docs/01_introduction/quick-start.mdx @@ -67,7 +67,7 @@ The Actor's source code is in the `src` folder. This folder contains two importa {MainExample} - + {UnderscoreMainExample} @@ -97,12 +97,20 @@ To learn more about the features of the Apify SDK and how to use them, check out ### Guides -To see how you can integrate the Apify SDK with popular web scraping libraries, check out our guides: +To see how you can integrate the Apify SDK with popular scraping libraries and frameworks, check out these guides: -- [BeautifulSoup with HTTPX](../guides/beautifulsoup-httpx) -- [Parsel with Impit](../guides/parsel-impit) -- [Playwright](../guides/playwright) -- [Selenium](../guides/selenium) -- [Crawlee](../guides/crawlee) -- [Scrapy](../guides/scrapy) -- [Running webserver](../guides/running-webserver) +- [Scraping with BeautifulSoup and HTTPX](../guides/beautifulsoup-httpx) +- [Scraping with Parsel and Impit](../guides/parsel-impit) +- [Browser automation with Playwright](../guides/playwright) +- [Browser automation with Selenium](../guides/selenium) +- [Building crawlers with Crawlee](../guides/crawlee) +- [Building crawlers with Scrapy](../guides/scrapy) +- [Adaptive scraping with Scrapling](../guides/scrapling) +- [LLM-ready scraping with Crawl4AI](../guides/crawl4ai) +- [Browser AI agents with Browser Use](../guides/browser-use) + +For other aspects of Actor development, explore these guides: + +- [Project management with uv](../guides/uv) +- [Input validation with Pydantic](../guides/input-validation) +- [Running a web server](../guides/running-webserver) diff --git a/docs/02_concepts/02_actor_input.mdx b/docs/02_concepts/02_actor_input.mdx index 15807c05..f975e6ae 100644 --- a/docs/02_concepts/02_actor_input.mdx +++ b/docs/02_concepts/02_actor_input.mdx @@ -20,6 +20,10 @@ For example, if an Actor received a JSON input with two fields, `{ "firstNumber" {InputExample} +## Validating input + +Reading values straight out of the raw input dictionary works for simple cases, but it gives you no type guarantees, no constraint checks, and no clear error when the input is malformed. For anything beyond a couple of fields, validate the input with [Pydantic](https://docs.pydantic.dev/) so your code works with a typed, guaranteed-valid object instead. See the [Validate Actor input with Pydantic](../guides/input-validation) guide for the recommended approach. + ## Loading URLs from Actor input Actors commonly receive a list of URLs to process via their input. The `ApifyRequestList` class (from `apify.request_loaders`) can parse the standard Apify input format for URL sources. It supports both direct URL objects (`{"url": "https://example.com"}`) and remote URL lists (`{"requestsFromUrl": "https://example.com/urls.txt"}`), where the remote file contains one URL per line. diff --git a/docs/03_guides/01_beautifulsoup_httpx.mdx b/docs/03_guides/01_beautifulsoup_httpx.mdx index ba15df03..2ae47ded 100644 --- a/docs/03_guides/01_beautifulsoup_httpx.mdx +++ b/docs/03_guides/01_beautifulsoup_httpx.mdx @@ -1,6 +1,6 @@ --- id: beautifulsoup-httpx -title: Use BeautifulSoup with HTTPX +title: Scraping with BeautifulSoup and HTTPX description: Build an Apify Actor that scrapes web pages using BeautifulSoup and HTTPX. --- @@ -8,7 +8,7 @@ import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock'; import BeautifulSoupHttpxExample from '!!raw-loader!roa-loader!./code/01_beautifulsoup_httpx.py'; -In this guide, you'll learn how to use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) library with the [HTTPX](https://www.python-httpx.org/) library in your Apify Actors. +In this guide, you'll learn how to scrape web pages with the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) and [HTTPX](https://www.python-httpx.org/) libraries in your Apify Actors. ## Introduction @@ -20,12 +20,16 @@ To create an Actor which uses those libraries, start from the [BeautifulSoup & P ## Example Actor -Below is a simple Actor that recursively scrapes titles from all linked websites, up to a specified maximum depth, starting from URLs provided in the Actor input. It uses [HTTPX](https://www.python-httpx.org/) for fetching pages and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for parsing their content to extract titles and links to other pages. +Below is a simple Actor that recursively scrapes data from linked pages on the same site, up to a specified maximum depth, starting from URLs provided in the Actor input. It uses [HTTPX](https://www.python-httpx.org/) for fetching pages through [Apify Proxy](https://docs.apify.com/platform/proxy) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for parsing their content to extract the title, headings, and links to other pages. {BeautifulSoupHttpxExample} +## Using Apify Proxy + +Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with `Actor.create_proxy_configuration` and fetches a fresh proxy URL for every request, so each page goes through a different IP. A new HTTPX client is created per request to apply that URL. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For more details, see the [Proxy management](../concepts/proxy-management) guide. + ## Conclusion In this guide, you learned how to use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) with the [HTTPX](https://www.python-httpx.org/) in your Apify Actors. By combining these libraries, you can efficiently extract data from HTML or XML files, making it easy to build web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping! diff --git a/docs/03_guides/02_parsel_impit.mdx b/docs/03_guides/02_parsel_impit.mdx index da5a2866..d91ebea2 100644 --- a/docs/03_guides/02_parsel_impit.mdx +++ b/docs/03_guides/02_parsel_impit.mdx @@ -1,6 +1,6 @@ --- id: parsel-impit -title: Use Parsel with Impit +title: Scraping with Parsel and Impit description: Build an Apify Actor that scrapes web pages using Parsel selectors and the Impit HTTP client. --- @@ -8,7 +8,7 @@ import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock'; import ParselImpitExample from '!!raw-loader!roa-loader!./code/02_parsel_impit.py'; -In this guide, you'll learn how to combine the [Parsel](https://github.com/scrapy/parsel) and [Impit](https://github.com/apify/impit) libraries when building Apify Actors. +In this guide, you'll learn how to scrape web pages with the [Parsel](https://github.com/scrapy/parsel) and [Impit](https://github.com/apify/impit) libraries in your Apify Actors. ## Introduction @@ -18,12 +18,16 @@ In this guide, you'll learn how to combine the [Parsel](https://github.com/scrap ## Example Actor -The following example shows a simple Actor that recursively scrapes titles from linked pages, up to a user-defined maximum depth. It uses [Impit](https://github.com/apify/impit) to fetch pages and [Parsel](https://github.com/scrapy/parsel) to extract titles and discover new links. +The following example shows a simple Actor that recursively scrapes data from linked pages on the same site, up to a user-defined maximum depth. It uses [Impit](https://github.com/apify/impit) to fetch pages through [Apify Proxy](https://docs.apify.com/platform/proxy) and [Parsel](https://github.com/scrapy/parsel) to extract the title, headings, and links. {ParselImpitExample} +## Using Apify Proxy + +Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with `Actor.create_proxy_configuration` and fetches a fresh proxy URL for every request, so each page goes through a different IP. A new Impit client is created per request to apply that URL. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For more details, see the [Proxy management](../concepts/proxy-management) guide. + ## Conclusion In this guide, you learned how to use [Parsel](https://github.com/scrapy/parsel) with [Impit](https://github.com/apify/impit) in your Apify Actors. By combining these libraries, you get a powerful and efficient solution for web scraping: [Parsel](https://github.com/scrapy/parsel) provides excellent CSS selector and XPath support for data extraction, while [Impit](https://github.com/apify/impit) offers a fast and simple HTTP client built by Apify. This combination makes it easy to build scalable web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping! diff --git a/docs/03_guides/03_playwright.mdx b/docs/03_guides/03_playwright.mdx index 0e20b9e4..11b57b7e 100644 --- a/docs/03_guides/03_playwright.mdx +++ b/docs/03_guides/03_playwright.mdx @@ -1,6 +1,6 @@ --- id: playwright -title: Use Playwright +title: Browser automation with Playwright description: Build an Apify Actor that scrapes dynamic web pages using Playwright browser automation. --- @@ -11,7 +11,7 @@ import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock'; import PlaywrightExample from '!!raw-loader!roa-loader!./code/03_playwright.py'; -In this guide, you'll learn how to use [Playwright](https://playwright.dev) for web scraping in your Apify Actors. +In this guide, you'll learn how to use [Playwright](https://playwright.dev) for browser automation and web scraping in your Apify Actors. ## Introduction @@ -48,14 +48,18 @@ playwright install --with-deps` ## Example Actor -This is a simple Actor that recursively scrapes titles from all linked websites, up to a maximum depth, starting from URLs in the Actor input. +This is a simple Actor that recursively scrapes data from linked pages on the same site, up to a maximum depth, starting from URLs in the Actor input. -It uses Playwright to open the pages in an automated Chrome browser, and to extract the title and anchor elements after the pages load. +It uses Playwright to open the pages in an automated Chrome browser, and to extract the title, headings, and links after the pages load. {PlaywrightExample} +## Using Apify Proxy + +Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with `Actor.create_proxy_configuration` and launches the browser through it. Playwright applies the proxy at the browser level, so the whole run shares a single proxy URL rather than rotating per request; the `to_playwright_proxy` helper splits that URL into the `server`, `username`, and `password` fields Playwright expects. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For more details, see the [Proxy management](../concepts/proxy-management) guide. + ## Conclusion In this guide you learned how to create Actors that use Playwright to scrape websites. Playwright is a powerful tool that can be used to manage browser instances and scrape websites that require JavaScript execution. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping! diff --git a/docs/03_guides/04_selenium.mdx b/docs/03_guides/04_selenium.mdx index e878c3a6..ae4ccbef 100644 --- a/docs/03_guides/04_selenium.mdx +++ b/docs/03_guides/04_selenium.mdx @@ -1,6 +1,6 @@ --- id: selenium -title: Use Selenium +title: Browser automation with Selenium description: Build an Apify Actor that scrapes dynamic web pages using Selenium WebDriver. --- @@ -8,7 +8,7 @@ import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock'; import SeleniumExample from '!!raw-loader!roa-loader!./code/04_selenium.py'; -In this guide, you'll learn how to use [Selenium](https://www.selenium.dev/) for web scraping in your Apify Actors. +In this guide, you'll learn how to use [Selenium](https://www.selenium.dev/) for browser automation and web scraping in your Apify Actors. ## Introduction @@ -32,14 +32,20 @@ Refer to the [Selenium documentation](https://www.selenium.dev/documentation/web ## Example Actor -This is a simple Actor that recursively scrapes titles from all linked websites, up to a maximum depth, starting from URLs in the Actor input. +This is a simple Actor that recursively scrapes data from linked pages on the same site, up to a maximum depth, starting from URLs in the Actor input. -It uses Selenium ChromeDriver to open the pages in an automated Chrome browser, and to extract the title and anchor elements after the pages load. +It uses Selenium ChromeDriver to open the pages in an automated Chrome browser, and to extract the title, headings, and links after the pages load. {SeleniumExample} +## Using Apify Proxy + +Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with `Actor.create_proxy_configuration` and routes the browser through it for the whole run. + +Chrome ignores the credentials passed in the `--proxy-server` flag, so an authenticated proxy such as Apify Proxy has to be configured from inside a small extension. The `proxy_auth_extension` helper builds one at runtime: its service worker sets the proxy server and answers the browser's authentication challenge with the username and password. Note that the new headless mode (`--headless=new`) is required for Chrome to load the extension. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For more details, see the [Proxy management](../concepts/proxy-management) guide. + ## Conclusion In this guide you learned how to use Selenium for web scraping in Apify Actors. You can now create your own Actors that use Selenium to scrape dynamic websites and interact with web pages just like a human would. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping! diff --git a/docs/03_guides/05_crawlee.mdx b/docs/03_guides/05_crawlee.mdx index 34bb0f46..f0aa67f6 100644 --- a/docs/03_guides/05_crawlee.mdx +++ b/docs/03_guides/05_crawlee.mdx @@ -1,6 +1,6 @@ --- id: crawlee -title: Use Crawlee +title: Building crawlers with Crawlee description: Build Apify Actors using Crawlee's BeautifulSoupCrawler, ParselCrawler, or PlaywrightCrawler. --- @@ -10,7 +10,7 @@ import CrawleeBeautifulSoupExample from '!!raw-loader!roa-loader!./code/05_crawl import CrawleeParselExample from '!!raw-loader!roa-loader!./code/05_crawlee_parsel.py'; import CrawleePlaywrightExample from '!!raw-loader!roa-loader!./code/05_crawlee_playwright.py'; -In this guide, you'll learn how to use the [Crawlee](https://crawlee.dev/python) library in your Apify Actors. +In this guide, you'll learn how to build web crawlers with the [Crawlee](https://crawlee.dev/python) library in your Apify Actors. ## Introduction @@ -42,6 +42,10 @@ The [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler {CrawleePlaywrightExample} +## Using Apify Proxy + +All three crawlers above route their requests through [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. `Actor.create_proxy_configuration` returns a Crawlee-compatible proxy configuration, which is passed to the crawler as `proxy_configuration`; Crawlee then rotates the proxy IP for every request on its own. Because the configuration is only available inside the running Actor, the crawler is created in `main` and the request handler is registered on a standalone [`Router`](https://crawlee.dev/python/api/class/Router) up front. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For more details, see the [Proxy management](../concepts/proxy-management) guide. + ## Conclusion In this guide, you learned how to use the [Crawlee](https://crawlee.dev/python) library in your Apify Actors. By using the [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler), [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler), and [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler) crawlers, you can efficiently scrape static or dynamic web pages, making it easy to build web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping! diff --git a/docs/03_guides/06_scrapy.mdx b/docs/03_guides/06_scrapy.mdx index 12525609..4af24354 100644 --- a/docs/03_guides/06_scrapy.mdx +++ b/docs/03_guides/06_scrapy.mdx @@ -1,6 +1,6 @@ --- id: scrapy -title: Use Scrapy +title: Building crawlers with Scrapy description: Convert Scrapy spiders into Apify Actors with platform storage and proxy integration. --- @@ -15,7 +15,7 @@ import ItemsExample from '!!raw-loader!./code/scrapy_project/src/items.py'; import SpidersExample from '!!raw-loader!./code/scrapy_project/src/spiders/title.py'; import SettingsExample from '!!raw-loader!./code/scrapy_project/src/settings.py'; -In this guide, you'll learn how to use the [Scrapy](https://scrapy.org/) framework in your Apify Actors. +In this guide, you'll learn how to build web crawlers with the [Scrapy](https://scrapy.org/) framework in your Apify Actors. ## Introduction @@ -23,9 +23,9 @@ In this guide, you'll learn how to use the [Scrapy](https://scrapy.org/) framewo ## Integrating Scrapy with the Apify platform -The Apify SDK provides an Apify-Scrapy integration. The main challenge of this is to combine two asynchronous frameworks that use different event loop implementations. Scrapy uses [Twisted](https://twisted.org/) for asynchronous execution, while the Apify SDK is based on [asyncio](https://docs.python.org/3/library/asyncio.html). The key thing is to install the Twisted's `asyncioreactor` to run Twisted's asyncio compatible event loop. The `apify.scrapy.run_scrapy_actor` function handles this reactor installation automatically. This allows both Twisted and asyncio to run on a single event loop, enabling a Scrapy spider to run as an Apify Actor with minimal modifications. +The Apify SDK provides an Apify-Scrapy integration. The main challenge of this is to combine two asynchronous frameworks that use different event loop implementations. Scrapy uses [Twisted](https://twisted.org/) for asynchronous execution, while the Apify SDK is based on [asyncio](https://docs.python.org/3/library/asyncio.html). The key thing is to install Twisted's `asyncioreactor` to run Twisted's asyncio compatible event loop. The `apify.scrapy.run_scrapy_actor` function handles this reactor installation automatically. This allows both Twisted and asyncio to run on a single event loop, enabling a Scrapy spider to run as an Apify Actor with minimal modifications. - + {UnderscoreMainExample} @@ -74,7 +74,7 @@ For further details, see the [Scrapy migration guide](https://docs.apify.com/cli The following example shows a Scrapy Actor that scrapes page titles and enqueues links found on each page. This example aligns with the structure provided in the Apify Actor templates. - + {UnderscoreMainExample} diff --git a/docs/03_guides/07_running_webserver.mdx b/docs/03_guides/07_running_webserver.mdx deleted file mode 100644 index c17c313b..00000000 --- a/docs/03_guides/07_running_webserver.mdx +++ /dev/null @@ -1,39 +0,0 @@ ---- -id: running-webserver -title: Run a web server -description: Run an HTTP server inside your Actor for monitoring or serving content during execution. ---- - -import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock'; - -import WebserverExample from '!!raw-loader!roa-loader!./code/07_webserver.py'; - -In this guide, you'll learn how to run a web server inside your Apify Actor. This is useful for monitoring Actor progress, creating custom APIs, or serving content during the Actor run. - -## Introduction - -Each Actor run on the Apify platform is assigned a unique hard-to-guess URL (for example `https://8segt5i81sokzm.runs.apify.net`), which enables HTTP access to an optional web server running inside the Actor run's container. - -The URL is available in the following places: - -- In Apify Console, on the Actor run details page as the **Container URL** field. -- In the API as the `container_url` property of the [Run object](https://docs.apify.com/api/v2#/reference/actors/run-object/get-run). -- In the Actor as the `Actor.configuration.container_url` property. - -The web server running inside the container must listen at the port defined by the `Actor.configuration.container_port` property. When running Actors locally, the port defaults to `4321`, so the web server will be accessible at `http://localhost:4321`. - -## Example Actor - -The following example shows how to start a simple web server in your Actor, which will respond to every GET request with the number of items that the Actor has processed so far: - - - {WebserverExample} - - -## Conclusion - -In this guide, you learned how to run a web server inside your Apify Actor. By leveraging the container URL and port provided by the platform, you can expose HTTP endpoints for monitoring, reporting, or serving content during Actor execution. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). - -## Additional resources - -- [Apify templates: Standby Python project](https://apify.com/templates/python-standby) diff --git a/docs/03_guides/07_scrapling.mdx b/docs/03_guides/07_scrapling.mdx new file mode 100644 index 00000000..63e948e5 --- /dev/null +++ b/docs/03_guides/07_scrapling.mdx @@ -0,0 +1,96 @@ +--- +id: scrapling +title: Adaptive scraping with Scrapling +description: Build an Apify Actor that scrapes web pages using the Scrapling adaptive web scraping library. +--- + +import CodeBlock from '@theme/CodeBlock'; +import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock'; + +import ScraplingExample from '!!raw-loader!roa-loader!./code/07_scrapling.py'; +import ScraplingBrowserScraper from '!!raw-loader!./code/07_scrapling_browser.py'; + +In this guide, you'll learn how to use the [Scrapling](https://scrapling.readthedocs.io/) library for adaptive web scraping in your Apify Actors. + +## Introduction + +[Scrapling](https://scrapling.readthedocs.io/) is an adaptive web scraping library for Python that combines fetching and parsing behind a single, high-level API. It can fetch a page with fast HTTP requests or with a real browser, parse the result with familiar CSS selectors and XPath, and even relocate your selectors automatically when a website's structure changes. + +Some of the features that make Scrapling a good fit for Apify Actors: + +- **Multiple fetchers** - A single API exposes a fast HTTP client with browser TLS-fingerprint impersonation, as well as full browser automation for JavaScript-heavy or protected pages. +- **Adaptive selectors** - Scrapling can remember the elements you scraped and find them again after a website redesign, so your scrapers keep working with fewer manual fixes. +- **Anti-bot evasion** - Built-in stealth features (browser impersonation, realistic headers, and automatic Cloudflare Turnstile solving with the browser fetchers) help you avoid being blocked. +- **Familiar parsing API** - Elements are selected with CSS selectors (including the `::text` and `::attr()` pseudo-elements) or XPath, with a Scrapy/Parsel-like `.get()` and `.getall()` interface. +- **First-class async support** - Every fetcher has an asynchronous variant, which integrates naturally with the asyncio-based Apify SDK. + +Scrapling's parser works on its own, while the fetchers are an optional extra. Install Scrapling with the `fetchers` extra to get the HTTP and browser fetchers: + +```bash +pip install "scrapling[fetchers]" +``` + +## Choosing a fetcher + +All of Scrapling's fetchers are importable from `scrapling.fetchers`. Pick the one that matches the website you're scraping: + +- **`Fetcher` / `AsyncFetcher`** - Plain HTTP requests via `.get()`, `.post()`, `.put()`, and `.delete()`. Fast and lightweight, with optional browser TLS-fingerprint impersonation (`impersonate`) and realistic headers (`stealthy_headers`). This is the best choice for static pages and APIs, and it needs no browser binaries. +- **`DynamicFetcher` / `DynamicSession`** - Full browser automation based on [Playwright](https://playwright.dev/), for pages that require JavaScript rendering or interaction. Fetch a page with `.fetch()` or its async variant `.async_fetch()`. +- **`StealthyFetcher` / `StealthySession`** - A stealth-hardened browser fetcher that can automatically solve Cloudflare Turnstile challenges (`solve_cloudflare=True`). Use it for the most heavily protected websites. + +The returned `Response` object is also a Scrapling selector, so you can call `.css()`, `.xpath()`, `.find_all()`, and the other parsing methods on it directly. + +The HTTP fetchers work with just the `scrapling[fetchers]` extra. The browser-based fetchers (`DynamicFetcher` and `StealthyFetcher`) additionally need browser binaries, which you download with the `scrapling install` command - see [Running browser-based fetchers](#running-browser-based-fetchers) below. + +The example Actor in this guide uses the HTTP `AsyncFetcher`, which is the simplest to deploy and pairs well with Apify Proxy. + +## Example Actor + +The following Actor recursively scrapes data from linked pages on the same site, up to a user-defined maximum depth, starting from the URLs in the Actor input. It uses Scrapling's `AsyncFetcher` to fetch each page through [Apify Proxy](https://docs.apify.com/platform/proxy), and CSS selectors to extract the title, headings, and links. + +The whole Actor fits in a single file. A `scrape_page` helper holds the Scrapling-specific fetching and parsing, while the `main` coroutine handles the [Actor](https://docs.apify.com/platform/actors) lifecycle, reads the input, sets up [Apify Proxy](https://docs.apify.com/platform/proxy) and the [request queue](https://docs.apify.com/platform/storage/request-queue), and drives the crawl: + + + {ScraplingExample} + + +A few things worth pointing out: + +- Keeping the fetching and parsing in `scrape_page` separates the Scrapling-specific code from the Actor's orchestration logic. The function returns the extracted data together with the discovered links, so `main` decides what to store and what to enqueue. +- The response of `AsyncFetcher.get` is a Scrapling selector, so `response.css('title::text').get()` reads the page title and `response.css('a::attr(href)').getall()` returns every link's `href` in one call. +- `response.urljoin(link_href)` resolves relative links against the page URL, so you can enqueue them directly. +- The `impersonate='chrome'` and `stealthy_headers=True` options make the request look like it comes from a real Chrome browser, which - combined with Apify Proxy - reduces the chance of being blocked. + +## Using Apify Proxy + +Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. In the example above, `main` creates a proxy configuration with `Actor.create_proxy_configuration` and passes a fresh proxy URL to `scrape_page` for every request, which forwards it to Scrapling's `proxy` argument. + +Scrapling accepts the proxy as a URL string (for example `http://user:pass@proxy.apify.com:8000`), which is exactly what `ProxyConfiguration.new_url` returns. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For more details, see the [Proxy management](../concepts/proxy-management) guide. The browser-based fetchers accept the same `proxy` argument. + +## Running browser-based fetchers + +`DynamicFetcher` and `StealthyFetcher` drive a real browser, so they need the browser binaries installed with the `scrapling install` command. Locally, run it once after installing the `scrapling[fetchers]` extra: + +```bash +scrapling install +``` + +Switching the example Actor from HTTP to a real browser takes only one code change - swap the `AsyncFetcher.get` call in `scrape_page` for `DynamicFetcher.async_fetch`. The parsing API is identical, so the rest of the Actor stays exactly the same: + + + {ScraplingBrowserScraper} + + +To run this on the Apify platform, build on top of the [Apify Playwright base image](https://hub.docker.com/r/apify/actor-python-playwright), which already ships a browser together with all of its system-level dependencies, and run `scrapling install` during the Docker build to download the browser binaries that Scrapling expects. + +## Conclusion + +In this guide, you learned how to use Scrapling in your Apify Actors. You can now fetch pages with Scrapling's HTTP or browser-based fetchers, extract data with its CSS and XPath selectors, route requests through Apify Proxy, and run the whole thing on the Apify platform. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping! + +## Additional resources + +- [Scrapling: Official documentation](https://scrapling.readthedocs.io/) +- [Scrapling: Fetchers](https://scrapling.readthedocs.io/en/latest/fetching/choosing/) +- [Scrapling: Parsing and selecting elements](https://scrapling.readthedocs.io/en/latest/parsing/selection/) +- [Scrapling: GitHub repository](https://github.com/D4Vinci/Scrapling) +- [Apify: Proxy management](https://docs.apify.com/platform/proxy) diff --git a/docs/03_guides/08_crawl4ai.mdx b/docs/03_guides/08_crawl4ai.mdx new file mode 100644 index 00000000..0802c002 --- /dev/null +++ b/docs/03_guides/08_crawl4ai.mdx @@ -0,0 +1,80 @@ +--- +id: crawl4ai +title: LLM-ready scraping with Crawl4AI +description: Build an Apify Actor that scrapes web pages into LLM-ready markdown using the Crawl4AI library. +--- + +import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock'; + +import Crawl4aiExample from '!!raw-loader!roa-loader!./code/08_crawl4ai.py'; + +In this guide, you'll learn how to use the [Crawl4AI](https://crawl4ai.com/) library for LLM-ready web scraping in your Apify Actors. + +## Introduction + +[Crawl4AI](https://crawl4ai.com/) is an open-source, asynchronous web crawler built for LLM and AI workflows. It renders a page in a real browser and turns the result into clean, structured markdown that's ready to feed into a language model or a retrieval-augmented generation (RAG) pipeline, while still giving you the raw HTML, extracted links, and media when you need them. + +Some of the features that make Crawl4AI a good fit for Apify Actors: + +- **LLM-ready markdown** - Crawl4AI converts each page into clean markdown, stripping boilerplate and optionally filtering content, so the output can be fed straight into a language model. +- **Real browser rendering** - Pages are loaded in a [Playwright](https://playwright.dev/)-driven browser, so JavaScript-heavy and dynamically rendered websites work out of the box. +- **Built-in link and media extraction** - Every crawl returns the page's links already split into `internal` and `external` groups, together with the media it found, which makes recursive crawling straightforward. +- **Flexible extraction strategies** - Beyond markdown, Crawl4AI can extract structured data with CSS/XPath schemas or with an LLM, all configured per request. +- **First-class async support** - The `AsyncWebCrawler` is built on `asyncio`, which integrates naturally with the asyncio-based Apify SDK. +- **Per-request proxy** - Each request can be routed through its own proxy, which pairs well with Apify Proxy and its rotating IP addresses. + +Crawl4AI drives a real browser through Playwright, so after installing the library you need to download the browser binaries once with the `crawl4ai-setup` command: + +```bash +pip install crawl4ai +crawl4ai-setup +``` + +## Example Actor + +The following Actor recursively crawls pages, starting from the URLs in the Actor input and following links up to a user-defined maximum depth. It uses Crawl4AI's `AsyncWebCrawler` to render each page through [Apify Proxy](https://docs.apify.com/platform/proxy), stores the page's markdown in the dataset, and follows the internal links that Crawl4AI discovers. + +The whole Actor fits in a single file. A `scrape_page` helper holds the Crawl4AI-specific crawling and parsing, while the `main` coroutine handles the [Actor](https://docs.apify.com/platform/actors) lifecycle, reads the input, sets up [Apify Proxy](https://docs.apify.com/platform/proxy) and the [request queue](https://docs.apify.com/platform/storage/request-queue), opens a single browser-backed crawler, and drives the crawl: + + + {Crawl4aiExample} + + +A few things worth pointing out: + +- A single `AsyncWebCrawler` is opened once and reused for every request. The crawler manages one browser instance, so reusing it across the whole crawl is far cheaper than launching a new browser per page. +- Keeping the crawling and parsing in `scrape_page` separates the Crawl4AI-specific code from the Actor's orchestration logic. The function returns the extracted data together with the discovered links, so `main` decides what to store and what to enqueue. +- `result.markdown` is the rendered page as clean markdown, and `result.metadata` carries page-level fields such as the title - exactly the kind of output you want when preparing data for an LLM. +- `result.links` already separates `internal` (same-site) links from `external` ones, so the example follows only the internal links to keep the crawl on the same website. +- `CacheMode.BYPASS` tells Crawl4AI to always fetch a fresh copy of the page instead of serving it from its local cache. + +## Using Apify Proxy + +Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. In the example above, `main` creates a proxy configuration with `Actor.create_proxy_configuration` and passes a fresh proxy URL to `scrape_page` for every request, which forwards it to Crawl4AI's per-request `CrawlerRunConfig`. + +`ProxyConfig.from_string` parses the proxy URL returned by `ProxyConfiguration.new_url` (for example `http://groups-RESIDENTIAL:@proxy.apify.com:8000`) into the server, username, and password that the browser needs - the browser cannot take the credentials embedded directly in the URL. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For more details, see the [Proxy management](../concepts/proxy-management) guide. + +## Running on the Apify platform + +Because Crawl4AI renders pages in a real browser, the Actor image needs a browser and its system-level dependencies. Build on top of the [Apify Playwright base image](https://hub.docker.com/r/apify/actor-python-playwright), which already ships a browser - Crawl4AI reuses those binaries, so no separate browser-install step is required in the Dockerfile. + +Pin the Python 3.13 variant of that image (for example `apify/actor-python-playwright:3.13-1.60.0`), because some of Crawl4AI's dependencies do not yet publish wheels for the newest Python versions, which would otherwise force a slow source build during the image build. + +Add `apify` and `crawl4ai` to your `requirements.txt`: + +```text +apify +crawl4ai +``` + +## Conclusion + +In this guide, you learned how to use Crawl4AI in your Apify Actors. You can now render pages in a real browser, turn them into LLM-ready markdown, follow the links Crawl4AI discovers, route requests through Apify Proxy, and run the whole thing on the Apify platform. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping! + +## Additional resources + +- [Crawl4AI: Official documentation](https://docs.crawl4ai.com/) +- [Crawl4AI: AsyncWebCrawler and configuration](https://docs.crawl4ai.com/api/async-webcrawler/) +- [Crawl4AI: Proxy and security](https://docs.crawl4ai.com/advanced/proxy-security/) +- [Crawl4AI: GitHub repository](https://github.com/unclecode/crawl4ai) +- [Apify: Proxy management](https://docs.apify.com/platform/proxy) diff --git a/docs/03_guides/09_browser_use.mdx b/docs/03_guides/09_browser_use.mdx new file mode 100644 index 00000000..77529963 --- /dev/null +++ b/docs/03_guides/09_browser_use.mdx @@ -0,0 +1,90 @@ +--- +id: browser-use +title: Browser AI agents with Browser Use +description: Build an Apify Actor that automates a browser with an LLM agent using the Browser Use library. +--- + +import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock'; + +import BrowserUseExample from '!!raw-loader!roa-loader!./code/09_browser_use.py'; + +In this guide, you'll learn how to use the [Browser Use](https://browser-use.com/) library to drive a browser with an LLM agent in your Apify Actors. + +## Introduction + +[Browser Use](https://browser-use.com/) is a Python library that lets an LLM control a real web browser. Instead of writing selectors and navigation steps by hand, you give an agent a natural-language task - such as "find the top post on Hacker News and return its title and URL" - and the agent decides which pages to open, what to click, and what to read until the task is done. + +Some of the features that make Browser Use a good fit for Apify Actors: + +- **Natural-language tasks** - Describe what you want in plain English; the agent figures out the steps. This is well suited to pages whose structure changes often or is hard to target with fixed selectors. +- **Model-agnostic** - Browser Use ships wrappers for many providers (`ChatOpenAI`, `ChatAnthropic`, `ChatGoogle`, and more), so you can pick the model that fits your task and budget. +- **Structured output** - Pass a [Pydantic](https://docs.pydantic.dev/) model as the output schema and the agent returns a validated object instead of free-form text, which maps cleanly onto an Apify dataset. +- **Real browser via CDP** - The agent drives a real Chromium over the Chrome DevTools Protocol, so JavaScript-heavy pages render just like they would for a human. +- **First-class async support** - The agent's `run` method is asynchronous, which integrates naturally with the asyncio-based Apify SDK. + +Browser Use needs only the `browser-use` package - install it with: + +```bash +pip install browser-use +``` + +## Configuring the LLM + +Browser Use needs an LLM to drive the agent. You choose a provider wrapper, give it a model name, and supply the provider's API key: + +- **`ChatOpenAI`** - OpenAI models such as `gpt-4.1-mini` or `gpt-5-mini`. Reads the key from `OPENAI_API_KEY`, or accepts it via the `api_key` argument. +- **`ChatAnthropic`** - Anthropic Claude models such as `claude-sonnet-4-5` or `claude-haiku-4-5`. Reads the key from `ANTHROPIC_API_KEY`. +- **`ChatGoogle`** - Google Gemini models such as `gemini-2.5-flash`. Reads the key from `GOOGLE_API_KEY`. + +The example Actor in this guide uses `ChatOpenAI`, but switching providers is a one-line change in `run_agent_task`. More capable models generally complete tasks in fewer steps and more reliably, while smaller models are cheaper per step. + +Keep the API key out of the Actor input and source code. The example reads it from an environment variable, which on the Apify platform you set as a [secret environment variable](https://docs.apify.com/platform/actors/development/programming-interface/environment-variables) (for example `OPENAI_API_KEY`), and locally you export in your shell. + +## Example Actor + +The following Actor runs a Browser Use agent for a single task and stores its structured result in the default dataset. By default it opens [Hacker News](https://news.ycombinator.com) and returns the title and URL of the top five posts, but the task, model, and step limit are all configurable through the Actor input. + +The whole Actor fits in a single file. A `run_agent_task` helper holds the Browser Use-specific logic - it defines the output schema and builds the LLM, browser, and agent - while the `main` coroutine handles the [Actor](https://docs.apify.com/platform/actors) lifecycle, reads the input, sets up [Apify Proxy](https://docs.apify.com/platform/proxy), runs the agent, and stores the result: + + + {BrowserUseExample} + + +A few things worth pointing out: + +- Keeping the agent setup in `run_agent_task` separates the Browser Use-specific code from the Actor's orchestration logic. `main` only decides what to read from the input and what to store. +- Passing `output_model_schema=Posts` makes the agent return a validated `Posts` instance via `history.structured_output`, so `main` can push each item straight to the dataset. Adapt the task and the `Post`/`Posts` models together to fit your own use case. +- `enable_signal_handler=False` leaves signal handling to the Actor, which manages the run's lifecycle. Without it, Browser Use would install its own handlers and interfere with a clean shutdown. +- `headless=Actor.configuration.headless` runs the browser without a visible window, which is what you want on the platform. + +## Using Apify Proxy + +Running on the Apify platform gives your agent access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. In the example above, `main` creates a proxy configuration with `Actor.create_proxy_configuration` and passes a fresh proxy URL to `run_agent_task`. + +Browser Use expects the proxy as a `ProxySettings` object with separate `server`, `username`, and `password` fields, whereas `ProxyConfiguration.new_url` returns a single URL string (for example `http://user:pass@proxy.apify.com:8000`). The `_proxy_settings` helper splits that URL into the fields Browser Use expects. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For more details, see the [Proxy management](../concepts/proxy-management) guide. + +## Running on the Apify platform + +Browser Use drives a real Chromium over CDP, so the Actor needs a browser binary available at runtime. The simplest way to provide one is to build on top of the [Apify Playwright base image](https://hub.docker.com/r/apify/actor-python-playwright), which already ships a browser together with all of its system-level dependencies. Browser Use discovers that browser automatically, so no extra install step is needed in the image. + +Disable Browser Use's telemetry and cloud sync inside the Actor by setting the `ANONYMIZED_TELEMETRY=false` and `BROWSER_USE_CLOUD_SYNC=false` environment variables in your Dockerfile. + +When running the Actor locally, install the browser once with the `browser-use install` command, which downloads a Chromium build together with its dependencies: + +```bash +browser-use install +``` + +Remember to provide the LLM API key in both environments - as a secret environment variable on the platform, and exported in your shell when running locally. + +## Conclusion + +In this guide, you learned how to use Browser Use in your Apify Actors. You can now drive a real browser with an LLM agent, return its results as a validated Pydantic model, route the browser through Apify Proxy, and run the whole thing on the Apify platform. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own automation tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy automating! + +## Additional resources + +- [Browser Use: Official documentation](https://docs.browser-use.com/) +- [Browser Use: Supported models](https://docs.browser-use.com/customize/supported-models) +- [Browser Use: Structured output](https://docs.browser-use.com/customize/agent/output-format) +- [Browser Use: GitHub repository](https://github.com/browser-use/browser-use) +- [Apify: Proxy management](https://docs.apify.com/platform/proxy) diff --git a/docs/03_guides/10_uv.mdx b/docs/03_guides/10_uv.mdx new file mode 100644 index 00000000..3e037d5b --- /dev/null +++ b/docs/03_guides/10_uv.mdx @@ -0,0 +1,188 @@ +--- +id: uv +title: Project management with uv +description: Manage your Actor's Python version, dependencies, and virtual environment with the uv package and project manager. +--- + +import CodeBlock from '@theme/CodeBlock'; +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +import PyprojectExample from '!!raw-loader!./code/uv_project/pyproject.toml'; +import MainExample from '!!raw-loader!./code/uv_project/my_actor/main.py'; +import UnderscoreMainExample from '!!raw-loader!./code/uv_project/my_actor/__main__.py'; +import DockerfileExample from '!!raw-loader!./code/uv_project/Dockerfile'; + +In this guide, you'll learn how to use [uv](https://docs.astral.sh/uv/) to manage your Apify Actor projects - from creating a new project, through running it locally, to building and deploying it on the Apify platform. + +## Introduction + +[uv](https://docs.astral.sh/uv/) is an extremely fast Python package and project manager. It replaces the combination of pip, virtualenv, and similar tools with a single binary that manages your project's Python version, virtual environment, and dependencies. It records the project metadata in the standard [`pyproject.toml`](https://packaging.python.org/en/latest/guides/writing-pyproject-toml/) file and the exact resolved versions of all dependencies in a [`uv.lock`](https://docs.astral.sh/uv/concepts/projects/sync/) lockfile. + +The [Python Actor templates](https://apify.com/templates/categories/python) declare their dependencies in a `requirements.txt` file, which is the default approach for Actors. Using uv instead brings a few advantages: + +- The lockfile guarantees that the dependencies installed in the Actor's Docker image are exactly the ones you developed and tested against locally. +- Dependency installation during the Docker build is significantly faster than with pip, especially with a warm cache. +- During local development, a single tool manages your Python version, virtual environment, and dependencies, so the project behaves the same on every developer's machine. + +:::info Actor templates don't support uv yet + +The [Apify Actor templates](https://apify.com/templates) currently support only pip with `requirements.txt`. Adding uv-based templates is planned - follow [apify/actor-templates#350](https://github.com/apify/actor-templates/issues/350) for updates. + +::: + +To follow along, install [uv](https://docs.astral.sh/uv/getting-started/installation/) and the [Apify CLI](https://docs.apify.com/cli/docs/installation) first. + +## Create a new project + +Create a new uv project and add the Apify SDK to its dependencies: + +```bash +uv init my-actor --bare +cd my-actor +uv python pin 3.14 +uv add apify +``` + +The [`uv init`](https://docs.astral.sh/uv/reference/cli/#uv-init) command with the `--bare` option creates just the `pyproject.toml` project manifest. The `uv python pin` command writes the project's Python version to the `.python-version` file - uv automatically downloads that Python version if it's not installed on your machine. Finally, [`uv add`](https://docs.astral.sh/uv/reference/cli/#uv-add) records the dependency in `pyproject.toml`, resolves the exact versions of the whole dependency tree into `uv.lock`, and installs everything into the project's virtual environment in `.venv`. + +The `uv add` command constrains the dependency to the latest version it resolved. You can edit the constraint as you see fit - this guide's example Actor allows any version of the SDK within the current major one: + + + {PyprojectExample} + + +Note that the example has no `[build-system]` section. Without one, uv treats the project as a non-package ("virtual") project: it doesn't try to build and install the project itself, it only manages its dependencies. That's exactly what we want here - the Actor runs as a module straight from the source tree. + +## Add the Actor scaffolding + +For the project to be runnable as an Actor, it needs two more pieces: the source code as a runnable Python package, and the `.actor/` directory with the [Actor configuration](https://docs.apify.com/platform/actors/development/actor-definition/actor-json). + +Create a `my_actor` package with the Actor's source code: + + + + + {MainExample} + + + + + {UnderscoreMainExample} + + + + +Don't forget to add an empty `my_actor/__init__.py` file, so that the directory is a regular Python package executable with `python -m my_actor`. + +Then add the Actor definition to `.actor/actor.json`: + +```json title=".actor/actor.json" +{ + "$schema": "https://apify.com/schemas/v1/actor.ide.json", + "actorSpecification": 1, + "name": "my-actor", + "title": "My uv Actor", + "description": "An Apify Actor with dependencies managed by uv.", + "version": "0.1", + "buildTag": "latest", + "dockerfile": "../Dockerfile" +} +``` + +The `dockerfile` field points to the project's `Dockerfile`, which doesn't exist yet - you'll create it in the [Use uv in the Dockerfile](#use-uv-in-the-dockerfile) section below. + +The final project structure looks like this: + +```text +my-actor/ +├── .actor/ +│ └── actor.json +├── my_actor/ +│ ├── __init__.py +│ ├── __main__.py +│ └── main.py +├── .python-version +├── Dockerfile +├── pyproject.toml +└── uv.lock +``` + +Make sure to commit `uv.lock` and `.python-version` to version control, so that every developer's machine works with identical dependencies and the same Python version. The Actor's Docker build gets its Python interpreter from the base image instead, so keep the base image tag (`apify/actor-python:3.14`) in sync with `.python-version`. + +## Run the Actor locally + +If you've just cloned the project (or skipped `uv add` above), install the dependencies first: + +```bash +uv sync +``` + +The [`uv sync`](https://docs.astral.sh/uv/reference/cli/#uv-sync) command creates the `.venv` virtual environment (if it doesn't exist yet) and installs the locked dependencies into it. Then run the Actor with the Apify CLI: + +```bash +apify run +``` + +The [`apify run`](https://docs.apify.com/cli/docs/reference#apify-run) command automatically detects the virtual environment in `.venv` and uses it to run the Actor as a module (`python -m my_actor`), with the environment set up to emulate the Apify platform locally - for example, the Actor input is read from `storage/key_value_stores/default/INPUT.json`. + +## Use uv in the Dockerfile + +On the Apify platform, the Actor runs as a Docker container built from the Dockerfile referenced in `.actor/actor.json`. The following Dockerfile installs the locked dependencies with uv on top of the [Apify Python base image](https://hub.docker.com/r/apify/actor-python): + + + {DockerfileExample} + + +A few details worth understanding: + +- The uv binary is copied from its [official Docker image](https://docs.astral.sh/uv/guides/integration/docker/), pinned to a minor version line, so builds are reproducible and there is no need to install uv with pip. +- `uv sync --locked --no-dev` installs the dependencies exactly as recorded in `uv.lock` and skips development dependencies. If the lockfile is missing or out of sync with `pyproject.toml`, the build fails instead of silently resolving different versions. +- The dependencies are installed in a separate layer before the source code is copied, so editing your code doesn't invalidate the dependency layer, and rebuilds are fast. +- Putting `.venv/bin` first on `PATH` makes `python` resolve to the project's virtual environment, both during the build and when the Actor runs. + +Also create a `.dockerignore` file and exclude at least `.venv`, `.git`, and `storage` from the Docker build context - the local virtual environment must never be copied into the image, since it's recreated by `uv sync` during the build. + +## Deploy to the Apify platform + +Once the Actor works locally, log in and push it to the Apify platform: + +```bash +apify login +apify push +``` + +The [`apify push`](https://docs.apify.com/cli/docs/reference#apify-push) command uploads the project to the platform and builds the Docker image from the Dockerfile above. Thanks to the committed lockfile, the platform build installs exactly the dependency versions you ran locally. + +## Manage dependencies + +Day-to-day dependency management goes through uv as well: + +```bash +# Add a dependency (records it in pyproject.toml and updates uv.lock). +uv add httpx + +# Add a development-only dependency (skipped in the Docker build by --no-dev). +uv add --dev ruff + +# Remove a dependency. +uv remove httpx + +# Upgrade all dependencies to the latest versions allowed by pyproject.toml. +uv lock --upgrade +uv sync +``` + +Whenever the dependencies change, commit the updated `uv.lock` together with `pyproject.toml`. + +## Conclusion + +In this guide, you learned how to use uv to manage Apify Actor projects. You can now create a uv project with the Apify SDK, run it locally with the Apify CLI, install the locked dependencies with uv in the Actor's Docker image, and deploy the whole project to the Apify platform with reproducible builds. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy coding! + +## Additional resources + +- [uv: Official documentation](https://docs.astral.sh/uv/) +- [uv: Working on projects](https://docs.astral.sh/uv/guides/projects/) +- [uv: Using uv in Docker](https://docs.astral.sh/uv/guides/integration/docker/) +- [Apify: Actor Dockerfile documentation](https://docs.apify.com/platform/actors/development/actor-definition/dockerfile) +- [Apify templates: Python](https://apify.com/templates/categories/python) diff --git a/docs/03_guides/11_pydantic.mdx b/docs/03_guides/11_pydantic.mdx new file mode 100644 index 00000000..dd8bbde6 --- /dev/null +++ b/docs/03_guides/11_pydantic.mdx @@ -0,0 +1,107 @@ +--- +id: input-validation +title: Input validation with Pydantic +description: Parse, validate, and type your Actor's input with Pydantic models instead of reaching into a raw dictionary. +--- + +import CodeBlock from '@theme/CodeBlock'; +import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock'; +import ApiLink from '@theme/ApiLink'; + +import RawInputExample from '!!raw-loader!roa-loader!./code/11_raw_input.py'; +import PydanticExample from '!!raw-loader!roa-loader!./code/11_pydantic.py'; +import HttpUrlExample from '!!raw-loader!./code/11_http_url.py'; +import ModelValidatorExample from '!!raw-loader!./code/11_model_validator.py'; + +In this guide, you'll learn how to validate your Apify Actor's input with [Pydantic](https://docs.pydantic.dev/), so that your code works with a typed, guaranteed-valid object instead of a raw dictionary. + +## Introduction + +An Actor reads its input with `Actor.get_input`, which returns the input record as a plain `dict` (or `None` when there's no input). Working with that dictionary directly is fragile: + + + {RawInputExample} + + +- There are no type guarantees - `max_results` could just as easily arrive as the string `"10"` or `None`, and you'd only find out when something blows up later. +- There's no validation - nothing stops `max_results` from being `0` or `-5`, or `search_terms` from being empty. +- A typo in a key (`maxResult` instead of `maxResults`) silently falls back to the default instead of failing. +- Defaults are scattered across the codebase, and your editor can't autocomplete the fields or catch mistakes. + +[Pydantic](https://docs.pydantic.dev/) solves all of this. You declare the shape of your input once as a model, and Pydantic parses the raw dictionary into a typed object, applying defaults, enforcing constraints, and producing clear error messages when the input doesn't match. Pydantic is already a dependency of the Apify SDK, so there's nothing extra to install. + +## Example Actor + +The following Actor declares its input as a Pydantic `BaseModel`, validates the raw input against it, and then works with a fully typed object. On invalid input it fails fast with a readable error; on valid input it logs the normalized values and stores them as the Actor's output. + + + {PydanticExample} + + +A few things worth pointing out about the **model**: + +- **Aliases bridge the naming conventions.** Apify input fields are conventionally `camelCase` (`maxResults`), while Python attributes are `snake_case` (`max_results`). `Field(alias='maxResults')` maps one to the other, and `populate_by_name=True` lets the model accept either spelling - handy in tests. +- **Defaults and `required` fields are explicit.** A field without a default (`search_terms`) is required; one with a default (`max_results`) is optional. There's a single, obvious place where every default lives. +- **Constraints are declarative.** `ge=1, le=100` enforces a numeric range, `min_length=1` rejects an empty list, and `Literal['json', 'csv']` restricts a field to a fixed set of choices - mirroring an `enum` in the input schema. +- **Custom validators handle the rest.** The `field_validator` normalizes the search terms (trimming whitespace, dropping empties) and rejects input that has nothing left, so the rest of your code never has to repeat those checks. +- **Unknown fields are ignored.** `extra='ignore'` means adding a new field to your input schema won't break an older Actor build that doesn't know about it yet. Use `extra='forbid'` instead if you'd rather reject anything unexpected. + +And about the **validation** itself: + +- `model_validate` parses the raw dictionary into a typed `ActorInput` instance, filling in defaults and guaranteeing every field is valid - or raising a `ValidationError` describing every problem at once. +- Catching that error, logging a readable summary, and re-raising makes the Actor **fail fast** with a clear explanation right at the start, rather than crashing with an obscure error somewhere deep in the run. Because the body runs inside `async with Actor:`, the re-raised exception automatically marks the run as `FAILED`. +- The error messages refer to the fields by their input-schema aliases. For invalid input like `{"searchTerms": [], "maxResults": 999, "outputFormat": "xml"}`, the log shows exactly what's wrong: + + ```text + The Actor input is invalid: + 3 validation errors for ActorInput + searchTerms + List should have at least 1 item after validation, not 0 ... + maxResults + Input should be less than or equal to 100 ... + outputFormat + Input should be 'json' or 'csv' ... + ``` + +Once validation passes, the rest of `main` works with `actor_input.search_terms`, `actor_input.max_results`, and `actor_input.output_format` - all correctly typed, with editor autocompletion and static type checking. + +## Relationship to the input schema + +Pydantic validation **complements** the Actor's [input schema](https://docs.apify.com/platform/actors/development/input-schema) (`.actor/input_schema.json`) - it doesn't replace it. The two serve different layers: + +- The **input schema** drives the Apify Console form, documents the fields for your users, and lets the platform validate input before the run even starts. Keep declaring your fields there. +- The **Pydantic model** validates the input again *inside your Python code*, where it gives you a typed object, IDE support, and richer rules (normalization, cross-field checks, custom formats) that the input schema can't express. It's also your safety net for runs started programmatically by [another Actor](../concepts/interacting-with-other-actors) or executed [locally](https://docs.apify.com/cli/docs/reference#apify-run), and for keeping the two definitions honest with each other. + +Keep the model's aliases in sync with the field keys in `input_schema.json`, and the two definitions describe the same input from both sides. + +## Useful validation features + +Pydantic offers much more than the example uses. A few features that come up often when validating Actor input: + +**Format-validated types** for common string formats, for example `HttpUrl` for URLs or `EmailStr` for e-mail addresses (the latter needs the `pydantic[email]` extra): + + + {HttpUrlExample} + + +**Cross-field validation** with `model_validator`, when one field's validity depends on another: + + + {ModelValidatorExample} + + +**Secret input fields.** The platform decrypts [secret input fields](https://docs.apify.com/platform/actors/development/secret-input) for you before `Actor.get_input` returns, so you receive plaintext. Wrap such fields in Pydantic's `SecretStr` to keep them from leaking into logs or `model_dump()` output. + +For the full set of types, constraints, and validators, see the [Pydantic documentation](https://docs.pydantic.dev/latest/concepts/models/). + +## Conclusion + +In this guide, you learned how to validate Actor input with Pydantic: declaring the input as a model with aliases, defaults, and constraints; parsing the raw input with `model_validate`; failing fast with a readable error when the input is invalid; and working with a typed object for the rest of the run. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own Actors. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy validating! + +## Additional resources + +- [Pydantic: Official documentation](https://docs.pydantic.dev/) +- [Pydantic: Models](https://docs.pydantic.dev/latest/concepts/models/) +- [Pydantic: Validators](https://docs.pydantic.dev/latest/concepts/validators/) +- [Apify: Actor input](https://docs.apify.com/platform/actors/running/input) +- [Apify: Input schema specification](https://docs.apify.com/platform/actors/development/input-schema) diff --git a/docs/03_guides/12_running_webserver.mdx b/docs/03_guides/12_running_webserver.mdx new file mode 100644 index 00000000..7b946e86 --- /dev/null +++ b/docs/03_guides/12_running_webserver.mdx @@ -0,0 +1,71 @@ +--- +id: running-webserver +title: Running a web server +description: Run an HTTP server inside your Actor for monitoring or serving content during execution. +--- + +import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock'; + +import WebserverExample from '!!raw-loader!roa-loader!./code/12_webserver.py'; +import WebserverFastApiExample from '!!raw-loader!roa-loader!./code/12_webserver_fastapi.py'; + +In this guide, you'll learn how to run a web server inside your Apify Actor. This is useful for monitoring Actor progress, creating custom APIs, or serving content during the Actor run. + +## Introduction + +Each Actor run on the Apify platform is assigned a unique hard-to-guess URL (for example `https://8segt5i81sokzm.runs.apify.net`), which enables HTTP access to an optional web server running inside the Actor run's container. + +The URL is available in the following places: + +- In Apify Console, on the Actor run details page as the **Container URL** field. +- In the API as the `container_url` property of the [Run object](https://docs.apify.com/api/v2#/reference/actors/run-object/get-run). +- In the Actor as the `Actor.configuration.web_server_url` property. + +The web server running inside the container must listen at the port defined by the `Actor.configuration.web_server_port` property. When running Actors locally, the port defaults to `4321`, so the web server will be accessible at `http://localhost:4321`. + +## Example Actor + +The following example shows how to start a simple web server in your Actor, which will respond to every GET request with the number of items that the Actor has processed so far: + + + {WebserverExample} + + +## Using FastAPI + +The example above relies only on Python's standard library, which keeps it dependency-free but leaves you handling requests by hand. For anything beyond a single endpoint, a web framework such as [FastAPI](https://fastapi.tiangolo.com/) is a better fit - it gives you routing, request parsing, and automatic JSON responses, and is served by an ASGI server like [uvicorn](https://www.uvicorn.org/). + +Install both, for example by adding them to your `requirements.txt`: + +```text +fastapi +uvicorn[standard] +``` + +The following Actor serves the same processed-items counter as before, but through a FastAPI endpoint. The key difference is that uvicorn runs inside the Actor's event loop as a background task, bound to `Actor.configuration.web_server_port` so the platform routes the container URL to it: + + + {WebserverFastApiExample} + + +A few things worth pointing out: + +- `uvicorn.Server(...).serve()` is a coroutine, so it runs as an `asyncio` task alongside the Actor's own work instead of blocking it. Setting `server.should_exit = True` triggers a graceful shutdown once the work is done. +- The server binds to `0.0.0.0` (all interfaces) rather than `localhost`, so it's reachable through the container URL, not only from inside the container. +- The same pattern powers an [Actor Standby](#actor-standby) service - swap the one-off work loop for an Actor that just keeps serving requests. + +## Actor Standby + +The example above runs a web server for the duration of a single Actor run. With [Actor Standby](https://docs.apify.com/platform/actors/development/programming-interface/standby), you can instead expose your Actor as an always-ready HTTP API: the platform keeps the Actor running in the background and routes incoming HTTP requests to the web server inside it, spinning up additional instances as the load grows. + +From the SDK's perspective, a Standby Actor is built the same way as the web server above — start an HTTP server listening on the port from `Actor.configuration.web_server_port`. The difference is operational: instead of doing its work once and exiting, a Standby Actor stays up and serves requests. This makes it a good fit for low-latency, on-demand use cases, such as serving scraped data or acting as a microservice. + +To get started quickly, use the [Standby Python template](https://apify.com/templates/python-standby). For details on enabling Standby, request routing, and readiness probes, see the [Actor Standby documentation](https://docs.apify.com/platform/actors/development/programming-interface/standby). + +## Conclusion + +In this guide, you learned how to run a web server inside your Apify Actor. By leveraging the container URL and port provided by the platform, you can expose HTTP endpoints for monitoring, reporting, or serving content during Actor execution. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). + +## Additional resources + +- [Apify templates: Standby Python project](https://apify.com/templates/python-standby) diff --git a/docs/03_guides/code/01_beautifulsoup_httpx.py b/docs/03_guides/code/01_beautifulsoup_httpx.py index 5dbfab2a..adc03361 100644 --- a/docs/03_guides/code/01_beautifulsoup_httpx.py +++ b/docs/03_guides/code/01_beautifulsoup_httpx.py @@ -1,87 +1,119 @@ import asyncio -from urllib.parse import urljoin +from typing import Any +from urllib.parse import urljoin, urlsplit import httpx from bs4 import BeautifulSoup from apify import Actor, Request +from apify.storages import RequestQueue + + +async def scrape_page( + url: str, + *, + proxy_url: str | None = None, +) -> tuple[dict[str, Any], list[str]]: + """Fetch a page with HTTPX and return its data and same-site links.""" + # A fresh client per call lets each request use a new proxy URL. + async with httpx.AsyncClient(proxy=proxy_url) as client: + response = await client.get(url, follow_redirects=True) + + soup = BeautifulSoup(response.content, 'html.parser') + + data = { + 'url': url, + 'title': soup.title.string if soup.title else None, + 'h1s': [h1.text for h1 in soup.find_all('h1')], + 'h2s': [h2.text for h2 in soup.find_all('h2')], + 'h3s': [h3.text for h3 in soup.find_all('h3')], + } + + # Keep only absolute links on the same host. + links: list[str] = [] + host = urlsplit(url).netloc + for link in soup.find_all('a'): + link_url = urljoin(url, link.get('href')) + if not link_url.startswith(('http://', 'https://')): + continue + if urlsplit(link_url).netloc == host: + links.append(link_url) + + return data, links + + +async def enqueue_links( + request_queue: RequestQueue, + links: list[str], + *, + depth: int, + max_depth: int, +) -> None: + """Enqueue the links one level deeper, unless max_depth was reached.""" + if depth >= max_depth: + return + + for link_url in links: + Actor.log.info(f'Enqueuing {link_url} ...') + request = Request.from_url(link_url) + request.crawl_depth = depth + 1 + await request_queue.add_request(request) async def main() -> None: - # Enter the context of the Actor. async with Actor: - # Retrieve the Actor input, and use default values if not provided. + # Read the Actor input. actor_input = await Actor.get_input() or {} - start_urls = actor_input.get('start_urls', [{'url': 'https://apify.com'}]) - max_depth = actor_input.get('max_depth', 1) + start_urls = actor_input.get('startUrls', [{'url': 'https://crawlee.dev'}]) + max_depth = actor_input.get('maxDepth', 1) - # Exit if no start URLs are provided. if not start_urls: Actor.log.info('No start URLs specified in Actor input, exiting...') await Actor.exit() - # Open the default request queue for handling URLs to be processed. + # Set up Apify Proxy and the request queue. + proxy_configuration = await Actor.create_proxy_configuration() request_queue = await Actor.open_request_queue() - # Enqueue the start URLs with an initial crawl depth of 0. + # Enqueue the start URLs (crawl depth defaults to 0). for start_url in start_urls: url = start_url.get('url') - Actor.log.info(f'Enqueuing {url} ...') - new_request = Request.from_url(url, user_data={'depth': 0}) - await request_queue.add_request(new_request) - - # Create an HTTPX client to fetch the HTML content of the URLs. - async with httpx.AsyncClient() as client: - # Process the URLs from the request queue. - while request := await request_queue.fetch_next_request(): - url = request.url - - if not isinstance(request.user_data['depth'], (str, int)): - raise TypeError('Request.depth is an unexpected type.') - - depth = int(request.user_data['depth']) - Actor.log.info(f'Scraping {url} (depth={depth}) ...') - - try: - # Fetch the HTTP response from the specified URL using HTTPX. - response = await client.get(url, follow_redirects=True) - - # Parse the HTML content using Beautiful Soup. - soup = BeautifulSoup(response.content, 'html.parser') - - # If the current depth is less than max_depth, find nested links - # and enqueue them. - if depth < max_depth: - for link in soup.find_all('a'): - link_href = link.get('href') - link_url = urljoin(url, link_href) - - if link_url.startswith(('http://', 'https://')): - Actor.log.info(f'Enqueuing {link_url} ...') - new_request = Request.from_url( - link_url, - user_data={'depth': depth + 1}, - ) - await request_queue.add_request(new_request) - - # Extract the desired data. - data = { - 'url': url, - 'title': soup.title.string if soup.title else None, - 'h1s': [h1.text for h1 in soup.find_all('h1')], - 'h2s': [h2.text for h2 in soup.find_all('h2')], - 'h3s': [h3.text for h3 in soup.find_all('h3')], - } - - # Store the extracted data to the default dataset. - await Actor.push_data(data) - - except Exception: - Actor.log.exception(f'Cannot extract data from {url}.') - - finally: - # Mark the request as handled to ensure it is not processed again. - await request_queue.mark_request_as_handled(new_request) + Actor.log.info(f'Enqueuing start URL: {url}') + await request_queue.add_request(Request.from_url(url)) + + # Cap the crawl; raise or remove to follow more pages. + max_requests = 50 + handled_requests = 0 + + while handled_requests < max_requests and ( + request := await request_queue.fetch_next_request() + ): + handled_requests += 1 + url = request.url + depth = request.crawl_depth + Actor.log.info(f'Scraping {url} (depth={depth}) ...') + + try: + # Fresh proxy URL per request (None if no proxy). + proxy_url = None + if proxy_configuration: + proxy_url = await proxy_configuration.new_url() + + data, links = await scrape_page(url, proxy_url=proxy_url) + await Actor.push_data(data) + Actor.log.info( + f'Stored data from {url} ' + f'(title={data["title"]!r}, {len(links)} links found).' + ) + await enqueue_links( + request_queue, links, depth=depth, max_depth=max_depth + ) + + except Exception: + Actor.log.exception(f'Cannot extract data from {url}.') + + finally: + await request_queue.mark_request_as_handled(request) if __name__ == '__main__': diff --git a/docs/03_guides/code/02_parsel_impit.py b/docs/03_guides/code/02_parsel_impit.py index 21b5e74f..c937f48e 100644 --- a/docs/03_guides/code/02_parsel_impit.py +++ b/docs/03_guides/code/02_parsel_impit.py @@ -1,93 +1,119 @@ import asyncio -from urllib.parse import urljoin +from typing import Any +from urllib.parse import urljoin, urlsplit import impit import parsel from apify import Actor, Request +from apify.storages import RequestQueue + + +async def scrape_page( + url: str, + *, + proxy_url: str | None = None, +) -> tuple[dict[str, Any], list[str]]: + """Fetch a page with Impit and return its data and same-site links.""" + # A fresh client per call lets each request use a new proxy URL. + async with impit.AsyncClient(proxy=proxy_url) as client: + response = await client.get(url) + + selector = parsel.Selector(text=response.text) + + data = { + 'url': url, + 'title': selector.css('title::text').get(), + 'h1s': selector.css('h1::text').getall(), + 'h2s': selector.css('h2::text').getall(), + 'h3s': selector.css('h3::text').getall(), + } + + # Keep only absolute links on the same host. + links: list[str] = [] + host = urlsplit(url).netloc + for link_href in selector.css('a::attr(href)').getall(): + link_url = urljoin(url, link_href) + if not link_url.startswith(('http://', 'https://')): + continue + if urlsplit(link_url).netloc == host: + links.append(link_url) + + return data, links + + +async def enqueue_links( + request_queue: RequestQueue, + links: list[str], + *, + depth: int, + max_depth: int, +) -> None: + """Enqueue the links one level deeper, unless max_depth was reached.""" + if depth >= max_depth: + return + + for link_url in links: + Actor.log.info(f'Enqueuing {link_url} ...') + request = Request.from_url(link_url) + request.crawl_depth = depth + 1 + await request_queue.add_request(request) async def main() -> None: - # Enter the context of the Actor. async with Actor: - # Retrieve the Actor input, and use default values if not provided. + # Read the Actor input. actor_input = await Actor.get_input() or {} - start_urls = actor_input.get('start_urls', [{'url': 'https://apify.com'}]) - max_depth = actor_input.get('max_depth', 1) + start_urls = actor_input.get('startUrls', [{'url': 'https://crawlee.dev'}]) + max_depth = actor_input.get('maxDepth', 1) - # Exit if no start URLs are provided. if not start_urls: Actor.log.info('No start URLs specified in Actor input, exiting...') await Actor.exit() - # Open the default request queue for handling URLs to be processed. + # Set up Apify Proxy and the request queue. + proxy_configuration = await Actor.create_proxy_configuration() request_queue = await Actor.open_request_queue() - # Enqueue the start URLs with an initial crawl depth of 0. + # Enqueue the start URLs (crawl depth defaults to 0). for start_url in start_urls: url = start_url.get('url') - Actor.log.info(f'Enqueuing {url} ...') - new_request = Request.from_url(url, user_data={'depth': 0}) - await request_queue.add_request(new_request) - - # Create an Impit client to fetch the HTML content of the URLs. - async with impit.AsyncClient() as client: - # Process the URLs from the request queue. - while request := await request_queue.fetch_next_request(): - url = request.url - - if not isinstance(request.user_data['depth'], (str, int)): - raise TypeError('Request.depth is an unexpected type.') - - depth = int(request.user_data['depth']) - Actor.log.info(f'Scraping {url} (depth={depth}) ...') - - try: - # Fetch the HTTP response from the specified URL using Impit. - response = await client.get(url) - - # Parse the HTML content using Parsel Selector. - selector = parsel.Selector(text=response.text) - - # If the current depth is less than max_depth, find nested links - # and enqueue them. - if depth < max_depth: - # Extract all links using CSS selector - links = selector.css('a::attr(href)').getall() - for link_href in links: - link_url = urljoin(url, link_href) - - if link_url.startswith(('http://', 'https://')): - Actor.log.info(f'Enqueuing {link_url} ...') - new_request = Request.from_url( - link_url, - user_data={'depth': depth + 1}, - ) - await request_queue.add_request(new_request) - - # Extract the desired data using Parsel selectors. - title = selector.css('title::text').get() - h1s = selector.css('h1::text').getall() - h2s = selector.css('h2::text').getall() - h3s = selector.css('h3::text').getall() - - data = { - 'url': url, - 'title': title, - 'h1s': h1s, - 'h2s': h2s, - 'h3s': h3s, - } - - # Store the extracted data to the default dataset. - await Actor.push_data(data) - - except Exception: - Actor.log.exception(f'Cannot extract data from {url}.') - - finally: - # Mark the request as handled to ensure it is not processed again. - await request_queue.mark_request_as_handled(request) + Actor.log.info(f'Enqueuing start URL: {url}') + await request_queue.add_request(Request.from_url(url)) + + # Cap the crawl; raise or remove to follow more pages. + max_requests = 50 + handled_requests = 0 + + while handled_requests < max_requests and ( + request := await request_queue.fetch_next_request() + ): + handled_requests += 1 + url = request.url + depth = request.crawl_depth + Actor.log.info(f'Scraping {url} (depth={depth}) ...') + + try: + # Fresh proxy URL per request (None if no proxy). + proxy_url = None + if proxy_configuration: + proxy_url = await proxy_configuration.new_url() + + data, links = await scrape_page(url, proxy_url=proxy_url) + await Actor.push_data(data) + Actor.log.info( + f'Stored data from {url} ' + f'(title={data["title"]!r}, {len(links)} links found).' + ) + await enqueue_links( + request_queue, links, depth=depth, max_depth=max_depth + ) + + except Exception: + Actor.log.exception(f'Cannot extract data from {url}.') + + finally: + await request_queue.mark_request_as_handled(request) if __name__ == '__main__': diff --git a/docs/03_guides/code/03_playwright.py b/docs/03_guides/code/03_playwright.py index 3eecb4ac..46c89867 100644 --- a/docs/03_guides/code/03_playwright.py +++ b/docs/03_guides/code/03_playwright.py @@ -1,95 +1,136 @@ import asyncio -from urllib.parse import urljoin +from typing import Any +from urllib.parse import urljoin, urlsplit -from playwright.async_api import async_playwright +from playwright.async_api import BrowserContext, async_playwright from apify import Actor, Request - -# Note: To run this Actor locally, ensure that Playwright browsers are installed. -# Run `playwright install --with-deps` in the Actor's virtual environment to install them. -# When running on the Apify platform, these dependencies are already included -# in the Actor's Docker image. +from apify.storages import RequestQueue + +# To run locally, install the browsers first: `playwright install --with-deps`. +# On the Apify platform they are already in the Actor's Docker image. + + +def to_playwright_proxy(proxy_url: str) -> dict[str, str]: + """Split an Apify Proxy URL into Playwright's server/username/password.""" + parts = urlsplit(proxy_url) + return { + 'server': f'{parts.scheme}://{parts.hostname}:{parts.port}', + 'username': parts.username or '', + 'password': parts.password or '', + } + + +async def scrape_page( + context: BrowserContext, url: str +) -> tuple[dict[str, Any], list[str]]: + """Open the URL in a new page and return its data and same-site links.""" + page = await context.new_page() + try: + await page.goto(url) + + data = { + 'url': url, + 'title': await page.title(), + 'h1s': [await h1.text_content() for h1 in await page.locator('h1').all()], + 'h2s': [await h2.text_content() for h2 in await page.locator('h2').all()], + 'h3s': [await h3.text_content() for h3 in await page.locator('h3').all()], + } + + # Keep only absolute links on the same host. + links: list[str] = [] + host = urlsplit(url).netloc + for link in await page.locator('a').all(): + link_href = await link.get_attribute('href') + link_url = urljoin(url, link_href) + if not link_url.startswith(('http://', 'https://')): + continue + if urlsplit(link_url).netloc == host: + links.append(link_url) + + return data, links + + finally: + await page.close() + + +async def enqueue_links( + request_queue: RequestQueue, + links: list[str], + *, + depth: int, + max_depth: int, +) -> None: + """Enqueue the links one level deeper, unless max_depth was reached.""" + if depth >= max_depth: + return + + for link_url in links: + Actor.log.info(f'Enqueuing {link_url} ...') + request = Request.from_url(link_url) + request.crawl_depth = depth + 1 + await request_queue.add_request(request) async def main() -> None: - # Enter the context of the Actor. async with Actor: - # Retrieve the Actor input, and use default values if not provided. + # Read the Actor input. actor_input = await Actor.get_input() or {} - start_urls = actor_input.get('start_urls', [{'url': 'https://apify.com'}]) - max_depth = actor_input.get('max_depth', 1) + start_urls = actor_input.get('startUrls', [{'url': 'https://crawlee.dev'}]) + max_depth = actor_input.get('maxDepth', 1) - # Exit if no start URLs are provided. if not start_urls: - Actor.log.info('No start URLs specified in actor input, exiting...') + Actor.log.info('No start URLs specified in Actor input, exiting...') await Actor.exit() - # Open the default request queue for handling URLs to be processed. - request_queue = await Actor.open_request_queue() + # Playwright proxies at the browser level, so one URL is shared per run. + proxy_configuration = await Actor.create_proxy_configuration() + proxy_url = await proxy_configuration.new_url() if proxy_configuration else None - # Enqueue the start URLs with an initial crawl depth of 0. + # Open the request queue and enqueue the start URLs (crawl depth 0). + request_queue = await Actor.open_request_queue() for start_url in start_urls: url = start_url.get('url') - Actor.log.info(f'Enqueuing {url} ...') - new_request = Request.from_url(url, user_data={'depth': 0}) - await request_queue.add_request(new_request) + Actor.log.info(f'Enqueuing start URL: {url}') + await request_queue.add_request(Request.from_url(url)) + + # Cap the crawl; raise or remove to follow more pages. + max_requests = 50 + handled_requests = 0 Actor.log.info('Launching Playwright...') - # Launch Playwright and open a new browser context. async with async_playwright() as playwright: - # Configure the browser to launch in headless mode as per Actor configuration. browser = await playwright.chromium.launch( headless=Actor.configuration.headless, - args=['--disable-gpu'], + proxy=to_playwright_proxy(proxy_url) if proxy_url else None, + args=['--no-sandbox', '--disable-dev-shm-usage', '--disable-gpu'], ) context = await browser.new_context() - # Process the URLs from the request queue. - while request := await request_queue.fetch_next_request(): + while handled_requests < max_requests and ( + request := await request_queue.fetch_next_request() + ): + handled_requests += 1 url = request.url - - if not isinstance(request.user_data['depth'], (str, int)): - raise TypeError('Request.depth is an unexpected type.') - - depth = int(request.user_data['depth']) + depth = request.crawl_depth Actor.log.info(f'Scraping {url} (depth={depth}) ...') try: - # Open a new page in the browser context and navigate to the URL. - page = await context.new_page() - await page.goto(url) - - # If the current depth is less than max_depth, find nested links - # and enqueue them. - if depth < max_depth: - for link in await page.locator('a').all(): - link_href = await link.get_attribute('href') - link_url = urljoin(url, link_href) - - if link_url.startswith(('http://', 'https://')): - Actor.log.info(f'Enqueuing {link_url} ...') - new_request = Request.from_url( - link_url, - user_data={'depth': depth + 1}, - ) - await request_queue.add_request(new_request) - - # Extract the desired data. - data = { - 'url': url, - 'title': await page.title(), - } - - # Store the extracted data to the default dataset. + data, links = await scrape_page(context, url) await Actor.push_data(data) + Actor.log.info( + f'Stored data from {url} ' + f'(title={data["title"]!r}, {len(links)} links found).' + ) + await enqueue_links( + request_queue, links, depth=depth, max_depth=max_depth + ) except Exception: Actor.log.exception(f'Cannot extract data from {url}.') finally: - await page.close() - # Mark the request as handled to ensure it is not processed again. await request_queue.mark_request_as_handled(request) diff --git a/docs/03_guides/code/04_selenium.py b/docs/03_guides/code/04_selenium.py index 4b427a7a..8bf08817 100644 --- a/docs/03_guides/code/04_selenium.py +++ b/docs/03_guides/code/04_selenium.py @@ -1,102 +1,191 @@ import asyncio -from urllib.parse import urljoin +import json +from pathlib import Path +from tempfile import mkdtemp +from typing import Any +from urllib.parse import urljoin, urlsplit +from zipfile import ZipFile from selenium import webdriver from selenium.webdriver.chrome.options import Options as ChromeOptions from selenium.webdriver.common.by import By from apify import Actor, Request +from apify.storages import RequestQueue -# To run this Actor locally, you need to have the Selenium Chromedriver installed. -# Follow the installation guide at: +# To run locally, install the Selenium Chromedriver: # https://www.selenium.dev/documentation/webdriver/getting_started/install_drivers/ -# When running on the Apify platform, the Chromedriver is already included -# in the Actor's Docker image. +# On the Apify platform it is already in the Actor's Docker image. + + +def proxy_auth_extension(proxy_url: str) -> str: + """Build a Chrome extension that routes Chrome through an authenticated proxy.""" + parts = urlsplit(proxy_url) + + manifest = { + 'name': 'Apify Proxy', + 'version': '1.0.0', + 'manifest_version': 3, + 'permissions': ['proxy', 'webRequest', 'webRequestAuthProvider'], + 'host_permissions': [''], + 'background': {'service_worker': 'background.js'}, + 'minimum_chrome_version': '108', + } + + # The service worker sets the proxy and answers the auth challenge. + proxy_config = json.dumps( + { + 'mode': 'fixed_servers', + 'rules': { + 'singleProxy': { + 'scheme': parts.scheme, + 'host': parts.hostname, + 'port': parts.port, + }, + }, + } + ) + credentials = json.dumps( + {'username': parts.username or '', 'password': parts.password or ''} + ) + background = ( + 'chrome.proxy.settings.set(' + '{value: ' + proxy_config + ', scope: "regular"});\n' + 'chrome.webRequest.onAuthRequired.addListener(\n' + ' () => ({authCredentials: ' + credentials + '}),\n' + ' {urls: [""]},\n' + ' ["blocking"],\n' + ');\n' + ) + + extension_path = Path(mkdtemp()) / 'apify_proxy.zip' + with ZipFile(extension_path, 'w') as archive: + archive.writestr('manifest.json', json.dumps(manifest)) + archive.writestr('background.js', background) + return str(extension_path) + + +def build_chrome_driver(proxy_url: str | None = None) -> webdriver.Chrome: + """Create a headless Chrome WebDriver, optionally routed through a proxy.""" + chrome_options = ChromeOptions() + + if Actor.configuration.headless: + # The new headless mode is required to load the proxy extension. + chrome_options.add_argument('--headless=new') + + chrome_options.add_argument('--no-sandbox') + chrome_options.add_argument('--disable-dev-shm-usage') + chrome_options.add_argument('--disable-gpu') + + if proxy_url: + chrome_options.add_extension(proxy_auth_extension(proxy_url)) + chrome_options.add_argument( + '--disable-features=DisableLoadExtensionCommandLineSwitch' + ) + + return webdriver.Chrome(options=chrome_options) + + +def scrape_page(driver: webdriver.Chrome, url: str) -> tuple[dict[str, Any], list[str]]: + """Navigate to the URL with Selenium and return its data and same-site links.""" + driver.get(url) + + data = { + 'url': url, + 'title': driver.title, + 'h1s': [el.text for el in driver.find_elements(By.TAG_NAME, 'h1')], + 'h2s': [el.text for el in driver.find_elements(By.TAG_NAME, 'h2')], + 'h3s': [el.text for el in driver.find_elements(By.TAG_NAME, 'h3')], + } + + # Keep only absolute links on the same host. + links: list[str] = [] + host = urlsplit(url).netloc + for link in driver.find_elements(By.TAG_NAME, 'a'): + link_url = urljoin(url, link.get_attribute('href')) + if not link_url.startswith(('http://', 'https://')): + continue + if urlsplit(link_url).netloc == host: + links.append(link_url) + + return data, links + + +async def enqueue_links( + request_queue: RequestQueue, + links: list[str], + *, + depth: int, + max_depth: int, +) -> None: + """Enqueue the links one level deeper, unless max_depth was reached.""" + if depth >= max_depth: + return + + for link_url in links: + Actor.log.info(f'Enqueuing {link_url} ...') + request = Request.from_url(link_url) + request.crawl_depth = depth + 1 + await request_queue.add_request(request) async def main() -> None: - # Enter the context of the Actor. async with Actor: - # Retrieve the Actor input, and use default values if not provided. + # Read the Actor input. actor_input = await Actor.get_input() or {} - start_urls = actor_input.get('start_urls', [{'url': 'https://apify.com'}]) - max_depth = actor_input.get('max_depth', 1) + start_urls = actor_input.get('startUrls', [{'url': 'https://crawlee.dev'}]) + max_depth = actor_input.get('maxDepth', 1) - # Exit if no start URLs are provided. if not start_urls: - Actor.log.info('No start URLs specified in actor input, exiting...') + Actor.log.info('No start URLs specified in Actor input, exiting...') await Actor.exit() - # Open the default request queue for handling URLs to be processed. - request_queue = await Actor.open_request_queue() + # Selenium proxies at the browser level, so one URL is shared per run. + proxy_configuration = await Actor.create_proxy_configuration() - # Enqueue the start URLs with an initial crawl depth of 0. + # Open the request queue and enqueue the start URLs (crawl depth 0). + request_queue = await Actor.open_request_queue() for start_url in start_urls: url = start_url.get('url') - Actor.log.info(f'Enqueuing {url} ...') - new_request = Request.from_url(url, user_data={'depth': 0}) - await request_queue.add_request(new_request) - - # Launch a new Selenium Chrome WebDriver and configure it. - Actor.log.info('Launching Chrome WebDriver...') - chrome_options = ChromeOptions() + Actor.log.info(f'Enqueuing start URL: {url}') + await request_queue.add_request(Request.from_url(url)) - if Actor.configuration.headless: - chrome_options.add_argument('--headless') + # Cap the crawl; raise or remove to follow more pages. + max_requests = 50 + handled_requests = 0 - chrome_options.add_argument('--no-sandbox') - chrome_options.add_argument('--disable-dev-shm-usage') - driver = webdriver.Chrome(options=chrome_options) + # Fresh proxy URL for the run (None if no proxy). + proxy_url = None + if proxy_configuration: + proxy_url = await proxy_configuration.new_url() - # Test WebDriver setup by navigating to an example page. - driver.get('http://www.example.com') - if driver.title != 'Example Domain': - raise ValueError('Failed to open example page.') + Actor.log.info('Launching Chrome WebDriver...') + driver = build_chrome_driver(proxy_url) - # Process the URLs from the request queue. - while request := await request_queue.fetch_next_request(): + while handled_requests < max_requests and ( + request := await request_queue.fetch_next_request() + ): + handled_requests += 1 url = request.url - - if not isinstance(request.user_data['depth'], (str, int)): - raise TypeError('Request.depth is an unexpected type.') - - depth = int(request.user_data['depth']) + depth = request.crawl_depth Actor.log.info(f'Scraping {url} (depth={depth}) ...') try: - # Navigate to the URL using Selenium WebDriver. Use asyncio.to_thread - # for non-blocking execution. - await asyncio.to_thread(driver.get, url) - - # If the current depth is less than max_depth, find nested links - # and enqueue them. - if depth < max_depth: - for link in driver.find_elements(By.TAG_NAME, 'a'): - link_href = link.get_attribute('href') - link_url = urljoin(url, link_href) - - if link_url.startswith(('http://', 'https://')): - Actor.log.info(f'Enqueuing {link_url} ...') - new_request = Request.from_url( - link_url, - user_data={'depth': depth + 1}, - ) - await request_queue.add_request(new_request) - - # Extract the desired data. - data = { - 'url': url, - 'title': driver.title, - } - - # Store the extracted data to the default dataset. + # Blocking WebDriver calls run in a worker thread. + data, links = await asyncio.to_thread(scrape_page, driver, url) await Actor.push_data(data) + Actor.log.info( + f'Stored data from {url} ' + f'(title={data["title"]!r}, {len(links)} links found).' + ) + await enqueue_links( + request_queue, links, depth=depth, max_depth=max_depth + ) except Exception: Actor.log.exception(f'Cannot extract data from {url}.') finally: - # Mark the request as handled to ensure it is not processed again. await request_queue.mark_request_as_handled(request) driver.quit() diff --git a/docs/03_guides/code/05_crawlee_beautifulsoup.py b/docs/03_guides/code/05_crawlee_beautifulsoup.py index 4d3a81d7..d3767109 100644 --- a/docs/03_guides/code/05_crawlee_beautifulsoup.py +++ b/docs/03_guides/code/05_crawlee_beautifulsoup.py @@ -1,22 +1,19 @@ import asyncio from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext +from crawlee.router import Router from apify import Actor -# Create a crawler. -crawler = BeautifulSoupCrawler( - # Limit the crawl to max requests. Remove or increase it for crawling all links. - max_requests_per_crawl=50, -) +# Define the router up front; the crawler is created later in `main`. +router = Router[BeautifulSoupCrawlingContext]() -# Define a request handler, which will be called for every request. -@crawler.router.default_handler +# Handler called for every request. +@router.default_handler async def request_handler(context: BeautifulSoupCrawlingContext) -> None: - Actor.log.info(f'Scraping {context.request.url}...') + Actor.log.info(f'Scraping {context.request.url} ...') - # Extract the desired data. data = { 'url': context.request.url, 'title': context.soup.title.string if context.soup.title else None, @@ -25,29 +22,38 @@ async def request_handler(context: BeautifulSoupCrawlingContext) -> None: 'h3s': [h3.text for h3 in context.soup.find_all('h3')], } - # Store the extracted data to the default dataset. await context.push_data(data) + Actor.log.info(f'Stored data from {context.request.url} (title={data["title"]!r}).') - # Enqueue additional links found on the current page. + # Enqueue links found on the page. await context.enqueue_links(strategy='same-domain') async def main() -> None: - # Enter the context of the Actor. async with Actor: - # Retrieve the Actor input, and use default values if not provided. + # Read the Actor input. actor_input = await Actor.get_input() or {} start_urls = [ url.get('url') - for url in actor_input.get('start_urls', [{'url': 'https://apify.com'}]) + for url in actor_input.get('startUrls', [{'url': 'https://crawlee.dev'}]) ] - # Exit if no start URLs are provided. if not start_urls: Actor.log.info('No start URLs specified in Actor input, exiting...') await Actor.exit() - # Run the crawler with the starting requests. + # Crawlee rotates the proxy URL per request on its own. + proxy_configuration = await Actor.create_proxy_configuration() + if proxy_configuration is None: + raise RuntimeError('Failed to create the proxy configuration.') + + crawler = BeautifulSoupCrawler( + proxy_configuration=proxy_configuration, + request_handler=router, + # Cap the crawl; remove or increase to follow all links. + max_requests_per_crawl=50, + ) + await crawler.run(start_urls) diff --git a/docs/03_guides/code/05_crawlee_parsel.py b/docs/03_guides/code/05_crawlee_parsel.py index 31f39d8b..32723b00 100644 --- a/docs/03_guides/code/05_crawlee_parsel.py +++ b/docs/03_guides/code/05_crawlee_parsel.py @@ -1,22 +1,19 @@ import asyncio from crawlee.crawlers import ParselCrawler, ParselCrawlingContext +from crawlee.router import Router from apify import Actor -# Create a crawler. -crawler = ParselCrawler( - # Limit the crawl to max requests. Remove or increase it for crawling all links. - max_requests_per_crawl=50, -) +# Define the router up front; the crawler is created later in `main`. +router = Router[ParselCrawlingContext]() -# Define a request handler, which will be called for every request. -@crawler.router.default_handler +# Handler called for every request. +@router.default_handler async def request_handler(context: ParselCrawlingContext) -> None: - Actor.log.info(f'Scraping {context.request.url}...') + Actor.log.info(f'Scraping {context.request.url} ...') - # Extract the desired data. data = { 'url': context.request.url, 'title': context.selector.xpath('//title/text()').get(), @@ -25,29 +22,38 @@ async def request_handler(context: ParselCrawlingContext) -> None: 'h3s': context.selector.xpath('//h3/text()').getall(), } - # Store the extracted data to the default dataset. await context.push_data(data) + Actor.log.info(f'Stored data from {context.request.url} (title={data["title"]!r}).') - # Enqueue additional links found on the current page. + # Enqueue links found on the page. await context.enqueue_links(strategy='same-domain') async def main() -> None: - # Enter the context of the Actor. async with Actor: - # Retrieve the Actor input, and use default values if not provided. + # Read the Actor input. actor_input = await Actor.get_input() or {} start_urls = [ url.get('url') - for url in actor_input.get('start_urls', [{'url': 'https://apify.com'}]) + for url in actor_input.get('startUrls', [{'url': 'https://crawlee.dev'}]) ] - # Exit if no start URLs are provided. if not start_urls: Actor.log.info('No start URLs specified in Actor input, exiting...') await Actor.exit() - # Run the crawler with the starting requests. + # Crawlee rotates the proxy URL per request on its own. + proxy_configuration = await Actor.create_proxy_configuration() + if proxy_configuration is None: + raise RuntimeError('Failed to create the proxy configuration.') + + crawler = ParselCrawler( + proxy_configuration=proxy_configuration, + request_handler=router, + # Cap the crawl; remove or increase to follow all links. + max_requests_per_crawl=50, + ) + await crawler.run(start_urls) diff --git a/docs/03_guides/code/05_crawlee_playwright.py b/docs/03_guides/code/05_crawlee_playwright.py index be4ea29e..56337a31 100644 --- a/docs/03_guides/code/05_crawlee_playwright.py +++ b/docs/03_guides/code/05_crawlee_playwright.py @@ -1,25 +1,19 @@ import asyncio from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext +from crawlee.router import Router from apify import Actor -# Create a crawler. -crawler = PlaywrightCrawler( - # Limit the crawl to max requests. Remove or increase it for crawling all links. - max_requests_per_crawl=50, - # Run the browser in a headless mode. - headless=True, - browser_launch_options={'args': ['--disable-gpu']}, -) +# Define the router up front; the crawler is created later in `main`. +router = Router[PlaywrightCrawlingContext]() -# Define a request handler, which will be called for every request. -@crawler.router.default_handler +# Handler called for every request. +@router.default_handler async def request_handler(context: PlaywrightCrawlingContext) -> None: - Actor.log.info(f'Scraping {context.request.url}...') + Actor.log.info(f'Scraping {context.request.url} ...') - # Extract the desired data. data = { 'url': context.request.url, 'title': await context.page.title(), @@ -28,29 +22,43 @@ async def request_handler(context: PlaywrightCrawlingContext) -> None: 'h3s': [await h3.text_content() for h3 in await context.page.locator('h3').all()], } - # Store the extracted data to the default dataset. await context.push_data(data) + Actor.log.info(f'Stored data from {context.request.url} (title={data["title"]!r}).') - # Enqueue additional links found on the current page. + # Enqueue links found on the page. await context.enqueue_links(strategy='same-domain') async def main() -> None: - # Enter the context of the Actor. async with Actor: - # Retrieve the Actor input, and use default values if not provided. + # Read the Actor input. actor_input = await Actor.get_input() or {} start_urls = [ url.get('url') - for url in actor_input.get('start_urls', [{'url': 'https://apify.com'}]) + for url in actor_input.get('startUrls', [{'url': 'https://crawlee.dev'}]) ] - # Exit if no start URLs are provided. if not start_urls: Actor.log.info('No start URLs specified in Actor input, exiting...') await Actor.exit() - # Run the crawler with the starting requests. + # Crawlee rotates the proxy URL per request on its own. + proxy_configuration = await Actor.create_proxy_configuration() + if proxy_configuration is None: + raise RuntimeError('Failed to create the proxy configuration.') + + # Common Chrome flags for running the browser in a container. + browser_args = ['--no-sandbox', '--disable-dev-shm-usage', '--disable-gpu'] + + crawler = PlaywrightCrawler( + proxy_configuration=proxy_configuration, + request_handler=router, + # Cap the crawl; remove or increase to follow all links. + max_requests_per_crawl=50, + headless=True, + browser_launch_options={'args': browser_args}, + ) + await crawler.run(start_urls) diff --git a/docs/03_guides/code/07_scrapling.py b/docs/03_guides/code/07_scrapling.py new file mode 100644 index 00000000..49aab31b --- /dev/null +++ b/docs/03_guides/code/07_scrapling.py @@ -0,0 +1,122 @@ +import asyncio +from typing import Any +from urllib.parse import urlsplit + +from scrapling.fetchers import AsyncFetcher + +from apify import Actor, Request +from apify.storages import RequestQueue + + +async def scrape_page( + url: str, + *, + proxy_url: str | None = None, +) -> tuple[dict[str, Any], list[str]]: + """Fetch a page with Scrapling's HTTP fetcher and return data and links.""" + # `impersonate` and `stealthy_headers` make the request look like Chrome. + response = await AsyncFetcher.get( + url, + proxy=proxy_url, + impersonate='chrome', + stealthy_headers=True, + timeout=60, + ) + + data = { + 'url': url, + 'title': response.css('title::text').get(), + 'h1s': response.css('h1::text').getall(), + 'h2s': response.css('h2::text').getall(), + 'h3s': response.css('h3::text').getall(), + } + + # Keep only absolute links on the same host. + links: list[str] = [] + host = urlsplit(url).netloc + for href in response.css('a::attr(href)').getall(): + link_url = response.urljoin(href) + if not link_url.startswith(('http://', 'https://')): + continue + if urlsplit(link_url).netloc == host: + links.append(link_url) + + return data, links + + +async def enqueue_links( + request_queue: RequestQueue, + links: list[str], + *, + depth: int, + max_depth: int, +) -> None: + """Enqueue the links one level deeper, unless max_depth was reached.""" + if depth >= max_depth: + return + + for link_url in links: + Actor.log.info(f'Enqueuing {link_url} ...') + request = Request.from_url(link_url) + request.crawl_depth = depth + 1 + await request_queue.add_request(request) + + +async def main() -> None: + async with Actor: + # Read the Actor input. + actor_input = await Actor.get_input() or {} + start_urls = actor_input.get('startUrls', [{'url': 'https://crawlee.dev'}]) + max_depth = actor_input.get('maxDepth', 1) + + if not start_urls: + Actor.log.info('No start URLs specified in Actor input, exiting...') + await Actor.exit() + + # Set up Apify Proxy and the request queue. + proxy_configuration = await Actor.create_proxy_configuration() + request_queue = await Actor.open_request_queue() + + # Enqueue the start URLs (crawl depth defaults to 0). + for start_url in start_urls: + url = start_url.get('url') + Actor.log.info(f'Enqueuing start URL: {url}') + await request_queue.add_request(Request.from_url(url)) + + # Cap the crawl; raise or remove to follow more pages. + max_requests = 50 + handled_requests = 0 + + while handled_requests < max_requests and ( + request := await request_queue.fetch_next_request() + ): + handled_requests += 1 + url = request.url + depth = request.crawl_depth + Actor.log.info(f'Scraping {url} (depth={depth}) ...') + + try: + # Fresh proxy URL per request (None if no proxy). + proxy_url = None + if proxy_configuration: + proxy_url = await proxy_configuration.new_url() + + data, links = await scrape_page(url, proxy_url=proxy_url) + await Actor.push_data(data) + Actor.log.info( + f'Stored data from {url} ' + f'(title={data["title"]!r}, {len(links)} links found).' + ) + await enqueue_links( + request_queue, links, depth=depth, max_depth=max_depth + ) + + except Exception: + Actor.log.exception(f'Cannot extract data from {url}.') + + finally: + await request_queue.mark_request_as_handled(request) + + +if __name__ == '__main__': + asyncio.run(main()) diff --git a/docs/03_guides/code/07_scrapling_browser.py b/docs/03_guides/code/07_scrapling_browser.py new file mode 100644 index 00000000..3eb50e24 --- /dev/null +++ b/docs/03_guides/code/07_scrapling_browser.py @@ -0,0 +1,35 @@ +from typing import Any + +from scrapling.fetchers import DynamicFetcher + + +async def scrape_page( + url: str, + *, + proxy_url: str | None = None, +) -> tuple[dict[str, Any], list[str]]: + """Fetch a page in a real browser with Scrapling and return data and links.""" + # `network_idle` waits until the page stops making network requests. + response = await DynamicFetcher.async_fetch( + url, + proxy=proxy_url, + headless=True, + network_idle=True, + ) + + data = { + 'url': url, + 'title': response.css('title::text').get(), + 'h1s': response.css('h1::text').getall(), + 'h2s': response.css('h2::text').getall(), + 'h3s': response.css('h3::text').getall(), + } + + # Collect absolute links from the page. + links: list[str] = [] + for href in response.css('a::attr(href)').getall(): + link_url = response.urljoin(href) + if link_url.startswith(('http://', 'https://')): + links.append(link_url) + + return data, links diff --git a/docs/03_guides/code/08_crawl4ai.py b/docs/03_guides/code/08_crawl4ai.py new file mode 100644 index 00000000..1c7884c1 --- /dev/null +++ b/docs/03_guides/code/08_crawl4ai.py @@ -0,0 +1,124 @@ +import asyncio +from typing import Any + +from crawl4ai import ( + AsyncWebCrawler, + BrowserConfig, + CacheMode, + CrawlerRunConfig, + ProxyConfig, +) + +from apify import Actor, Request +from apify.storages import RequestQueue + + +async def scrape_page( + crawler: AsyncWebCrawler, + url: str, + *, + proxy_url: str | None = None, +) -> tuple[dict[str, Any], list[str]]: + """Crawl a page with Crawl4AI and return its markdown and same-site links.""" + run_config = CrawlerRunConfig( + cache_mode=CacheMode.BYPASS, + proxy_config=ProxyConfig.from_string(proxy_url) if proxy_url else None, + ) + + result = await crawler.arun(url, config=run_config) + if not result.success: + raise RuntimeError(result.error_message or f'Failed to crawl {url}') + + data = { + 'url': result.url, + 'title': (result.metadata or {}).get('title'), + 'markdown': str(result.markdown), + } + + # Crawl4AI already classifies links; follow only the internal ones. + internal_links = result.links.get('internal', []) + links = [link['href'] for link in internal_links if link.get('href')] + + return data, links + + +async def enqueue_links( + request_queue: RequestQueue, + links: list[str], + *, + depth: int, + max_depth: int, +) -> None: + """Enqueue the links one level deeper, unless max_depth was reached.""" + if depth >= max_depth: + return + + for link_url in links: + Actor.log.info(f'Enqueuing {link_url} ...') + request = Request.from_url(link_url) + request.crawl_depth = depth + 1 + await request_queue.add_request(request) + + +async def main() -> None: + async with Actor: + # Read the Actor input. + actor_input = await Actor.get_input() or {} + start_urls = actor_input.get('startUrls', [{'url': 'https://crawlee.dev'}]) + max_depth = actor_input.get('maxDepth', 1) + + if not start_urls: + Actor.log.info('No start URLs specified in Actor input, exiting...') + await Actor.exit() + + # Set up Apify Proxy and the request queue. + proxy_configuration = await Actor.create_proxy_configuration() + request_queue = await Actor.open_request_queue() + + # Enqueue the start URLs (crawl depth defaults to 0). + for start_url in start_urls: + url = start_url.get('url') + Actor.log.info(f'Enqueuing start URL: {url}') + await request_queue.add_request(Request.from_url(url)) + + # Cap the crawl; raise or remove to follow more pages. + max_requests = 50 + handled_requests = 0 + + # Reuse one headless browser-backed crawler for every request. + browser_config = BrowserConfig(headless=True) + + async with AsyncWebCrawler(config=browser_config) as crawler: + while handled_requests < max_requests and ( + request := await request_queue.fetch_next_request() + ): + handled_requests += 1 + url = request.url + depth = request.crawl_depth + Actor.log.info(f'Scraping {url} (depth={depth}) ...') + + try: + # Fresh proxy URL per request (None if no proxy). + proxy_url = None + if proxy_configuration: + proxy_url = await proxy_configuration.new_url() + + data, links = await scrape_page(crawler, url, proxy_url=proxy_url) + await Actor.push_data(data) + Actor.log.info( + f'Stored data from {url} ' + f'(title={data["title"]!r}, {len(links)} links found).' + ) + await enqueue_links( + request_queue, links, depth=depth, max_depth=max_depth + ) + + except Exception: + Actor.log.exception(f'Cannot extract data from {url}.') + + finally: + await request_queue.mark_request_as_handled(request) + + +if __name__ == '__main__': + asyncio.run(main()) diff --git a/docs/03_guides/code/09_browser_use.py b/docs/03_guides/code/09_browser_use.py new file mode 100644 index 00000000..cd16773f --- /dev/null +++ b/docs/03_guides/code/09_browser_use.py @@ -0,0 +1,113 @@ +import asyncio +import os +from urllib.parse import urlsplit + +from browser_use import Agent, Browser, ChatOpenAI +from browser_use.browser import ProxySettings +from pydantic import BaseModel + +from apify import Actor + +# Default task, aligned with the `Posts` schema below. +DEFAULT_TASK = ( + 'Open https://news.ycombinator.com and return the title and URL ' + 'of the top 5 posts on the front page.' +) + + +class Post(BaseModel): + """A single item the agent is asked to extract.""" + + title: str + url: str + + +class Posts(BaseModel): + """The structured result returned by the agent.""" + + posts: list[Post] + + +def to_browser_use_proxy(proxy_url: str) -> ProxySettings: + """Convert an Apify Proxy URL into Browser Use `ProxySettings`.""" + parts = urlsplit(proxy_url) + return ProxySettings( + server=f'{parts.scheme}://{parts.hostname}:{parts.port}', + username=parts.username, + password=parts.password, + ) + + +async def run_agent_task( + task: str, + *, + model: str, + llm_api_key: str, + max_steps: int, + headless: bool = True, + proxy_url: str | None = None, +) -> Posts | None: + """Run a Browser Use agent for one task and return its structured output.""" + # Configure the LLM. Swap `ChatOpenAI` for another provider if needed. + llm = ChatOpenAI(model=model, api_key=llm_api_key) + + # Configure the browser, optionally routed through a proxy. + browser = Browser( + headless=headless, + proxy=to_browser_use_proxy(proxy_url) if proxy_url else None, + ) + + # `output_model_schema` returns a validated `Posts`; signals stay with the Actor. + agent = Agent( + task=task, + llm=llm, + browser=browser, + output_model_schema=Posts, + enable_signal_handler=False, + ) + + history = await agent.run(max_steps=max_steps) + return history.structured_output + + +async def main() -> None: + async with Actor: + # Read the Actor input. + actor_input = await Actor.get_input() or {} + task = actor_input.get('task', DEFAULT_TASK) + model = actor_input.get('model', 'gpt-4.1-mini') + max_steps = actor_input.get('maxSteps', 25) + + # Read the LLM API key from the environment (set it as a secret on Apify). + llm_api_key = os.environ.get('OPENAI_API_KEY') + if not llm_api_key: + raise RuntimeError('The OPENAI_API_KEY environment variable is not set.') + + # Route the browser through Apify Proxy. + proxy_configuration = await Actor.create_proxy_configuration() + proxy_url = await proxy_configuration.new_url() if proxy_configuration else None + + Actor.log.info(f'Running the agent (model={model}) for task: {task}') + + result = await run_agent_task( + task, + model=model, + llm_api_key=llm_api_key, + max_steps=max_steps, + headless=Actor.configuration.headless, + proxy_url=proxy_url, + ) + + if result is None: + Actor.log.warning('The agent did not return any structured output.') + return + + # Store each extracted item as a dataset row. + Actor.log.info(f'The agent returned {len(result.posts)} post(s); storing them.') + for post in result.posts: + Actor.log.info(f'Storing post: {post.title!r} ({post.url})') + await Actor.push_data(post.model_dump()) + + +if __name__ == '__main__': + asyncio.run(main()) diff --git a/docs/03_guides/code/11_http_url.py b/docs/03_guides/code/11_http_url.py new file mode 100644 index 00000000..80bf1f19 --- /dev/null +++ b/docs/03_guides/code/11_http_url.py @@ -0,0 +1,5 @@ +from pydantic import BaseModel, HttpUrl + + +class ActorInput(BaseModel): + target_url: HttpUrl diff --git a/docs/03_guides/code/11_model_validator.py b/docs/03_guides/code/11_model_validator.py new file mode 100644 index 00000000..29c4c98e --- /dev/null +++ b/docs/03_guides/code/11_model_validator.py @@ -0,0 +1,14 @@ +from typing import Self + +from pydantic import BaseModel, model_validator + + +class ActorInput(BaseModel): + min_price: int = 0 + max_price: int = 100 + + @model_validator(mode='after') + def _check_range(self) -> Self: + if self.min_price > self.max_price: + raise ValueError('min_price must not exceed max_price') + return self diff --git a/docs/03_guides/code/11_pydantic.py b/docs/03_guides/code/11_pydantic.py new file mode 100644 index 00000000..7ce35f88 --- /dev/null +++ b/docs/03_guides/code/11_pydantic.py @@ -0,0 +1,59 @@ +import asyncio +from typing import Literal + +from pydantic import BaseModel, ConfigDict, Field, ValidationError, field_validator + +from apify import Actor + + +class ActorInput(BaseModel): + """Typed and validated representation of the Actor input.""" + + # Accept both snake_case and the input schema's camelCase; ignore extras. + model_config = ConfigDict(populate_by_name=True, extra='ignore') + + # Required: non-empty list of search terms (normalized below). + search_terms: list[str] = Field(alias='searchTerms', min_length=1) + + # Optional: 1-100, defaults to 10. + max_results: int = Field(alias='maxResults', default=10, ge=1, le=100) + + # Optional: restricted to a fixed set of choices. + output_format: Literal['json', 'csv'] = Field(alias='outputFormat', default='json') + + @field_validator('search_terms') + @classmethod + def _normalize_terms(cls, value: list[str]) -> list[str]: + # Trim whitespace and drop empty terms. + cleaned = [term.strip() for term in value if term.strip()] + if not cleaned: + raise ValueError('searchTerms must contain at least one non-empty term') + return cleaned + + +async def main() -> None: + async with Actor: + # Read the raw input (a plain dict, not yet validated). + raw_input = await Actor.get_input() or {} + + # Validate the raw input against the model. + try: + actor_input = ActorInput.model_validate(raw_input) + except ValidationError as exc: + # Log a per-field summary, then re-raise to fail the run. + Actor.log.error('The Actor input is invalid:\n%s', exc) + raise + + # Work with typed attributes from here on. + Actor.log.info('Input passed validation: %s', actor_input.model_dump()) + + max_results = actor_input.max_results + for term in actor_input.search_terms: + Actor.log.info('Processing %r (max %d results)', term, max_results) + + # Store the normalized input as output. + await Actor.set_value('OUTPUT', actor_input.model_dump()) + + +if __name__ == '__main__': + asyncio.run(main()) diff --git a/docs/03_guides/code/11_raw_input.py b/docs/03_guides/code/11_raw_input.py new file mode 100644 index 00000000..29c313e5 --- /dev/null +++ b/docs/03_guides/code/11_raw_input.py @@ -0,0 +1,18 @@ +import asyncio + +from apify import Actor + + +async def main() -> None: + # Enter the context of the Actor. + async with Actor: + # Read the input and reach into the raw dict. + actor_input = await Actor.get_input() or {} + search_terms = actor_input.get('searchTerms', []) + max_results = actor_input.get('maxResults', 10) + + Actor.log.info('search_terms=%s, max_results=%s', search_terms, max_results) + + +if __name__ == '__main__': + asyncio.run(main()) diff --git a/docs/03_guides/code/07_webserver.py b/docs/03_guides/code/12_webserver.py similarity index 87% rename from docs/03_guides/code/07_webserver.py rename to docs/03_guides/code/12_webserver.py index 66ecfe3c..1cb23c1f 100644 --- a/docs/03_guides/code/07_webserver.py +++ b/docs/03_guides/code/12_webserver.py @@ -10,7 +10,7 @@ class RequestHandler(BaseHTTPRequestHandler): """A handler that prints the number of processed items on every GET request.""" - def do_get(self) -> None: + def do_GET(self) -> None: self.log_request() self.send_response(200) self.end_headers() @@ -18,7 +18,7 @@ def do_get(self) -> None: def run_server() -> None: - """Start the HTTP server on the provided port, and save a reference to the server.""" + """Start the HTTP server and keep a reference to it.""" global http_server with ThreadingHTTPServer( ('', Actor.configuration.web_server_port), RequestHandler @@ -43,7 +43,7 @@ async def main() -> None: if http_server is None: raise RuntimeError('HTTP server not started') - # Signal the HTTP server to shut down, and wait for it to finish. + # Signal the server to shut down and wait. http_server.shutdown() await run_server_task diff --git a/docs/03_guides/code/12_webserver_fastapi.py b/docs/03_guides/code/12_webserver_fastapi.py new file mode 100644 index 00000000..08768eb0 --- /dev/null +++ b/docs/03_guides/code/12_webserver_fastapi.py @@ -0,0 +1,48 @@ +import asyncio + +import uvicorn +from fastapi import FastAPI + +from apify import Actor + +# Counter the server reports and the Actor updates. +processed_items = 0 + +# FastAPI app with a single endpoint. +app = FastAPI() + + +@app.get('/') +async def index() -> dict[str, int]: + """Respond to every GET request with the number of processed items.""" + return {'processed_items': processed_items} + + +async def main() -> None: + global processed_items + async with Actor: + # Serve the app on the platform's web server port; 0.0.0.0 exposes it. + config = uvicorn.Config( + app, + host='0.0.0.0', # noqa: S104 + port=Actor.configuration.web_server_port, + ) + server = uvicorn.Server(config) + + # Run the server in the background. + server_task = asyncio.create_task(server.serve()) + Actor.log.info(f'Server running at {Actor.configuration.web_server_url}') + + # Simulate work, updating the reported counter. + for _ in range(100): + await asyncio.sleep(1) + processed_items += 1 + Actor.log.info(f'Processed items: {processed_items}') + + # Signal the server to shut down and wait. + server.should_exit = True + await server_task + + +if __name__ == '__main__': + asyncio.run(main()) diff --git a/docs/03_guides/code/scrapy_project/src/__main__.py b/docs/03_guides/code/scrapy_project/src/__main__.py index 807447c9..f9b27ed5 100644 --- a/docs/03_guides/code/scrapy_project/src/__main__.py +++ b/docs/03_guides/code/scrapy_project/src/__main__.py @@ -7,7 +7,7 @@ # Import your main Actor coroutine here. from .main import main -# Ensure the location to the Scrapy settings module is defined. +# Point Scrapy at the settings module. os.environ['SCRAPY_SETTINGS_MODULE'] = 'src.settings' diff --git a/docs/03_guides/code/scrapy_project/src/main.py b/docs/03_guides/code/scrapy_project/src/main.py index d8b67984..b234b171 100644 --- a/docs/03_guides/code/scrapy_project/src/main.py +++ b/docs/03_guides/code/scrapy_project/src/main.py @@ -14,16 +14,16 @@ async def main() -> None: """Apify Actor main coroutine for executing the Scrapy spider.""" async with Actor: - # Retrieve and process Actor input. + # Read the Actor input. actor_input = await Actor.get_input() or {} start_urls = [url['url'] for url in actor_input.get('startUrls', [])] allowed_domains = actor_input.get('allowedDomains') proxy_config = actor_input.get('proxyConfiguration') - # Apply Apify settings, which will override the Scrapy project settings. + # Apply Apify settings (override the Scrapy project settings). settings = apply_apify_settings(proxy_config=proxy_config) - # Create AsyncCrawlerRunner and execute the Scrapy spider. + # Run the Scrapy spider. crawler_runner = AsyncCrawlerRunner(settings) await crawler_runner.crawl( Spider, diff --git a/docs/03_guides/code/scrapy_project/src/settings.py b/docs/03_guides/code/scrapy_project/src/settings.py index 5c0e56e3..67ae1a03 100644 --- a/docs/03_guides/code/scrapy_project/src/settings.py +++ b/docs/03_guides/code/scrapy_project/src/settings.py @@ -5,7 +5,7 @@ ROBOTSTXT_OBEY = True SPIDER_MODULES = ['src.spiders'] TELNETCONSOLE_ENABLED = False -# Do not change the Twisted reactor unless you really know what you are doing. +# Don't change the Twisted reactor unless you know what you're doing. TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor' HTTPCACHE_ENABLED = True HTTPCACHE_EXPIRATION_SECS = 7200 diff --git a/docs/03_guides/code/scrapy_project/src/spiders/title.py b/docs/03_guides/code/scrapy_project/src/spiders/title.py index 7223a53d..8111ee31 100644 --- a/docs/03_guides/code/scrapy_project/src/spiders/title.py +++ b/docs/03_guides/code/scrapy_project/src/spiders/title.py @@ -14,11 +14,7 @@ class TitleSpider(Spider): - """A spider that scrapes web pages to extract titles and discover new links. - - This spider retrieves the content of the element from each page and queues - any valid hyperlinks for further crawling. - """ + """A spider that extracts page titles and queues links for further crawling.""" name = 'title_spider' @@ -32,36 +28,21 @@ def __init__( *args: Any, **kwargs: Any, ) -> None: - """A default constructor. - - Args: - start_urls: URLs to start the scraping from. - allowed_domains: Domains that the scraper is allowed to crawl. - *args: Additional positional arguments. - **kwargs: Additional keyword arguments. - """ + """Store the start URLs and allowed domains.""" super().__init__(*args, **kwargs) self.start_urls = start_urls self.allowed_domains = allowed_domains def parse(self, response: Response) -> Generator[TitleItem | Request, None, None]: - """Parse the web page response. - - Args: - response: The web page response. - - Yields: - Yields scraped `TitleItem` and new `Request` objects for links. - """ + """Yield a `TitleItem` and a `Request` for each link on the page.""" self.logger.info('TitleSpider is parsing %s...', response) - # Extract and yield the TitleItem + # Yield the title item. url = response.url title = response.css('title::text').extract_first() yield TitleItem(url=url, title=title) - # Extract all links from the page, create `Request` objects out of them, - # and yield them. + # Yield a request for each link. for link_href in response.css('a::attr("href")'): link_url = urljoin(response.url, link_href.get()) if link_url.startswith(('http://', 'https://')): diff --git a/docs/03_guides/code/uv_project/Dockerfile b/docs/03_guides/code/uv_project/Dockerfile new file mode 100644 index 00000000..24e7a44b --- /dev/null +++ b/docs/03_guides/code/uv_project/Dockerfile @@ -0,0 +1,38 @@ +# syntax=docker/dockerfile:1 +# First, specify the base Docker image. +# You can see the Docker images from Apify at https://hub.docker.com/r/apify/. +# You can also use any other image from Docker Hub. +FROM apify/actor-python:3.14 + +# Add the uv binary from its official distroless image (pinned to the 0.11.x line). +COPY --from=ghcr.io/astral-sh/uv:0.11 /uv /uvx /bin/ + +# Configure uv for container builds: +# - compile installed packages to bytecode, so the Actor starts faster, +# - copy packages instead of hardlinking, which avoids warnings with the cache mount, +# - never download a managed Python, always reuse the base image's interpreter, +# - put the project virtual environment first on PATH, so `python` resolves to it. +ENV UV_COMPILE_BYTECODE=1 \ + UV_LINK_MODE=copy \ + UV_PYTHON_DOWNLOADS=0 \ + PATH="/usr/src/app/.venv/bin:$PATH" + +# Install dependencies into the project virtual environment (.venv) as a separate +# layer. The cache mount speeds up repeated builds, and the bind mounts make the +# project metadata available without copying it into the image. This layer is +# rebuilt only when uv.lock or pyproject.toml change - not on source code edits. +RUN --mount=type=cache,target=/root/.cache/uv \ + --mount=type=bind,source=uv.lock,target=uv.lock \ + --mount=type=bind,source=pyproject.toml,target=pyproject.toml \ + uv sync --locked --no-dev + +# Next, copy the remaining files and directories with the source code. +# Since we do this after installing the dependencies, quick rebuilds will be +# really fast for most source file changes. +COPY . ./ + +# Use compileall to ensure the runnability of the Actor Python code. +RUN python -m compileall -q my_actor/ + +# Specify how to launch the source code of your Actor. +CMD ["python", "-m", "my_actor"] diff --git a/docs/03_guides/code/uv_project/my_actor/__init__.py b/docs/03_guides/code/uv_project/my_actor/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/docs/03_guides/code/uv_project/my_actor/__main__.py b/docs/03_guides/code/uv_project/my_actor/__main__.py new file mode 100644 index 00000000..8c4ab0b8 --- /dev/null +++ b/docs/03_guides/code/uv_project/my_actor/__main__.py @@ -0,0 +1,6 @@ +import asyncio + +from .main import main + +if __name__ == '__main__': + asyncio.run(main()) diff --git a/docs/03_guides/code/uv_project/my_actor/main.py b/docs/03_guides/code/uv_project/my_actor/main.py new file mode 100644 index 00000000..10e88e19 --- /dev/null +++ b/docs/03_guides/code/uv_project/my_actor/main.py @@ -0,0 +1,8 @@ +from apify import Actor + + +async def main() -> None: + async with Actor: + actor_input = await Actor.get_input() or {} + Actor.log.info('Actor input: %s', actor_input) + await Actor.set_value('OUTPUT', 'Hello from a uv-managed Actor!') diff --git a/docs/03_guides/code/uv_project/pyproject.toml b/docs/03_guides/code/uv_project/pyproject.toml new file mode 100644 index 00000000..7500a440 --- /dev/null +++ b/docs/03_guides/code/uv_project/pyproject.toml @@ -0,0 +1,8 @@ +[project] +name = "my-actor" +version = "0.1.0" +description = "An Apify Actor with dependencies managed by uv." +requires-python = ">=3.14" +dependencies = [ + "apify>=3.0.0,<4.0.0", +]