Skip to content

Commit 4348e3b

Browse files
committed
Added python pdf form extractor example
1 parent 1f980f6 commit 4348e3b

12 files changed

Lines changed: 2286 additions & 2 deletions

File tree

python-crawl4ai/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
.pnp.*
77
.yarn/*
88
.venv/
9+
venv/
910
!.yarn/patches
1011
!.yarn/plugins
1112
!.yarn/releases

python-crawl4ai/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,5 +42,5 @@ Once you have a proxy service, set the following environment variables in your T
4242
## Relevant code
4343

4444
- [pythonTasks.ts](./src/trigger/pythonTasks.ts) triggers the Python script and returns the result
45-
- [trigger.config.ts](./src/trigger/trigger.config.ts) uses the Trigger.dev Python extension to install the dependencies and run the script, as well as `installPlaywrightChromium()` to create a headless chromium browser
45+
- [trigger.config.ts](./trigger.config.ts) uses the Trigger.dev Python extension to install the dependencies and run the script, as well as `installPlaywrightChromium()` to create a headless chromium browser
4646
- [crawl-url.py](./src/python/crawl-url.py) is the main Python script that takes a URL and returns the markdown content of the page

python-crawl4ai/trigger.config.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ export default defineConfig({
1111
extensions: [
1212
pythonExtension({
1313
requirementsFile: "./requirements.txt",
14-
devPythonBinaryPath: `.venv/bin/python`,
14+
devPythonBinaryPath: `venv/bin/python`,
1515
scripts: ["src/python/**/*.py"],
1616
}),
1717
installPlaywrightChromium(),
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# See https://help.github.com/articles/ignoring-files/ for more about ignoring files.
2+
3+
# dependencies
4+
/node_modules
5+
/.pnp
6+
.pnp.*
7+
.yarn/*
8+
.venv/
9+
venv/
10+
!.yarn/patches
11+
!.yarn/plugins
12+
!.yarn/releases
13+
!.yarn/versions
14+
15+
# testing
16+
/coverage
17+
18+
# next.js
19+
/.next/
20+
/out/
21+
22+
# production
23+
/build
24+
25+
# misc
26+
.DS_Store
27+
*.pem
28+
29+
# debug
30+
npm-debug.log*
31+
yarn-debug.log*
32+
yarn-error.log*
33+
.pnpm-debug.log*
34+
35+
# env files (can opt-in for committing if needed)
36+
.env*
37+
38+
# vercel
39+
.vercel
40+
41+
# typescript
42+
*.tsbuildinfo
43+
next-env.d.ts
44+
45+
.trigger
46+
!.env.example
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Trigger.dev + Python PDF Form Extractor Example
2+
3+
This demo showcases how to use Trigger.dev with Python to extract structured form data from a PDF file available at a URL.
4+
5+
## Features
6+
7+
- [Trigger.dev](https://trigger.dev) to orchestrate background tasks
8+
- [Trigger.dev Python build extension](https://trigger.dev/docs/config/extensions/pythonExtension) to install the dependencies and run the Python script
9+
- [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/) to extract form data from PDF files
10+
- [Requests](https://docs.python-requests.org/en/master/) to download PDF files from URLs
11+
12+
## Getting Started
13+
14+
1. After cloning the repo, run `npm install` to install the dependencies.
15+
2. Create a virtual environment `python -m venv venv`
16+
3. Activate the virtual environment, depending on your OS: On Mac/Linux: `source venv/bin/activate`, on Windows: `venv\Scripts\activate`
17+
4. Install the Python dependencies `pip install -r requirements.txt`
18+
5. Copy the project ref from your [Trigger.dev dashboard](https://cloud.trigger.dev) and add it to the `trigger.config.ts` file.
19+
6. Run the Trigger.dev dev CLI command with `npx trigger dev@latest dev` (it may ask you to authorize the CLI if you haven't already).
20+
7. Test the task in the dashboard by providing a valid PDF URL.
21+
8. Deploy the task to production using the CLI command `npx trigger.dev@latest deploy`
22+
23+
## Relevant code
24+
25+
- [pythonPdfTask.ts](./src/trigger/pythonPdfTask.ts) triggers the Python script and returns the structured form data as JSON
26+
- [trigger.config.ts](./trigger.config.ts) uses the Trigger.dev Python extension to install the dependencies and run the script
27+
- [extract-pdf-form.py](./src/python/extract-pdf-form.py) is the main Python script that takes a URL and returns the form data from the PDF in JSON format

0 commit comments

Comments
 (0)