Skip to content

Updated crawl4ai docs #1787

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 11, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 50 additions & 2 deletions docs/guides/python/python-crawl4ai.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,26 @@ This demo showcases how to use Trigger.dev with Python to build a web crawler th
- Our [Python build extension](/config/extensions/pythonExtension) to install the dependencies and run the Python script
- [Crawl4AI](https://github.com/unclecode/crawl4ai), an open source LLM friendly web crawler
- A custom [Playwright extension](https://playwright.dev/) to create a headless chromium browser
- Proxy support

## Using Proxies

<ScrapingWarning />

Some popular proxy services are:

- [Smartproxy](https://smartproxy.com/)
- [Bright Data](https://brightdata.com/)
- [Browserbase](https://browserbase.com/)
- [Oxylabs](https://oxylabs.io/)
- [ScrapingBee](https://scrapingbee.com/)

Once you have a proxy service, set the following environment variables in your Trigger.dev .env file, and add them in the Trigger.dev dashboard:

- `PROXY_URL`: The URL of your proxy server (e.g., `http://proxy.example.com:8080`)
- `PROXY_USERNAME`: Username for authenticated proxies (optional)
- `PROXY_PASSWORD`: Password for authenticated proxies (optional)

## GitHub repo

<Card
Expand Down Expand Up @@ -113,7 +130,14 @@ export const convertUrlToMarkdown = schemaTask({
url: z.string().url(),
}),
run: async (payload) => {
const result = await python.runScript("./src/python/crawl-url.py", [payload.url]);
// Pass through any proxy environment variables
const env = {
PROXY_URL: process.env.PROXY_URL,
PROXY_USERNAME: process.env.PROXY_USERNAME,
PROXY_PASSWORD: process.env.PROXY_PASSWORD,
};

const result = await python.runScript("./src/python/crawl-url.py", [payload.url], { env });

logger.debug("convert-url-to-markdown", {
url: payload.url,
Expand Down Expand Up @@ -142,10 +166,34 @@ The Python script is a simple script using Crawl4AI that takes a URL and returns
```python src/python/crawl-url.py
import asyncio
import sys
import os
from crawl4ai import *
from crawl4ai.async_configs import BrowserConfig

async def main(url: str):
async with AsyncWebCrawler() as crawler:
# Get proxy configuration from environment variables
proxy_url = os.environ.get("PROXY_URL")
proxy_username = os.environ.get("PROXY_USERNAME")
proxy_password = os.environ.get("PROXY_PASSWORD")

# Configure the proxy
browser_config = None
if proxy_url:
if proxy_username and proxy_password:
# Use authenticated proxy
proxy_config = {
"server": proxy_url,
"username": proxy_username,
"password": proxy_password
}
browser_config = BrowserConfig(proxy_config=proxy_config)
else:
# Use simple proxy
browser_config = BrowserConfig(proxy=proxy_url)
else:
browser_config = BrowserConfig()

async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url=url,
)
Expand Down