Skip to content

Added Python crawling example #1786

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 11, 2025
Merged

Added Python crawling example #1786

merged 2 commits into from
Mar 11, 2025

Conversation

D-K-P
Copy link
Member

@D-K-P D-K-P commented Mar 11, 2025

Summary by CodeRabbit

  • Documentation
    • Refined navigation formatting for improved clarity in our docs.
    • Updated the scraping guidelines to emphasize obtaining permission before content scraping.
    • Updated the URL path for the Python web crawler project.
    • Introduced a comprehensive guide for building a headless browser web crawler with Python tools.
    • Launched a new section that highlights the Python build extension for managing dependencies and executing Python code.

Copy link

changeset-bot bot commented Mar 11, 2025

⚠️ No Changeset found

Latest commit: c3ef5be

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link
Contributor

coderabbitai bot commented Mar 11, 2025

Walkthrough

This PR updates documentation by reformatting array declarations in a JSON file to a single-line style and by adding new entries to the “Example projects” sections. It revises a warning message in the Puppeteer guide to explicitly require permission for web scraping, and introduces a comprehensive guide for a Python-based headless web crawler. Additionally, a new “Learn more about using Python with Trigger.dev” section has been added, enhancing the overall clarity and consistency of the documentation.

Changes

File(s) Change Summary
docs/docs.json, docs/guides/introduction.mdx Consolidated multi-line arrays to single-line arrays in various sections and added a new example project entry (Python web crawler) in the "Example projects" group.
docs/guides/examples/puppeteer.mdx Updated the warning message to emphasize that users must have permission from website owners when scraping content.
docs/guides/python/python-crawl4ai.mdx New guide added detailing how to implement a headless Python web crawler using Crawl4AI, Playwright, and Trigger.dev, including build configuration and task setup.
docs/snippets/python-learn-more.mdx Introduced a new section with a card component to help users learn about and use the Python build extension with Trigger.dev.

Sequence Diagram(s)

sequenceDiagram
    participant U as User
    participant TD as Trigger.dev Task Orchestrator
    participant PS as Python Script (crawl-url.py)
    
    U->>TD: Initiate web crawling task
    TD->>PS: Execute task via python.runScript
    PS->>PS: Crawl target URL & process content
    PS-->>TD: Return crawled data (markdown format)
    TD-->>U: Deliver formatted output
Loading

Possibly related PRs

  • fal AI example docs #1439: The changes in the main PR, which involve consolidating array declarations in docs/docs.json, are related to the modifications in the retrieved PR that also include similar formatting updates to arrays in docs/mint.json, specifically regarding the pages attributes and the addition of a new entry for an example task.
  • Added LLM evaluator to example projects #1589: The changes in the main PR, which involve modifications to the "Example projects" section in docs/docs.json, are related to the addition of a new entry for the "Batch LLM Evaluator" project in the retrieved PR, which also updates the same section in docs/introduction.mdx.

Suggested reviewers

  • matt-aitken

Poem

I'm a hopping rabbit with joyful glee,
Skimming through docs so light and free.
Arrays now dance on a single line,
New guides and warnings, all in time.
With playful hops through Python and play,
I celebrate these changes in my bunny way!


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a1ea09d and c3ef5be.

📒 Files selected for processing (2)
  • docs/guides/introduction.mdx (1 hunks)
  • docs/guides/introduction.mdx (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • docs/guides/introduction.mdx
  • docs/guides/introduction.mdx
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Analyze (javascript-typescript)

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
docs/guides/examples/puppeteer.mdx (1)

208-218: Improve Punctuation in the Warning Message

The warning text now explicitly emphasizes the need for permission when scraping; however, the sentence “If you don't you'll risk getting our IP address blocked and we will ban you from our service.” would be clearer and more grammatically correct with proper punctuation. Consider inserting a comma after “don’t” and another before “and” to separate the independent clauses.

- If you don't you'll risk getting our IP address blocked and we will ban you from our service.
+ If you don't, you'll risk getting our IP address blocked, and we will ban you from our service.
🧰 Tools
🪛 LanguageTool

[uncategorized] ~208-~208: Use a comma before ‘and’ if it connects two independent clauses (unless they are closely connected and short).
Context: ...u'll risk getting our IP address blocked and we will ban you from our service. You m...

(COMMA_COMPOUND_SENTENCE)


[grammar] ~210-~210: There may be a verb agreement error, if referring to a singular entity (a list).
Context: ... owner to scrape their content.** Here are a list of proxy services we recommend: ...

(THERE_IS_ARE)

docs/guides/python/python-crawl4ai.mdx (2)

21-25: Enhance Descriptive Text in the Features Section

In the features list:

  • On the Crawl4AI item, consider using a hyphen in “LLM friendly” to form “LLM‐friendly,” which improves clarity when used as a compound adjective.
  • On the Playwright extension item, “chromium browser” should capitalize “Chromium” (since it is a proper noun referring to the browser engine).
- [Crawl4AI](https://github.com/unclecode/crawl4ai), an open source LLM friendly web crawler
+ [Crawl4AI](https://github.com/unclecode/crawl4ai), an open source LLM‐friendly web crawler
- A custom [Playwright extension](https://playwright.dev/) to create a headless chromium browser
+ A custom [Playwright extension](https://playwright.dev/) to create a headless Chromium browser
🧰 Tools
🪛 LanguageTool

[uncategorized] ~23-~23: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...ps://github.com/unclecode/crawl4ai), an open source LLM friendly web crawler - A custom [Pl...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


[grammar] ~24-~24: The proper noun “Chromium” (= software from Google) needs to be capitalized.
Context: ...//playwright.dev/) to create a headless chromium browser ## GitHub...

(GOOGLE_PRODUCTS)


162-168: Correct Duplicate Words in the Testing Instructions

There are duplicate words in the testing steps:

  • The phrase “and and add it” should be corrected to “and add it.”
  • Similarly, “with with” should be corrected to “with.”
- 4. If you haven't already, copy your project ref from your [Trigger.dev dashboard](https://cloud.trigger.dev) and and add it to the `trigger.config.ts` file.
+ 4. If you haven't already, copy your project ref from your [Trigger.dev dashboard](https://cloud.trigger.dev) and add it to the `trigger.config.ts` file.
- 5. Run the Trigger.dev dev CLI command with with `npx trigger dev@latest dev` (it may ask you to authorize the CLI if you haven't already).
+ 5. Run the Trigger.dev dev CLI command with `npx trigger dev@latest dev` (it may ask you to authorize the CLI if you haven't already).
🧰 Tools
🪛 LanguageTool

[duplication] ~167-~167: Possible typo: you repeated a word.
Context: ...v dashboard](https://cloud.trigger.dev) and and add it to the trigger.config.ts file....

(ENGLISH_WORD_REPEAT_RULE)


[duplication] ~168-~168: Possible typo: you repeated a word.
Context: ...ger.config.tsfile. 5. Run the Trigger.dev dev CLI command with withnpx trigger dev@...

(ENGLISH_WORD_REPEAT_RULE)


[duplication] ~168-~168: Possible typo: you repeated a word.
Context: ... 5. Run the Trigger.dev dev CLI command with with npx trigger dev@latest dev (it may as...

(ENGLISH_WORD_REPEAT_RULE)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 67115ca and a1ea09d.

📒 Files selected for processing (5)
  • docs/docs.json (11 hunks)
  • docs/guides/examples/puppeteer.mdx (1 hunks)
  • docs/guides/introduction.mdx (1 hunks)
  • docs/guides/python/python-crawl4ai.mdx (1 hunks)
  • docs/snippets/python-learn-more.mdx (1 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/guides/examples/puppeteer.mdx

[uncategorized] ~208-~208: Use a comma before ‘and’ if it connects two independent clauses (unless they are closely connected and short).
Context: ...u'll risk getting our IP address blocked and we will ban you from our service. You m...

(COMMA_COMPOUND_SENTENCE)

docs/guides/python/python-crawl4ai.mdx

[uncategorized] ~23-~23: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...ps://github.com/unclecode/crawl4ai), an open source LLM friendly web crawler - A custom [Pl...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


[grammar] ~24-~24: The proper noun “Chromium” (= software from Google) needs to be capitalized.
Context: ...//playwright.dev/) to create a headless chromium browser ## GitHub...

(GOOGLE_PRODUCTS)


[duplication] ~167-~167: Possible typo: you repeated a word.
Context: ...v dashboard](https://cloud.trigger.dev) and and add it to the trigger.config.ts file....

(ENGLISH_WORD_REPEAT_RULE)


[duplication] ~168-~168: Possible typo: you repeated a word.
Context: ...ger.config.tsfile. 5. Run the Trigger.dev dev CLI command with withnpx trigger dev@...

(ENGLISH_WORD_REPEAT_RULE)


[duplication] ~168-~168: Possible typo: you repeated a word.
Context: ... 5. Run the Trigger.dev dev CLI command with with npx trigger dev@latest dev (it may as...

(ENGLISH_WORD_REPEAT_RULE)

🪛 GitHub Actions: 📚 Docs Checks
docs/guides/introduction.mdx

[error] 1-1: 1 broken links found: /guides/example-projects/python-web-crawler

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (5)
docs/guides/introduction.mdx (1)

41-49: Approve New "Python web crawler" Entry

The new entry for the Python web crawler in the “Example projects” section is a valuable addition. It clearly specifies the project details and provides a direct link to the GitHub repository, aligning with the overall documentation updates.

docs/snippets/python-learn-more.mdx (1)

1-7: Approve "Learn More about Using Python" Section

The new snippet section effectively highlights additional learning resources related to using Python with Trigger.dev. The card component is clear and provides a neat call-to-action for further exploration.

docs/docs.json (1)

305-311: Approve JSON Reformatting and New Entry Addition

The reformatting of the array declarations into a single-line style enhances readability and consistency. The new entry "guides/python/python-crawl4ai" is properly added under the "Example projects" group. The overall JSON structure remains intact and clear.

docs/guides/python/python-crawl4ai.mdx (2)

1-7: Header and Metadata Look Good

The metadata block clearly sets the title, sidebar title, and description. This information is concise and effectively prepares the reader for the guide.


169-177: Overall Guide Structure and Code Examples

The guide comprehensively covers the prerequisites, build configuration, task code, dependency management, and testing/deployment steps for the Python web crawler. The code blocks (both TypeScript and Python) are clearly presented with proper context linking back to the examples repository. This thorough detail will greatly benefit users implementing the guide.

@D-K-P D-K-P merged commit 282cc06 into main Mar 11, 2025
7 checks passed
@D-K-P D-K-P deleted the docs/python-crawl-example branch March 11, 2025 14:17
@coderabbitai coderabbitai bot mentioned this pull request Apr 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants