-
-
Notifications
You must be signed in to change notification settings - Fork 728
Added Python crawling example #1786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
WalkthroughThis PR updates documentation by reformatting array declarations in a JSON file to a single-line style and by adding new entries to the “Example projects” sections. It revises a warning message in the Puppeteer guide to explicitly require permission for web scraping, and introduces a comprehensive guide for a Python-based headless web crawler. Additionally, a new “Learn more about using Python with Trigger.dev” section has been added, enhancing the overall clarity and consistency of the documentation. Changes
Sequence Diagram(s)sequenceDiagram
participant U as User
participant TD as Trigger.dev Task Orchestrator
participant PS as Python Script (crawl-url.py)
U->>TD: Initiate web crawling task
TD->>PS: Execute task via python.runScript
PS->>PS: Crawl target URL & process content
PS-->>TD: Return crawled data (markdown format)
TD-->>U: Deliver formatted output
Possibly related PRs
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
⏰ Context from checks skipped due to timeout of 90000ms (1)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
docs/guides/examples/puppeteer.mdx (1)
208-218
: Improve Punctuation in the Warning MessageThe warning text now explicitly emphasizes the need for permission when scraping; however, the sentence “If you don't you'll risk getting our IP address blocked and we will ban you from our service.” would be clearer and more grammatically correct with proper punctuation. Consider inserting a comma after “don’t” and another before “and” to separate the independent clauses.
- If you don't you'll risk getting our IP address blocked and we will ban you from our service. + If you don't, you'll risk getting our IP address blocked, and we will ban you from our service.🧰 Tools
🪛 LanguageTool
[uncategorized] ~208-~208: Use a comma before ‘and’ if it connects two independent clauses (unless they are closely connected and short).
Context: ...u'll risk getting our IP address blocked and we will ban you from our service. You m...(COMMA_COMPOUND_SENTENCE)
[grammar] ~210-~210: There may be a verb agreement error, if referring to a singular entity (a list).
Context: ... owner to scrape their content.** Here are a list of proxy services we recommend: ...(THERE_IS_ARE)
docs/guides/python/python-crawl4ai.mdx (2)
21-25
: Enhance Descriptive Text in the Features SectionIn the features list:
- On the Crawl4AI item, consider using a hyphen in “LLM friendly” to form “LLM‐friendly,” which improves clarity when used as a compound adjective.
- On the Playwright extension item, “chromium browser” should capitalize “Chromium” (since it is a proper noun referring to the browser engine).
- [Crawl4AI](https://github.com/unclecode/crawl4ai), an open source LLM friendly web crawler + [Crawl4AI](https://github.com/unclecode/crawl4ai), an open source LLM‐friendly web crawler- A custom [Playwright extension](https://playwright.dev/) to create a headless chromium browser + A custom [Playwright extension](https://playwright.dev/) to create a headless Chromium browser🧰 Tools
🪛 LanguageTool
[uncategorized] ~23-~23: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...ps://github.com/unclecode/crawl4ai), an open source LLM friendly web crawler - A custom [Pl...(EN_COMPOUND_ADJECTIVE_INTERNAL)
[grammar] ~24-~24: The proper noun “Chromium” (= software from Google) needs to be capitalized.
Context: ...//playwright.dev/) to create a headless chromium browser ## GitHub...(GOOGLE_PRODUCTS)
162-168
: Correct Duplicate Words in the Testing InstructionsThere are duplicate words in the testing steps:
- The phrase “and and add it” should be corrected to “and add it.”
- Similarly, “with with” should be corrected to “with.”
- 4. If you haven't already, copy your project ref from your [Trigger.dev dashboard](https://cloud.trigger.dev) and and add it to the `trigger.config.ts` file. + 4. If you haven't already, copy your project ref from your [Trigger.dev dashboard](https://cloud.trigger.dev) and add it to the `trigger.config.ts` file.- 5. Run the Trigger.dev dev CLI command with with `npx trigger dev@latest dev` (it may ask you to authorize the CLI if you haven't already). + 5. Run the Trigger.dev dev CLI command with `npx trigger dev@latest dev` (it may ask you to authorize the CLI if you haven't already).🧰 Tools
🪛 LanguageTool
[duplication] ~167-~167: Possible typo: you repeated a word.
Context: ...v dashboard](https://cloud.trigger.dev) and and add it to thetrigger.config.ts
file....(ENGLISH_WORD_REPEAT_RULE)
[duplication] ~168-~168: Possible typo: you repeated a word.
Context: ...ger.config.tsfile. 5. Run the Trigger.dev dev CLI command with with
npx trigger dev@...(ENGLISH_WORD_REPEAT_RULE)
[duplication] ~168-~168: Possible typo: you repeated a word.
Context: ... 5. Run the Trigger.dev dev CLI command with withnpx trigger dev@latest dev
(it may as...(ENGLISH_WORD_REPEAT_RULE)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
docs/docs.json
(11 hunks)docs/guides/examples/puppeteer.mdx
(1 hunks)docs/guides/introduction.mdx
(1 hunks)docs/guides/python/python-crawl4ai.mdx
(1 hunks)docs/snippets/python-learn-more.mdx
(1 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/guides/examples/puppeteer.mdx
[uncategorized] ~208-~208: Use a comma before ‘and’ if it connects two independent clauses (unless they are closely connected and short).
Context: ...u'll risk getting our IP address blocked and we will ban you from our service. You m...
(COMMA_COMPOUND_SENTENCE)
docs/guides/python/python-crawl4ai.mdx
[uncategorized] ~23-~23: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...ps://github.com/unclecode/crawl4ai), an open source LLM friendly web crawler - A custom [Pl...
(EN_COMPOUND_ADJECTIVE_INTERNAL)
[grammar] ~24-~24: The proper noun “Chromium” (= software from Google) needs to be capitalized.
Context: ...//playwright.dev/) to create a headless chromium browser ## GitHub...
(GOOGLE_PRODUCTS)
[duplication] ~167-~167: Possible typo: you repeated a word.
Context: ...v dashboard](https://cloud.trigger.dev) and and add it to the trigger.config.ts
file....
(ENGLISH_WORD_REPEAT_RULE)
[duplication] ~168-~168: Possible typo: you repeated a word.
Context: ...ger.config.tsfile. 5. Run the Trigger.dev dev CLI command with with
npx trigger dev@...
(ENGLISH_WORD_REPEAT_RULE)
[duplication] ~168-~168: Possible typo: you repeated a word.
Context: ... 5. Run the Trigger.dev dev CLI command with with npx trigger dev@latest dev
(it may as...
(ENGLISH_WORD_REPEAT_RULE)
🪛 GitHub Actions: 📚 Docs Checks
docs/guides/introduction.mdx
[error] 1-1: 1 broken links found: /guides/example-projects/python-web-crawler
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (5)
docs/guides/introduction.mdx (1)
41-49
: Approve New "Python web crawler" EntryThe new entry for the Python web crawler in the “Example projects” section is a valuable addition. It clearly specifies the project details and provides a direct link to the GitHub repository, aligning with the overall documentation updates.
docs/snippets/python-learn-more.mdx (1)
1-7
: Approve "Learn More about Using Python" SectionThe new snippet section effectively highlights additional learning resources related to using Python with Trigger.dev. The card component is clear and provides a neat call-to-action for further exploration.
docs/docs.json (1)
305-311
: Approve JSON Reformatting and New Entry AdditionThe reformatting of the array declarations into a single-line style enhances readability and consistency. The new entry
"guides/python/python-crawl4ai"
is properly added under the "Example projects" group. The overall JSON structure remains intact and clear.docs/guides/python/python-crawl4ai.mdx (2)
1-7
: Header and Metadata Look GoodThe metadata block clearly sets the title, sidebar title, and description. This information is concise and effectively prepares the reader for the guide.
169-177
: Overall Guide Structure and Code ExamplesThe guide comprehensively covers the prerequisites, build configuration, task code, dependency management, and testing/deployment steps for the Python web crawler. The code blocks (both TypeScript and Python) are clearly presented with proper context linking back to the examples repository. This thorough detail will greatly benefit users implementing the guide.
Summary by CodeRabbit