Skip to content

Deep scraper integration #727

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 46 commits into from
Oct 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
e2fe39c
Reasoning node created
vedovati-matteo Sep 26, 2024
3228f7d
Update reasoning_node.py
vedovati-matteo Sep 26, 2024
0b12589
reasoning node prompt refinement
vedovati-matteo Sep 26, 2024
b7b3e96
reasoning node refactoring
vedovati-matteo Sep 26, 2024
afa9aa3
import refactoring
vedovati-matteo Sep 26, 2024
9fa1094
Update reasoning_node_prompts.py
VinciGit00 Sep 27, 2024
b1ce563
Merge pull request #10 from VinciGit00/patch-2
vedovati-matteo Sep 27, 2024
857f28d
Merge pull request #702 from vedovati-matteo/pre/beta
VinciGit00 Sep 27, 2024
bdcffd6
feat: add html_mode to smart_scraper
VinciGit00 Sep 27, 2024
1e4ee3a
Update html_mode.py
VinciGit00 Sep 27, 2024
02ec4c1
Merge pull request #704 from ScrapeGraphAI/refactoring-smart_scraper
VinciGit00 Sep 27, 2024
4330179
ci(release): 1.22.0-beta.4 [skip ci]
semantic-release-bot Sep 27, 2024
b2822f6
feat: add reasoning integration
VinciGit00 Sep 27, 2024
faf25ee
Merge branch 'reasoning-branch' into pre/beta
VinciGit00 Sep 27, 2024
6d8f543
ci(release): 1.22.0-beta.5 [skip ci]
semantic-release-bot Sep 27, 2024
7783bfb
Merge pull request #706 from ScrapeGraphAI/pre/beta
VinciGit00 Sep 27, 2024
f87ffa1
fix: integration with html_mode
VinciGit00 Sep 27, 2024
ac552bc
Merge pull request #707 from ScrapeGraphAI/reasoning-branch
VinciGit00 Sep 28, 2024
39f7815
ci(release): 1.22.0-beta.6 [skip ci]
semantic-release-bot Sep 28, 2024
d9a8208
Merge branch 'pre/beta' into branch_temp
VinciGit00 Sep 29, 2024
d14fb54
Merge pull request #709 from ScrapeGraphAI/branch_temp
VinciGit00 Sep 29, 2024
ea27b24
add empyt nodes
VinciGit00 Sep 30, 2024
89de5b6
Stating anew
vedovati-matteo Sep 30, 2024
336bf70
initial creation of FetchNodeLevelK and DescriptionNode
vedovati-matteo Sep 30, 2024
7411ff0
Revert "initial creation of FetchNodeLevelK and DescriptionNode"
vedovati-matteo Sep 30, 2024
462b27b
Revert "Stating anew"
vedovati-matteo Sep 30, 2024
6915f3e
start form scratch
vedovati-matteo Sep 30, 2024
57bf572
initial code for fetch nodel level K
vedovati-matteo Sep 30, 2024
d80b792
fetching first level
vedovati-matteo Sep 30, 2024
55199e8
add first iterations of the nodes
VinciGit00 Sep 30, 2024
e88fee9
Update generate_answer_node_k_level.py
VinciGit00 Sep 30, 2024
45f02cd
refactoring of the format
VinciGit00 Oct 1, 2024
4cb621f
fetch node level k implementation
vedovati-matteo Oct 2, 2024
ea3ae1f
fetch multiple links fix
vedovati-matteo Oct 2, 2024
2bdb01b
Create parse_node_depth_k.py
vedovati-matteo Oct 2, 2024
f755d56
updated parse node
vedovati-matteo Oct 2, 2024
015c6fd
remove link from markdown
vedovati-matteo Oct 2, 2024
6124fbd
add embeddings with openai
VinciGit00 Oct 2, 2024
17c5145
Merge pull request #717 from vedovati-matteo/deep_scraper_integration
VinciGit00 Oct 2, 2024
4b371f4
feat: add deep scraper implementation
VinciGit00 Oct 3, 2024
85cb957
feat: finished basic version of deep scraper
VinciGit00 Oct 3, 2024
cb46efb
changed depedencies
VinciGit00 Oct 3, 2024
c91975e
update examples
VinciGit00 Oct 3, 2024
db54d69
refactoring of code for pylint integration
VinciGit00 Oct 4, 2024
d056c43
Create code_generator_graph_togehter.py
VinciGit00 Oct 4, 2024
0cfb7ec
Merge branch 'main' into deep_scraper_integration
VinciGit00 Oct 5, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,15 @@
## [1.24.1](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.24.0...v1.24.1) (2024-09-26)



### Bug Fixes

* script creator multi ([9905be8](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/9905be8a37dc1ff4b90fe9b8be987887253be8bd))

## [1.24.0](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.23.1...v1.24.0) (2024-09-26)
* integration with html_mode ([f87ffa1](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/f87ffa1d8db32b38c47d9f5aa2ae88f1d7978a04))

## [1.22.0-beta.5](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.22.0-beta.4...v1.22.0-beta.5) (2024-09-27)


### Features
Expand All @@ -51,6 +55,14 @@
* **release:** 1.22.0-beta.1 [skip ci] ([f42a95f](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/f42a95faa05de39bd9cfc05e377d4b3da372e482))
* **release:** 1.22.0-beta.2 [skip ci] ([431c09f](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/431c09f551ac28581674c6061f055fde0350ed4c))
* **release:** 1.22.0-beta.3 [skip ci] ([e5ac020](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/e5ac0205d1e04a8b31e86166c3673915b70fd1e3))
* add reasoning integration ([b2822f6](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/b2822f620a610e61d295cbf4b670aa08fde9de24))

## [1.22.0-beta.4](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.22.0-beta.3...v1.22.0-beta.4) (2024-09-27)


### Features

* add html_mode to smart_scraper ([bdcffd6](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/bdcffd6360237b27797546a198ceece55ce4bc81))

## [1.22.0-beta.3](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.22.0-beta.2...v1.22.0-beta.3) (2024-09-25)

Expand Down
21 changes: 3 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,10 +38,9 @@ Additional dependecies can be added while installing the library:

- <b>More Language Models</b>: additional language models are installed, such as Fireworks, Groq, Anthropic, Hugging Face, and Nvidia AI Endpoints.


This group allows you to use additional language models like Fireworks, Groq, Anthropic, Together AI, Hugging Face, and Nvidia AI Endpoints.
```bash
pip install scrapegraphai[other-language-models]
This group allows you to use additional language models like Fireworks, Groq, Anthropic, Together AI, Hugging Face, and Nvidia AI Endpoints.
```bash
pip install scrapegraphai[other-language-models]
```
- <b>Semantic Options</b>: this group includes tools for advanced semantic processing, such as Graphviz.

Expand All @@ -55,23 +54,9 @@ pip install scrapegraphai[other-language-models]
pip install scrapegraphai[more-browser-options]
```

- <b>faiss Options</b>: this group includes faiss integration

```bash
pip install scrapegraphai[faiss-cpu]
```

</details>



### Installing "More Browser Options"

This group includes an ocr scraper for websites
```bash
pip install scrapegraphai[screenshot_scraper]
```

## 💻 Usage
There are multiple standard scraping pipelines that can be used to extract information from a website (or local file).

Expand Down
28 changes: 28 additions & 0 deletions examples/anthropic/depth_search_graph_anthropic.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
"""
depth_search_graph_opeani example
"""
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import DepthSearchGraph

load_dotenv()

graph_config = {
"llm": {
"api_key": os.getenv("ANTHROPIC_API_KEY"),
"model": "openai/gpt-4o-mini",
},
"verbose": True,
"headless": False,
"depth": 2,
"only_inside_links": False,
}

search_graph = DepthSearchGraph(
prompt="List me all the projects with their description",
source="https://perinim.github.io",
config=graph_config
)

result = search_graph.run()
print(result)
2 changes: 1 addition & 1 deletion examples/azure/code_generator_graph_azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ class Projects(BaseModel):
graph_config = {
"llm": {
"api_key": os.environ["AZURE_OPENAI_KEY"],
"model": "azure_openai/gpt-3.5-turbo",
"model": "azure_openai/gpt-4o"
},
"verbose": True,
"headless": False,
Expand Down
2 changes: 1 addition & 1 deletion examples/azure/csv_scraper_azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
graph_config = {
"llm": {
"api_key": os.environ["AZURE_OPENAI_KEY"],
"model": "azure_openai/gpt-3.5-turbo",
"model": "azure_openai/gpt-4o"
},
"verbose": True,
"headless": False
Expand Down
2 changes: 1 addition & 1 deletion examples/azure/csv_scraper_graph_multi_azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
graph_config = {
"llm": {
"api_key": os.environ["AZURE_OPENAI_KEY"],
"model": "azure_openai/gpt-3.5-turbo",
"model": "azure_openai/gpt-4o"
},
"verbose": True,
"headless": False
Expand Down
30 changes: 30 additions & 0 deletions examples/azure/depth_search_graph_azure.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
"""
depth_search_graph_opeani example
"""
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import DepthSearchGraph

load_dotenv()

openai_key = os.getenv("OPENAI_APIKEY")

graph_config = {
"llm": {
"api_key": os.environ["AZURE_OPENAI_KEY"],
"model": "azure_openai/gpt-4o",
},
"verbose": True,
"headless": False,
"depth": 2,
"only_inside_links": False,
}

search_graph = DepthSearchGraph(
prompt="List me all the projects with their description",
source="https://perinim.github.io",
config=graph_config
)

result = search_graph.run()
print(result)
2 changes: 1 addition & 1 deletion examples/azure/json_scraper_azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
graph_config = {
"llm": {
"api_key": os.environ["AZURE_OPENAI_KEY"],
"model": "azure_openai/gpt-3.5-turbo",
"model": "azure_openai/gpt-4o"
},
"verbose": True,
"headless": False
Expand Down
2 changes: 1 addition & 1 deletion examples/azure/json_scraper_multi_azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
graph_config = {
"llm": {
"api_key": os.environ["AZURE_OPENAI_KEY"],
"model": "azure_openai/gpt-3.5-turbo",
"model": "azure_openai/gpt-4o"
},
"verbose": True,
"headless": False
Expand Down
2 changes: 1 addition & 1 deletion examples/azure/pdf_scraper_azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
graph_config = {
"llm": {
"api_key": os.environ["AZURE_OPENAI_KEY"],
"model": "azure_openai/gpt-3.5-turbo",
"model": "azure_openai/gpt-4o"
},
"verbose": True,
"headless": False
Expand Down
2 changes: 1 addition & 1 deletion examples/azure/rate_limit_azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
graph_config = {
"llm": {
"api_key": os.environ["AZURE_OPENAI_KEY"],
"model": "azure_openai/gpt-3.5-turbo",
"model": "azure_openai/gpt-4o",
"rate_limit": {
"requests_per_second": 1
},
Expand Down
2 changes: 1 addition & 1 deletion examples/azure/scrape_plain_text_azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
graph_config = {
"llm": {
"api_key": os.environ["AZURE_OPENAI_KEY"],
"model": "azure_openai/gpt-3.5-turbo",
"model": "azure_openai/gpt-4o"
},
"verbose": True,
"headless": False
Expand Down
2 changes: 1 addition & 1 deletion examples/azure/script_generator_azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
graph_config = {
"llm": {
"api_key": os.environ["AZURE_OPENAI_KEY"],
"model": "azure_openai/gpt-3.5-turbo",
"model": "azure_openai/gpt-4o"
},
"verbose": True,
"headless": False
Expand Down
2 changes: 1 addition & 1 deletion examples/azure/script_multi_generator_azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
graph_config = {
"llm": {
"api_key": os.environ["AZURE_OPENAI_KEY"],
"model": "azure_openai/gpt-3.5-turbo",
"model": "azure_openai/gpt-4o"
},
"verbose": True,
"headless": False
Expand Down
2 changes: 1 addition & 1 deletion examples/azure/search_graph_azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
graph_config = {
"llm": {
"api_key": os.environ["AZURE_OPENAI_KEY"],
"model": "azure_openai/gpt-3.5-turbo",
"model": "azure_openai/gpt-4o"
},
"verbose": True,
"headless": False
Expand Down
2 changes: 1 addition & 1 deletion examples/azure/search_graph_schema_azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ class Dishes(BaseModel):
graph_config = {
"llm": {
"api_key": os.environ["AZURE_OPENAI_KEY"],
"model": "azure_openai/gpt-3.5-turbo",
"model": "azure_openai/gpt-4o"
},
"verbose": True,
"headless": False
Expand Down
2 changes: 1 addition & 1 deletion examples/azure/search_link_graph_azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
graph_config = {
"llm": {
"api_key": os.environ["AZURE_OPENAI_KEY"],
"model": "azure_openai/gpt-3.5-turbo",
"model": "azure_openai/gpt-4o"
},
"verbose": True,
"headless": False
Expand Down
2 changes: 1 addition & 1 deletion examples/azure/smart_scraper_azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
graph_config = {
"llm": {
"api_key": os.environ["AZURE_OPENAI_KEY"],
"model": "azure_openai/gpt-3.5-turbo",
"model": "azure_openai/gpt-4o"
},
"verbose": True,
"headless": False
Expand Down
2 changes: 1 addition & 1 deletion examples/azure/smart_scraper_multi_azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
graph_config = {
"llm": {
"api_key": os.environ["AZURE_OPENAI_KEY"],
"model": "azure_openai/gpt-3.5-turbo",
"model": "azure_openai/gpt-4o"
},
"verbose": True,
"headless": False
Expand Down
2 changes: 1 addition & 1 deletion examples/azure/smart_scraper_multi_concat_azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
graph_config = {
"llm": {
"api_key": os.environ["AZURE_OPENAI_KEY"],
"model": "azure_openai/gpt-3.5-turbo",
"model": "azure_openai/gpt-4o"
},
"verbose": True,
"headless": False
Expand Down
2 changes: 1 addition & 1 deletion examples/azure/smart_scraper_schema_azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ class Projects(BaseModel):
graph_config = {
"llm": {
"api_key": os.environ["AZURE_OPENAI_KEY"],
"model": "azure_openai/gpt-3.5-turbo",
"model": "azure_openai/gpt-4o"
},
"verbose": True,
"headless": False
Expand Down
2 changes: 1 addition & 1 deletion examples/azure/xml_scraper_azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
graph_config = {
"llm": {
"api_key": os.environ["AZURE_OPENAI_KEY"],
"model": "azure_openai/gpt-3.5-turbo",
"model": "azure_openai/gpt-4o"
},
"verbose": True,
"headless": False
Expand Down
2 changes: 1 addition & 1 deletion examples/azure/xml_scraper_graph_multi_azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
graph_config = {
"llm": {
"api_key": os.environ["AZURE_OPENAI_KEY"],
"model": "azure_openai/gpt-3.5-turbo",
"model": "azure_openai/gpt-4o",
},
"verbose": True,
"headless": False
Expand Down
31 changes: 31 additions & 0 deletions examples/bedrock/depth_search_graph_bedrock.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
"""
depth_search_graph_opeani example
"""
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import DepthSearchGraph

load_dotenv()

openai_key = os.getenv("OPENAI_APIKEY")

graph_config = {
"llm": {
"client": "client_name",
"model": "bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
"temperature": 0.0
},
"verbose": True,
"headless": False,
"depth": 2,
"only_inside_links": False,
}

search_graph = DepthSearchGraph(
prompt="List me all the projects with their description",
source="https://perinim.github.io",
config=graph_config
)

result = search_graph.run()
print(result)
30 changes: 30 additions & 0 deletions examples/deepseek/depth_search_graph_deepseek.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
"""
depth_search_graph_opeani example
"""
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import DepthSearchGraph

load_dotenv()

deepseek_key = os.getenv("DEEPSEEK_APIKEY")

graph_config = {
"llm": {
"model": "deepseek/deepseek-chat",
"api_key": deepseek_key,
},
"verbose": True,
"headless": False,
"depth": 2,
"only_inside_links": False,
}

search_graph = DepthSearchGraph(
prompt="List me all the projects with their description",
source="https://perinim.github.io",
config=graph_config
)

result = search_graph.run()
print(result)
2 changes: 1 addition & 1 deletion examples/ernie/custom_graph_ernie.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
# Define the configuration for the graph
# ************************************************

graph_config = {
graph_config = {
"llm": {
"model": "ernie/ernie-bot-turbo",
"ernie_client_id": "<ernie_client_id>",
Expand Down
26 changes: 26 additions & 0 deletions examples/ernie/depth_search_graph_ernie.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
"""
depth_search_graph_opeani example
"""
from scrapegraphai.graphs import DepthSearchGraph

graph_config = {
"llm": {
"model": "ernie/ernie-bot-turbo",
"ernie_client_id": "<ernie_client_id>",
"ernie_client_secret": "<ernie_client_secret>",
"temperature": 0.1
},
"verbose": True,
"headless": False,
"depth": 2,
"only_inside_links": False,
}

search_graph = DepthSearchGraph(
prompt="List me all the projects with their description",
source="https://perinim.github.io",
config=graph_config
)

result = search_graph.run()
print(result)
Loading
Loading