Skip to content

Commit 0683e78

Browse files
authored
Merge branch 'pre/beta' into fix-GenerateScraperGraph
2 parents 24c3b05 + 7ae50c0 commit 0683e78

File tree

76 files changed

+4278
-1088
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

76 files changed

+4278
-1088
lines changed

.gitignore

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -31,8 +31,4 @@ examples/graph_examples/ScrapeGraphAI_generated_graph
3131
examples/**/result.csv
3232
examples/**/result.json
3333
main.py
34-
poetry.lock
35-
36-
# lock files
37-
*.lock
38-
poetry.lock
34+

CHANGELOG.md

Lines changed: 149 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,167 @@
1-
## [0.9.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.8.0...v0.9.0) (2024-05-04)
1+
## [0.11.0-beta.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.11.0-beta.1...v0.11.0-beta.2) (2024-05-10)
22

33

44
### Features
55

6-
* Enable end users to pass model instances of HuggingFaceHub ([7599234](https://github.com/VinciGit00/Scrapegraph-ai/commit/7599234ab9563ca4ee9b7f5b2d0267daac621ecf))
6+
* revert fetch_node ([864aa91](https://github.com/VinciGit00/Scrapegraph-ai/commit/864aa91326c360992326e04811d272e55eac8355))
7+
8+
## [0.11.0-beta.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.10.0...v0.11.0-beta.1) (2024-05-10)
9+
10+
11+
### Features
12+
13+
* Add support for passing pdf path as source ([f10f3b1](https://github.com/VinciGit00/Scrapegraph-ai/commit/f10f3b1438e0c625b7f2fa52faeb5a6c12116113))
14+
* update info ([4ed0fb8](https://github.com/VinciGit00/Scrapegraph-ai/commit/4ed0fb89c3e6068190a7775bedcb6ae65ba59d18))
715

816

917
### Bug Fixes
1018

11-
* trailing whitespace ([2878695](https://github.com/VinciGit00/Scrapegraph-ai/commit/2878695d5f35cc9d81f24e4844fdc1988d10cb26))
19+
* add json integration ([0ab31c3](https://github.com/VinciGit00/Scrapegraph-ai/commit/0ab31c3fdbd56652ed306e60109301f60e8042d3))
20+
* Augment the information getting fetched from a webpage ([f8ce3d5](https://github.com/VinciGit00/Scrapegraph-ai/commit/f8ce3d5916eab926275d59d4d48b0d89ec9cd43f))
21+
* fixed bugs for csv and xml ([324e977](https://github.com/VinciGit00/Scrapegraph-ai/commit/324e977b853ecaa55bac4bf86e7cd927f7f43d0d))
22+
* limit python version to < 3.12 ([a37fbbc](https://github.com/VinciGit00/Scrapegraph-ai/commit/a37fbbcbcfc3ddd0cc66f586f279676b52c4abfe))
1223

1324

14-
### Build
25+
### CI
1526

16-
* **deps:** bump tqdm from 4.66.1 to 4.66.3 ([0a17c74](https://github.com/VinciGit00/Scrapegraph-ai/commit/0a17c74e50d0457aec289e81183e9c779c735842))
17-
* **deps:** bump tqdm from 4.66.1 to 4.66.3 ([aff6f98](https://github.com/VinciGit00/Scrapegraph-ai/commit/aff6f983b02a37ced21826847a6ace5fb15ecf3d))
27+
* **release:** 0.10.0-beta.3 [skip ci] ([ad32298](https://github.com/VinciGit00/Scrapegraph-ai/commit/ad32298e70fc626fd62c897e153b806f79dba9b9))
28+
* **release:** 0.10.0-beta.4 [skip ci] ([548bff9](https://github.com/VinciGit00/Scrapegraph-ai/commit/548bff9d77c8b4d2aadee40e966a06cc9d7fd4ab))
29+
* **release:** 0.10.0-beta.5 [skip ci] ([28c9dce](https://github.com/VinciGit00/Scrapegraph-ai/commit/28c9dce7cbda49750172bafd7767fa48a0c33859))
30+
* **release:** 0.10.0-beta.6 [skip ci] ([460d292](https://github.com/VinciGit00/Scrapegraph-ai/commit/460d292af21fabad3fdd2b66110913ccee22ba91))
31+
32+
### Bug Fixes
33+
34+
* add json integration ([0ab31c3](https://github.com/VinciGit00/Scrapegraph-ai/commit/0ab31c3fdbd56652ed306e60109301f60e8042d3))
35+
36+
## [0.10.0-beta.5](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.10.0-beta.4...v0.10.0-beta.5) (2024-05-09)
37+
38+
39+
40+
### Bug Fixes
41+
42+
43+
* fixed bugs for csv and xml ([324e977](https://github.com/VinciGit00/Scrapegraph-ai/commit/324e977b853ecaa55bac4bf86e7cd927f7f43d0d))
44+
45+
## [0.10.0-beta.4](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.10.0-beta.3...v0.10.0-beta.4) (2024-05-09)
46+
47+
48+
### Features
49+
50+
* Add support for passing pdf path as source ([f10f3b1](https://github.com/VinciGit00/Scrapegraph-ai/commit/f10f3b1438e0c625b7f2fa52faeb5a6c12116113))
51+
52+
53+
### Bug Fixes
54+
55+
* limit python version to < 3.12 ([a37fbbc](https://github.com/VinciGit00/Scrapegraph-ai/commit/a37fbbcbcfc3ddd0cc66f586f279676b52c4abfe))
56+
57+
## [0.10.0-beta.3](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.10.0-beta.2...v0.10.0-beta.3) (2024-05-09)
58+
59+
60+
### Features
61+
62+
* update info ([4ed0fb8](https://github.com/VinciGit00/Scrapegraph-ai/commit/4ed0fb89c3e6068190a7775bedcb6ae65ba59d18))
63+
64+
## [0.10.0-beta.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.10.0-beta.1...v0.10.0-beta.2) (2024-05-08)
65+
66+
67+
### Bug Fixes
68+
69+
* **examples:** local, mixed models and fixed SearchGraph embeddings problem ([6b71ec1](https://github.com/VinciGit00/Scrapegraph-ai/commit/6b71ec1d2be953220b6767bc429f4cf6529803fd))
70+
* **examples:** openai std examples ([186c0d0](https://github.com/VinciGit00/Scrapegraph-ai/commit/186c0d035d1d211aff33c38c449f2263d9716a07))
71+
* removed .lock file for deployment ([d4c7d4e](https://github.com/VinciGit00/Scrapegraph-ai/commit/d4c7d4e7fcc2110beadcb2fc91efc657ec6a485c))
72+
73+
74+
### Docs
75+
76+
* update README.md ([17ec992](https://github.com/VinciGit00/Scrapegraph-ai/commit/17ec992b498839e001277e7bc3f0ebea49fbd00d))
77+
78+
## [0.10.0-beta.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.9.0...v0.10.0-beta.1) (2024-05-06)
79+
80+
81+
### Features
82+
83+
* add claude documentation ([5bdee55](https://github.com/VinciGit00/Scrapegraph-ai/commit/5bdee558760521bab818efc6725739e2a0f55d20))
84+
* add gemini embeddings ([79daa4c](https://github.com/VinciGit00/Scrapegraph-ai/commit/79daa4c112e076e9c5f7cd70bbbc6f5e4930832c))
85+
* add llava integration ([019b722](https://github.com/VinciGit00/Scrapegraph-ai/commit/019b7223dc969c87c3c36b6a42a19b4423b5d2af))
86+
* add new hugging_face models ([d5547a4](https://github.com/VinciGit00/Scrapegraph-ai/commit/d5547a450ccd8908f1cf73707142b3481fbc6baa))
87+
* Fix bug for gemini case when embeddings config not passed ([726de28](https://github.com/VinciGit00/Scrapegraph-ai/commit/726de288982700dab8ab9f22af8e26f01c6198a7))
88+
* fixed custom_graphs example and robots_node ([84fcb44](https://github.com/VinciGit00/Scrapegraph-ai/commit/84fcb44aaa36e84f775884138d04f4a60bb389be))
89+
* multiple graph instances ([dbb614a](https://github.com/VinciGit00/Scrapegraph-ai/commit/dbb614a8dd88d7667fe3daaf0263f5d6e9be1683))
90+
* **node:** multiple url search in SearchGraph + fixes ([930adb3](https://github.com/VinciGit00/Scrapegraph-ai/commit/930adb38f2154ba225342466bfd1846c47df72a0))
91+
* refactoring search function ([aeb1acb](https://github.com/VinciGit00/Scrapegraph-ai/commit/aeb1acbf05e63316c91672c99d88f8a6f338147f))
92+
93+
94+
### Bug Fixes
95+
96+
* bug on .toml ([f7d66f5](https://github.com/VinciGit00/Scrapegraph-ai/commit/f7d66f51818dbdfddd0fa326f26265a3ab686b20))
97+
* **llm:** fixed gemini api_key ([fd01b73](https://github.com/VinciGit00/Scrapegraph-ai/commit/fd01b73b71b515206cfdf51c1d52136293494389))
1898

1999

20100
### CI
21101

22-
* **release:** 0.8.0-beta.1 [skip ci] ([d277b34](https://github.com/VinciGit00/Scrapegraph-ai/commit/d277b349a98848749a7e38ea3c511271bced3b71))
23-
* **release:** 0.8.0-beta.2 [skip ci] ([892500a](https://github.com/VinciGit00/Scrapegraph-ai/commit/892500afe93c4d96dcffe897b382977a22079b83))
24-
* **release:** 0.9.0-beta.1 [skip ci] ([14615a7](https://github.com/VinciGit00/Scrapegraph-ai/commit/14615a73c71bb5250772a75c415c57cb153660f8))
102+
* **release:** 0.9.0-beta.2 [skip ci] ([5aa600c](https://github.com/VinciGit00/Scrapegraph-ai/commit/5aa600cb0a85d320ad8dc786af26ffa46dd4d097))
103+
* **release:** 0.9.0-beta.3 [skip ci] ([da8c72c](https://github.com/VinciGit00/Scrapegraph-ai/commit/da8c72ce138bcfe2627924d25a67afcd22cfafd5))
104+
* **release:** 0.9.0-beta.4 [skip ci] ([8c5397f](https://github.com/VinciGit00/Scrapegraph-ai/commit/8c5397f67a9f05e0c00f631dd297b5527263a888))
105+
* **release:** 0.9.0-beta.5 [skip ci] ([532adb6](https://github.com/VinciGit00/Scrapegraph-ai/commit/532adb639d58640bc89e8b162903b2ed97be9853))
106+
* **release:** 0.9.0-beta.6 [skip ci] ([8c0b46e](https://github.com/VinciGit00/Scrapegraph-ai/commit/8c0b46eb40b446b270c665c11b2c6508f4d5f4be))
107+
* **release:** 0.9.0-beta.7 [skip ci] ([6911e21](https://github.com/VinciGit00/Scrapegraph-ai/commit/6911e21584767460c59c5a563c3fd010857cbb67))
108+
* **release:** 0.9.0-beta.8 [skip ci] ([739aaa3](https://github.com/VinciGit00/Scrapegraph-ai/commit/739aaa33c39c12e7ab7df8a0656cad140b35c9db))
109+
110+
## [0.9.0-beta.8](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.9.0-beta.7...v0.9.0-beta.8) (2024-05-06)
111+
112+
113+
### Features
114+
115+
* add llava integration ([019b722](https://github.com/VinciGit00/Scrapegraph-ai/commit/019b7223dc969c87c3c36b6a42a19b4423b5d2af))
116+
117+
## [0.9.0-beta.7](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.9.0-beta.6...v0.9.0-beta.7) (2024-05-06)
118+
119+
120+
### Bug Fixes
121+
122+
* **llm:** fixed gemini api_key ([fd01b73](https://github.com/VinciGit00/Scrapegraph-ai/commit/fd01b73b71b515206cfdf51c1d52136293494389))
123+
124+
## [0.9.0-beta.6](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.9.0-beta.5...v0.9.0-beta.6) (2024-05-06)
125+
126+
127+
### Features
128+
129+
* Fix bug for gemini case when embeddings config not passed ([726de28](https://github.com/VinciGit00/Scrapegraph-ai/commit/726de288982700dab8ab9f22af8e26f01c6198a7))
130+
131+
## [0.9.0-beta.5](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.9.0-beta.4...v0.9.0-beta.5) (2024-05-06)
132+
133+
134+
### Features
135+
136+
* fixed custom_graphs example and robots_node ([84fcb44](https://github.com/VinciGit00/Scrapegraph-ai/commit/84fcb44aaa36e84f775884138d04f4a60bb389be))
137+
* multiple graph instances ([dbb614a](https://github.com/VinciGit00/Scrapegraph-ai/commit/dbb614a8dd88d7667fe3daaf0263f5d6e9be1683))
138+
* **node:** multiple url search in SearchGraph + fixes ([930adb3](https://github.com/VinciGit00/Scrapegraph-ai/commit/930adb38f2154ba225342466bfd1846c47df72a0))
139+
140+
## [0.9.0-beta.4](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.9.0-beta.3...v0.9.0-beta.4) (2024-05-05)
141+
142+
143+
### Features
144+
145+
* add gemini embeddings ([79daa4c](https://github.com/VinciGit00/Scrapegraph-ai/commit/79daa4c112e076e9c5f7cd70bbbc6f5e4930832c))
146+
147+
## [0.9.0-beta.3](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.9.0-beta.2...v0.9.0-beta.3) (2024-05-05)
148+
149+
150+
### Features
151+
152+
* add claude documentation ([5bdee55](https://github.com/VinciGit00/Scrapegraph-ai/commit/5bdee558760521bab818efc6725739e2a0f55d20))
153+
154+
## [0.9.0-beta.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.9.0-beta.1...v0.9.0-beta.2) (2024-05-05)
155+
156+
157+
### Features
158+
159+
* refactoring search function ([aeb1acb](https://github.com/VinciGit00/Scrapegraph-ai/commit/aeb1acbf05e63316c91672c99d88f8a6f338147f))
160+
161+
162+
### Bug Fixes
163+
164+
* bug on .toml ([f7d66f5](https://github.com/VinciGit00/Scrapegraph-ai/commit/f7d66f51818dbdfddd0fa326f26265a3ab686b20))
25165

26166
## [0.9.0-beta.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.8.0...v0.9.0-beta.1) (2024-05-04)
27167

SECURITY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,4 @@
33
## Reporting a Vulnerability
44

55
For reporting a vulnerability contact directly [email protected]
6+

docs/source/getting_started/examples.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,9 +44,12 @@ Local models
4444

4545
Remember to have installed in your pc ollama `ollama <https://ollama.com/>`
4646
Remember to pull the right model for LLM and for the embeddings, like:
47+
4748
.. code-block:: bash
4849
4950
ollama pull llama3
51+
ollama pull nomic-embed-text
52+
ollama pull mistral
5053
5154
After that, you can run the following code, using only your machine resources brum brum brum:
5255

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
"""
2+
Basic example of scraping pipeline using SmartScraper using Azure OpenAI Key
3+
"""
4+
5+
import os
6+
from dotenv import load_dotenv
7+
from scrapegraphai.graphs import SmartScraperGraph
8+
from scrapegraphai.utils import prettify_exec_info
9+
from langchain_community.llms import HuggingFaceEndpoint
10+
from langchain_community.embeddings import HuggingFaceInferenceAPIEmbeddings
11+
12+
13+
# required environment variables in .env
14+
# HUGGINGFACEHUB_API_TOKEN
15+
# ANTHROPIC_API_KEY
16+
load_dotenv()
17+
18+
HUGGINGFACEHUB_API_TOKEN = os.getenv('HUGGINGFACEHUB_API_TOKEN')
19+
# ************************************************
20+
# Initialize the model instances
21+
# ************************************************
22+
23+
24+
embedder_model_instance = HuggingFaceInferenceAPIEmbeddings(
25+
api_key=HUGGINGFACEHUB_API_TOKEN, model_name="sentence-transformers/all-MiniLM-l6-v2"
26+
)
27+
28+
# ************************************************
29+
# Create the SmartScraperGraph instance and run it
30+
# ************************************************
31+
32+
graph_config = {
33+
"llm": {
34+
"api_key": os.getenv("ANTHROPIC_API_KEY"),
35+
"model": "claude-3-haiku-20240307",
36+
"max_tokens": 4000},
37+
"embeddings": {"model_instance": embedder_model_instance}
38+
}
39+
40+
smart_scraper_graph = SmartScraperGraph(
41+
prompt="""Don't say anything else. Output JSON only. List me all the events, with the following fields: company_name, event_name, event_start_date, event_start_time,
42+
event_end_date, event_end_time, location, event_mode, event_category,
43+
third_party_redirect, no_of_days,
44+
time_in_hours, hosted_or_attending, refreshments_type,
45+
registration_available, registration_link""",
46+
# also accepts a string with the already downloaded HTML code
47+
source="https://www.hmhco.com/event",
48+
config=graph_config
49+
)
50+
51+
result = smart_scraper_graph.run()
52+
print(result)
53+
54+
# ************************************************
55+
# Get graph execution info
56+
# ************************************************
57+
58+
graph_exec_info = smart_scraper_graph.get_execution_info()
59+
print(prettify_exec_info(graph_exec_info))
Lines changed: 27 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,15 @@
11
"""
2-
Basic example of scraping pipeline using JSONScraperGraph from JSON documents
2+
Example of Search Graph
33
"""
44

55
import os
66
from dotenv import load_dotenv
7-
from scrapegraphai.graphs import JSONScraperGraph
7+
from langchain_openai import AzureChatOpenAI
8+
from langchain_openai import AzureOpenAIEmbeddings
9+
from scrapegraphai.graphs import SearchGraph
810
from scrapegraphai.utils import convert_to_csv, convert_to_json, prettify_exec_info
911
load_dotenv()
1012

11-
# ************************************************
12-
# Read the JSON file
13-
# ************************************************
14-
1513
FILE_NAME = "inputs/example.json"
1614
curr_dir = os.path.dirname(os.path.realpath(__file__))
1715
file_path = os.path.join(curr_dir, FILE_NAME)
@@ -20,42 +18,47 @@
2018
text = file.read()
2119

2220
# ************************************************
23-
# Define the configuration for the graph
21+
# Initialize the model instances
22+
# ************************************************
23+
24+
llm_model_instance = AzureChatOpenAI(
25+
openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
26+
azure_deployment=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"]
27+
)
28+
29+
embedder_model_instance = AzureOpenAIEmbeddings(
30+
azure_deployment=os.environ["AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME"],
31+
openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
32+
)
33+
34+
# ************************************************
35+
# Create the JSONScraperGraph instance and run it
2436
# ************************************************
2537

2638
graph_config = {
27-
"llm": {
28-
"model": "ollama/mistral",
29-
"temperature": 0,
30-
"format": "json", # Ollama needs the format to be specified explicitly
31-
# "model_tokens": 2000, # set context length arbitrarily
32-
},
33-
"embeddings": {
34-
"model": "ollama/nomic-embed-text",
35-
"temperature": 0,
36-
}
39+
"llm": {"model_instance": llm_model_instance},
40+
"embeddings": {"model_instance": embedder_model_instance}
3741
}
3842

3943
# ************************************************
40-
# Create the JSONScraperGraph instance and run it
44+
# Create the SearchGraph instance and run it
4145
# ************************************************
4246

43-
json_scraper_graph = JSONScraperGraph(
44-
prompt="List me all the authors, title and genres of the books",
45-
source=text, # Pass the content of the file, not the file object
47+
search_graph = SearchGraph(
48+
prompt="List me the best escursions near Trento",
4649
config=graph_config
4750
)
4851

49-
result = json_scraper_graph.run()
52+
result = search_graph.run()
5053
print(result)
5154

5255
# ************************************************
5356
# Get graph execution info
5457
# ************************************************
5558

56-
graph_exec_info = json_scraper_graph.get_execution_info()
59+
graph_exec_info = search_graph.get_execution_info()
5760
print(prettify_exec_info(graph_exec_info))
5861

59-
# Save to json or csv
62+
# Save to json and csv
6063
convert_to_csv(result, "result")
6164
convert_to_json(result, "result")

examples/gemini/search_graph_gemini.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@
2121
"temperature": 0,
2222
"streaming": True
2323
},
24+
"max_results": 5,
25+
"verbose": True,
2426
}
2527

2628
# ************************************************

0 commit comments

Comments
 (0)