Skip to content

Commit d6f5ca8

Browse files
committed
Merge branch 'main' into pre/beta
2 parents 4fd8a39 + df918fa commit d6f5ca8

22 files changed

+147
-161
lines changed

README.md

Lines changed: 62 additions & 125 deletions
Original file line numberDiff line numberDiff line change
@@ -8,13 +8,18 @@
88
[![](https://dcbadge.vercel.app/api/server/gkxQDAjfeX)](https://discord.gg/gkxQDAjfeX)
99

1010

11-
ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites, documents and XML files.
11+
ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, etc.).
12+
1213
Just say which information you want to extract and the library will do it for you!
1314

1415
<p align="center">
1516
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/scrapegraphai_logo.png" alt="Scrapegraph-ai Logo" style="width: 50%;">
1617
</p>
1718

19+
[![My Skills](https://skillicons.dev/icons?i=discord)](https://discord.gg/gkxQDAjfeX)
20+
[![My Skills](https://skillicons.dev/icons?i=linkedin)](https://www.linkedin.com/company/scrapegraphai/)
21+
[![My Skills](https://skillicons.dev/icons?i=twitter)](https://twitter.com/scrapegraphai)
22+
1823

1924
## 🚀 Quick install
2025

@@ -48,11 +53,16 @@ The documentation for ScrapeGraphAI can be found [here](https://scrapegraph-ai.r
4853
Check out also the docusaurus [documentation](https://scrapegraph-doc.onrender.com/).
4954

5055
## 💻 Usage
51-
You can use the `SmartScraper` class to extract information from a website using a prompt.
56+
There are three main scraping pipelines that can be used to extract information from a website (or local file):
57+
- `SmartScraperGraph`: single-page scraper that only needs a user prompt and an input source;
58+
- `SearchGraph`: multi-page scraper that extracts information from the top n search results of a search engine;
59+
- `SpeechGraph`: single-page scraper that extracts information from a website and generates an audio file.
60+
61+
It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**, **Azure** and **Gemini**, or local models using **Ollama**.
5262

53-
The `SmartScraper` class is a direct graph implementation that uses the most common nodes present in a web scraping pipeline. For more information, please see the [documentation](https://scrapegraph-ai.readthedocs.io/en/latest/).
54-
### Case 1: Extracting information using Ollama
55-
Remember to download the model on Ollama separately!
63+
### Case 1: SmartScraper using Local Models
64+
65+
Remember to have [Ollama](https://ollama.com/) installed and download the models using the **ollama pull** command.
5666

5767
```python
5868
from scrapegraphai.graphs import SmartScraperGraph
@@ -67,11 +77,12 @@ graph_config = {
6777
"embeddings": {
6878
"model": "ollama/nomic-embed-text",
6979
"base_url": "http://localhost:11434", # set Ollama URL
70-
}
80+
},
81+
"verbose": True,
7182
}
7283

7384
smart_scraper_graph = SmartScraperGraph(
74-
prompt="List me all the articles",
85+
prompt="List me all the projects with their descriptions",
7586
# also accepts a string with the already downloaded HTML code
7687
source="https://perinim.github.io/projects",
7788
config=graph_config
@@ -82,177 +93,103 @@ print(result)
8293

8394
```
8495

85-
### Case 2: Extracting information using Docker
96+
The output will be a list of projects with their descriptions like the following:
8697

87-
Note: before using the local model remember to create the docker container!
88-
```text
89-
docker-compose up -d
90-
docker exec -it ollama ollama pull stablelm-zephyr
91-
```
92-
You can use which models available on Ollama or your own model instead of stablelm-zephyr
9398
```python
94-
from scrapegraphai.graphs import SmartScraperGraph
95-
96-
graph_config = {
97-
"llm": {
98-
"model": "ollama/mistral",
99-
"temperature": 0,
100-
"format": "json", # Ollama needs the format to be specified explicitly
101-
# "model_tokens": 2000, # set context length arbitrarily
102-
},
103-
}
104-
105-
smart_scraper_graph = SmartScraperGraph(
106-
prompt="List me all the articles",
107-
# also accepts a string with the already downloaded HTML code
108-
source="https://perinim.github.io/projects",
109-
config=graph_config
110-
)
111-
112-
result = smart_scraper_graph.run()
113-
print(result)
99+
{'projects': [{'title': 'Rotary Pendulum RL', 'description': 'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'}, {'title': 'DQN Implementation from scratch', 'description': 'Developed a Deep Q-Network algorithm to train a simple and double pendulum'}, ...]}
114100
```
115101

102+
### Case 2: SearchGraph using Mixed Models
116103

117-
### Case 3: Extracting information using Openai model
118-
```python
119-
from scrapegraphai.graphs import SmartScraperGraph
120-
OPENAI_API_KEY = "YOUR_API_KEY"
104+
We use **Groq** for the LLM and **Ollama** for the embeddings.
121105

122-
graph_config = {
123-
"llm": {
124-
"api_key": OPENAI_API_KEY,
125-
"model": "gpt-3.5-turbo",
126-
},
127-
}
128-
129-
smart_scraper_graph = SmartScraperGraph(
130-
prompt="List me all the articles",
131-
# also accepts a string with the already downloaded HTML code
132-
source="https://perinim.github.io/projects",
133-
config=graph_config
134-
)
135-
136-
result = smart_scraper_graph.run()
137-
print(result)
138-
```
139-
140-
### Case 4: Extracting information using Groq
141106
```python
142-
from scrapegraphai.graphs import SmartScraperGraph
143-
from scrapegraphai.utils import prettify_exec_info
144-
145-
groq_key = os.getenv("GROQ_APIKEY")
107+
from scrapegraphai.graphs import SearchGraph
146108

109+
# Define the configuration for the graph
147110
graph_config = {
148111
"llm": {
149112
"model": "groq/gemma-7b-it",
150-
"api_key": groq_key,
113+
"api_key": "GROQ_API_KEY",
151114
"temperature": 0
152115
},
153116
"embeddings": {
154117
"model": "ollama/nomic-embed-text",
155-
"temperature": 0,
156-
"base_url": "http://localhost:11434",
118+
"base_url": "http://localhost:11434", # set ollama URL arbitrarily
157119
},
158-
"headless": False
120+
"max_results": 5,
159121
}
160122

161-
smart_scraper_graph = SmartScraperGraph(
162-
prompt="List me all the projects with their description and the author.",
163-
source="https://perinim.github.io/projects",
123+
# Create the SearchGraph instance
124+
search_graph = SearchGraph(
125+
prompt="List me all the traditional recipes from Chioggia",
164126
config=graph_config
165127
)
166128

167-
result = smart_scraper_graph.run()
129+
# Run the graph
130+
result = search_graph.run()
168131
print(result)
169132
```
170133

134+
The output will be a list of recipes like the following:
171135

172-
### Case 5: Extracting information using Azure
173136
```python
174-
from langchain_openai import AzureChatOpenAI
175-
from langchain_openai import AzureOpenAIEmbeddings
176-
177-
lm_model_instance = AzureChatOpenAI(
178-
openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
179-
azure_deployment=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"]
180-
)
181-
182-
embedder_model_instance = AzureOpenAIEmbeddings(
183-
azure_deployment=os.environ["AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME"],
184-
openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
185-
)
186-
graph_config = {
187-
"llm": {"model_instance": llm_model_instance},
188-
"embeddings": {"model_instance": embedder_model_instance}
189-
}
190-
191-
smart_scraper_graph = SmartScraperGraph(
192-
prompt="""List me all the events, with the following fields: company_name, event_name, event_start_date, event_start_time,
193-
event_end_date, event_end_time, location, event_mode, event_category,
194-
third_party_redirect, no_of_days,
195-
time_in_hours, hosted_or_attending, refreshments_type,
196-
registration_available, registration_link""",
197-
source="https://www.hmhco.com/event",
198-
config=graph_config
199-
)
137+
{'recipes': [{'name': 'Sarde in Saòre'}, {'name': 'Bigoli in salsa'}, {'name': 'Seppie in umido'}, {'name': 'Moleche frite'}, {'name': 'Risotto alla pescatora'}, {'name': 'Broeto'}, {'name': 'Bibarasse in Cassopipa'}, {'name': 'Risi e bisi'}, {'name': 'Smegiassa Ciosota'}]}
200138
```
139+
### Case 3: SpeechGraph using OpenAI
140+
141+
You just need to pass the OpenAI API key and the model name.
201142

202-
### Case 6: Extracting information using Gemini
203143
```python
204-
from scrapegraphai.graphs import SmartScraperGraph
205-
GOOGLE_APIKEY = "YOUR_API_KEY"
144+
from scrapegraphai.graphs import SpeechGraph
206145

207-
# Define the configuration for the graph
208146
graph_config = {
209147
"llm": {
210-
"api_key": GOOGLE_APIKEY,
211-
"model": "gemini-pro",
148+
"api_key": "OPENAI_API_KEY",
149+
"model": "gpt-3.5-turbo",
212150
},
151+
"tts_model": {
152+
"api_key": "OPENAI_API_KEY",
153+
"model": "tts-1",
154+
"voice": "alloy"
155+
},
156+
"output_path": "audio_summary.mp3",
213157
}
214158

215-
# Create the SmartScraperGraph instance
216-
smart_scraper_graph = SmartScraperGraph(
217-
prompt="List me all the articles",
218-
source="https://perinim.github.io/projects",
219-
config=graph_config
159+
# ************************************************
160+
# Create the SpeechGraph instance and run it
161+
# ************************************************
162+
163+
speech_graph = SpeechGraph(
164+
prompt="Make a detailed audio summary of the projects.",
165+
source="https://perinim.github.io/projects/",
166+
config=graph_config,
220167
)
221168

222-
result = smart_scraper_graph.run()
169+
result = speech_graph.run()
223170
print(result)
224-
```
225171

226-
The output for all 3 the cases will be a dictionary with the extracted information, for example:
227-
228-
```bash
229-
{
230-
'titles': [
231-
'Rotary Pendulum RL'
232-
],
233-
'descriptions': [
234-
'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'
235-
]
236-
}
237172
```
238173

174+
The output will be an audio file with the summary of the projects on the page.
175+
239176
## 🤝 Contributing
240177

241178
Feel free to contribute and join our Discord server to discuss with us improvements and give us suggestions!
242179

243180
Please see the [contributing guidelines](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/CONTRIBUTING.md).
244181

245-
[![My Skills](https://skillicons.dev/icons?i=discord)](https://discord.gg/gkxQDAjfeX)
246-
[![My Skills](https://skillicons.dev/icons?i=linkedin)](https://www.linkedin.com/company/scrapegraphai/)
247-
[![My Skills](https://skillicons.dev/icons?i=twitter)](https://twitter.com/scrapegraphai)
248-
249182
## 📈 Roadmap
250183
Check out the project roadmap [here](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/README.md)! 🚀
251184

252185
Wanna visualize the roadmap in a more interactive way? Check out the [markmap](https://markmap.js.org/repl) visualization by copy pasting the markdown content in the editor!
253186

254187
## ❤️ Contributors
255188
[![Contributors](https://contrib.rocks/image?repo=VinciGit00/Scrapegraph-ai)](https://github.com/VinciGit00/Scrapegraph-ai/graphs/contributors)
189+
## Sponsors
190+
<p align="center">
191+
<a href="https://serpapi.com?utm_source=scrapegraphai"><img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/serp_api_logo.png" alt="SerpAPI" style="width: 10%;"></a>
192+
</p>
256193

257194
## 🎓 Citations
258195
If you have used our library for research purposes please quote us with the following reference:
@@ -269,7 +206,7 @@ If you have used our library for research purposes please quote us with the foll
269206
## Authors
270207

271208
<p align="center">
272-
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/logo_authors.png" alt="Authors Logos">
209+
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/logo_authors.png" alt="Authors_logos">
273210
</p>
274211

275212
| | Contact Info |

docs/assets/serp_api_logo.png

15.1 KB
Loading

docs/source/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,4 +46,4 @@
4646
["Landing Page|https://scrapegraphai.com/",
4747
"Docusaurus|https://scrapegraph-doc.onrender.com/docs/intro"]
4848
),
49-
)
49+
)
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
"""
2+
Basic example of scraping pipeline using SmartScraper
3+
"""
4+
5+
import os
6+
from dotenv import load_dotenv
7+
from scrapegraphai.graphs import SearchGraph
8+
from scrapegraphai.utils import prettify_exec_info
9+
10+
load_dotenv()
11+
12+
13+
# ************************************************
14+
# Define the configuration for the graph
15+
# ************************************************
16+
17+
groq_key = os.getenv("GROQ_APIKEY")
18+
openai_key = os.getenv("OPENAI_APIKEY")
19+
20+
graph_config = {
21+
"llm": {
22+
"model": "groq/gemma-7b-it",
23+
"api_key": groq_key,
24+
"temperature": 0
25+
},
26+
"embeddings": {
27+
"api_key": openai_key,
28+
"model": "openai",
29+
},
30+
"headless": False
31+
}
32+
33+
search_graph = SearchGraph(
34+
prompt="List me the best escursions near Trento",
35+
config=graph_config
36+
)
37+
38+
result = search_graph.run()
39+
print(result)
40+
41+
# ************************************************
42+
# Get graph execution info
43+
# ************************************************
44+
45+
graph_exec_info = search_graph.get_execution_info()
46+
print(prettify_exec_info(graph_exec_info))

examples/local_models/README.md

Whitespace-only changes.

pyproject.toml

Lines changed: 18 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -43,24 +43,24 @@ classifiers = [
4343
[tool.poetry.dependencies]
4444
python = ">=3.9, <3.12"
4545
langchain = "0.1.15"
46-
langchain-openai = "^0.1.6"
47-
langchain-google-genai = "^1.0.3"
48-
langchain-groq = "^0.1.3"
49-
langchain-aws = "^0.1.3"
50-
langchain-anthropic = "^0.1.11"
51-
html2text = "^2024.2.26"
52-
faiss-cpu = "^1.8.0"
53-
beautifulsoup4 = "^4.12.3"
54-
pandas = "^2.2.2"
55-
python-dotenv = "^1.0.1"
56-
tiktoken = "^0.6.0"
57-
tqdm = "^4.66.4"
58-
graphviz = "^0.20.3"
59-
minify-html = "^0.15.0"
60-
free-proxy = "^1.1.1"
61-
playwright = "^1.43.0"
62-
google = "^3.0.0"
63-
yahoo-search-py = "^0.3"
46+
langchain-openai = "0.1.6"
47+
langchain-google-genai = "1.0.3"
48+
langchain-groq = "0.1.3"
49+
langchain-aws = "0.1.3"
50+
langchain-anthropic = "0.1.11"
51+
html2text = "2024.2.26"
52+
faiss-cpu = "1.8.0"
53+
beautifulsoup4 = "4.12.3"
54+
pandas = "2.2.2"
55+
python-dotenv = "1.0.1"
56+
tiktoken = "0.6.0"
57+
tqdm = "4.66.4"
58+
graphviz = "0.20.3"
59+
minify-html = "0.15.0"
60+
free-proxy = "1.1.1"
61+
playwright = "1.43.0"
62+
google = "3.0.0"
63+
yahoo-search-py = "0.3"
6464

6565
[tool.poetry.dev-dependencies]
6666
pytest = "8.0.0"

scrapegraphai/graphs/abstract_graph.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@
88
from langchain_community.embeddings import HuggingFaceHubEmbeddings, OllamaEmbeddings
99
from langchain_google_genai import GoogleGenerativeAIEmbeddings
1010
from ..helpers import models_tokens
11-
from ..models import AzureOpenAI, Bedrock, Gemini, Groq, HuggingFace, Ollama, OpenAI, Anthropic, DeepSeek
11+
from ..models import AzureOpenAI, Bedrock, Gemini, Groq, HuggingFace, Ollama, OpenAI, Anthropic
12+
from langchain_google_genai.embeddings import GoogleGenerativeAIEmbeddings
1213

1314

1415
class AbstractGraph(ABC):

0 commit comments

Comments
 (0)