Skip to content

Commit 78f2174

Browse files
committed
2 parents 04a4d84 + b8a8ebb commit 78f2174

File tree

3 files changed

+61
-128
lines changed

3 files changed

+61
-128
lines changed

README.md

Lines changed: 59 additions & 125 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@
88
[![](https://dcbadge.vercel.app/api/server/gkxQDAjfeX)](https://discord.gg/gkxQDAjfeX)
99

1010

11-
ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites, documents and XML files.
11+
ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, etc.).
12+
1213
Just say which information you want to extract and the library will do it for you!
1314

1415
<p align="center">
@@ -52,11 +53,16 @@ The documentation for ScrapeGraphAI can be found [here](https://scrapegraph-ai.r
5253
Check out also the docusaurus [documentation](https://scrapegraph-doc.onrender.com/).
5354

5455
## 💻 Usage
55-
You can use the `SmartScraper` class to extract information from a website using a prompt.
56+
There are three main scraping pipelines that can be used to extract information from a website (or local file):
57+
- `SmartScraperGraph`: single-page scraper that only needs a user prompt and an input source;
58+
- `SearchGraph`: multi-page scraper that extracts information from the top n search results of a search engine;
59+
- `SpeechGraph`: single-page scraper that extracts information from a website and generates an audio file.
60+
61+
It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**, **Azure** and **Gemini**, or local models using **Ollama**.
62+
63+
### Case 1: SmartScraper using Local Models
5664

57-
The `SmartScraper` class is a direct graph implementation that uses the most common nodes present in a web scraping pipeline. For more information, please see the [documentation](https://scrapegraph-ai.readthedocs.io/en/latest/).
58-
### Case 1: Extracting information using Ollama
59-
Remember to download the model on Ollama separately!
65+
Remember to have [Ollama](https://ollama.com/) installed and download the models using the **ollama pull** command.
6066

6167
```python
6268
from scrapegraphai.graphs import SmartScraperGraph
@@ -71,11 +77,12 @@ graph_config = {
7177
"embeddings": {
7278
"model": "ollama/nomic-embed-text",
7379
"base_url": "http://localhost:11434", # set Ollama URL
74-
}
80+
},
81+
"verbose": True,
7582
}
7683

7784
smart_scraper_graph = SmartScraperGraph(
78-
prompt="List me all the articles",
85+
prompt="List me all the projects with their descriptions",
7986
# also accepts a string with the already downloaded HTML code
8087
source="https://perinim.github.io/projects",
8188
config=graph_config
@@ -86,159 +93,86 @@ print(result)
8693

8794
```
8895

89-
### Case 2: Extracting information using Docker
96+
The output will be a list of projects with their descriptions like the following:
9097

91-
Note: before using the local model remember to create the docker container!
92-
```text
93-
docker-compose up -d
94-
docker exec -it ollama ollama pull stablelm-zephyr
95-
```
96-
You can use which models available on Ollama or your own model instead of stablelm-zephyr
9798
```python
98-
from scrapegraphai.graphs import SmartScraperGraph
99-
100-
graph_config = {
101-
"llm": {
102-
"model": "ollama/mistral",
103-
"temperature": 0,
104-
"format": "json", # Ollama needs the format to be specified explicitly
105-
# "model_tokens": 2000, # set context length arbitrarily
106-
},
107-
}
108-
109-
smart_scraper_graph = SmartScraperGraph(
110-
prompt="List me all the articles",
111-
# also accepts a string with the already downloaded HTML code
112-
source="https://perinim.github.io/projects",
113-
config=graph_config
114-
)
115-
116-
result = smart_scraper_graph.run()
117-
print(result)
99+
{'projects': [{'title': 'Rotary Pendulum RL', 'description': 'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'}, {'title': 'DQN Implementation from scratch', 'description': 'Developed a Deep Q-Network algorithm to train a simple and double pendulum'}, ...]}
118100
```
119101

102+
### Case 2: SearchGraph using Mixed Models
120103

121-
### Case 3: Extracting information using Openai model
122-
```python
123-
from scrapegraphai.graphs import SmartScraperGraph
124-
OPENAI_API_KEY = "YOUR_API_KEY"
125-
126-
graph_config = {
127-
"llm": {
128-
"api_key": OPENAI_API_KEY,
129-
"model": "gpt-3.5-turbo",
130-
},
131-
}
132-
133-
smart_scraper_graph = SmartScraperGraph(
134-
prompt="List me all the articles",
135-
# also accepts a string with the already downloaded HTML code
136-
source="https://perinim.github.io/projects",
137-
config=graph_config
138-
)
139-
140-
result = smart_scraper_graph.run()
141-
print(result)
142-
```
104+
We use **Groq** for the LLM and **Ollama** for the embeddings.
143105

144-
### Case 4: Extracting information using Groq
145106
```python
146-
from scrapegraphai.graphs import SmartScraperGraph
147-
from scrapegraphai.utils import prettify_exec_info
148-
149-
groq_key = os.getenv("GROQ_APIKEY")
107+
from scrapegraphai.graphs import SearchGraph
150108

109+
# Define the configuration for the graph
151110
graph_config = {
152111
"llm": {
153112
"model": "groq/gemma-7b-it",
154-
"api_key": groq_key,
113+
"api_key": "GROQ_API_KEY",
155114
"temperature": 0
156115
},
157116
"embeddings": {
158117
"model": "ollama/nomic-embed-text",
159-
"temperature": 0,
160-
"base_url": "http://localhost:11434",
118+
"base_url": "http://localhost:11434", # set ollama URL arbitrarily
161119
},
162-
"headless": False
120+
"max_results": 5,
163121
}
164122

165-
smart_scraper_graph = SmartScraperGraph(
166-
prompt="List me all the projects with their description and the author.",
167-
source="https://perinim.github.io/projects",
123+
# Create the SearchGraph instance
124+
search_graph = SearchGraph(
125+
prompt="List me all the traditional recipes from Chioggia",
168126
config=graph_config
169127
)
170128

171-
result = smart_scraper_graph.run()
129+
# Run the graph
130+
result = search_graph.run()
172131
print(result)
173132
```
174133

175-
### Case 5: Extracting information using Azure
176-
```python
177-
from langchain_openai import AzureChatOpenAI
178-
from langchain_openai import AzureOpenAIEmbeddings
179-
180-
lm_model_instance = AzureChatOpenAI(
181-
openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
182-
azure_deployment=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"]
183-
)
184-
185-
embedder_model_instance = AzureOpenAIEmbeddings(
186-
azure_deployment=os.environ["AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME"],
187-
openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
188-
)
189-
graph_config = {
190-
"llm": {"model_instance": llm_model_instance},
191-
"embeddings": {"model_instance": embedder_model_instance}
192-
}
134+
The output will be a list of recipes like the following:
193135

194-
smart_scraper_graph = SmartScraperGraph(
195-
prompt="""List me all the events, with the following fields: company_name, event_name, event_start_date, event_start_time,
196-
event_end_date, event_end_time, location, event_mode, event_category,
197-
third_party_redirect, no_of_days,
198-
time_in_hours, hosted_or_attending, refreshments_type,
199-
registration_available, registration_link""",
200-
source="https://www.hmhco.com/event",
201-
config=graph_config
202-
)
136+
```python
137+
{'recipes': [{'name': 'Sarde in Saòre'}, {'name': 'Bigoli in salsa'}, {'name': 'Seppie in umido'}, {'name': 'Moleche frite'}, {'name': 'Risotto alla pescatora'}, {'name': 'Broeto'}, {'name': 'Bibarasse in Cassopipa'}, {'name': 'Risi e bisi'}, {'name': 'Smegiassa Ciosota'}]}
203138
```
139+
### Case 3: SpeechGraph using OpenAI
140+
141+
You just need to pass the OpenAI API key and the model name.
204142

205-
### Case 6: Extracting information using Gemini
206143
```python
207-
from scrapegraphai.graphs import SmartScraperGraph
208-
GOOGLE_APIKEY = "YOUR_API_KEY"
144+
from scrapegraphai.graphs import SpeechGraph
209145

210-
# Define the configuration for the graph
211146
graph_config = {
212147
"llm": {
213-
"api_key": GOOGLE_APIKEY,
214-
"model": "gemini-pro",
148+
"api_key": "OPENAI_API_KEY",
149+
"model": "gpt-3.5-turbo",
215150
},
151+
"tts_model": {
152+
"api_key": "OPENAI_API_KEY",
153+
"model": "tts-1",
154+
"voice": "alloy"
155+
},
156+
"output_path": "audio_summary.mp3",
216157
}
217158

218-
# Create the SmartScraperGraph instance
219-
smart_scraper_graph = SmartScraperGraph(
220-
prompt="List me all the articles",
221-
source="https://perinim.github.io/projects",
222-
config=graph_config
159+
# ************************************************
160+
# Create the SpeechGraph instance and run it
161+
# ************************************************
162+
163+
speech_graph = SpeechGraph(
164+
prompt="Make a detailed audio summary of the projects.",
165+
source="https://perinim.github.io/projects/",
166+
config=graph_config,
223167
)
224168

225-
result = smart_scraper_graph.run()
169+
result = speech_graph.run()
226170
print(result)
227-
```
228171

229-
The output for all 3 the cases will be a dictionary with the extracted information, for example:
230-
231-
```bash
232-
{
233-
'titles': [
234-
'Rotary Pendulum RL'
235-
],
236-
'descriptions': [
237-
'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'
238-
]
239-
}
240172
```
241173

174+
The output will be an audio file with the summary of the projects on the page.
175+
242176
## 🤝 Contributing
243177

244178
Feel free to contribute and join our Discord server to discuss with us improvements and give us suggestions!
@@ -252,6 +186,10 @@ Wanna visualize the roadmap in a more interactive way? Check out the [markmap](h
252186

253187
## ❤️ Contributors
254188
[![Contributors](https://contrib.rocks/image?repo=VinciGit00/Scrapegraph-ai)](https://github.com/VinciGit00/Scrapegraph-ai/graphs/contributors)
189+
## Sponsors
190+
<p align="center">
191+
<a href="https://serpapi.com/"><img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/serp_api_logo.png" alt="SerpAPI" style="width: 10%;"></a>
192+
</p>
255193

256194
## 🎓 Citations
257195
If you have used our library for research purposes please quote us with the following reference:
@@ -264,15 +202,11 @@ If you have used our library for research purposes please quote us with the foll
264202
note = {A Python library for scraping leveraging large language models}
265203
}
266204
```
267-
## Sponsors
268-
<p align="center">
269-
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/serp_api_logo.png" alt="Scrapegraph-ai Logo" style="width: 100px;">
270-
</p>
271205

272206
## Authors
273207

274208
<p align="center">
275-
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/logo_authors.png" alt="Authors Logos">
209+
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/logo_authors.png" alt="Authors_logos">
276210
</p>
277211

278212
| | Contact Info |

docs/source/conf.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,4 +30,3 @@
3030
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
3131

3232
html_theme = 'sphinx_rtd_theme'
33-
html_static_path = ['_static']

docs/source/getting_started/installation.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@ Prerequisites
88
^^^^^^^^^^^^^
99

1010
- `Python 3.8+ <https://www.python.org/downloads/>`_
11-
- `pip <https://pip.pypa.io/en/stable/getting-started/>`
12-
- `ollama <https://ollama.com/>` *optional for local models
11+
- `pip <https://pip.pypa.io/en/stable/getting-started/>`_
12+
- `ollama <https://ollama.com/>`_ *optional for local models
1313
1414

1515
Install the library

0 commit comments

Comments
 (0)