Skip to content

Commit 0b5cdd4

Browse files
authored
Merge pull request #246 from VinciGit00/main
reallignment
2 parents 20604bd + 51200d6 commit 0b5cdd4

File tree

110 files changed

+4276
-4111
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

110 files changed

+4276
-4111
lines changed

.github/workflows/release.yml

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,8 @@ jobs:
1414
run: |
1515
sudo apt update
1616
sudo apt install -y git
17-
- name: Install Python Env and Poetry
18-
uses: actions/setup-python@v5
19-
with:
20-
python-version: '3.9'
21-
- run: pip install poetry
17+
- name: Install the latest version of rye
18+
uses: eifinger/setup-rye@v3
2219
- name: Install Node Env
2320
uses: actions/setup-node@v4
2421
with:
@@ -30,8 +27,8 @@ jobs:
3027
persist-credentials: false
3128
- name: Build app
3229
run: |
33-
poetry install
34-
poetry build
30+
rye sync --no-lock
31+
rye build
3532
id: build_cache
3633
if: success()
3734
- name: Cache build

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,4 +31,6 @@ examples/graph_examples/ScrapeGraphAI_generated_graph
3131
examples/**/result.csv
3232
examples/**/result.json
3333
main.py
34+
*.python-version
35+
*.lock
3436

.python-version

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
3.9.19

CHANGELOG.md

Lines changed: 237 additions & 0 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 59 additions & 125 deletions
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,14 @@
88
[![](https://dcbadge.vercel.app/api/server/gkxQDAjfeX)](https://discord.gg/gkxQDAjfeX)
99

1010

11-
ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites, documents and XML files.
11+
ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, etc.).
12+
1213
Just say which information you want to extract and the library will do it for you!
1314

1415
<p align="center">
1516
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/scrapegraphai_logo.png" alt="Scrapegraph-ai Logo" style="width: 50%;">
1617
</p>
1718

18-
1919
## 🚀 Quick install
2020

2121
The reference page for Scrapegraph-ai is available on the official page of pypy: [pypi](https://pypi.org/project/scrapegraphai/).
@@ -39,20 +39,23 @@ Try it directly on the web using Google Colab:
3939

4040
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sEZBonBMGP44CtO6GQTwAlL0BGJXjtfd?usp=sharing)
4141

42-
Follow the procedure on the following link to setup your OpenAI API key: [link](https://scrapegraph-ai.readthedocs.io/en/latest/index.html).
43-
4442
## 📖 Documentation
4543

4644
The documentation for ScrapeGraphAI can be found [here](https://scrapegraph-ai.readthedocs.io/en/latest/).
4745

48-
Check out also the docusaurus [documentation](https://scrapegraph-doc.onrender.com/).
46+
Check out also the Docusaurus [here](https://scrapegraph-doc.onrender.com/).
4947

5048
## 💻 Usage
51-
You can use the `SmartScraper` class to extract information from a website using a prompt.
49+
There are three main scraping pipelines that can be used to extract information from a website (or local file):
50+
- `SmartScraperGraph`: single-page scraper that only needs a user prompt and an input source;
51+
- `SearchGraph`: multi-page scraper that extracts information from the top n search results of a search engine;
52+
- `SpeechGraph`: single-page scraper that extracts information from a website and generates an audio file.
53+
54+
It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**, **Azure** and **Gemini**, or local models using **Ollama**.
55+
56+
### Case 1: SmartScraper using Local Models
5257

53-
The `SmartScraper` class is a direct graph implementation that uses the most common nodes present in a web scraping pipeline. For more information, please see the [documentation](https://scrapegraph-ai.readthedocs.io/en/latest/).
54-
### Case 1: Extracting information using Ollama
55-
Remember to download the model on Ollama separately!
58+
Remember to have [Ollama](https://ollama.com/) installed and download the models using the **ollama pull** command.
5659

5760
```python
5861
from scrapegraphai.graphs import SmartScraperGraph
@@ -67,11 +70,12 @@ graph_config = {
6770
"embeddings": {
6871
"model": "ollama/nomic-embed-text",
6972
"base_url": "http://localhost:11434", # set Ollama URL
70-
}
73+
},
74+
"verbose": True,
7175
}
7276

7377
smart_scraper_graph = SmartScraperGraph(
74-
prompt="List me all the articles",
78+
prompt="List me all the projects with their descriptions",
7579
# also accepts a string with the already downloaded HTML code
7680
source="https://perinim.github.io/projects",
7781
config=graph_config
@@ -82,160 +86,86 @@ print(result)
8286

8387
```
8488

85-
### Case 2: Extracting information using Docker
89+
The output will be a list of projects with their descriptions like the following:
8690

87-
Note: before using the local model remember to create the docker container!
88-
```text
89-
docker-compose up -d
90-
docker exec -it ollama ollama pull stablelm-zephyr
91-
```
92-
You can use which models available on Ollama or your own model instead of stablelm-zephyr
9391
```python
94-
from scrapegraphai.graphs import SmartScraperGraph
95-
96-
graph_config = {
97-
"llm": {
98-
"model": "ollama/mistral",
99-
"temperature": 0,
100-
"format": "json", # Ollama needs the format to be specified explicitly
101-
# "model_tokens": 2000, # set context length arbitrarily
102-
},
103-
}
104-
105-
smart_scraper_graph = SmartScraperGraph(
106-
prompt="List me all the articles",
107-
# also accepts a string with the already downloaded HTML code
108-
source="https://perinim.github.io/projects",
109-
config=graph_config
110-
)
111-
112-
result = smart_scraper_graph.run()
113-
print(result)
92+
{'projects': [{'title': 'Rotary Pendulum RL', 'description': 'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'}, {'title': 'DQN Implementation from scratch', 'description': 'Developed a Deep Q-Network algorithm to train a simple and double pendulum'}, ...]}
11493
```
11594

95+
### Case 2: SearchGraph using Mixed Models
11696

117-
### Case 3: Extracting information using Openai model
118-
```python
119-
from scrapegraphai.graphs import SmartScraperGraph
120-
OPENAI_API_KEY = "YOUR_API_KEY"
121-
122-
graph_config = {
123-
"llm": {
124-
"api_key": OPENAI_API_KEY,
125-
"model": "gpt-3.5-turbo",
126-
},
127-
}
128-
129-
smart_scraper_graph = SmartScraperGraph(
130-
prompt="List me all the articles",
131-
# also accepts a string with the already downloaded HTML code
132-
source="https://perinim.github.io/projects",
133-
config=graph_config
134-
)
97+
We use **Groq** for the LLM and **Ollama** for the embeddings.
13598

136-
result = smart_scraper_graph.run()
137-
print(result)
138-
```
139-
140-
### Case 4: Extracting information using Groq
14199
```python
142-
from scrapegraphai.graphs import SmartScraperGraph
143-
from scrapegraphai.utils import prettify_exec_info
144-
145-
groq_key = os.getenv("GROQ_APIKEY")
100+
from scrapegraphai.graphs import SearchGraph
146101

102+
# Define the configuration for the graph
147103
graph_config = {
148104
"llm": {
149105
"model": "groq/gemma-7b-it",
150-
"api_key": groq_key,
106+
"api_key": "GROQ_API_KEY",
151107
"temperature": 0
152108
},
153109
"embeddings": {
154110
"model": "ollama/nomic-embed-text",
155-
"temperature": 0,
156-
"base_url": "http://localhost:11434",
111+
"base_url": "http://localhost:11434", # set ollama URL arbitrarily
157112
},
158-
"headless": False
113+
"max_results": 5,
159114
}
160115

161-
smart_scraper_graph = SmartScraperGraph(
162-
prompt="List me all the projects with their description and the author.",
163-
source="https://perinim.github.io/projects",
116+
# Create the SearchGraph instance
117+
search_graph = SearchGraph(
118+
prompt="List me all the traditional recipes from Chioggia",
164119
config=graph_config
165120
)
166121

167-
result = smart_scraper_graph.run()
122+
# Run the graph
123+
result = search_graph.run()
168124
print(result)
169125
```
170126

127+
The output will be a list of recipes like the following:
171128

172-
### Case 5: Extracting information using Azure
173129
```python
174-
from langchain_openai import AzureChatOpenAI
175-
from langchain_openai import AzureOpenAIEmbeddings
176-
177-
lm_model_instance = AzureChatOpenAI(
178-
openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
179-
azure_deployment=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"]
180-
)
181-
182-
embedder_model_instance = AzureOpenAIEmbeddings(
183-
azure_deployment=os.environ["AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME"],
184-
openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
185-
)
186-
graph_config = {
187-
"llm": {"model_instance": llm_model_instance},
188-
"embeddings": {"model_instance": embedder_model_instance}
189-
}
190-
191-
smart_scraper_graph = SmartScraperGraph(
192-
prompt="""List me all the events, with the following fields: company_name, event_name, event_start_date, event_start_time,
193-
event_end_date, event_end_time, location, event_mode, event_category,
194-
third_party_redirect, no_of_days,
195-
time_in_hours, hosted_or_attending, refreshments_type,
196-
registration_available, registration_link""",
197-
source="https://www.hmhco.com/event",
198-
config=graph_config
199-
)
130+
{'recipes': [{'name': 'Sarde in Saòre'}, {'name': 'Bigoli in salsa'}, {'name': 'Seppie in umido'}, {'name': 'Moleche frite'}, {'name': 'Risotto alla pescatora'}, {'name': 'Broeto'}, {'name': 'Bibarasse in Cassopipa'}, {'name': 'Risi e bisi'}, {'name': 'Smegiassa Ciosota'}]}
200131
```
132+
### Case 3: SpeechGraph using OpenAI
133+
134+
You just need to pass the OpenAI API key and the model name.
201135

202-
### Case 6: Extracting information using Gemini
203136
```python
204-
from scrapegraphai.graphs import SmartScraperGraph
205-
GOOGLE_APIKEY = "YOUR_API_KEY"
137+
from scrapegraphai.graphs import SpeechGraph
206138

207-
# Define the configuration for the graph
208139
graph_config = {
209140
"llm": {
210-
"api_key": GOOGLE_APIKEY,
211-
"model": "gemini-pro",
141+
"api_key": "OPENAI_API_KEY",
142+
"model": "gpt-3.5-turbo",
143+
},
144+
"tts_model": {
145+
"api_key": "OPENAI_API_KEY",
146+
"model": "tts-1",
147+
"voice": "alloy"
212148
},
149+
"output_path": "audio_summary.mp3",
213150
}
214151

215-
# Create the SmartScraperGraph instance
216-
smart_scraper_graph = SmartScraperGraph(
217-
prompt="List me all the articles",
218-
source="https://perinim.github.io/projects",
219-
config=graph_config
152+
# ************************************************
153+
# Create the SpeechGraph instance and run it
154+
# ************************************************
155+
156+
speech_graph = SpeechGraph(
157+
prompt="Make a detailed audio summary of the projects.",
158+
source="https://perinim.github.io/projects/",
159+
config=graph_config,
220160
)
221161

222-
result = smart_scraper_graph.run()
162+
result = speech_graph.run()
223163
print(result)
224-
```
225164

226-
The output for all 3 the cases will be a dictionary with the extracted information, for example:
227-
228-
```bash
229-
{
230-
'titles': [
231-
'Rotary Pendulum RL'
232-
],
233-
'descriptions': [
234-
'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'
235-
]
236-
}
237165
```
238166

167+
The output will be an audio file with the summary of the projects on the page.
168+
239169
## 🤝 Contributing
240170

241171
Feel free to contribute and join our Discord server to discuss with us improvements and give us suggestions!
@@ -253,6 +183,10 @@ Wanna visualize the roadmap in a more interactive way? Check out the [markmap](h
253183

254184
## ❤️ Contributors
255185
[![Contributors](https://contrib.rocks/image?repo=VinciGit00/Scrapegraph-ai)](https://github.com/VinciGit00/Scrapegraph-ai/graphs/contributors)
186+
## Sponsors
187+
<p align="center">
188+
<a href="https://serpapi.com?utm_source=scrapegraphai"><img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/serp_api_logo.png" alt="SerpAPI" style="width: 10%;"></a>
189+
</p>
256190

257191
## 🎓 Citations
258192
If you have used our library for research purposes please quote us with the following reference:
@@ -269,7 +203,7 @@ If you have used our library for research purposes please quote us with the foll
269203
## Authors
270204

271205
<p align="center">
272-
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/logo_authors.png" alt="Authors Logos">
206+
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/logo_authors.png" alt="Authors_logos">
273207
</p>
274208

275209
| | Contact Info |

docs/assets/omniscrapergraph.png

72.2 KB
Loading

docs/assets/omnisearchgraph.png

56.7 KB
Loading
51.8 KB
Binary file not shown.
82 KB
Loading

docs/assets/searchgraph.png

53.3 KB
Loading

docs/assets/serp_api_logo.png

15.1 KB
Loading

docs/assets/smartscrapergraph.png

59.7 KB
Loading

docs/assets/speechgraph.png

48.2 KB
Loading

docs/source/conf.py

Lines changed: 22 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -14,20 +14,36 @@
1414
# import all the modules
1515
sys.path.insert(0, os.path.abspath('../../'))
1616

17-
project = 'scrapegraphai'
18-
copyright = '2024, Marco Vinciguerra'
19-
author = 'Marco Vinciguerra'
17+
project = 'ScrapeGraphAI'
18+
copyright = '2024, ScrapeGraphAI'
19+
author = 'Marco Vinciguerra, Marco Perini, Lorenzo Padoan'
20+
21+
html_last_updated_fmt = "%b %d, %Y"
2022

2123
# -- General configuration ---------------------------------------------------
2224
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
2325

24-
extensions = ['sphinx.ext.autodoc', 'sphinx.ext.napoleon']
26+
extensions = ['sphinx.ext.autodoc', 'sphinx.ext.napoleon','sphinx_wagtail_theme']
2527

2628
templates_path = ['_templates']
2729
exclude_patterns = []
2830

2931
# -- Options for HTML output -------------------------------------------------
3032
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
3133

32-
html_theme = 'sphinx_rtd_theme'
33-
html_static_path = ['_static']
34+
# html_theme = 'sphinx_rtd_theme'
35+
html_theme = 'sphinx_wagtail_theme'
36+
37+
html_theme_options = dict(
38+
project_name = "ScrapeGraphAI",
39+
logo = "scrapegraphai_logo.png",
40+
logo_alt = "ScrapeGraphAI",
41+
logo_height = 59,
42+
logo_url = "https://scrapegraph-ai.readthedocs.io/en/latest/",
43+
logo_width = 45,
44+
github_url = "https://github.com/VinciGit00/Scrapegraph-ai/tree/main/docs/source/",
45+
footer_links = ",".join(
46+
["Landing Page|https://scrapegraphai.com/",
47+
"Docusaurus|https://scrapegraph-doc.onrender.com/docs/intro"]
48+
),
49+
)

docs/source/getting_started/examples.rst

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
11
Examples
22
========
33

4-
Here some example of the different ways to scrape with ScrapegraphAI
4+
Let's suppose you want to scrape a website to get a list of projects with their descriptions.
5+
You can use the `SmartScraperGraph` class to do that.
6+
The following examples show how to use the `SmartScraperGraph` class with OpenAI models and local models.
57

68
OpenAI models
79
^^^^^^^^^^^^^
@@ -78,7 +80,7 @@ After that, you can run the following code, using only your machine resources br
7880
# ************************************************
7981
8082
smart_scraper_graph = SmartScraperGraph(
81-
prompt="List me all the news with their description.",
83+
prompt="List me all the projects with their description.",
8284
# also accepts a string with the already downloaded HTML code
8385
source="https://perinim.github.io/projects",
8486
config=graph_config
@@ -87,3 +89,4 @@ After that, you can run the following code, using only your machine resources br
8789
result = smart_scraper_graph.run()
8890
print(result)
8991
92+
To find out how you can customize the `graph_config` dictionary, by using different LLM and adding new parameters, check the `Scrapers` section!

0 commit comments

Comments
 (0)