Skip to content

Commit ae5655f

Browse files
committed
docs(readme): improve main readme
1 parent cc28d5a commit ae5655f

File tree

1 file changed

+36
-112
lines changed

1 file changed

+36
-112
lines changed

README.md

Lines changed: 36 additions & 112 deletions
Original file line numberDiff line numberDiff line change
@@ -48,11 +48,16 @@ The documentation for ScrapeGraphAI can be found [here](https://scrapegraph-ai.r
4848
Check out also the docusaurus [documentation](https://scrapegraph-doc.onrender.com/).
4949

5050
## 💻 Usage
51-
You can use the `SmartScraper` class to extract information from a website using a prompt.
51+
There are three main scraping pipelines that can be used to extract information from a website (or local file):
52+
- `SmartScraperGraph`: single-page scraper that only needs a user prompt and an input source;
53+
- `SearchGraph`: multi-page scraper that extracts information from the top n search results of a search engine;
54+
- `SpeechGraph`: single-page scraper that extracts information from a website and generates an audio file.
5255

53-
The `SmartScraper` class is a direct graph implementation that uses the most common nodes present in a web scraping pipeline. For more information, please see the [documentation](https://scrapegraph-ai.readthedocs.io/en/latest/).
54-
### Case 1: Extracting information using Ollama
55-
Remember to download the model on Ollama separately!
56+
It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**, **Azure** and **Gemini**, or local models using **Ollama**.
57+
58+
### Case 1: SmartScraper using Local Models
59+
60+
Remember to have [Ollama](https://ollama.com/) installed and download the models using the **ollama pull** command.
5661

5762
```python
5863
from scrapegraphai.graphs import SmartScraperGraph
@@ -67,11 +72,12 @@ graph_config = {
6772
"embeddings": {
6873
"model": "ollama/nomic-embed-text",
6974
"base_url": "http://localhost:11434", # set Ollama URL
70-
}
75+
},
76+
"verbose": True,
7177
}
7278

7379
smart_scraper_graph = SmartScraperGraph(
74-
prompt="List me all the articles",
80+
prompt="List me all the projects with their descriptions",
7581
# also accepts a string with the already downloaded HTML code
7682
source="https://perinim.github.io/projects",
7783
config=graph_config
@@ -82,159 +88,77 @@ print(result)
8288

8389
```
8490

85-
### Case 2: Extracting information using Docker
91+
The output will be a list of projects with their descriptions like the following:
8692

87-
Note: before using the local model remember to create the docker container!
88-
```text
89-
docker-compose up -d
90-
docker exec -it ollama ollama pull stablelm-zephyr
91-
```
92-
You can use which models avaiable on Ollama or your own model instead of stablelm-zephyr
9393
```python
94-
from scrapegraphai.graphs import SmartScraperGraph
95-
96-
graph_config = {
97-
"llm": {
98-
"model": "ollama/mistral",
99-
"temperature": 0,
100-
"format": "json", # Ollama needs the format to be specified explicitly
101-
# "model_tokens": 2000, # set context length arbitrarily
102-
},
103-
}
104-
105-
smart_scraper_graph = SmartScraperGraph(
106-
prompt="List me all the articles",
107-
# also accepts a string with the already downloaded HTML code
108-
source="https://perinim.github.io/projects",
109-
config=graph_config
110-
)
111-
112-
result = smart_scraper_graph.run()
113-
print(result)
94+
{'projects': [{'title': 'Rotary Pendulum RL', 'description': 'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'}, {'title': 'DQN Implementation from scratch', 'description': 'Developed a Deep Q-Network algorithm to train a simple and double pendulum'}, ...]}
11495
```
11596

97+
### Case 2: SearchGraph using Mixed Models
11698

117-
### Case 3: Extracting information using Openai model
118-
```python
119-
from scrapegraphai.graphs import SmartScraperGraph
120-
OPENAI_API_KEY = "YOUR_API_KEY"
121-
122-
graph_config = {
123-
"llm": {
124-
"api_key": OPENAI_API_KEY,
125-
"model": "gpt-3.5-turbo",
126-
},
127-
}
99+
We use **Groq** for the LLM and **Ollama** for the embeddings.
128100

129-
smart_scraper_graph = SmartScraperGraph(
130-
prompt="List me all the articles",
131-
# also accepts a string with the already downloaded HTML code
132-
source="https://perinim.github.io/projects",
133-
config=graph_config
134-
)
135-
136-
result = smart_scraper_graph.run()
137-
print(result)
138-
```
139-
140-
### Case 4: Extracting information using Groq
141101
```python
142-
from scrapegraphai.graphs import SmartScraperGraph
143-
from scrapegraphai.utils import prettify_exec_info
144-
145-
groq_key = os.getenv("GROQ_APIKEY")
102+
from scrapegraphai.graphs import SearchGraph
146103

104+
# Define the configuration for the graph
147105
graph_config = {
148106
"llm": {
149107
"model": "groq/gemma-7b-it",
150-
"api_key": groq_key,
108+
"api_key": "GROQ_API_KEY",
151109
"temperature": 0
152110
},
153111
"embeddings": {
154112
"model": "ollama/nomic-embed-text",
155-
"temperature": 0,
156-
"base_url": "http://localhost:11434",
113+
"base_url": "http://localhost:11434", # set ollama URL arbitrarily
157114
},
158-
"headless": False
115+
"max_results": 5,
159116
}
160117

161-
smart_scraper_graph = SmartScraperGraph(
162-
prompt="List me all the projects with their description and the author.",
163-
source="https://perinim.github.io/projects",
118+
# Create the SearchGraph instance
119+
search_graph = SearchGraph(
120+
prompt="List me all the traditional recipes from Chioggia",
164121
config=graph_config
165122
)
166123

167-
result = smart_scraper_graph.run()
124+
# Run the graph
125+
result = search_graph.run()
168126
print(result)
169127
```
170128

129+
The output will be a list of recipes like the following:
171130

172-
### Case 5: Extracting information using Azure
173131
```python
174-
from langchain_openai import AzureChatOpenAI
175-
from langchain_openai import AzureOpenAIEmbeddings
176-
177-
lm_model_instance = AzureChatOpenAI(
178-
openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
179-
azure_deployment=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"]
180-
)
181-
182-
embedder_model_instance = AzureOpenAIEmbeddings(
183-
azure_deployment=os.environ["AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME"],
184-
openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
185-
)
186-
graph_config = {
187-
"llm": {"model_instance": llm_model_instance},
188-
"embeddings": {"model_instance": embedder_model_instance}
189-
}
190-
191-
smart_scraper_graph = SmartScraperGraph(
192-
prompt="""List me all the events, with the following fields: company_name, event_name, event_start_date, event_start_time,
193-
event_end_date, event_end_time, location, event_mode, event_category,
194-
third_party_redirect, no_of_days,
195-
time_in_hours, hosted_or_attending, refreshments_type,
196-
registration_available, registration_link""",
197-
source="https://www.hmhco.com/event",
198-
config=graph_config
199-
)
132+
{'recipes': [{'name': 'Sarde in Saòre'}, {'name': 'Bigoli in salsa'}, {'name': 'Seppie in umido'}, {'name': 'Moleche frite'}, {'name': 'Risotto alla pescatora'}, {'name': 'Broeto'}, {'name': 'Bibarasse in Cassopipa'}, {'name': 'Risi e bisi'}, {'name': 'Smegiassa Ciosota'}]}
200133
```
134+
### Case 3: SpeechGraph using OpenAI
135+
136+
You just need to pass the OpenAI API key and the model name.
201137

202-
### Case 6: Extracting information using Gemini
203138
```python
204139
from scrapegraphai.graphs import SmartScraperGraph
205-
GOOGLE_APIKEY = "YOUR_API_KEY"
206140

207141
# Define the configuration for the graph
208142
graph_config = {
209143
"llm": {
210-
"api_key": GOOGLE_APIKEY,
211-
"model": "gemini-pro",
144+
"api_key": "OPENAI_API_KEY",
145+
"model": "gpt-3.5-turbo",
212146
},
213147
}
214148

215149
# Create the SmartScraperGraph instance
216150
smart_scraper_graph = SmartScraperGraph(
217-
prompt="List me all the articles",
151+
prompt="Make a detailed audio summary of the projects on this page",
218152
source="https://perinim.github.io/projects",
219153
config=graph_config
220154
)
221155

156+
# Run the graph
222157
result = smart_scraper_graph.run()
223158
print(result)
224159
```
225160

226-
The output for all 3 the cases will be a dictionary with the extracted information, for example:
227-
228-
```bash
229-
{
230-
'titles': [
231-
'Rotary Pendulum RL'
232-
],
233-
'descriptions': [
234-
'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'
235-
]
236-
}
237-
```
161+
The output will be an audio file with the summary of the projects on the page.
238162

239163
## 🤝 Contributing
240164

0 commit comments

Comments
 (0)