You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites, documents and XML files.
11
+
ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, etc.).
12
+
12
13
Just say which information you want to extract and the library will do it for you!
13
14
14
15
<palign="center">
@@ -52,11 +53,16 @@ The documentation for ScrapeGraphAI can be found [here](https://scrapegraph-ai.r
52
53
Check out also the docusaurus [documentation](https://scrapegraph-doc.onrender.com/).
53
54
54
55
## 💻 Usage
55
-
You can use the `SmartScraper` class to extract information from a website using a prompt.
56
+
There are three main scraping pipelines that can be used to extract information from a website (or local file):
57
+
-`SmartScraperGraph`: single-page scraper that only needs a user prompt and an input source;
58
+
-`SearchGraph`: multi-page scraper that extracts information from the top n search results of a search engine;
59
+
-`SpeechGraph`: single-page scraper that extracts information from a website and generates an audio file.
60
+
61
+
It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**, **Azure** and **Gemini**, or local models using **Ollama**.
62
+
63
+
### Case 1: SmartScraper using Local Models
56
64
57
-
The `SmartScraper` class is a direct graph implementation that uses the most common nodes present in a web scraping pipeline. For more information, please see the [documentation](https://scrapegraph-ai.readthedocs.io/en/latest/).
58
-
### Case 1: Extracting information using Ollama
59
-
Remember to download the model on Ollama separately!
65
+
Remember to have [Ollama](https://ollama.com/) installed and download the models using the **ollama pull** command.
60
66
61
67
```python
62
68
from scrapegraphai.graphs import SmartScraperGraph
@@ -71,11 +77,12 @@ graph_config = {
71
77
"embeddings": {
72
78
"model": "ollama/nomic-embed-text",
73
79
"base_url": "http://localhost:11434", # set Ollama URL
74
-
}
80
+
},
81
+
"verbose": True,
75
82
}
76
83
77
84
smart_scraper_graph = SmartScraperGraph(
78
-
prompt="List me all the articles",
85
+
prompt="List me all the projects with their descriptions",
79
86
# also accepts a string with the already downloaded HTML code
80
87
source="https://perinim.github.io/projects",
81
88
config=graph_config
@@ -86,159 +93,86 @@ print(result)
86
93
87
94
```
88
95
89
-
### Case 2: Extracting information using Docker
96
+
The output will be a list of projects with their descriptions like the following:
90
97
91
-
Note: before using the local model remember to create the docker container!
You can use which models available on Ollama or your own model instead of stablelm-zephyr
97
98
```python
98
-
from scrapegraphai.graphs import SmartScraperGraph
99
-
100
-
graph_config = {
101
-
"llm": {
102
-
"model": "ollama/mistral",
103
-
"temperature": 0,
104
-
"format": "json", # Ollama needs the format to be specified explicitly
105
-
# "model_tokens": 2000, # set context length arbitrarily
106
-
},
107
-
}
108
-
109
-
smart_scraper_graph = SmartScraperGraph(
110
-
prompt="List me all the articles",
111
-
# also accepts a string with the already downloaded HTML code
112
-
source="https://perinim.github.io/projects",
113
-
config=graph_config
114
-
)
115
-
116
-
result = smart_scraper_graph.run()
117
-
print(result)
99
+
{'projects': [{'title': 'Rotary Pendulum RL', 'description': 'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'}, {'title': 'DQN Implementation from scratch', 'description': 'Developed a Deep Q-Network algorithm to train a simple and double pendulum'}, ...]}
118
100
```
119
101
102
+
### Case 2: SearchGraph using Mixed Models
120
103
121
-
### Case 3: Extracting information using Openai model
122
-
```python
123
-
from scrapegraphai.graphs import SmartScraperGraph
124
-
OPENAI_API_KEY="YOUR_API_KEY"
125
-
126
-
graph_config = {
127
-
"llm": {
128
-
"api_key": OPENAI_API_KEY,
129
-
"model": "gpt-3.5-turbo",
130
-
},
131
-
}
132
-
133
-
smart_scraper_graph = SmartScraperGraph(
134
-
prompt="List me all the articles",
135
-
# also accepts a string with the already downloaded HTML code
136
-
source="https://perinim.github.io/projects",
137
-
config=graph_config
138
-
)
139
-
140
-
result = smart_scraper_graph.run()
141
-
print(result)
142
-
```
104
+
We use **Groq** for the LLM and **Ollama** for the embeddings.
143
105
144
-
### Case 4: Extracting information using Groq
145
106
```python
146
-
from scrapegraphai.graphs import SmartScraperGraph
147
-
from scrapegraphai.utils import prettify_exec_info
148
-
149
-
groq_key = os.getenv("GROQ_APIKEY")
107
+
from scrapegraphai.graphs import SearchGraph
150
108
109
+
# Define the configuration for the graph
151
110
graph_config = {
152
111
"llm": {
153
112
"model": "groq/gemma-7b-it",
154
-
"api_key": groq_key,
113
+
"api_key": "GROQ_API_KEY",
155
114
"temperature": 0
156
115
},
157
116
"embeddings": {
158
117
"model": "ollama/nomic-embed-text",
159
-
"temperature": 0,
160
-
"base_url": "http://localhost:11434",
118
+
"base_url": "http://localhost:11434", # set ollama URL arbitrarily
161
119
},
162
-
"headless": False
120
+
"max_results": 5,
163
121
}
164
122
165
-
smart_scraper_graph = SmartScraperGraph(
166
-
prompt="List me all the projects with their description and the author.",
167
-
source="https://perinim.github.io/projects",
123
+
# Create the SearchGraph instance
124
+
search_graph = SearchGraph(
125
+
prompt="List me all the traditional recipes from Chioggia",
168
126
config=graph_config
169
127
)
170
128
171
-
result = smart_scraper_graph.run()
129
+
# Run the graph
130
+
result = search_graph.run()
172
131
print(result)
173
132
```
174
133
175
-
### Case 5: Extracting information using Azure
176
-
```python
177
-
from langchain_openai import AzureChatOpenAI
178
-
from langchain_openai import AzureOpenAIEmbeddings
0 commit comments