You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites, documents and XML files.
11
+
ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, etc.).
12
+
12
13
Just say which information you want to extract and the library will do it for you!
@@ -48,11 +53,16 @@ The documentation for ScrapeGraphAI can be found [here](https://scrapegraph-ai.r
48
53
Check out also the docusaurus [documentation](https://scrapegraph-doc.onrender.com/).
49
54
50
55
## 💻 Usage
51
-
You can use the `SmartScraper` class to extract information from a website using a prompt.
56
+
There are three main scraping pipelines that can be used to extract information from a website (or local file):
57
+
-`SmartScraperGraph`: single-page scraper that only needs a user prompt and an input source;
58
+
-`SearchGraph`: multi-page scraper that extracts information from the top n search results of a search engine;
59
+
-`SpeechGraph`: single-page scraper that extracts information from a website and generates an audio file.
60
+
61
+
It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**, **Azure** and **Gemini**, or local models using **Ollama**.
52
62
53
-
The `SmartScraper` class is a direct graph implementation that uses the most common nodes present in a web scraping pipeline. For more information, please see the [documentation](https://scrapegraph-ai.readthedocs.io/en/latest/).
54
-
### Case 1: Extracting information using Ollama
55
-
Remember to download the model on Ollama separately!
63
+
### Case 1: SmartScraper using Local Models
64
+
65
+
Remember to have [Ollama](https://ollama.com/) installed and download the models using the **ollama pull** command.
56
66
57
67
```python
58
68
from scrapegraphai.graphs import SmartScraperGraph
@@ -67,11 +77,12 @@ graph_config = {
67
77
"embeddings": {
68
78
"model": "ollama/nomic-embed-text",
69
79
"base_url": "http://localhost:11434", # set Ollama URL
70
-
}
80
+
},
81
+
"verbose": True,
71
82
}
72
83
73
84
smart_scraper_graph = SmartScraperGraph(
74
-
prompt="List me all the articles",
85
+
prompt="List me all the projects with their descriptions",
75
86
# also accepts a string with the already downloaded HTML code
76
87
source="https://perinim.github.io/projects",
77
88
config=graph_config
@@ -82,177 +93,103 @@ print(result)
82
93
83
94
```
84
95
85
-
### Case 2: Extracting information using Docker
96
+
The output will be a list of projects with their descriptions like the following:
86
97
87
-
Note: before using the local model remember to create the docker container!
You can use which models available on Ollama or your own model instead of stablelm-zephyr
93
98
```python
94
-
from scrapegraphai.graphs import SmartScraperGraph
95
-
96
-
graph_config = {
97
-
"llm": {
98
-
"model": "ollama/mistral",
99
-
"temperature": 0,
100
-
"format": "json", # Ollama needs the format to be specified explicitly
101
-
# "model_tokens": 2000, # set context length arbitrarily
102
-
},
103
-
}
104
-
105
-
smart_scraper_graph = SmartScraperGraph(
106
-
prompt="List me all the articles",
107
-
# also accepts a string with the already downloaded HTML code
108
-
source="https://perinim.github.io/projects",
109
-
config=graph_config
110
-
)
111
-
112
-
result = smart_scraper_graph.run()
113
-
print(result)
99
+
{'projects': [{'title': 'Rotary Pendulum RL', 'description': 'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'}, {'title': 'DQN Implementation from scratch', 'description': 'Developed a Deep Q-Network algorithm to train a simple and double pendulum'}, ...]}
114
100
```
115
101
102
+
### Case 2: SearchGraph using Mixed Models
116
103
117
-
### Case 3: Extracting information using Openai model
118
-
```python
119
-
from scrapegraphai.graphs import SmartScraperGraph
120
-
OPENAI_API_KEY="YOUR_API_KEY"
104
+
We use **Groq** for the LLM and **Ollama** for the embeddings.
121
105
122
-
graph_config = {
123
-
"llm": {
124
-
"api_key": OPENAI_API_KEY,
125
-
"model": "gpt-3.5-turbo",
126
-
},
127
-
}
128
-
129
-
smart_scraper_graph = SmartScraperGraph(
130
-
prompt="List me all the articles",
131
-
# also accepts a string with the already downloaded HTML code
132
-
source="https://perinim.github.io/projects",
133
-
config=graph_config
134
-
)
135
-
136
-
result = smart_scraper_graph.run()
137
-
print(result)
138
-
```
139
-
140
-
### Case 4: Extracting information using Groq
141
106
```python
142
-
from scrapegraphai.graphs import SmartScraperGraph
143
-
from scrapegraphai.utils import prettify_exec_info
144
-
145
-
groq_key = os.getenv("GROQ_APIKEY")
107
+
from scrapegraphai.graphs import SearchGraph
146
108
109
+
# Define the configuration for the graph
147
110
graph_config = {
148
111
"llm": {
149
112
"model": "groq/gemma-7b-it",
150
-
"api_key": groq_key,
113
+
"api_key": "GROQ_API_KEY",
151
114
"temperature": 0
152
115
},
153
116
"embeddings": {
154
117
"model": "ollama/nomic-embed-text",
155
-
"temperature": 0,
156
-
"base_url": "http://localhost:11434",
118
+
"base_url": "http://localhost:11434", # set ollama URL arbitrarily
157
119
},
158
-
"headless": False
120
+
"max_results": 5,
159
121
}
160
122
161
-
smart_scraper_graph = SmartScraperGraph(
162
-
prompt="List me all the projects with their description and the author.",
163
-
source="https://perinim.github.io/projects",
123
+
# Create the SearchGraph instance
124
+
search_graph = SearchGraph(
125
+
prompt="List me all the traditional recipes from Chioggia",
164
126
config=graph_config
165
127
)
166
128
167
-
result = smart_scraper_graph.run()
129
+
# Run the graph
130
+
result = search_graph.run()
168
131
print(result)
169
132
```
170
133
134
+
The output will be a list of recipes like the following:
171
135
172
-
### Case 5: Extracting information using Azure
173
136
```python
174
-
from langchain_openai import AzureChatOpenAI
175
-
from langchain_openai import AzureOpenAIEmbeddings
Check out the project roadmap [here](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/README.md)! 🚀
251
184
252
185
Wanna visualize the roadmap in a more interactive way? Check out the [markmap](https://markmap.js.org/repl) visualization by copy pasting the markdown content in the editor!
0 commit comments