You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites, documents and XML files.
11
+
ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, etc.).
12
+
12
13
Just say which information you want to extract and the library will do it for you!
The reference page for Scrapegraph-ai is available on the official page of pypy: [pypi](https://pypi.org/project/scrapegraphai/).
@@ -39,20 +39,23 @@ Try it directly on the web using Google Colab:
39
39
40
40
[](https://colab.research.google.com/drive/1sEZBonBMGP44CtO6GQTwAlL0BGJXjtfd?usp=sharing)
41
41
42
-
Follow the procedure on the following link to setup your OpenAI API key: [link](https://scrapegraph-ai.readthedocs.io/en/latest/index.html).
43
-
44
42
## 📖 Documentation
45
43
46
44
The documentation for ScrapeGraphAI can be found [here](https://scrapegraph-ai.readthedocs.io/en/latest/).
47
45
48
-
Check out also the docusaurus [documentation](https://scrapegraph-doc.onrender.com/).
46
+
Check out also the Docusaurus [here](https://scrapegraph-doc.onrender.com/).
49
47
50
48
## 💻 Usage
51
-
You can use the `SmartScraper` class to extract information from a website using a prompt.
49
+
There are three main scraping pipelines that can be used to extract information from a website (or local file):
50
+
-`SmartScraperGraph`: single-page scraper that only needs a user prompt and an input source;
51
+
-`SearchGraph`: multi-page scraper that extracts information from the top n search results of a search engine;
52
+
-`SpeechGraph`: single-page scraper that extracts information from a website and generates an audio file.
53
+
54
+
It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**, **Azure** and **Gemini**, or local models using **Ollama**.
55
+
56
+
### Case 1: SmartScraper using Local Models
52
57
53
-
The `SmartScraper` class is a direct graph implementation that uses the most common nodes present in a web scraping pipeline. For more information, please see the [documentation](https://scrapegraph-ai.readthedocs.io/en/latest/).
54
-
### Case 1: Extracting information using Ollama
55
-
Remember to download the model on Ollama separately!
58
+
Remember to have [Ollama](https://ollama.com/) installed and download the models using the **ollama pull** command.
56
59
57
60
```python
58
61
from scrapegraphai.graphs import SmartScraperGraph
@@ -67,11 +70,12 @@ graph_config = {
67
70
"embeddings": {
68
71
"model": "ollama/nomic-embed-text",
69
72
"base_url": "http://localhost:11434", # set Ollama URL
70
-
}
73
+
},
74
+
"verbose": True,
71
75
}
72
76
73
77
smart_scraper_graph = SmartScraperGraph(
74
-
prompt="List me all the articles",
78
+
prompt="List me all the projects with their descriptions",
75
79
# also accepts a string with the already downloaded HTML code
76
80
source="https://perinim.github.io/projects",
77
81
config=graph_config
@@ -82,160 +86,86 @@ print(result)
82
86
83
87
```
84
88
85
-
### Case 2: Extracting information using Docker
89
+
The output will be a list of projects with their descriptions like the following:
86
90
87
-
Note: before using the local model remember to create the docker container!
You can use which models available on Ollama or your own model instead of stablelm-zephyr
93
91
```python
94
-
from scrapegraphai.graphs import SmartScraperGraph
95
-
96
-
graph_config = {
97
-
"llm": {
98
-
"model": "ollama/mistral",
99
-
"temperature": 0,
100
-
"format": "json", # Ollama needs the format to be specified explicitly
101
-
# "model_tokens": 2000, # set context length arbitrarily
102
-
},
103
-
}
104
-
105
-
smart_scraper_graph = SmartScraperGraph(
106
-
prompt="List me all the articles",
107
-
# also accepts a string with the already downloaded HTML code
108
-
source="https://perinim.github.io/projects",
109
-
config=graph_config
110
-
)
111
-
112
-
result = smart_scraper_graph.run()
113
-
print(result)
92
+
{'projects': [{'title': 'Rotary Pendulum RL', 'description': 'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'}, {'title': 'DQN Implementation from scratch', 'description': 'Developed a Deep Q-Network algorithm to train a simple and double pendulum'}, ...]}
114
93
```
115
94
95
+
### Case 2: SearchGraph using Mixed Models
116
96
117
-
### Case 3: Extracting information using Openai model
118
-
```python
119
-
from scrapegraphai.graphs import SmartScraperGraph
120
-
OPENAI_API_KEY="YOUR_API_KEY"
121
-
122
-
graph_config = {
123
-
"llm": {
124
-
"api_key": OPENAI_API_KEY,
125
-
"model": "gpt-3.5-turbo",
126
-
},
127
-
}
128
-
129
-
smart_scraper_graph = SmartScraperGraph(
130
-
prompt="List me all the articles",
131
-
# also accepts a string with the already downloaded HTML code
132
-
source="https://perinim.github.io/projects",
133
-
config=graph_config
134
-
)
97
+
We use **Groq** for the LLM and **Ollama** for the embeddings.
135
98
136
-
result = smart_scraper_graph.run()
137
-
print(result)
138
-
```
139
-
140
-
### Case 4: Extracting information using Groq
141
99
```python
142
-
from scrapegraphai.graphs import SmartScraperGraph
143
-
from scrapegraphai.utils import prettify_exec_info
144
-
145
-
groq_key = os.getenv("GROQ_APIKEY")
100
+
from scrapegraphai.graphs import SearchGraph
146
101
102
+
# Define the configuration for the graph
147
103
graph_config = {
148
104
"llm": {
149
105
"model": "groq/gemma-7b-it",
150
-
"api_key": groq_key,
106
+
"api_key": "GROQ_API_KEY",
151
107
"temperature": 0
152
108
},
153
109
"embeddings": {
154
110
"model": "ollama/nomic-embed-text",
155
-
"temperature": 0,
156
-
"base_url": "http://localhost:11434",
111
+
"base_url": "http://localhost:11434", # set ollama URL arbitrarily
157
112
},
158
-
"headless": False
113
+
"max_results": 5,
159
114
}
160
115
161
-
smart_scraper_graph = SmartScraperGraph(
162
-
prompt="List me all the projects with their description and the author.",
163
-
source="https://perinim.github.io/projects",
116
+
# Create the SearchGraph instance
117
+
search_graph = SearchGraph(
118
+
prompt="List me all the traditional recipes from Chioggia",
164
119
config=graph_config
165
120
)
166
121
167
-
result = smart_scraper_graph.run()
122
+
# Run the graph
123
+
result = search_graph.run()
168
124
print(result)
169
125
```
170
126
127
+
The output will be a list of recipes like the following:
171
128
172
-
### Case 5: Extracting information using Azure
173
129
```python
174
-
from langchain_openai import AzureChatOpenAI
175
-
from langchain_openai import AzureOpenAIEmbeddings
0 commit comments