You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[](https://colab.research.google.com/drive/1sEZBonBMGP44CtO6GQTwAlL0BGJXjtfd?usp=sharing)
41
-
42
-
## 📖 Documentation
43
-
44
-
The documentation for ScrapeGraphAI can be found [here](https://scrapegraph-ai.readthedocs.io/en/latest/).
45
-
46
-
Check out also the Docusaurus [here](https://scrapegraph-doc.onrender.com/).
47
-
48
35
## 💻 Usage
49
-
There are multiple standard scraping pipelines that can be used to extract information from a website (or local file):
50
-
-`SmartScraperGraph`: single-page scraper that only needs a user prompt and an input source;
51
-
-`SearchGraph`: multi-page scraper that extracts information from the top n search results of a search engine;
52
-
-`SpeechGraph`: single-page scraper that extracts information from a website and generates an audio file.
53
-
-`ScriptCreatorGraph`: single-page scraper that extracts information from a website and generates a Python script.
36
+
There are multiple standard scraping pipelines that can be used to extract information from a website (or local file).
54
37
55
-
-`SmartScraperMultiGraph`: multi-page scraper that extracts information from multiple pages given a single prompt and a list of sources;
56
-
-`ScriptCreatorMultiGraph`: multi-page scraper that generates a Python script for extracting information from multiple pages given a single prompt and a list of sources.
57
-
58
-
It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**, **Azure** and **Gemini**, or local models using **Ollama**.
38
+
The most common one is the `SmartScraperGraph`, which extracts information from a single page given a user prompt and a source URL.
59
39
60
-
### Case 1: SmartScraper using Local Models
61
-
62
-
Remember to have [Ollama](https://ollama.com/) installed and download the models using the **ollama pull** command.
63
40
64
41
```python
42
+
import json
65
43
from scrapegraphai.graphs import SmartScraperGraph
66
44
45
+
# Define the configuration for the scraping pipeline
67
46
graph_config = {
68
47
"llm": {
69
-
"model": "ollama/mistral",
70
-
"temperature": 0,
71
-
"format": "json", # Ollama needs the format to be specified explicitly
72
-
"base_url": "http://localhost:11434", # set Ollama URL
73
-
},
74
-
"embeddings": {
75
-
"model": "ollama/nomic-embed-text",
76
-
"base_url": "http://localhost:11434", # set Ollama URL
48
+
"api_key": "YOUR_OPENAI_APIKEY",
49
+
"model": "gpt-4o-mini",
77
50
},
78
51
"verbose": True,
52
+
"headless": False,
79
53
}
80
54
55
+
# Create the SmartScraperGraph instance
81
56
smart_scraper_graph = SmartScraperGraph(
82
-
prompt="List me all the projects with their descriptions",
83
-
# also accepts a string with the already downloaded HTML code
84
-
source="https://perinim.github.io/projects",
57
+
prompt="Find some information about what does the company do, the name and a contact email.",
58
+
source="https://scrapegraphai.com/",
85
59
config=graph_config
86
60
)
87
61
62
+
# Run the pipeline
88
63
result = smart_scraper_graph.run()
89
-
print(result)
90
-
64
+
print(json.dumps(result, indent=4))
91
65
```
92
66
93
-
The output will be a list of projects with their descriptions like the following:
67
+
The output will be a dictionary like the following:
94
68
95
69
```python
96
-
{'projects': [{'title': 'Rotary Pendulum RL', 'description': 'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'}, {'title': 'DQN Implementation from scratch', 'description': 'Developed a Deep Q-Network algorithm to train a simple and double pendulum'}, ...]}
97
-
```
98
-
99
-
### Case 2: SearchGraph using Mixed Models
100
-
101
-
We use **Groq** for the LLM and **Ollama** for the embeddings.
102
-
103
-
```python
104
-
from scrapegraphai.graphs import SearchGraph
105
-
106
-
# Define the configuration for the graph
107
-
graph_config = {
108
-
"llm": {
109
-
"model": "groq/gemma-7b-it",
110
-
"api_key": "GROQ_API_KEY",
111
-
"temperature": 0
112
-
},
113
-
"embeddings": {
114
-
"model": "ollama/nomic-embed-text",
115
-
"base_url": "http://localhost:11434", # set ollama URL arbitrarily
116
-
},
117
-
"max_results": 5,
70
+
{
71
+
"company": "ScrapeGraphAI",
72
+
"name": "ScrapeGraphAI Extracting content from websites and local documents using LLM",
[](https://colab.research.google.com/drive/1sEZBonBMGP44CtO6GQTwAlL0BGJXjtfd?usp=sharing)
159
100
160
-
speech_graph = SpeechGraph(
161
-
prompt="Make a detailed audio summary of the projects.",
162
-
source="https://perinim.github.io/projects/",
163
-
config=graph_config,
164
-
)
101
+
## 📖 Documentation
165
102
166
-
result = speech_graph.run()
167
-
print(result)
103
+
The documentation for ScrapeGraphAI can be found [here](https://scrapegraph-ai.readthedocs.io/en/latest/).
168
104
169
-
```
105
+
Check out also the Docusaurus [here](https://scrapegraph-doc.onrender.com/).
170
106
171
-
The output will be an audio file with the summary of the projects on the page.
0 commit comments