Skip to content

Docs main readme #207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
174 changes: 54 additions & 120 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@
[![](https://dcbadge.vercel.app/api/server/gkxQDAjfeX)](https://discord.gg/gkxQDAjfeX)


ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites, documents and XML files.
ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, etc.).

Just say which information you want to extract and the library will do it for you!

<p align="center">
Expand Down Expand Up @@ -52,11 +53,16 @@ The documentation for ScrapeGraphAI can be found [here](https://scrapegraph-ai.r
Check out also the docusaurus [documentation](https://scrapegraph-doc.onrender.com/).

## 💻 Usage
You can use the `SmartScraper` class to extract information from a website using a prompt.
There are three main scraping pipelines that can be used to extract information from a website (or local file):
- `SmartScraperGraph`: single-page scraper that only needs a user prompt and an input source;
- `SearchGraph`: multi-page scraper that extracts information from the top n search results of a search engine;
- `SpeechGraph`: single-page scraper that extracts information from a website and generates an audio file.

It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**, **Azure** and **Gemini**, or local models using **Ollama**.

### Case 1: SmartScraper using Local Models

The `SmartScraper` class is a direct graph implementation that uses the most common nodes present in a web scraping pipeline. For more information, please see the [documentation](https://scrapegraph-ai.readthedocs.io/en/latest/).
### Case 1: Extracting information using Ollama
Remember to download the model on Ollama separately!
Remember to have [Ollama](https://ollama.com/) installed and download the models using the **ollama pull** command.

```python
from scrapegraphai.graphs import SmartScraperGraph
Expand All @@ -71,11 +77,12 @@ graph_config = {
"embeddings": {
"model": "ollama/nomic-embed-text",
"base_url": "http://localhost:11434", # set Ollama URL
}
},
"verbose": True,
}

smart_scraper_graph = SmartScraperGraph(
prompt="List me all the articles",
prompt="List me all the projects with their descriptions",
# also accepts a string with the already downloaded HTML code
source="https://perinim.github.io/projects",
config=graph_config
Expand All @@ -86,159 +93,86 @@ print(result)

```

### Case 2: Extracting information using Docker
The output will be a list of projects with their descriptions like the following:

Note: before using the local model remember to create the docker container!
```text
docker-compose up -d
docker exec -it ollama ollama pull stablelm-zephyr
```
You can use which models available on Ollama or your own model instead of stablelm-zephyr
```python
from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
"llm": {
"model": "ollama/mistral",
"temperature": 0,
"format": "json", # Ollama needs the format to be specified explicitly
# "model_tokens": 2000, # set context length arbitrarily
},
}

smart_scraper_graph = SmartScraperGraph(
prompt="List me all the articles",
# also accepts a string with the already downloaded HTML code
source="https://perinim.github.io/projects",
config=graph_config
)

result = smart_scraper_graph.run()
print(result)
{'projects': [{'title': 'Rotary Pendulum RL', 'description': 'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'}, {'title': 'DQN Implementation from scratch', 'description': 'Developed a Deep Q-Network algorithm to train a simple and double pendulum'}, ...]}
```

### Case 2: SearchGraph using Mixed Models

### Case 3: Extracting information using Openai model
```python
from scrapegraphai.graphs import SmartScraperGraph
OPENAI_API_KEY = "YOUR_API_KEY"

graph_config = {
"llm": {
"api_key": OPENAI_API_KEY,
"model": "gpt-3.5-turbo",
},
}

smart_scraper_graph = SmartScraperGraph(
prompt="List me all the articles",
# also accepts a string with the already downloaded HTML code
source="https://perinim.github.io/projects",
config=graph_config
)

result = smart_scraper_graph.run()
print(result)
```
We use **Groq** for the LLM and **Ollama** for the embeddings.

### Case 4: Extracting information using Groq
```python
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

groq_key = os.getenv("GROQ_APIKEY")
from scrapegraphai.graphs import SearchGraph

# Define the configuration for the graph
graph_config = {
"llm": {
"model": "groq/gemma-7b-it",
"api_key": groq_key,
"api_key": "GROQ_API_KEY",
"temperature": 0
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"temperature": 0,
"base_url": "http://localhost:11434",
"base_url": "http://localhost:11434", # set ollama URL arbitrarily
},
"headless": False
"max_results": 5,
}

smart_scraper_graph = SmartScraperGraph(
prompt="List me all the projects with their description and the author.",
source="https://perinim.github.io/projects",
# Create the SearchGraph instance
search_graph = SearchGraph(
prompt="List me all the traditional recipes from Chioggia",
config=graph_config
)

result = smart_scraper_graph.run()
# Run the graph
result = search_graph.run()
print(result)
```

### Case 5: Extracting information using Azure
```python
from langchain_openai import AzureChatOpenAI
from langchain_openai import AzureOpenAIEmbeddings
The output will be a list of recipes like the following:

lm_model_instance = AzureChatOpenAI(
openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
azure_deployment=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT_NAME"]
)

embedder_model_instance = AzureOpenAIEmbeddings(
azure_deployment=os.environ["AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME"],
openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
)
graph_config = {
"llm": {"model_instance": llm_model_instance},
"embeddings": {"model_instance": embedder_model_instance}
}

smart_scraper_graph = SmartScraperGraph(
prompt="""List me all the events, with the following fields: company_name, event_name, event_start_date, event_start_time,
event_end_date, event_end_time, location, event_mode, event_category,
third_party_redirect, no_of_days,
time_in_hours, hosted_or_attending, refreshments_type,
registration_available, registration_link""",
source="https://www.hmhco.com/event",
config=graph_config
)
```python
{'recipes': [{'name': 'Sarde in Saòre'}, {'name': 'Bigoli in salsa'}, {'name': 'Seppie in umido'}, {'name': 'Moleche frite'}, {'name': 'Risotto alla pescatora'}, {'name': 'Broeto'}, {'name': 'Bibarasse in Cassopipa'}, {'name': 'Risi e bisi'}, {'name': 'Smegiassa Ciosota'}]}
```
### Case 3: SpeechGraph using OpenAI

You just need to pass the OpenAI API key and the model name.

### Case 6: Extracting information using Gemini
```python
from scrapegraphai.graphs import SmartScraperGraph
GOOGLE_APIKEY = "YOUR_API_KEY"
from scrapegraphai.graphs import SpeechGraph

# Define the configuration for the graph
graph_config = {
"llm": {
"api_key": GOOGLE_APIKEY,
"model": "gemini-pro",
"api_key": "OPENAI_API_KEY",
"model": "gpt-3.5-turbo",
},
"tts_model": {
"api_key": "OPENAI_API_KEY",
"model": "tts-1",
"voice": "alloy"
},
"output_path": "audio_summary.mp3",
}

# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
prompt="List me all the articles",
source="https://perinim.github.io/projects",
config=graph_config
# ************************************************
# Create the SpeechGraph instance and run it
# ************************************************

speech_graph = SpeechGraph(
prompt="Make a detailed audio summary of the projects.",
source="https://perinim.github.io/projects/",
config=graph_config,
)

result = smart_scraper_graph.run()
result = speech_graph.run()
print(result)
```

The output for all 3 the cases will be a dictionary with the extracted information, for example:

```bash
{
'titles': [
'Rotary Pendulum RL'
],
'descriptions': [
'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'
]
}
```

The output will be an audio file with the summary of the projects on the page.

## 🤝 Contributing

Feel free to contribute and join our Discord server to discuss with us improvements and give us suggestions!
Expand Down
1 change: 0 additions & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,3 @@
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output

html_theme = 'sphinx_rtd_theme'
html_static_path = ['_static']
4 changes: 2 additions & 2 deletions docs/source/getting_started/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ Prerequisites
^^^^^^^^^^^^^

- `Python 3.8+ <https://www.python.org/downloads/>`_
- `pip <https://pip.pypa.io/en/stable/getting-started/>`
- `ollama <https://ollama.com/>` *optional for local models
- `pip <https://pip.pypa.io/en/stable/getting-started/>`_
- `ollama <https://ollama.com/>`_ *optional for local models


Install the library
Expand Down
Loading