Skip to content

Commit 4238210

Browse files
authored
Merge pull request #371 from VinciGit00/pre/beta
Pre/beta
2 parents 44fbd71 + a8251bd commit 4238210

File tree

187 files changed

+6571
-1044
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

187 files changed

+6571
-1044
lines changed

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ docs/source/_static/
2323
venv/
2424
.venv/
2525
.vscode/
26+
.conda/
2627

2728
# exclude pdf, mp3
2829
*.pdf
@@ -38,3 +39,6 @@ lib/
3839
*.html
3940
.idea
4041

42+
# extras
43+
cache/
44+
run_smart_scraper.py

CHANGELOG.md

Lines changed: 288 additions & 7 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,11 @@
55
| [русский](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/russian.md)
66

77

8-
[![Downloads](https://static.pepy.tech/badge/scrapegraphai)](https://pepy.tech/project/scrapegraphai)
9-
[![linting: pylint](https://img.shields.io/badge/linting-pylint-yellowgreen)](https://github.com/pylint-dev/pylint)
10-
[![Pylint](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/pylint.yml/badge.svg)](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/pylint.yml)
11-
[![CodeQL](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/codeql.yml/badge.svg)](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/codeql.yml)
12-
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
8+
[![Downloads](https://img.shields.io/pepy/dt/scrapegraphai?style=for-the-badge)](https://pepy.tech/project/scrapegraphai)
9+
[![linting: pylint](https://img.shields.io/badge/linting-pylint-yellowgreen?style=for-the-badge)](https://github.com/pylint-dev/pylint)
10+
[![Pylint](https://img.shields.io/github/actions/workflow/status/VinciGit00/Scrapegraph-ai/pylint.yml?label=Pylint&logo=github&style=for-the-badge)](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/pylint.yml)
11+
[![CodeQL](https://img.shields.io/github/actions/workflow/status/VinciGit00/Scrapegraph-ai/codeql.yml?label=CodeQL&logo=github&style=for-the-badge)](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/codeql.yml)
12+
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=for-the-badge)](https://opensource.org/licenses/MIT)
1313
[![](https://dcbadge.vercel.app/api/server/gkxQDAjfeX)](https://discord.gg/gkxQDAjfeX)
1414

1515
ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, etc.).
@@ -46,11 +46,14 @@ The documentation for ScrapeGraphAI can be found [here](https://scrapegraph-ai.r
4646
Check out also the Docusaurus [here](https://scrapegraph-doc.onrender.com/).
4747

4848
## 💻 Usage
49-
There are three main scraping pipelines that can be used to extract information from a website (or local file):
49+
There are multiple standard scraping pipelines that can be used to extract information from a website (or local file):
5050
- `SmartScraperGraph`: single-page scraper that only needs a user prompt and an input source;
5151
- `SearchGraph`: multi-page scraper that extracts information from the top n search results of a search engine;
5252
- `SpeechGraph`: single-page scraper that extracts information from a website and generates an audio file.
53-
- `SmartScraperMultiGraph`: multiple page scraper given a single prompt
53+
- `ScriptCreatorGraph`: single-page scraper that extracts information from a website and generates a Python script.
54+
55+
- `SmartScraperMultiGraph`: multi-page scraper that extracts information from multiple pages given a single prompt and a list of sources;
56+
- `ScriptCreatorMultiGraph`: multi-page scraper that generates a Python script for extracting information from multiple pages given a single prompt and a list of sources.
5457

5558
It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**, **Azure** and **Gemini**, or local models using **Ollama**.
5659

docs/assets/scriptcreatorgraph.png

53.7 KB
Loading

docs/source/conf.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,4 +36,7 @@
3636
"source_repository": "https://github.com/VinciGit00/Scrapegraph-ai/",
3737
"source_branch": "main",
3838
"source_directory": "docs/source/",
39-
}
39+
'navigation_with_keys': True,
40+
'sidebar_hide_name': False,
41+
}
42+

docs/source/index.rst

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,9 +22,6 @@
2222
:caption: Scrapers
2323

2424
scrapers/graphs
25-
scrapers/llm
26-
scrapers/graph_config
27-
scrapers/benchmarks
2825

2926
.. toctree::
3027
:maxdepth: 2

docs/source/scrapers/graph_config.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ Some interesting ones are:
1313
- `loader_kwargs`: A dictionary with additional parameters to be passed to the `Loader` class, such as `proxy`.
1414
- `burr_kwargs`: A dictionary with additional parameters to enable `Burr` graphical user interface.
1515
- `max_images`: The maximum number of images to be analyzed. Useful in `OmniScraperGraph` and `OmniSearchGraph`.
16+
- `cache_path`: The path where the cache files will be saved. If already exists, the cache will be loaded from this path.
1617

1718
.. _Burr:
1819

docs/source/scrapers/graphs.rst

Lines changed: 8 additions & 184 deletions
Original file line numberDiff line numberDiff line change
@@ -3,187 +3,11 @@ Graphs
33

44
Graphs are scraping pipelines aimed at solving specific tasks. They are composed by nodes which can be configured individually to address different aspects of the task (fetching data, extracting information, etc.).
55

6-
There are several types of graphs available in the library, each with its own purpose and functionality. The most common ones are:
7-
8-
- **SmartScraperGraph**: one-page scraper that requires a user-defined prompt and a URL (or local file) to extract information using LLM.
9-
- **SmartScraperMultiGraph**: multi-page scraper that requires a user-defined prompt and a list of URLs (or local files) to extract information using LLM. It is built on top of SmartScraperGraph.
10-
- **SearchGraph**: multi-page scraper that only requires a user-defined prompt to extract information from a search engine using LLM. It is built on top of SmartScraperGraph.
11-
- **SpeechGraph**: text-to-speech pipeline that generates an answer as well as a requested audio file. It is built on top of SmartScraperGraph and requires a user-defined prompt and a URL (or local file).
12-
- **ScriptCreatorGraph**: script generator that creates a Python script to scrape a website using the specified library (e.g. BeautifulSoup). It requires a user-defined prompt and a URL (or local file).
13-
14-
With the introduction of `GPT-4o`, two new powerful graphs have been created:
15-
16-
- **OmniScraperGraph**: similar to `SmartScraperGraph`, but with the ability to scrape images and describe them.
17-
- **OmniSearchGraph**: similar to `SearchGraph`, but with the ability to scrape images and describe them.
18-
19-
20-
.. note::
21-
22-
They all use a graph configuration to set up LLM models and other parameters. To find out more about the configurations, check the :ref:`LLM` and :ref:`Configuration` sections.
23-
24-
25-
.. note::
26-
27-
We can pass an optional `schema` parameter to the graph constructor to specify the output schema. If not provided or set to `None`, the schema will be generated by the LLM itself.
28-
29-
OmniScraperGraph
30-
^^^^^^^^^^^^^^^^
31-
32-
.. image:: ../../assets/omniscrapergraph.png
33-
:align: center
34-
:width: 90%
35-
:alt: OmniScraperGraph
36-
|
37-
38-
First we define the graph configuration, which includes the LLM model and other parameters. Then we create an instance of the OmniScraperGraph class, passing the prompt, source, and configuration as arguments. Finally, we run the graph and print the result.
39-
It will fetch the data from the source and extract the information based on the prompt in JSON format.
40-
41-
.. code-block:: python
42-
43-
from scrapegraphai.graphs import OmniScraperGraph
44-
45-
graph_config = {
46-
"llm": {...},
47-
}
48-
49-
omni_scraper_graph = OmniScraperGraph(
50-
prompt="List me all the projects with their titles and image links and descriptions.",
51-
source="https://perinim.github.io/projects",
52-
config=graph_config,
53-
schema=schema
54-
)
55-
56-
result = omni_scraper_graph.run()
57-
print(result)
58-
59-
OmniSearchGraph
60-
^^^^^^^^^^^^^^^
61-
62-
.. image:: ../../assets/omnisearchgraph.png
63-
:align: center
64-
:width: 80%
65-
:alt: OmniSearchGraph
66-
|
67-
68-
Similar to OmniScraperGraph, we define the graph configuration, create multiple of the OmniSearchGraph class, and run the graph.
69-
It will create a search query, fetch the first n results from the search engine, run n OmniScraperGraph instances, and return the results in JSON format.
70-
71-
.. code-block:: python
72-
73-
from scrapegraphai.graphs import OmniSearchGraph
74-
75-
graph_config = {
76-
"llm": {...},
77-
}
78-
79-
# Create the OmniSearchGraph instance
80-
omni_search_graph = OmniSearchGraph(
81-
prompt="List me all Chioggia's famous dishes and describe their pictures.",
82-
config=graph_config,
83-
schema=schema
84-
)
85-
86-
# Run the graph
87-
result = omni_search_graph.run()
88-
print(result)
89-
90-
SmartScraperGraph & SmartScraperMultiGraph
91-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
92-
93-
.. image:: ../../assets/smartscrapergraph.png
94-
:align: center
95-
:width: 90%
96-
:alt: SmartScraperGraph
97-
|
98-
99-
First we define the graph configuration, which includes the LLM model and other parameters. Then we create an instance of the SmartScraperGraph class, passing the prompt, source, and configuration as arguments. Finally, we run the graph and print the result.
100-
It will fetch the data from the source and extract the information based on the prompt in JSON format.
101-
102-
.. code-block:: python
103-
104-
from scrapegraphai.graphs import SmartScraperGraph
105-
106-
graph_config = {
107-
"llm": {...},
108-
}
109-
110-
smart_scraper_graph = SmartScraperGraph(
111-
prompt="List me all the projects with their descriptions",
112-
source="https://perinim.github.io/projects",
113-
config=graph_config,
114-
schema=schema
115-
)
116-
117-
result = smart_scraper_graph.run()
118-
print(result)
119-
120-
**SmartScraperMultiGraph** is similar to SmartScraperGraph, but it can handle multiple sources. We define the graph configuration, create an instance of the SmartScraperMultiGraph class, and run the graph.
121-
122-
SearchGraph
123-
^^^^^^^^^^^
124-
125-
.. image:: ../../assets/searchgraph.png
126-
:align: center
127-
:width: 80%
128-
:alt: SearchGraph
129-
|
130-
131-
Similar to SmartScraperGraph, we define the graph configuration, create an instance of the SearchGraph class, and run the graph.
132-
It will create a search query, fetch the first n results from the search engine, run n SmartScraperGraph instances, and return the results in JSON format.
133-
134-
135-
.. code-block:: python
136-
137-
from scrapegraphai.graphs import SearchGraph
138-
139-
graph_config = {
140-
"llm": {...},
141-
"embeddings": {...},
142-
}
143-
144-
# Create the SearchGraph instance
145-
search_graph = SearchGraph(
146-
prompt="List me all the traditional recipes from Chioggia",
147-
config=graph_config,
148-
schema=schema
149-
)
150-
151-
# Run the graph
152-
result = search_graph.run()
153-
print(result)
154-
155-
156-
SpeechGraph
157-
^^^^^^^^^^^
158-
159-
.. image:: ../../assets/speechgraph.png
160-
:align: center
161-
:width: 90%
162-
:alt: SpeechGraph
163-
|
164-
165-
Similar to SmartScraperGraph, we define the graph configuration, create an instance of the SpeechGraph class, and run the graph.
166-
It will fetch the data from the source, extract the information based on the prompt, and generate an audio file with the answer, as well as the answer itself, in JSON format.
167-
168-
.. code-block:: python
169-
170-
from scrapegraphai.graphs import SpeechGraph
171-
172-
graph_config = {
173-
"llm": {...},
174-
"tts_model": {...},
175-
}
176-
177-
# ************************************************
178-
# Create the SpeechGraph instance and run it
179-
# ************************************************
180-
181-
speech_graph = SpeechGraph(
182-
prompt="Make a detailed audio summary of the projects.",
183-
source="https://perinim.github.io/projects/",
184-
config=graph_config,
185-
schema=schema
186-
)
187-
188-
result = speech_graph.run()
189-
print(result)
6+
.. toctree::
7+
:maxdepth: 4
8+
9+
types
10+
llm
11+
graph_config
12+
benchmarks
13+
telemetry

docs/source/scrapers/telemetry.rst

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
===============
2+
Usage Analytics
3+
===============
4+
5+
ScrapeGraphAI collects **anonymous** usage data by default to improve the library and guide development efforts.
6+
7+
**Events Captured**
8+
9+
We capture events in the following scenarios:
10+
11+
1. When a ``Graph`` finishes running.
12+
2. When an exception is raised in one of the nodes.
13+
14+
**Data Collected**
15+
16+
The data captured is limited to:
17+
18+
- Operating System and Python version
19+
- A persistent UUID to identify the session, stored in ``~/.scrapegraphai.conf``
20+
21+
Additionally, the following properties are collected:
22+
23+
.. code-block:: python
24+
25+
properties = {
26+
"graph_name": graph_name,
27+
"llm_model": llm_model_name,
28+
"embedder_model": embedder_model_name,
29+
"source_type": source_type,
30+
"execution_time": execution_time,
31+
"error_node": error_node_name,
32+
}
33+
34+
For more details, refer to the `telemetry.py <https://github.com/VinciGit00/Scrapegraph-ai/blob/main/scrapegraphai/telemetry/telemetry.py>`_ module.
35+
36+
**Opting Out**
37+
38+
If you prefer not to participate in telemetry, you can opt out using any of the following methods:
39+
40+
1. **Programmatically Disable Telemetry**:
41+
42+
Add the following code at the beginning of your script:
43+
44+
.. code-block:: python
45+
46+
from scrapegraphai import telemetry
47+
telemetry.disable_telemetry()
48+
49+
2. **Configuration File**:
50+
51+
Set the ``telemetry_enabled`` key to ``false`` in ``~/.scrapegraphai.conf`` under the ``[DEFAULT]`` section:
52+
53+
.. code-block:: ini
54+
55+
[DEFAULT]
56+
telemetry_enabled = False
57+
58+
3. **Environment Variable**:
59+
60+
- **For a Shell Session**:
61+
62+
.. code-block:: bash
63+
64+
export SCRAPEGRAPHAI_TELEMETRY_ENABLED=false
65+
66+
- **For a Single Command**:
67+
68+
.. code-block:: bash
69+
70+
SCRAPEGRAPHAI_TELEMETRY_ENABLED=false python my_script.py
71+
72+
By following any of these methods, you can easily opt out of telemetry and ensure your usage data is not collected.

0 commit comments

Comments
 (0)