Skip to content

Commit f9f6b08

Browse files
committed
Merge branch 'pre/beta' into burr_integration
2 parents 19b27bb + e65faca commit f9f6b08

File tree

90 files changed

+3129
-707
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

90 files changed

+3129
-707
lines changed

.gitignore

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ dist/
1414
*.egg-info/
1515
*.egg
1616
MANIFEST
17+
*.python-version
1718

1819
docs/build/
1920
docs/source/_templates/
@@ -31,6 +32,7 @@ examples/graph_examples/ScrapeGraphAI_generated_graph
3132
examples/**/result.csv
3233
examples/**/result.json
3334
main.py
34-
*.python-version
35-
*.lock
36-
35+
lib/
36+
*.html
37+
.idea
38+

.python-version

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
3.10.14
2+

CHANGELOG.md

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,119 @@
1+
## [1.5.0-beta.3](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.5.0-beta.2...v1.5.0-beta.3) (2024-05-24)
2+
3+
4+
### Bug Fixes
5+
6+
* **kg:** removed unused nodes and utils ([5684578](https://github.com/VinciGit00/Scrapegraph-ai/commit/5684578fab635e862de58f7847ad736c6a57f766))
7+
8+
## [1.5.0-beta.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.5.0-beta.1...v1.5.0-beta.2) (2024-05-24)
9+
10+
11+
### Bug Fixes
12+
13+
* **pdf_scraper:** fix the pdf scraper gaph ([d00cde6](https://github.com/VinciGit00/Scrapegraph-ai/commit/d00cde60309935e283ba9116cf0b114e53cb9640))
14+
* **local_file:** fixed textual input pdf, csv, json and xml graph ([8d5eb0b](https://github.com/VinciGit00/Scrapegraph-ai/commit/8d5eb0bb0d5d008a63a96df94ce3842320376b8e))
15+
16+
## [1.5.0-beta.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.4.0...v1.5.0-beta.1) (2024-05-24)
17+
18+
19+
### Features
20+
21+
* **knowledgegraph:** add knowledge graph node ([0196423](https://github.com/VinciGit00/Scrapegraph-ai/commit/0196423bdeea6568086aae6db8fc0f5652fc4e87))
22+
* add logger integration ([e53766b](https://github.com/VinciGit00/Scrapegraph-ai/commit/e53766b16e89254f945f9b54b38445a24f8b81f2))
23+
* **smart-scraper-multi:** add schema to graphs and created SmartScraperMultiGraph ([fc58e2d](https://github.com/VinciGit00/Scrapegraph-ai/commit/fc58e2d3a6f05efa72b45c9e68c6bb41a1eee755))
24+
* **base_graph:** alligned with main ([73fa31d](https://github.com/VinciGit00/Scrapegraph-ai/commit/73fa31db0f791d1fd63b489ac88cc6e595aa07f9))
25+
* **verbose:** centralized graph logging on debug or warning depending on verbose ([c807695](https://github.com/VinciGit00/Scrapegraph-ai/commit/c807695720a85c74a0b4365afb397bbbcd7e2889))
26+
* **node:** knowledge graph node ([8c33ea3](https://github.com/VinciGit00/Scrapegraph-ai/commit/8c33ea3fbce18f74484fe7bd9469ab95c985ad0b))
27+
* **multiple:** quick fix working ([58cc903](https://github.com/VinciGit00/Scrapegraph-ai/commit/58cc903d556d0b8db10284493b05bed20992c339))
28+
* **kg:** removed import ([a338383](https://github.com/VinciGit00/Scrapegraph-ai/commit/a338383399b669ae2dd7bfcec168b791e8206816))
29+
* **docloaders:** undetected-playwright ([7b3ee4e](https://github.com/VinciGit00/Scrapegraph-ai/commit/7b3ee4e71e4af04edeb47999d70d398b67c93ac4))
30+
* **multiple_search:** working multiple example ([bed3eed](https://github.com/VinciGit00/Scrapegraph-ai/commit/bed3eed50c1678cfb07cba7b451ac28d38c87d7c))
31+
* **kg:** working rag kg ([c75e6a0](https://github.com/VinciGit00/Scrapegraph-ai/commit/c75e6a06b1a647f03e6ac6eeacdc578a85baa25b))
32+
33+
34+
### Bug Fixes
35+
36+
* error in jsons ([ca436ab](https://github.com/VinciGit00/Scrapegraph-ai/commit/ca436abf3cbff21d752a71969e787e8f8c98c6a8))
37+
* **logger:** set up centralized root logger in base node ([4348d4f](https://github.com/VinciGit00/Scrapegraph-ai/commit/4348d4f4db6f30213acc1bbccebc2b143b4d2636))
38+
* **logging:** source code citation ([d139480](https://github.com/VinciGit00/Scrapegraph-ai/commit/d1394809d704bee4085d494ddebab772306b3b17))
39+
* template names ([b82f33a](https://github.com/VinciGit00/Scrapegraph-ai/commit/b82f33aee72515e4258e6f508fce15028eba5cbe))
40+
* **node-logging:** use centralized logger in each node for logging ([c251cc4](https://github.com/VinciGit00/Scrapegraph-ai/commit/c251cc45d3694f8e81503e38a6d2b362452b740e))
41+
* **web-loader:** use sublogger ([0790ecd](https://github.com/VinciGit00/Scrapegraph-ai/commit/0790ecd2083642af9f0a84583216ababe351cd76))
42+
43+
44+
### CI
45+
46+
* **release:** 1.2.0-beta.1 [skip ci] ([fd3e0aa](https://github.com/VinciGit00/Scrapegraph-ai/commit/fd3e0aa5823509dfb46b4f597521c24d4eb345f1))
47+
* **release:** 1.3.0-beta.1 [skip ci] ([191db0b](https://github.com/VinciGit00/Scrapegraph-ai/commit/191db0bc779e4913713b47b68ec4162a347da3ea))
48+
* **release:** 1.4.0-beta.1 [skip ci] ([2caddf9](https://github.com/VinciGit00/Scrapegraph-ai/commit/2caddf9a99b5f3aedc1783216f21d23cd35b3a8c))
49+
* **release:** 1.4.0-beta.2 [skip ci] ([f1a2523](https://github.com/VinciGit00/Scrapegraph-ai/commit/f1a25233d650010e1932e0ab80938079a22a296d))
50+
51+
## [1.4.0-beta.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.4.0-beta.1...v1.4.0-beta.2) (2024-05-19)
52+
53+
54+
### Features
55+
56+
* Add new models and update existing ones ([58289ec](https://github.com/VinciGit00/Scrapegraph-ai/commit/58289eccc523814a2898650c41410f9a35b4e4c2))
57+
58+
## [1.3.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.3.1...v1.3.2) (2024-05-22)
59+
60+
61+
### Bug Fixes
62+
63+
* pdf scraper bug ([f2dffe5](https://github.com/VinciGit00/Scrapegraph-ai/commit/f2dffe534f51aa83aed5ac491243604a443f4373))
64+
65+
## [1.3.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.3.0...v1.3.1) (2024-05-21)
66+
67+
68+
### Bug Fixes
69+
70+
* add deepseek embeddings ([659fad7](https://github.com/VinciGit00/Scrapegraph-ai/commit/659fad770a5b6ace87511513e5233a3bc1269009))
71+
72+
73+
## [1.3.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.2.4...v1.3.0) (2024-05-19)
74+
75+
76+
77+
### Features
78+
79+
* add new model ([8c7afa7](https://github.com/VinciGit00/Scrapegraph-ai/commit/8c7afa7570f0a104578deb35658168435cfe5ae1))
80+
81+
82+
## [1.2.4](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.2.3...v1.2.4) (2024-05-17)
83+
84+
85+
### Bug Fixes
86+
87+
* **deepcopy:** switch whether we have obj in the config ([d4d913c](https://github.com/VinciGit00/Scrapegraph-ai/commit/d4d913c8a360b907ebe1fbf3764e00b69783afe8))
88+
89+
## [1.2.3](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.2.2...v1.2.3) (2024-05-15)
90+
91+
92+
### Bug Fixes
93+
94+
* **deepcopy:** reaplced to shallow copy ([999c930](https://github.com/VinciGit00/Scrapegraph-ai/commit/999c930f424430a3d3d7ff604afbd2bf6d27c7ad))
95+
96+
## [1.2.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.2.1...v1.2.2) (2024-05-15)
97+
98+
99+
### Bug Fixes
100+
101+
* come back to the old version ([cc5adef](https://github.com/VinciGit00/Scrapegraph-ai/commit/cc5adefd29eb2d0d7127515c4a4a72eabbc7eaa8))
102+
103+
## [1.2.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.2.0...v1.2.1) (2024-05-15)
104+
105+
106+
### Bug Fixes
107+
108+
* removed unused ([5587a64](https://github.com/VinciGit00/Scrapegraph-ai/commit/5587a64d23451a6a216000fe83b2ce1cc8f7141b))
109+
110+
## [1.2.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.1.0...v1.2.0) (2024-05-15)
111+
112+
113+
### Features
114+
115+
* add finalize_node() ([6e7283e](https://github.com/VinciGit00/Scrapegraph-ai/commit/6e7283ed8fc42408d718e8776f9fd3856960ffdb))
116+
1117
## [1.1.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.0.1...v1.1.0) (2024-05-15)
2118

3119

README.md

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,6 @@
77
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
88
[![](https://dcbadge.vercel.app/api/server/gkxQDAjfeX)](https://discord.gg/gkxQDAjfeX)
99

10-
1110
ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, etc.).
1211

1312
Just say which information you want to extract and the library will do it for you!
@@ -23,10 +22,6 @@ The reference page for Scrapegraph-ai is available on the official page of pypy:
2322
```bash
2423
pip install scrapegraphai
2524
```
26-
you will also need to install Playwright for javascript-based scraping:
27-
```bash
28-
playwright install
29-
```
3025

3126
**Note**: it is recommended to install the library in a virtual environment to avoid conflicts with other libraries 🐱
3227

@@ -50,6 +45,7 @@ There are three main scraping pipelines that can be used to extract information
5045
- `SmartScraperGraph`: single-page scraper that only needs a user prompt and an input source;
5146
- `SearchGraph`: multi-page scraper that extracts information from the top n search results of a search engine;
5247
- `SpeechGraph`: single-page scraper that extracts information from a website and generates an audio file.
48+
- `SmartScraperMultiGraph`: multiple page scraper given a single prompt
5349

5450
It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**, **Azure** and **Gemini**, or local models using **Ollama**.
5551

@@ -184,9 +180,14 @@ Wanna visualize the roadmap in a more interactive way? Check out the [markmap](h
184180
## ❤️ Contributors
185181
[![Contributors](https://contrib.rocks/image?repo=VinciGit00/Scrapegraph-ai)](https://github.com/VinciGit00/Scrapegraph-ai/graphs/contributors)
186182
## Sponsors
187-
<p align="center">
188-
<a href="https://serpapi.com?utm_source=scrapegraphai"><img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/serp_api_logo.png" alt="SerpAPI" style="width: 10%;"></a>
189-
</p>
183+
<div style="text-align: center;">
184+
<a href="https://serpapi.com?utm_source=scrapegraphai">
185+
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/serp_api_logo.png" alt="SerpAPI" style="width: 10%;">
186+
</a>
187+
<a href="https://dashboard.statproxies.com/?refferal=scrapegraph">
188+
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/transparent_stat.png" alt="Stats" style="width: 10%;">
189+
</a>
190+
</div>
190191

191192
## 🎓 Citations
192193
If you have used our library for research purposes please quote us with the following reference:

docs/assets/transparent_stat.png

217 KB
Loading

docs/source/getting_started/installation.rst

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,13 +25,11 @@ The library is available on PyPI, so it can be installed using the following com
2525

2626
It is higly recommended to install the library in a virtual environment (conda, venv, etc.)
2727

28-
If you clone the repository, you can install the library using `rye <https://rye-up.com/>`_. Follow the installation instruction from the website and then run:
28+
If your clone the repository, you can install the library using `poetry <https://python-poetry.org/docs/>`_:
2929

3030
.. code-block:: bash
3131
32-
rye pin 3.10
33-
rye sync
34-
rye build
32+
poetry install
3533
3634
Additionally on Windows when using WSL
3735
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

docs/source/scrapers/graphs.rst

Lines changed: 22 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3,21 +3,29 @@ Graphs
33

44
Graphs are scraping pipelines aimed at solving specific tasks. They are composed by nodes which can be configured individually to address different aspects of the task (fetching data, extracting information, etc.).
55

6-
There are three types of graphs available in the library:
6+
There are several types of graphs available in the library, each with its own purpose and functionality. The most common ones are:
77

8-
- **SmartScraperGraph**: one-page scraper that requires a user-defined prompt and a URL (or local file) to extract information from using LLM.
8+
- **SmartScraperGraph**: one-page scraper that requires a user-defined prompt and a URL (or local file) to extract information using LLM.
9+
- **SmartScraperMultiGraph**: multi-page scraper that requires a user-defined prompt and a list of URLs (or local files) to extract information using LLM. It is built on top of SmartScraperGraph.
910
- **SearchGraph**: multi-page scraper that only requires a user-defined prompt to extract information from a search engine using LLM. It is built on top of SmartScraperGraph.
1011
- **SpeechGraph**: text-to-speech pipeline that generates an answer as well as a requested audio file. It is built on top of SmartScraperGraph and requires a user-defined prompt and a URL (or local file).
12+
- **ScriptCreatorGraph**: script generator that creates a Python script to scrape a website using the specified library (e.g. BeautifulSoup). It requires a user-defined prompt and a URL (or local file).
1113

1214
With the introduction of `GPT-4o`, two new powerful graphs have been created:
1315

1416
- **OmniScraperGraph**: similar to `SmartScraperGraph`, but with the ability to scrape images and describe them.
1517
- **OmniSearchGraph**: similar to `SearchGraph`, but with the ability to scrape images and describe them.
1618

19+
1720
.. note::
1821

1922
They all use a graph configuration to set up LLM models and other parameters. To find out more about the configurations, check the :ref:`LLM` and :ref:`Configuration` sections.
2023

24+
25+
.. note::
26+
27+
We can pass an optional `schema` parameter to the graph constructor to specify the output schema. If not provided or set to `None`, the schema will be generated by the LLM itself.
28+
2129
OmniScraperGraph
2230
^^^^^^^^^^^^^^^^
2331

@@ -41,7 +49,8 @@ It will fetch the data from the source and extract the information based on the
4149
omni_scraper_graph = OmniScraperGraph(
4250
prompt="List me all the projects with their titles and image links and descriptions.",
4351
source="https://perinim.github.io/projects",
44-
config=graph_config
52+
config=graph_config,
53+
schema=schema
4554
)
4655
4756
result = omni_scraper_graph.run()
@@ -70,15 +79,16 @@ It will create a search query, fetch the first n results from the search engine,
7079
# Create the OmniSearchGraph instance
7180
omni_search_graph = OmniSearchGraph(
7281
prompt="List me all Chioggia's famous dishes and describe their pictures.",
73-
config=graph_config
82+
config=graph_config,
83+
schema=schema
7484
)
7585
7686
# Run the graph
7787
result = omni_search_graph.run()
7888
print(result)
7989
80-
SmartScraperGraph
81-
^^^^^^^^^^^^^^^^^
90+
SmartScraperGraph & SmartScraperMultiGraph
91+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
8292

8393
.. image:: ../../assets/smartscrapergraph.png
8494
:align: center
@@ -100,12 +110,14 @@ It will fetch the data from the source and extract the information based on the
100110
smart_scraper_graph = SmartScraperGraph(
101111
prompt="List me all the projects with their descriptions",
102112
source="https://perinim.github.io/projects",
103-
config=graph_config
113+
config=graph_config,
114+
schema=schema
104115
)
105116
106117
result = smart_scraper_graph.run()
107118
print(result)
108119
120+
**SmartScraperMultiGraph** is similar to SmartScraperGraph, but it can handle multiple sources. We define the graph configuration, create an instance of the SmartScraperMultiGraph class, and run the graph.
109121

110122
SearchGraph
111123
^^^^^^^^^^^
@@ -132,7 +144,8 @@ It will create a search query, fetch the first n results from the search engine,
132144
# Create the SearchGraph instance
133145
search_graph = SearchGraph(
134146
prompt="List me all the traditional recipes from Chioggia",
135-
config=graph_config
147+
config=graph_config,
148+
schema=schema
136149
)
137150
138151
# Run the graph
@@ -169,6 +182,7 @@ It will fetch the data from the source, extract the information based on the pro
169182
prompt="Make a detailed audio summary of the projects.",
170183
source="https://perinim.github.io/projects/",
171184
config=graph_config,
185+
schema=schema
172186
)
173187
174188
result = speech_graph.run()

examples/bedrock/.env.example

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
AWS_ACCESS_KEY_ID="..."
2+
AWS_SECRET_ACCESS_KEY="..."
3+
AWS_SESSION_TOKEN="..."
4+
AWS_DEFAULT_REGION="..."

examples/bedrock/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
This folder contains examples of how to use ScrapeGraphAI with [Amazon Bedrock](https://aws.amazon.com/bedrock/) ⛰️. The examples show how to extract information from websites and files using a natural language prompt.
2+
3+
![](scrapegraphai_bedrock.png)
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
"""
2+
Basic example of scraping pipeline using CSVScraperGraph from CSV documents
3+
"""
4+
5+
import os
6+
import json
7+
8+
from dotenv import load_dotenv
9+
10+
import pandas as pd
11+
12+
from scrapegraphai.graphs import CSVScraperGraph
13+
from scrapegraphai.utils import convert_to_csv, convert_to_json, prettify_exec_info
14+
15+
load_dotenv()
16+
17+
# ************************************************
18+
# Read the CSV file
19+
# ************************************************
20+
21+
FILE_NAME = "inputs/username.csv"
22+
curr_dir = os.path.dirname(os.path.realpath(__file__))
23+
file_path = os.path.join(curr_dir, FILE_NAME)
24+
25+
text = pd.read_csv(file_path)
26+
27+
# ************************************************
28+
# Define the configuration for the graph
29+
# ************************************************
30+
31+
graph_config = {
32+
"llm": {
33+
"model": "bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
34+
"temperature": 0.0
35+
},
36+
"embeddings": {
37+
"model": "bedrock/cohere.embed-multilingual-v3"
38+
}
39+
}
40+
41+
# ************************************************
42+
# Create the CSVScraperGraph instance and run it
43+
# ************************************************
44+
45+
csv_scraper_graph = CSVScraperGraph(
46+
prompt="List me all the last names",
47+
source=str(text), # Pass the content of the file, not the file object
48+
config=graph_config
49+
)
50+
51+
result = csv_scraper_graph.run()
52+
print(json.dumps(result, indent=4))
53+
54+
# ************************************************
55+
# Get graph execution info
56+
# ************************************************
57+
58+
graph_exec_info = csv_scraper_graph.get_execution_info()
59+
print(prettify_exec_info(graph_exec_info))
60+
61+
# Save to json or csv
62+
convert_to_csv(result, "result")
63+
convert_to_json(result, "result")

0 commit comments

Comments
 (0)