Skip to content

Commit da93162

Browse files
authored
Merge branch 'pre/beta' into main
2 parents d8d5cd2 + 4c8becc commit da93162

File tree

179 files changed

+5940
-790
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

179 files changed

+5940
-790
lines changed

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ docs/source/_static/
2323
venv/
2424
.venv/
2525
.vscode/
26+
.conda/
2627

2728
# exclude pdf, mp3
2829
*.pdf
@@ -38,3 +39,6 @@ lib/
3839
*.html
3940
.idea
4041

42+
# extras
43+
cache/
44+
run_smart_scraper.py

CHANGELOG.md

Lines changed: 257 additions & 7 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,11 @@
55
| [русский](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/russian.md)
66

77

8-
[![Downloads](https://static.pepy.tech/badge/scrapegraphai)](https://pepy.tech/project/scrapegraphai)
9-
[![linting: pylint](https://img.shields.io/badge/linting-pylint-yellowgreen)](https://github.com/pylint-dev/pylint)
10-
[![Pylint](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/pylint.yml/badge.svg)](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/pylint.yml)
11-
[![CodeQL](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/codeql.yml/badge.svg)](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/codeql.yml)
12-
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
8+
[![Downloads](https://img.shields.io/pepy/dt/scrapegraphai?style=for-the-badge)](https://pepy.tech/project/scrapegraphai)
9+
[![linting: pylint](https://img.shields.io/badge/linting-pylint-yellowgreen?style=for-the-badge)](https://github.com/pylint-dev/pylint)
10+
[![Pylint](https://img.shields.io/github/actions/workflow/status/VinciGit00/Scrapegraph-ai/pylint.yml?label=Pylint&logo=github&style=for-the-badge)](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/pylint.yml)
11+
[![CodeQL](https://img.shields.io/github/actions/workflow/status/VinciGit00/Scrapegraph-ai/codeql.yml?label=CodeQL&logo=github&style=for-the-badge)](https://github.com/VinciGit00/Scrapegraph-ai/actions/workflows/codeql.yml)
12+
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=for-the-badge)](https://opensource.org/licenses/MIT)
1313
[![](https://dcbadge.vercel.app/api/server/gkxQDAjfeX)](https://discord.gg/gkxQDAjfeX)
1414

1515
ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, etc.).
@@ -46,11 +46,14 @@ The documentation for ScrapeGraphAI can be found [here](https://scrapegraph-ai.r
4646
Check out also the Docusaurus [here](https://scrapegraph-doc.onrender.com/).
4747

4848
## 💻 Usage
49-
There are three main scraping pipelines that can be used to extract information from a website (or local file):
49+
There are multiple standard scraping pipelines that can be used to extract information from a website (or local file):
5050
- `SmartScraperGraph`: single-page scraper that only needs a user prompt and an input source;
5151
- `SearchGraph`: multi-page scraper that extracts information from the top n search results of a search engine;
5252
- `SpeechGraph`: single-page scraper that extracts information from a website and generates an audio file.
53-
- `SmartScraperMultiGraph`: multiple page scraper given a single prompt
53+
- `ScriptCreatorGraph`: single-page scraper that extracts information from a website and generates a Python script.
54+
55+
- `SmartScraperMultiGraph`: multi-page scraper that extracts information from multiple pages given a single prompt and a list of sources;
56+
- `ScriptCreatorMultiGraph`: multi-page scraper that generates a Python script for extracting information from multiple pages given a single prompt and a list of sources.
5457

5558
It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**, **Azure** and **Gemini**, or local models using **Ollama**.
5659

docs/assets/scriptcreatorgraph.png

53.7 KB
Loading

docs/source/scrapers/graph_config.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ Some interesting ones are:
1313
- `loader_kwargs`: A dictionary with additional parameters to be passed to the `Loader` class, such as `proxy`.
1414
- `burr_kwargs`: A dictionary with additional parameters to enable `Burr` graphical user interface.
1515
- `max_images`: The maximum number of images to be analyzed. Useful in `OmniScraperGraph` and `OmniSearchGraph`.
16+
- `cache_path`: The path where the cache files will be saved. If already exists, the cache will be loaded from this path.
1617

1718
.. _Burr:
1819

docs/source/scrapers/graphs.rst

Lines changed: 39 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,15 @@ Graphs are scraping pipelines aimed at solving specific tasks. They are composed
66
There are several types of graphs available in the library, each with its own purpose and functionality. The most common ones are:
77

88
- **SmartScraperGraph**: one-page scraper that requires a user-defined prompt and a URL (or local file) to extract information using LLM.
9-
- **SmartScraperMultiGraph**: multi-page scraper that requires a user-defined prompt and a list of URLs (or local files) to extract information using LLM. It is built on top of SmartScraperGraph.
109
- **SearchGraph**: multi-page scraper that only requires a user-defined prompt to extract information from a search engine using LLM. It is built on top of SmartScraperGraph.
1110
- **SpeechGraph**: text-to-speech pipeline that generates an answer as well as a requested audio file. It is built on top of SmartScraperGraph and requires a user-defined prompt and a URL (or local file).
1211
- **ScriptCreatorGraph**: script generator that creates a Python script to scrape a website using the specified library (e.g. BeautifulSoup). It requires a user-defined prompt and a URL (or local file).
1312

13+
There are also two additional graphs that can handle multiple sources:
14+
15+
- **SmartScraperMultiGraph**: similar to `SmartScraperGraph`, but with the ability to handle multiple sources.
16+
- **ScriptCreatorMultiGraph**: similar to `ScriptCreatorGraph`, but with the ability to handle multiple sources.
17+
1418
With the introduction of `GPT-4o`, two new powerful graphs have been created:
1519

1620
- **OmniScraperGraph**: similar to `SmartScraperGraph`, but with the ability to scrape images and describe them.
@@ -186,4 +190,37 @@ It will fetch the data from the source, extract the information based on the pro
186190
)
187191
188192
result = speech_graph.run()
189-
print(result)
193+
print(result)
194+
195+
196+
ScriptCreatorGraph & ScriptCreatorMultiGraph
197+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
198+
199+
.. image:: ../../assets/scriptcreatorgraph.png
200+
:align: center
201+
:width: 90%
202+
:alt: ScriptCreatorGraph
203+
204+
First we define the graph configuration, which includes the LLM model and other parameters.
205+
Then we create an instance of the ScriptCreatorGraph class, passing the prompt, source, and configuration as arguments. Finally, we run the graph and print the result.
206+
207+
.. code-block:: python
208+
209+
from scrapegraphai.graphs import ScriptCreatorGraph
210+
211+
graph_config = {
212+
"llm": {...},
213+
"library": "beautifulsoup4"
214+
}
215+
216+
script_creator_graph = ScriptCreatorGraph(
217+
prompt="Create a Python script to scrape the projects.",
218+
source="https://perinim.github.io/projects/",
219+
config=graph_config,
220+
schema=schema
221+
)
222+
223+
result = script_creator_graph.run()
224+
print(result)
225+
226+
**ScriptCreatorMultiGraph** is similar to ScriptCreatorGraph, but it can handle multiple sources. We define the graph configuration, create an instance of the ScriptCreatorMultiGraph class, and run the graph.
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
"""
2+
Basic example of scraping pipeline using CSVScraperMultiGraph from CSV documents
3+
"""
4+
5+
import os
6+
from dotenv import load_dotenv
7+
import pandas as pd
8+
from scrapegraphai.graphs import CSVScraperMultiGraph
9+
from scrapegraphai.utils import convert_to_csv, convert_to_json, prettify_exec_info
10+
11+
load_dotenv()
12+
# ************************************************
13+
# Read the CSV file
14+
# ************************************************
15+
16+
FILE_NAME = "inputs/username.csv"
17+
curr_dir = os.path.dirname(os.path.realpath(__file__))
18+
file_path = os.path.join(curr_dir, FILE_NAME)
19+
20+
text = pd.read_csv(file_path)
21+
22+
# ************************************************
23+
# Define the configuration for the graph
24+
# ************************************************
25+
26+
graph_config = {
27+
"llm": {
28+
"api_key": os.getenv("ANTHROPIC_API_KEY"),
29+
"model": "claude-3-haiku-20240307",
30+
"max_tokens": 4000},
31+
}
32+
33+
# ************************************************
34+
# Create the CSVScraperMultiGraph instance and run it
35+
# ************************************************
36+
37+
csv_scraper_graph = CSVScraperMultiGraph(
38+
prompt="List me all the last names",
39+
source=[str(text), str(text)],
40+
config=graph_config
41+
)
42+
43+
result = csv_scraper_graph.run()
44+
print(result)
45+
46+
# ************************************************
47+
# Get graph execution info
48+
# ************************************************
49+
50+
graph_exec_info = csv_scraper_graph.get_execution_info()
51+
print(prettify_exec_info(graph_exec_info))
52+
53+
# Save to json or csv
54+
convert_to_csv(result, "result")
55+
convert_to_json(result, "result")
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
"""
2+
Module for showing how JSONScraperMultiGraph multi works
3+
"""
4+
import os
5+
import json
6+
from dotenv import load_dotenv
7+
from scrapegraphai.graphs import JSONScraperMultiGraph
8+
9+
load_dotenv()
10+
11+
graph_config = {
12+
"llm": {
13+
"api_key": os.getenv("ANTHROPIC_API_KEY"),
14+
"model": "claude-3-haiku-20240307",
15+
"max_tokens": 4000
16+
},
17+
}
18+
19+
FILE_NAME = "inputs/example.json"
20+
curr_dir = os.path.dirname(os.path.realpath(__file__))
21+
file_path = os.path.join(curr_dir, FILE_NAME)
22+
23+
with open(file_path, 'r', encoding="utf-8") as file:
24+
text = file.read()
25+
26+
sources = [text, text]
27+
28+
multiple_search_graph = JSONScraperMultiGraph(
29+
prompt= "List me all the authors, title and genres of the books",
30+
source= sources,
31+
schema=None,
32+
config=graph_config
33+
)
34+
35+
result = multiple_search_graph.run()
36+
print(json.dumps(result, indent=4))

examples/anthropic/pdf_scraper_graph_haiku.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
1+
"""
2+
Module for showing how PDFScraper multi works
3+
"""
14
import os, json
25
from dotenv import load_dotenv
36
from scrapegraphai.graphs import PDFScraperGraph
47

58
load_dotenv()
69

7-
810
# ************************************************
911
# Define the configuration for the graph
1012
# ************************************************
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
"""
2+
Module for showing how PDFScraper multi works
3+
"""
4+
import os
5+
import json
6+
from dotenv import load_dotenv
7+
from scrapegraphai.graphs import PdfScraperMultiGraph
8+
9+
load_dotenv()
10+
11+
graph_config = {
12+
"llm": {
13+
"api_key": os.getenv("ANTHROPIC_API_KEY"),
14+
"model": "claude-3-haiku-20240307",
15+
"max_tokens": 4000
16+
},
17+
}
18+
19+
# ***************
20+
# Covert to list
21+
# ***************
22+
23+
sources = [
24+
"This paper provides evidence from a natural experiment on the relationship between positive affect and productivity. We link highly detailed administrative data on the behaviors and performance of all telesales workers at a large telecommunications company with survey reports of employee happiness that we collected on a weekly basis. We use variation in worker mood arising from visual exposure to weather—the interaction between call center architecture and outdoor weather conditions—in order to provide a quasi-experimental test of the effect of happiness on productivity. We find evidence of a positive impact on sales performance, which is driven by changes in labor productivity – largely through workers converting more calls into sales, and to a lesser extent by making more calls per hour and adhering more closely to their schedule. We find no evidence in our setting of effects on measures of high-frequency labor supply such as attendance and break-taking.",
25+
"This paper provides evidence from a natural experiment on the relationship between positive affect and productivity. We link highly detailed administrative data on the behaviors and performance of all telesales workers at a large telecommunications company with survey reports of employee happiness that we collected on a weekly basis. We use variation in worker mood arising from visual exposure to weather—the interaction between call center architecture and outdoor weather conditions—in order to provide a quasi-experimental test of the effect of happiness on productivity. We find evidence of a positive impact on sales performance, which is driven by changes in labor productivity – largely through workers converting more calls into sales, and to a lesser extent by making more calls per hour and adhering more closely to their schedule. We find no evidence in our setting of effects on measures of high-frequency labor supply such as attendance and break-taking.",
26+
"This paper provides evidence from a natural experiment on the relationship between positive affect and productivity. We link highly detailed administrative data on the behaviors and performance of all telesales workers at a large telecommunications company with survey reports of employee happiness that we collected on a weekly basis. We use variation in worker mood arising from visual exposure to weather—the interaction between call center architecture and outdoor weather conditions—in order to provide a quasi-experimental test of the effect of happiness on productivity. We find evidence of a positive impact on sales performance, which is driven by changes in labor productivity – largely through workers converting more calls into sales, and to a lesser extent by making more calls per hour and adhering more closely to their schedule. We find no evidence in our setting of effects on measures of high-frequency labor supply such as attendance and break-taking.",
27+
"This paper provides evidence from a natural experiment on the relationship between positive affect and productivity. We link highly detailed administrative data on the behaviors and performance of all telesales workers at a large telecommunications company with survey reports of employee happiness that we collected on a weekly basis. We use variation in worker mood arising from visual exposure to weather—the interaction between call center architecture and outdoor weather conditions—in order to provide a quasi-experimental test of the effect of happiness on productivity. We find evidence of a positive impact on sales performance, which is driven by changes in labor productivity – largely through workers converting more calls into sales, and to a lesser extent by making more calls per hour and adhering more closely to their schedule. We find no evidence in our setting of effects on measures of high-frequency labor supply such as attendance and break-taking.",
28+
]
29+
30+
prompt = """
31+
You are an expert in reviewing academic manuscripts. Please analyze the abstracts provided from an academic journal article to extract and clearly identify the following elements:
32+
33+
Independent Variable (IV): The variable that is manipulated or considered as the primary cause affecting other variables.
34+
Dependent Variable (DV): The variable that is measured or observed, which is expected to change as a result of variations in the Independent Variable.
35+
Exogenous Shock: Identify any external or unexpected events used in the study that serve as a natural experiment or provide a unique setting for observing the effects on the IV and DV.
36+
Response Format: For each abstract, present your response in the following structured format:
37+
38+
Independent Variable (IV):
39+
Dependent Variable (DV):
40+
Exogenous Shock:
41+
42+
Example Queries and Responses:
43+
44+
Query: This paper provides evidence from a natural experiment on the relationship between positive affect and productivity. We link highly detailed administrative data on the behaviors and performance of all telesales workers at a large telecommunications company with survey reports of employee happiness that we collected on a weekly basis. We use variation in worker mood arising from visual exposure to weather the interaction between call center architecture and outdoor weather conditions in order to provide a quasi-experimental test of the effect of happiness on productivity. We find evidence of a positive impact on sales performance, which is driven by changes in labor productivity largely through workers converting more calls into sales, and to a lesser extent by making more calls per hour and adhering more closely to their schedule. We find no evidence in our setting of effects on measures of high-frequency labor supply such as attendance and break-taking.
45+
46+
Response:
47+
48+
Independent Variable (IV): Employee happiness.
49+
Dependent Variable (DV): Overall firm productivity.
50+
Exogenous Shock: Sudden company-wide increase in bonus payments.
51+
52+
Query: The diffusion of social media coincided with a worsening of mental health conditions among adolescents and young adults in the United States, giving rise to speculation that social media might be detrimental to mental health. In this paper, we provide quasi-experimental estimates of the impact of social media on mental health by leveraging a unique natural experiment: the staggered introduction of Facebook across U.S. colleges. Our analysis couples data on student mental health around the years of Facebook's expansion with a generalized difference-in-differences empirical strategy. We find that the roll-out of Facebook at a college increased symptoms of poor mental health, especially depression. We also find that, among students predicted to be most susceptible to mental illness, the introduction of Facebook led to increased utilization of mental healthcare services. Lastly, we find that, after the introduction of Facebook, students were more likely to report experiencing impairments to academic performance resulting from poor mental health. Additional evidence on mechanisms suggests that the results are due to Facebook fostering unfavorable social comparisons.
53+
54+
Response:
55+
56+
Independent Variable (IV): Exposure to social media.
57+
Dependent Variable (DV): Mental health outcomes.
58+
Exogenous Shock: staggered introduction of Facebook across U.S. colleges.
59+
"""
60+
# *******************************************************
61+
# Create the SmartScraperMultiGraph instance and run it
62+
# *******************************************************
63+
64+
multiple_search_graph = PdfScraperMultiGraph(
65+
prompt=prompt,
66+
source= sources,
67+
schema=None,
68+
config=graph_config
69+
)
70+
71+
result = multiple_search_graph.run()
72+
print(json.dumps(result, indent=4))
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
"""
2+
Basic example of scraping pipeline using ScriptCreatorGraph
3+
"""
4+
5+
import os
6+
from dotenv import load_dotenv
7+
from scrapegraphai.graphs import ScriptCreatorMultiGraph
8+
from scrapegraphai.utils import prettify_exec_info
9+
10+
load_dotenv()
11+
12+
# ************************************************
13+
# Define the configuration for the graph
14+
# ************************************************
15+
16+
graph_config = {
17+
"llm": {
18+
"api_key": os.getenv("ANTHROPIC_API_KEY"),
19+
"model": "claude-3-haiku-20240307",
20+
"max_tokens": 4000
21+
},
22+
"library": "beautifulsoup"
23+
}
24+
25+
# ************************************************
26+
# Create the ScriptCreatorGraph instance and run it
27+
# ************************************************
28+
29+
urls=[
30+
"https://schultzbergagency.com/emil-raste-karlsen/",
31+
"https://schultzbergagency.com/johanna-hedberg/",
32+
]
33+
34+
# ************************************************
35+
# Create the ScriptCreatorGraph instance and run it
36+
# ************************************************
37+
38+
script_creator_graph = ScriptCreatorMultiGraph(
39+
prompt="Find information about actors",
40+
# also accepts a string with the already downloaded HTML code
41+
source=urls,
42+
config=graph_config
43+
)
44+
45+
result = script_creator_graph.run()
46+
print(result)
47+
48+
# ************************************************
49+
# Get graph execution info
50+
# ************************************************
51+
52+
graph_exec_info = script_creator_graph.get_execution_info()
53+
print(prettify_exec_info(graph_exec_info))

0 commit comments

Comments
 (0)