Skip to content

Commit f5cbd80

Browse files
committed
feat: add pdf scraper multi graph
1 parent 930f673 commit f5cbd80

File tree

8 files changed

+190
-6
lines changed

8 files changed

+190
-6
lines changed
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
"""
2+
Module for showing how PDFScraper multi works
3+
"""
4+
from scrapegraphai.graphs import PdfScraperMultiGraph
5+
6+
graph_config = {
7+
"llm": {
8+
"model": "ollama/llama3",
9+
"temperature": 0,
10+
"format": "json", # Ollama needs the format to be specified explicitly
11+
"model_tokens": 4000,
12+
},
13+
"embeddings": {
14+
"model": "ollama/nomic-embed-text",
15+
"temperature": 0,
16+
},
17+
"verbose": True,
18+
"headless": False,
19+
}
20+
21+
# Covert to list
22+
sources = [
23+
"This paper provides evidence from a natural experiment on the relationship between positive affect and productivity. We link highly detailed administrative data on the behaviors and performance of all telesales workers at a large telecommunications company with survey reports of employee happiness that we collected on a weekly basis. We use variation in worker mood arising from visual exposure to weather—the interaction between call center architecture and outdoor weather conditions—in order to provide a quasi-experimental test of the effect of happiness on productivity. We find evidence of a positive impact on sales performance, which is driven by changes in labor productivity – largely through workers converting more calls into sales, and to a lesser extent by making more calls per hour and adhering more closely to their schedule. We find no evidence in our setting of effects on measures of high-frequency labor supply such as attendance and break-taking.",
24+
"This paper provides evidence from a natural experiment on the relationship between positive affect and productivity. We link highly detailed administrative data on the behaviors and performance of all telesales workers at a large telecommunications company with survey reports of employee happiness that we collected on a weekly basis. We use variation in worker mood arising from visual exposure to weather—the interaction between call center architecture and outdoor weather conditions—in order to provide a quasi-experimental test of the effect of happiness on productivity. We find evidence of a positive impact on sales performance, which is driven by changes in labor productivity – largely through workers converting more calls into sales, and to a lesser extent by making more calls per hour and adhering more closely to their schedule. We find no evidence in our setting of effects on measures of high-frequency labor supply such as attendance and break-taking.",
25+
"This paper provides evidence from a natural experiment on the relationship between positive affect and productivity. We link highly detailed administrative data on the behaviors and performance of all telesales workers at a large telecommunications company with survey reports of employee happiness that we collected on a weekly basis. We use variation in worker mood arising from visual exposure to weather—the interaction between call center architecture and outdoor weather conditions—in order to provide a quasi-experimental test of the effect of happiness on productivity. We find evidence of a positive impact on sales performance, which is driven by changes in labor productivity – largely through workers converting more calls into sales, and to a lesser extent by making more calls per hour and adhering more closely to their schedule. We find no evidence in our setting of effects on measures of high-frequency labor supply such as attendance and break-taking.",
26+
"This paper provides evidence from a natural experiment on the relationship between positive affect and productivity. We link highly detailed administrative data on the behaviors and performance of all telesales workers at a large telecommunications company with survey reports of employee happiness that we collected on a weekly basis. We use variation in worker mood arising from visual exposure to weather—the interaction between call center architecture and outdoor weather conditions—in order to provide a quasi-experimental test of the effect of happiness on productivity. We find evidence of a positive impact on sales performance, which is driven by changes in labor productivity – largely through workers converting more calls into sales, and to a lesser extent by making more calls per hour and adhering more closely to their schedule. We find no evidence in our setting of effects on measures of high-frequency labor supply such as attendance and break-taking.",
27+
]
28+
29+
prompt = """
30+
You are an expert in reviewing academic manuscripts. Please analyze the abstracts provided from an academic journal article to extract and clearly identify the following elements:
31+
32+
Independent Variable (IV): The variable that is manipulated or considered as the primary cause affecting other variables.
33+
Dependent Variable (DV): The variable that is measured or observed, which is expected to change as a result of variations in the Independent Variable.
34+
Exogenous Shock: Identify any external or unexpected events used in the study that serve as a natural experiment or provide a unique setting for observing the effects on the IV and DV.
35+
Response Format: For each abstract, present your response in the following structured format:
36+
37+
Independent Variable (IV):
38+
Dependent Variable (DV):
39+
Exogenous Shock:
40+
41+
Example Queries and Responses:
42+
43+
Query: This paper provides evidence from a natural experiment on the relationship between positive affect and productivity. We link highly detailed administrative data on the behaviors and performance of all telesales workers at a large telecommunications company with survey reports of employee happiness that we collected on a weekly basis. We use variation in worker mood arising from visual exposure to weather the interaction between call center architecture and outdoor weather conditions in order to provide a quasi-experimental test of the effect of happiness on productivity. We find evidence of a positive impact on sales performance, which is driven by changes in labor productivity largely through workers converting more calls into sales, and to a lesser extent by making more calls per hour and adhering more closely to their schedule. We find no evidence in our setting of effects on measures of high-frequency labor supply such as attendance and break-taking.
44+
45+
Response:
46+
47+
Independent Variable (IV): Employee happiness.
48+
Dependent Variable (DV): Overall firm productivity.
49+
Exogenous Shock: Sudden company-wide increase in bonus payments.
50+
51+
Query: The diffusion of social media coincided with a worsening of mental health conditions among adolescents and young adults in the United States, giving rise to speculation that social media might be detrimental to mental health. In this paper, we provide quasi-experimental estimates of the impact of social media on mental health by leveraging a unique natural experiment: the staggered introduction of Facebook across U.S. colleges. Our analysis couples data on student mental health around the years of Facebook's expansion with a generalized difference-in-differences empirical strategy. We find that the roll-out of Facebook at a college increased symptoms of poor mental health, especially depression. We also find that, among students predicted to be most susceptible to mental illness, the introduction of Facebook led to increased utilization of mental healthcare services. Lastly, we find that, after the introduction of Facebook, students were more likely to report experiencing impairments to academic performance resulting from poor mental health. Additional evidence on mechanisms suggests that the results are due to Facebook fostering unfavorable social comparisons.
52+
53+
Response:
54+
55+
Independent Variable (IV): Exposure to social media.
56+
Dependent Variable (DV): Mental health outcomes.
57+
Exogenous Shock: staggered introduction of Facebook across U.S. colleges.
58+
"""
59+
results = []
60+
for source in sources:
61+
pdf_scraper_graph = PdfScraperMultiGraph(
62+
prompt=prompt,
63+
source=source,
64+
config=graph_config
65+
)
66+
result = pdf_scraper_graph.run()
67+
results.append(result)
68+
69+
print(results)

scrapegraphai/graphs/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,3 +16,4 @@
1616
from .omni_scraper_graph import OmniScraperGraph
1717
from .omni_search_graph import OmniSearchGraph
1818
from .smart_scraper_multi_graph import SmartScraperMultiGraph
19+
from .pdf_scraper_multi import PdfScraperMultiGraph
Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
"""
2+
PdfScraperMultiGraph Module
3+
"""
4+
5+
from copy import copy, deepcopy
6+
from typing import List, Optional
7+
8+
from .base_graph import BaseGraph
9+
from .abstract_graph import AbstractGraph
10+
from .pdf_scraper_graph import PDFScraperGraph
11+
12+
from ..nodes import (
13+
GraphIteratorNode,
14+
MergeAnswersNode
15+
)
16+
17+
18+
class PdfScraperMultiGraph(AbstractGraph):
19+
"""
20+
PdfScraperMultiGraph is a scraping pipeline that scrapes a
21+
list of URLs and generates answers to a given prompt.
22+
It only requires a user prompt and a list of URLs.
23+
24+
Attributes:
25+
prompt (str): The user prompt to search the internet.
26+
llm_model (dict): The configuration for the language model.
27+
embedder_model (dict): The configuration for the embedder model.
28+
headless (bool): A flag to run the browser in headless mode.
29+
verbose (bool): A flag to display the execution information.
30+
model_token (int): The token limit for the language model.
31+
32+
Args:
33+
prompt (str): The user prompt to search the internet.
34+
source (List[str]): The source of the graph.
35+
config (dict): Configuration parameters for the graph.
36+
schema (Optional[str]): The schema for the graph output.
37+
38+
Example:
39+
>>> search_graph = MultipleSearchGraph(
40+
... "What is Chioggia famous for?",
41+
... {"llm": {"model": "gpt-3.5-turbo"}}
42+
... )
43+
>>> result = search_graph.run()
44+
"""
45+
46+
def __init__(self, prompt: str, source: List[str], config: dict, schema: Optional[str] = None):
47+
48+
self.max_results = config.get("max_results", 3)
49+
50+
if all(isinstance(value, str) for value in config.values()):
51+
self.copy_config = copy(config)
52+
else:
53+
self.copy_config = deepcopy(config)
54+
55+
super().__init__(prompt, config, source, schema)
56+
57+
def _create_graph(self) -> BaseGraph:
58+
"""
59+
Creates the graph of nodes representing the workflow for web scraping and searching.
60+
61+
Returns:
62+
BaseGraph: A graph instance representing the web scraping and searching workflow.
63+
"""
64+
65+
# ************************************************
66+
# Create a PDFScraperGraph instance
67+
# ************************************************
68+
69+
pdf_scraper_instance = PDFScraperGraph(
70+
prompt="",
71+
source="",
72+
config=self.copy_config,
73+
)
74+
75+
# ************************************************
76+
# Define the graph nodes
77+
# ************************************************
78+
79+
graph_iterator_node = GraphIteratorNode(
80+
input="user_prompt & pdfs",
81+
output=["results"],
82+
node_config={
83+
"graph_instance": pdf_scraper_instance,
84+
}
85+
)
86+
87+
merge_answers_node = MergeAnswersNode(
88+
input="user_prompt & results",
89+
output=["answer"],
90+
node_config={
91+
"llm_model": self.llm_model,
92+
"schema": self.schema
93+
}
94+
)
95+
96+
return BaseGraph(
97+
nodes=[
98+
graph_iterator_node,
99+
merge_answers_node,
100+
],
101+
edges=[
102+
(graph_iterator_node, merge_answers_node),
103+
],
104+
entry_point=graph_iterator_node
105+
)
106+
107+
def run(self) -> str:
108+
"""
109+
Executes the web scraping and searching process.
110+
111+
Returns:
112+
str: The answer to the prompt.
113+
"""
114+
inputs = {"user_prompt": self.prompt, "pdfs": self.source}
115+
self.final_state, self.execution_info = self.graph.execute(inputs)
116+
117+
return self.final_state.get("answer", "No answer found.")

scrapegraphai/nodes/generate_answer_csv_node.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ def __init__(
4949
input: str,
5050
output: List[str],
5151
node_config: Optional[dict] = None,
52-
node_name: str = "GenerateAnswer",
52+
node_name: str = "GenerateAnswerCSV",
5353
):
5454
"""
5555
Initializes the GenerateAnswerNodeCsv with a language model client and a node name.

scrapegraphai/nodes/generate_answer_pdf_node.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ def __init__(
4848
input: str,
4949
output: List[str],
5050
node_config: Optional[dict] = None,
51-
node_name: str = "GenerateAnswer",
51+
node_name: str = "GenerateAnswerPDF",
5252
):
5353
"""
5454
Initializes the GenerateAnswerNodePDF with a language model client and a node name.

scrapegraphai/nodes/generate_scraper_node.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,6 @@
1010
from langchain_core.output_parsers import StrOutputParser
1111
from langchain_core.runnables import RunnableParallel
1212
from tqdm import tqdm
13-
1413
from ..utils.logging import get_logger
1514

1615
# Imports from the library

scrapegraphai/nodes/get_probable_tags_node.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,8 @@
33
"""
44

55
from typing import List, Optional
6-
76
from langchain.output_parsers import CommaSeparatedListOutputParser
87
from langchain.prompts import PromptTemplate
9-
108
from ..utils.logging import get_logger
119
from .base_node import BaseNode
1210

scrapegraphai/nodes/robots_node.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ def __init__(
4747
input: str,
4848
output: List[str],
4949
node_config: Optional[dict] = None,
50-
node_name: str = "Robots",
50+
node_name: str = "RobotNode",
5151

5252
):
5353
super().__init__(node_name, "node", input, output, 1)

0 commit comments

Comments
 (0)