Skip to content

Commit 8b032a9

Browse files
authored
Merge pull request #293 from VinciGit00/pdf_scraper_refactoring
fix(pdf_scraper): fix the pdf scraper gaph
2 parents e1006f3 + a4ee757 commit 8b032a9

File tree

7 files changed

+113
-70
lines changed

7 files changed

+113
-70
lines changed

examples/openai/pdf_scraper_openai.py

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
"""
2+
Basic example of scraping pipeline using PDFScraper
3+
"""
4+
5+
import os
6+
from dotenv import load_dotenv
7+
from scrapegraphai.graphs import PDFScraperGraph
8+
9+
load_dotenv()
10+
11+
12+
# ************************************************
13+
# Define the configuration for the graph
14+
# ************************************************
15+
16+
openai_key = os.getenv("OPENAI_APIKEY")
17+
18+
graph_config = {
19+
"llm": {
20+
"api_key":openai_key,
21+
"model": "gpt-3.5-turbo",
22+
},
23+
"verbose": True,
24+
"headless": False,
25+
}
26+
27+
# Covert to list
28+
sources = [
29+
"This paper provides evidence from a natural experiment on the relationship between positive affect and productivity. We link highly detailed administrative data on the behaviors and performance of all telesales workers at a large telecommunications company with survey reports of employee happiness that we collected on a weekly basis. We use variation in worker mood arising from visual exposure to weather—the interaction between call center architecture and outdoor weather conditions—in order to provide a quasi-experimental test of the effect of happiness on productivity. We find evidence of a positive impact on sales performance, which is driven by changes in labor productivity – largely through workers converting more calls into sales, and to a lesser extent by making more calls per hour and adhering more closely to their schedule. We find no evidence in our setting of effects on measures of high-frequency labor supply such as attendance and break-taking.",
30+
"The diffusion of social media coincided with a worsening of mental health conditions among adolescents and young adults in the United States, giving rise to speculation that social media might be detrimental to mental health. In this paper, we provide quasi-experimental estimates of the impact of social media on mental health by leveraging a unique natural experiment: the staggered introduction of Facebook across U.S. colleges. Our analysis couples data on student mental health around the years of Facebook's expansion with a generalized difference-in-differences empirical strategy. We find that the roll-out of Facebook at a college increased symptoms of poor mental health, especially depression. We also find that, among students predicted to be most susceptible to mental illness, the introduction of Facebook led to increased utilization of mental healthcare services. Lastly, we find that, after the introduction of Facebook, students were more likely to report experiencing impairments to academic performance resulting from poor mental health. Additional evidence on mechanisms suggests that the results are due to Facebook fostering unfavorable social comparisons.",
31+
"Hollywood films are generally released first in the United States and then later abroad, with some variation in lags across films and countries. With the growth in movie piracy since the appearance of BitTorrent in 2003, films have become available through illegal piracy immediately after release in the US, while they are not available for legal viewing abroad until their foreign premieres in each country. We make use of this variation in international release lags to ask whether longer lags – which facilitate more local pre-release piracy – depress theatrical box office receipts, particularly after the widespread adoption of BitTorrent. We find that longer release windows are associated with decreased box office returns, even after controlling for film and country fixed effects. This relationship is much stronger in contexts where piracy is more prevalent: after BitTorrent’s adoption and in heavily-pirated genres. Our findings indicate that, as a lower bound, international box office returns in our sample were at least 7% lower than they would have been in the absence of pre-release piracy. By contrast, we do not see evidence of elevated sales displacement in US box office revenue following the adoption of BitTorrent, and we suggest that delayed legal availability of the content abroad may drive the losses to piracy."
32+
# Add more sources here
33+
]
34+
35+
prompt = """
36+
You are an expert in reviewing academic manuscripts. Please analyze the abstracts provided from an academic journal article to extract and clearly identify the following elements:
37+
38+
Independent Variable (IV): The variable that is manipulated or considered as the primary cause affecting other variables.
39+
Dependent Variable (DV): The variable that is measured or observed, which is expected to change as a result of variations in the Independent Variable.
40+
Exogenous Shock: Identify any external or unexpected events used in the study that serve as a natural experiment or provide a unique setting for observing the effects on the IV and DV.
41+
Response Format: For each abstract, present your response in the following structured format:
42+
43+
Independent Variable (IV):
44+
Dependent Variable (DV):
45+
Exogenous Shock:
46+
47+
Example Queries and Responses:
48+
49+
Query: This paper provides evidence from a natural experiment on the relationship between positive affect and productivity. We link highly detailed administrative data on the behaviors and performance of all telesales workers at a large telecommunications company with survey reports of employee happiness that we collected on a weekly basis. We use variation in worker mood arising from visual exposure to weather the interaction between call center architecture and outdoor weather conditions in order to provide a quasi-experimental test of the effect of happiness on productivity. We find evidence of a positive impact on sales performance, which is driven by changes in labor productivity largely through workers converting more calls into sales, and to a lesser extent by making more calls per hour and adhering more closely to their schedule. We find no evidence in our setting of effects on measures of high-frequency labor supply such as attendance and break-taking.
50+
51+
Response:
52+
53+
Independent Variable (IV): Employee happiness.
54+
Dependent Variable (DV): Overall firm productivity.
55+
Exogenous Shock: Sudden company-wide increase in bonus payments.
56+
57+
Query: The diffusion of social media coincided with a worsening of mental health conditions among adolescents and young adults in the United States, giving rise to speculation that social media might be detrimental to mental health. In this paper, we provide quasi-experimental estimates of the impact of social media on mental health by leveraging a unique natural experiment: the staggered introduction of Facebook across U.S. colleges. Our analysis couples data on student mental health around the years of Facebook's expansion with a generalized difference-in-differences empirical strategy. We find that the roll-out of Facebook at a college increased symptoms of poor mental health, especially depression. We also find that, among students predicted to be most susceptible to mental illness, the introduction of Facebook led to increased utilization of mental healthcare services. Lastly, we find that, after the introduction of Facebook, students were more likely to report experiencing impairments to academic performance resulting from poor mental health. Additional evidence on mechanisms suggests that the results are due to Facebook fostering unfavorable social comparisons.
58+
59+
Response:
60+
61+
Independent Variable (IV): Exposure to social media.
62+
Dependent Variable (DV): Mental health outcomes.
63+
Exogenous Shock: staggered introduction of Facebook across U.S. colleges.
64+
"""
65+
66+
pdf_scraper_graph = PDFScraperGraph(
67+
prompt=prompt,
68+
source=sources[0],
69+
config=graph_config
70+
)
71+
result = pdf_scraper_graph.run()
72+
73+
74+
print(result)

scrapegraphai/graphs/abstract_graph.py

Lines changed: 18 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,7 @@ def _create_llm(self, llm_config: dict, chat=False) -> object:
196196
try:
197197
self.model_token = models_tokens["ollama"][llm_params["model"]]
198198
except KeyError as exc:
199+
print("model not found, using default token size (8192)")
199200
self.model_token = 8192
200201
else:
201202
self.model_token = 8192
@@ -206,25 +207,28 @@ def _create_llm(self, llm_config: dict, chat=False) -> object:
206207
elif "hugging_face" in llm_params["model"]:
207208
try:
208209
self.model_token = models_tokens["hugging_face"][llm_params["model"]]
209-
except KeyError as exc:
210-
raise KeyError("Model not supported") from exc
210+
except KeyError:
211+
print("model not found, using default token size (8192)")
212+
self.model_token = 8192
211213
return HuggingFace(llm_params)
212214
elif "groq" in llm_params["model"]:
213215
llm_params["model"] = llm_params["model"].split("/")[-1]
214216

215217
try:
216218
self.model_token = models_tokens["groq"][llm_params["model"]]
217-
except KeyError as exc:
218-
raise KeyError("Model not supported") from exc
219+
except KeyError:
220+
print("model not found, using default token size (8192)")
221+
self.model_token = 8192
219222
return Groq(llm_params)
220223
elif "bedrock" in llm_params["model"]:
221224
llm_params["model"] = llm_params["model"].split("/")[-1]
222225
model_id = llm_params["model"]
223226
client = llm_params.get("client", None)
224227
try:
225228
self.model_token = models_tokens["bedrock"][llm_params["model"]]
226-
except KeyError as exc:
227-
raise KeyError("Model not supported") from exc
229+
except KeyError:
230+
print("model not found, using default token size (8192)")
231+
self.model_token = 8192
228232
return Bedrock(
229233
{
230234
"client": client,
@@ -235,13 +239,18 @@ def _create_llm(self, llm_config: dict, chat=False) -> object:
235239
}
236240
)
237241
elif "claude-3-" in llm_params["model"]:
238-
self.model_token = models_tokens["claude"]["claude3"]
242+
try:
243+
self.model_token = models_tokens["claude"]["claude3"]
244+
except KeyError:
245+
print("model not found, using default token size (8192)")
246+
self.model_token = 8192
239247
return Anthropic(llm_params)
240248
elif "deepseek" in llm_params["model"]:
241249
try:
242250
self.model_token = models_tokens["deepseek"][llm_params["model"]]
243-
except KeyError as exc:
244-
raise KeyError("Model not supported") from exc
251+
except KeyError:
252+
print("model not found, using default token size (8192)")
253+
self.model_token = 8192
245254
return DeepSeek(llm_params)
246255
else:
247256
raise ValueError("Model provided by the configuration not supported")

scrapegraphai/graphs/csv_scraper_graph.py

Lines changed: 4 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,6 @@
99

1010
from ..nodes import (
1111
FetchNode,
12-
ParseNode,
1312
RAGNode,
1413
GenerateAnswerCSVNode
1514
)
@@ -35,25 +34,18 @@ def _create_graph(self):
3534
"""
3635
fetch_node = FetchNode(
3736
input="csv | csv_dir",
38-
output=["doc", "link_urls", "img_urls"],
39-
)
40-
parse_node = ParseNode(
41-
input="doc",
42-
output=["parsed_doc"],
43-
node_config={
44-
"chunk_size": self.model_token,
45-
}
37+
output=["doc"],
4638
)
4739
rag_node = RAGNode(
48-
input="user_prompt & (parsed_doc | doc)",
40+
input="user_prompt & doc",
4941
output=["relevant_chunks"],
5042
node_config={
5143
"llm_model": self.llm_model,
5244
"embedder_model": self.embedder_model,
5345
}
5446
)
5547
generate_answer_node = GenerateAnswerCSVNode(
56-
input="user_prompt & (relevant_chunks | parsed_doc | doc)",
48+
input="user_prompt & (relevant_chunks | doc)",
5749
output=["answer"],
5850
node_config={
5951
"llm_model": self.llm_model,
@@ -64,13 +56,11 @@ def _create_graph(self):
6456
return BaseGraph(
6557
nodes=[
6658
fetch_node,
67-
parse_node,
6859
rag_node,
6960
generate_answer_node,
7061
],
7162
edges=[
72-
(fetch_node, parse_node),
73-
(parse_node, rag_node),
63+
(fetch_node, rag_node),
7464
(rag_node, generate_answer_node)
7565
],
7666
entry_point=fetch_node

scrapegraphai/graphs/json_scraper_graph.py

Lines changed: 1 addition & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,6 @@
99

1010
from ..nodes import (
1111
FetchNode,
12-
ParseNode,
1312
RAGNode,
1413
GenerateAnswerNode
1514
)
@@ -62,13 +61,6 @@ def _create_graph(self) -> BaseGraph:
6261
input="json | json_dir",
6362
output=["doc", "link_urls", "img_urls"],
6463
)
65-
parse_node = ParseNode(
66-
input="doc",
67-
output=["parsed_doc"],
68-
node_config={
69-
"chunk_size": self.model_token
70-
}
71-
)
7264
rag_node = RAGNode(
7365
input="user_prompt & (parsed_doc | doc)",
7466
output=["relevant_chunks"],
@@ -89,13 +81,11 @@ def _create_graph(self) -> BaseGraph:
8981
return BaseGraph(
9082
nodes=[
9183
fetch_node,
92-
parse_node,
9384
rag_node,
9485
generate_answer_node,
9586
],
9687
edges=[
97-
(fetch_node, parse_node),
98-
(parse_node, rag_node),
88+
(fetch_node, rag_node),
9989
(rag_node, generate_answer_node)
10090
],
10191
entry_point=fetch_node

scrapegraphai/graphs/pdf_scraper_graph.py

Lines changed: 11 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,8 @@
99

1010
from ..nodes import (
1111
FetchNode,
12-
ParseNode,
1312
RAGNode,
14-
GenerateAnswerNode
13+
GenerateAnswerPDFNode
1514
)
1615

1716

@@ -48,7 +47,7 @@ class PDFScraperGraph(AbstractGraph):
4847
"""
4948

5049
def __init__(self, prompt: str, source: str, config: dict, schema: Optional[str] = None):
51-
super().__init__(prompt, config, source, schema)
50+
super().__init__(prompt, config, source)
5251

5352
self.input_key = "pdf" if source.endswith("pdf") else "pdf_dir"
5453

@@ -62,43 +61,33 @@ def _create_graph(self) -> BaseGraph:
6261

6362
fetch_node = FetchNode(
6463
input='pdf | pdf_dir',
65-
output=["doc", "link_urls", "img_urls"],
66-
)
67-
parse_node = ParseNode(
68-
input="doc",
69-
output=["parsed_doc"],
70-
node_config={
71-
"chunk_size": self.model_token,
72-
}
64+
output=["doc"],
7365
)
7466
rag_node = RAGNode(
75-
input="user_prompt & (parsed_doc | doc)",
67+
input="user_prompt & doc",
7668
output=["relevant_chunks"],
7769
node_config={
7870
"llm_model": self.llm_model,
79-
"embedder_model": self.embedder_model,
71+
"embedder_model": self.embedder_model
8072
}
8173
)
82-
generate_answer_node = GenerateAnswerNode(
83-
input="user_prompt & (relevant_chunks | parsed_doc | doc)",
74+
generate_answer_node_pdf = GenerateAnswerPDFNode(
75+
input="user_prompt & (relevant_chunks | doc)",
8476
output=["answer"],
8577
node_config={
8678
"llm_model": self.llm_model,
87-
"schema": self.schema,
8879
}
8980
)
9081

9182
return BaseGraph(
9283
nodes=[
9384
fetch_node,
94-
parse_node,
9585
rag_node,
96-
generate_answer_node,
86+
generate_answer_node_pdf,
9787
],
9888
edges=[
99-
(fetch_node, parse_node),
100-
(parse_node, rag_node),
101-
(rag_node, generate_answer_node)
89+
(fetch_node, rag_node),
90+
(rag_node, generate_answer_node_pdf)
10291
],
10392
entry_point=fetch_node
10493
)
@@ -114,4 +103,4 @@ def run(self) -> str:
114103
inputs = {"user_prompt": self.prompt, self.input_key: self.source}
115104
self.final_state, self.execution_info = self.graph.execute(inputs)
116105

117-
return self.final_state.get("answer", "No answer found.")
106+
return self.final_state.get("answer", "No answer found.")

scrapegraphai/graphs/xml_scraper_graph.py

Lines changed: 3 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,6 @@
99

1010
from ..nodes import (
1111
FetchNode,
12-
ParseNode,
1312
RAGNode,
1413
GenerateAnswerNode
1514
)
@@ -64,23 +63,16 @@ def _create_graph(self) -> BaseGraph:
6463
input="xml | xml_dir",
6564
output=["doc", "link_urls", "img_urls"]
6665
)
67-
parse_node = ParseNode(
68-
input="doc",
69-
output=["parsed_doc"],
70-
node_config={
71-
"chunk_size": self.model_token
72-
}
73-
)
7466
rag_node = RAGNode(
75-
input="user_prompt & (parsed_doc | doc)",
67+
input="user_prompt & doc",
7668
output=["relevant_chunks"],
7769
node_config={
7870
"llm_model": self.llm_model,
7971
"embedder_model": self.embedder_model
8072
}
8173
)
8274
generate_answer_node = GenerateAnswerNode(
83-
input="user_prompt & (relevant_chunks | parsed_doc | doc)",
75+
input="user_prompt & (relevant_chunks | doc)",
8476
output=["answer"],
8577
node_config={
8678
"llm_model": self.llm_model,
@@ -91,13 +83,11 @@ def _create_graph(self) -> BaseGraph:
9183
return BaseGraph(
9284
nodes=[
9385
fetch_node,
94-
parse_node,
9586
rag_node,
9687
generate_answer_node,
9788
],
9889
edges=[
99-
(fetch_node, parse_node),
100-
(parse_node, rag_node),
90+
(fetch_node, rag_node),
10191
(rag_node, generate_answer_node)
10292
],
10393
entry_point=fetch_node

scrapegraphai/nodes/fetch_node.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -90,8 +90,9 @@ def execute(self, state):
9090
or input_keys[0] == "pdf_dir"
9191
):
9292
compressed_document = [
93-
Document(page_content=source, metadata={"source": "local_dir"})
93+
source
9494
]
95+
9596
state.update({self.output[0]: compressed_document})
9697
return state
9798
# handling for pdf

0 commit comments

Comments
 (0)