Skip to content

Commit 0ab7272

Browse files
authored
Merge branch 'pre/beta' into 133-support-claude3-haiku-and-others-using-litellm
2 parents 5bdee55 + 5aa600c commit 0ab7272

File tree

14 files changed

+473
-20
lines changed

14 files changed

+473
-20
lines changed

CHANGELOG.md

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,79 @@
1+
## [0.9.0-beta.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.9.0-beta.1...v0.9.0-beta.2) (2024-05-05)
2+
3+
4+
### Features
5+
6+
* refactoring search function ([aeb1acb](https://github.com/VinciGit00/Scrapegraph-ai/commit/aeb1acbf05e63316c91672c99d88f8a6f338147f))
7+
8+
9+
### Bug Fixes
10+
11+
* bug on .toml ([f7d66f5](https://github.com/VinciGit00/Scrapegraph-ai/commit/f7d66f51818dbdfddd0fa326f26265a3ab686b20))
12+
13+
## [0.9.0-beta.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.8.0...v0.9.0-beta.1) (2024-05-04)
14+
15+
16+
### Features
17+
18+
* Enable end users to pass model instances of HuggingFaceHub ([7599234](https://github.com/VinciGit00/Scrapegraph-ai/commit/7599234ab9563ca4ee9b7f5b2d0267daac621ecf))
19+
20+
21+
### Build
22+
23+
* **deps:** bump tqdm from 4.66.1 to 4.66.3 ([0a17c74](https://github.com/VinciGit00/Scrapegraph-ai/commit/0a17c74e50d0457aec289e81183e9c779c735842))
24+
* **deps:** bump tqdm from 4.66.1 to 4.66.3 ([aff6f98](https://github.com/VinciGit00/Scrapegraph-ai/commit/aff6f983b02a37ced21826847a6ace5fb15ecf3d))
25+
26+
27+
### CI
28+
29+
* **release:** 0.8.0-beta.1 [skip ci] ([d277b34](https://github.com/VinciGit00/Scrapegraph-ai/commit/d277b349a98848749a7e38ea3c511271bced3b71))
30+
* **release:** 0.8.0-beta.2 [skip ci] ([892500a](https://github.com/VinciGit00/Scrapegraph-ai/commit/892500afe93c4d96dcffe897b382977a22079b83))
31+
32+
## [0.8.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.7.0...v0.8.0) (2024-05-03)
33+
34+
35+
36+
### Features
37+
38+
* add pdf scraper ([10a9453](https://github.com/VinciGit00/Scrapegraph-ai/commit/10a94530e3fd4dfde933ecfa96cb3e21df72e606))
39+
40+
41+
### CI
42+
43+
* **release:** 0.7.0-beta.3 [skip ci] ([fbb06ab](https://github.com/VinciGit00/Scrapegraph-ai/commit/fbb06ab551fac9cc9824ad567f042e55450277bd))
44+
45+
## [0.7.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.6.2...v0.7.0) (2024-05-03)
46+
47+
### Features
48+
49+
* add base_node to __init__.py ([cb1cb61](https://github.com/VinciGit00/Scrapegraph-ai/commit/cb1cb616b7998d3624bf57b19b5f1b1945fea4ef))
50+
* Azure implementation + embeddings refactoring ([aa9271e](https://github.com/VinciGit00/Scrapegraph-ai/commit/aa9271e7bc4daa54860499d0615580b17550ff58))
51+
52+
53+
### Refactor
54+
55+
* Changed the way embedding model is created in AbstractGraph class and removed handling of embedding model creation from RAGNode. Now AbstractGraph will call a dedicated method for embedding models instead of _create_llm. This makes it easy to use any LLM with any supported embedding model. ([819cbcd](https://github.com/VinciGit00/Scrapegraph-ai/commit/819cbcd3be1a8cb195de0b44c6b6d4d824e2a42a))
56+
57+
58+
### CI
59+
60+
* **release:** 0.7.0-beta.1 [skip ci] ([98dec36](https://github.com/VinciGit00/Scrapegraph-ai/commit/98dec36c60d1dc8b072482e8d514c3869a45a3f8))
61+
* **release:** 0.7.0-beta.2 [skip ci] ([42fa02e](https://github.com/VinciGit00/Scrapegraph-ai/commit/42fa02e65a3a81796bd66e55cf9dd1d1b692cb89))
62+
63+
64+
## [0.7.0-beta.3](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.7.0-beta.2...v0.7.0-beta.3) (2024-05-03)
65+
## [0.7.0-beta.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.7.0-beta.1...v0.7.0-beta.2) (2024-05-03)
66+
67+
68+
### Features
69+
70+
* Azure implementation + embeddings refactoring ([aa9271e](https://github.com/VinciGit00/Scrapegraph-ai/commit/aa9271e7bc4daa54860499d0615580b17550ff58))
71+
* add pdf scraper ([10a9453](https://github.com/VinciGit00/Scrapegraph-ai/commit/10a94530e3fd4dfde933ecfa96cb3e21df72e606))
72+
73+
### Refactor
74+
75+
* Changed the way embedding model is created in AbstractGraph class and removed handling of embedding model creation from RAGNode. Now AbstractGraph will call a dedicated method for embedding models instead of _create_llm. This makes it easy to use any LLM with any supported embedding model. ([819cbcd](https://github.com/VinciGit00/Scrapegraph-ai/commit/819cbcd3be1a8cb195de0b44c6b6d4d824e2a42a))
76+
177
## [0.7.0-beta.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.6.2...v0.7.0-beta.1) (2024-05-03)
278

379

docs/source/getting_started/installation.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,12 @@ Install the library
1919
2020
pip install scrapegraphai
2121
22+
Additionally on Windows when using WSL
23+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24+
25+
.. code-block:: bash
26+
27+
sudo apt-get -y install libnss3 libnspr4 libgbm1 libasound2
2228
2329
As simple as that! You are now ready to scrape gnamgnamgnam 👿👿👿
2430

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
"""
2+
Basic example of scraping pipeline using SmartScraper using Azure OpenAI Key
3+
"""
4+
5+
import os
6+
from dotenv import load_dotenv
7+
from scrapegraphai.graphs import SmartScraperGraph
8+
from scrapegraphai.utils import prettify_exec_info
9+
from langchain_community.llms import HuggingFaceEndpoint
10+
from langchain_community.embeddings import HuggingFaceInferenceAPIEmbeddings
11+
12+
13+
14+
15+
## required environment variable in .env
16+
#HUGGINGFACEHUB_API_TOKEN
17+
load_dotenv()
18+
19+
HUGGINGFACEHUB_API_TOKEN = os.getenv('HUGGINGFACEHUB_API_TOKEN')
20+
# ************************************************
21+
# Initialize the model instances
22+
# ************************************************
23+
24+
repo_id = "mistralai/Mistral-7B-Instruct-v0.2"
25+
26+
llm_model_instance = HuggingFaceEndpoint(
27+
repo_id=repo_id, max_length=128, temperature=0.5, token=HUGGINGFACEHUB_API_TOKEN
28+
)
29+
30+
31+
32+
33+
embedder_model_instance = HuggingFaceInferenceAPIEmbeddings(
34+
api_key=HUGGINGFACEHUB_API_TOKEN, model_name="sentence-transformers/all-MiniLM-l6-v2"
35+
)
36+
37+
# ************************************************
38+
# Create the SmartScraperGraph instance and run it
39+
# ************************************************
40+
41+
graph_config = {
42+
"llm": {"model_instance": llm_model_instance},
43+
"embeddings": {"model_instance": embedder_model_instance}
44+
}
45+
46+
smart_scraper_graph = SmartScraperGraph(
47+
prompt="List me all the events, with the following fields: company_name, event_name, event_start_date, event_start_time, event_end_date, event_end_time, location, event_mode, event_category, third_party_redirect, no_of_days, time_in_hours, hosted_or_attending, refreshments_type, registration_available, registration_link",
48+
# also accepts a string with the already downloaded HTML code
49+
source="https://www.hmhco.com/event",
50+
config=graph_config
51+
)
52+
53+
result = smart_scraper_graph.run()
54+
print(result)
55+
56+
# ************************************************
57+
# Get graph execution info
58+
# ************************************************
59+
60+
graph_exec_info = smart_scraper_graph.get_execution_info()
61+
print(prettify_exec_info(graph_exec_info))
62+
63+

examples/openai/smart_scraper_openai.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
"api_key": openai_key,
2222
"model": "gpt-3.5-turbo",
2323
},
24-
"verbose":False,
24+
"verbose": True,
2525
}
2626

2727
# ************************************************

pyproject.toml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
[tool.poetry]
22
name = "scrapegraphai"
33

4-
version = "0.7.0b1"
4+
version = "0.9.0b2"
55

66
description = "A web scraping library based on LangChain which uses LLM and direct graph logic to create scraping pipelines."
77
authors = [
@@ -33,7 +33,7 @@ beautifulsoup4 = "4.12.3"
3333
pandas = "2.0.3"
3434
python-dotenv = "1.0.1"
3535
tiktoken = {version = ">=0.5.2,<0.6.0"}
36-
tqdm = "4.66.1"
36+
tqdm = "4.66.3"
3737
graphviz = "0.20.1"
3838
google = "3.0.0"
3939
minify-html = "0.15.0"
@@ -42,6 +42,7 @@ langchain-groq = "0.1.3"
4242
playwright = "^1.43.0"
4343
langchain-aws = "^0.1.2"
4444
langchain-anthropic = "^0.1.11"
45+
yahoo-search-py="^0.3"
4546

4647
[tool.poetry.dev-dependencies]
4748
pytest = "8.0.0"

requirements.txt

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,13 @@ beautifulsoup4==4.12.3
77
pandas==2.0.3
88
python-dotenv==1.0.1
99
tiktoken>=0.5.2,<0.6.0
10-
tqdm==4.66.1
10+
tqdm==4.66.3
1111
graphviz==0.20.1
1212
google==3.0.0
1313
minify-html==0.15.0
1414
free-proxy==1.1.1
1515
langchain-groq==0.1.3
1616
playwright==1.43.0
1717
langchain-aws==0.1.2
18-
langchain-anthropic==0.1.11
18+
langchain-anthropic==0.1.11
19+
yahoo-search-py==0.3

scrapegraphai/graphs/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,4 @@
1010
from .xml_scraper_graph import XMLScraperGraph
1111
from .json_scraper_graph import JSONScraperGraph
1212
from .csv_scraper_graph import CSVScraperGraph
13+
from .pdf_scraper_graph import PDFScraperGraph

scrapegraphai/graphs/abstract_graph.py

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -67,8 +67,15 @@ def _set_model_token(self, llm):
6767
if 'Azure' in str(type(llm)):
6868
try:
6969
self.model_token = models_tokens["azure"][llm.model_name]
70-
except KeyError as exc:
71-
raise KeyError("Model not supported") from exc
70+
except KeyError:
71+
raise KeyError("Model not supported")
72+
73+
elif 'HuggingFaceEndpoint' in str(type(llm)):
74+
if 'mistral' in llm.repo_id:
75+
try:
76+
self.model_token = models_tokens['mistral'][llm.repo_id]
77+
except KeyError:
78+
raise KeyError("Model not supported")
7279

7380
def _create_llm(self, llm_config: dict, chat=False) -> object:
7481
"""
@@ -185,7 +192,6 @@ def _create_default_embedder(self) -> object:
185192
Raises:
186193
ValueError: If the model is not supported.
187194
"""
188-
189195
if isinstance(self.llm_model, OpenAI):
190196
return OpenAIEmbeddings(api_key=self.llm_model.openai_api_key)
191197
elif isinstance(self.llm_model, AzureOpenAIEmbeddings):
@@ -221,6 +227,9 @@ def _create_embedder(self, embedder_config: dict) -> object:
221227
KeyError: If the model is not supported.
222228
"""
223229

230+
if 'model_instance' in embedder_config:
231+
return embedder_config['model_instance']
232+
224233
# Instantiate the embedding model based on the model name
225234
if "openai" in embedder_config["model"]:
226235
return OpenAIEmbeddings(api_key=embedder_config["api_key"])
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
"""
2+
PDFScraperGraph Module
3+
"""
4+
5+
from .base_graph import BaseGraph
6+
from ..nodes import (
7+
FetchNode,
8+
ParseNode,
9+
RAGNode,
10+
GenerateAnswerNode
11+
)
12+
from .abstract_graph import AbstractGraph
13+
14+
15+
class PDFScraperGraph(AbstractGraph):
16+
"""
17+
PDFScraperGraph is a scraping pipeline that extracts information from pdf files using a natural
18+
language model to interpret and answer prompts.
19+
20+
Attributes:
21+
prompt (str): The prompt for the graph.
22+
source (str): The source of the graph.
23+
config (dict): Configuration parameters for the graph.
24+
llm_model: An instance of a language model client, configured for generating answers.
25+
embedder_model: An instance of an embedding model client,
26+
configured for generating embeddings.
27+
verbose (bool): A flag indicating whether to show print statements during execution.
28+
headless (bool): A flag indicating whether to run the graph in headless mode.
29+
model_token (int): The token limit for the language model.
30+
31+
Args:
32+
prompt (str): The prompt for the graph.
33+
source (str): The source of the graph.
34+
config (dict): Configuration parameters for the graph.
35+
36+
Example:
37+
>>> pdf_scraper = PDFScraperGraph(
38+
... "List me all the attractions in Chioggia.",
39+
... "data/chioggia.pdf",
40+
... {"llm": {"model": "gpt-3.5-turbo"}}
41+
... )
42+
>>> result = pdf_scraper.run()
43+
"""
44+
45+
def __init__(self, prompt: str, source: str, config: dict):
46+
super().__init__(prompt, config, source)
47+
48+
self.input_key = "pdf" if source.endswith("pdf") else "pdf_dir"
49+
50+
def _create_graph(self) -> BaseGraph:
51+
"""
52+
Creates the graph of nodes representing the workflow for web scraping.
53+
54+
Returns:
55+
BaseGraph: A graph instance representing the web scraping workflow.
56+
"""
57+
58+
fetch_node = FetchNode(
59+
input="pdf_dir",
60+
output=["doc"],
61+
node_config={
62+
"headless": self.headless,
63+
"verbose": self.verbose
64+
}
65+
)
66+
parse_node = ParseNode(
67+
input="doc",
68+
output=["parsed_doc"],
69+
node_config={
70+
"chunk_size": self.model_token,
71+
"verbose": self.verbose
72+
}
73+
)
74+
rag_node = RAGNode(
75+
input="user_prompt & (parsed_doc | doc)",
76+
output=["relevant_chunks"],
77+
node_config={
78+
"llm": self.llm_model,
79+
"embedder_model": self.embedder_model,
80+
"verbose": self.verbose
81+
}
82+
)
83+
generate_answer_node = GenerateAnswerNode(
84+
input="user_prompt & (relevant_chunks | parsed_doc | doc)",
85+
output=["answer"],
86+
node_config={
87+
"llm": self.llm_model,
88+
"verbose": self.verbose
89+
}
90+
)
91+
92+
return BaseGraph(
93+
nodes=[
94+
fetch_node,
95+
parse_node,
96+
rag_node,
97+
generate_answer_node,
98+
],
99+
edges=[
100+
(fetch_node, parse_node),
101+
(parse_node, rag_node),
102+
(rag_node, generate_answer_node)
103+
],
104+
entry_point=fetch_node
105+
)
106+
107+
def run(self) -> str:
108+
"""
109+
Executes the web scraping process and returns the answer to the prompt.
110+
111+
Returns:
112+
str: The answer to the prompt.
113+
"""
114+
115+
inputs = {"user_prompt": self.prompt, self.input_key: self.source}
116+
self.final_state, self.execution_info = self.graph.execute(inputs)
117+
118+
return self.final_state.get("answer", "No answer found.")

scrapegraphai/helpers/models_tokens.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,8 @@
3535
"codellama": 16000,
3636
"dolphin-mixtral": 32000,
3737
"mistral-openorca": 32000,
38-
"stablelm-zephyr": 8192
38+
"stablelm-zephyr": 8192,
39+
"nomic-embed-text":8192
3940
},
4041
"groq": {
4142
"llama3-8b-8192": 8192,
@@ -65,5 +66,8 @@
6566
"mistral.mistral-large-2402-v1:0": 32768,
6667
"cohere.embed-english-v3": 512,
6768
"cohere.embed-multilingual-v3": 512
69+
},
70+
"mistral": {
71+
"mistralai/Mistral-7B-Instruct-v0.2": 32000
6872
}
6973
}

scrapegraphai/nodes/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,3 +16,4 @@
1616
from .search_link_node import SearchLinkNode
1717
from .robots_node import RobotsNode
1818
from .generate_answer_csv_node import GenerateAnswerCSVNode
19+
from .generate_answer_pdf_node import GenerateAnswerPDFNode

0 commit comments

Comments
 (0)