Skip to content

Commit bc9e700

Browse files
committed
Merge branch 'main' into pre/beta
2 parents 51051e9 + 4c72385 commit bc9e700

File tree

12 files changed

+97
-28
lines changed

12 files changed

+97
-28
lines changed

CHANGELOG.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,60 @@
1+
## [1.12.0](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.11.3...v1.12.0) (2024-08-06)
2+
3+
4+
### Features
5+
6+
* add generate_answer node paralellization ([0c4b290](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/0c4b2908d98efbb2b0a6faf68618a801d726bb5f))
7+
* add integration in the abstract grapgh ([5ecdbe7](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/5ecdbe715f4bb223fa1be834fda07ccea2a51cb9))
8+
* fix tests ([1db164e](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/1db164e9e682eefbc1414989a043fefa2e9009c2))
9+
* intregration of firebase ([4caed54](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/4caed545e5030460b2d5e46f9ad90546ce36f0ee))
10+
* pdate models_tokens.py ([377d679](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/377d679eecd62611c0c9a3cba8202c6f0719ed31))
11+
* refactoring of the code ([9355507](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/9355507a2dc73342f325b6649e871df48ae13567))
12+
13+
14+
### Bug Fixes
15+
16+
* abstract_graph and removed unused embeddings ([0b4cfd6](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/0b4cfd6522dcad0eb418f0badd0f7824a1efd534))
17+
* add llama 3.1 ([f336c95](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/f336c95c2d1833d1f829d61ae7fa415ac2caf250))
18+
* fixed bug on fetch_node ([968c69e](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/968c69e217d9c180b9b8c2aa52ca59b9a1733525))
19+
* **AbstractGraph:** instantiation of Azure GPT models ([ade28fc](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/ade28fca2c3fdf40f28a80854e3b8435a52a6930)), closes [#498](https://github.com/ScrapeGraphAI/Scrapegraph-ai/issues/498)
20+
* pyproject.toml ([e90fad4](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/e90fad44ce53e34a73270619255cc392eed81a06))
21+
* rebuild pyproject, requirements and lockfiles ([1193984](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/1193984434dea0ad70ff6b975ac778d56d2e1688))
22+
23+
24+
### chore
25+
26+
* rebuild requirements ([2edad66](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/2edad66788cbd92f197e3b37db13c44bfa39e36a))
27+
* remove unused import ([88710f1](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/88710f1a7c7d50f57108456112da30d1a12a1ba1))
28+
* set dependency version for vertexai ([971cc2d](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/971cc2da04e331ebca1f93048c78bc58b452d30a))
29+
* update pyproject, rebuild lockfiles ([d6312bf](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/d6312bfc9b2d68370727645b1ce5010ff7a626c0))
30+
31+
32+
### Refactor
33+
34+
* **Ollama:** integrate new LangChain chat init ([d177afb](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/d177afb68be036465ede1f567d2562b145d77d36))
35+
* **OpenAI:** integrate new LangChain chat init ([5e3eb6e](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/5e3eb6e43df4bd4c452d34b49f254235e9ff1b22))
36+
* move embeddings code from AbstractGraph to RAGNode ([a94ebcd](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/a94ebcde0078d66d33e67f7e0a87850efb92d408))
37+
* remove LangChain wrappers ([2c5f934](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/2c5f934f101e319ec4e61009d4c464ca4626c1ff))
38+
* remove LangChain wrappers for Ollama ([25066b2](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/25066b2bc51517e50058231664230b8edef365b9))
39+
* remove redundant LangChain wrappers ([9275486](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/927548624034b3c30eca60011d216720102d1815))
40+
* remove redundant wrappers for Ernie and Nvidia ([bc2c996](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/bc2c9967d2f13ade6eeb7b23e9b423f6e79aa890))
41+
* reuse code for common interface models ([bb73d91](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/bb73d916a1a7b378438038ec928eeda6d8f6ac9d))
42+
43+
44+
### CI
45+
46+
* **release:** 1.11.0-beta.1 [skip ci] ([7080a0a](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/7080a0afd527a34ada33ee2d3ace8e724d879df7))
47+
* **release:** 1.11.0-beta.10 [skip ci] ([ee30a83](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/ee30a83f8a77958be6881ca0a94b02d278f37a61)), closes [#498](https://github.com/ScrapeGraphAI/Scrapegraph-ai/issues/498)
48+
* **release:** 1.11.0-beta.2 [skip ci] ([bf6d487](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/bf6d487bbb26187b32f5985433b54025f6437af5))
49+
* **release:** 1.11.0-beta.3 [skip ci] ([66f9421](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/66f9421fc216f0984d5a393101d1c109b08eaa33))
50+
* **release:** 1.11.0-beta.4 [skip ci] ([51db43a](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/51db43a129ef05c050b6de017598a664119594ba))
51+
* **release:** 1.11.0-beta.5 [skip ci] ([b15fd9f](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/b15fd9f4dc3643c9904a2cbaa5f392a6805c9762))
52+
* **release:** 1.11.0-beta.6 [skip ci] ([74ed8d0](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/74ed8d06c5db4f9734521c2f84f4379b18b7308f))
53+
* **release:** 1.11.0-beta.7 [skip ci] ([55f706f](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/55f706f3d5f4a8afe9dd8fc9ce9bd527f8a11894))
54+
* **release:** 1.11.0-beta.8 [skip ci] ([3e07f62](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/3e07f6273fae667b2f663be1cdd5e9c068f4c59f))
55+
* **release:** 1.11.0-beta.9 [skip ci] ([4440790](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/4440790f00c1ddd416add7af895756ab42c30bf3))
56+
57+
158
## [1.11.0-beta.12](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.11.0-beta.11...v1.11.0-beta.12) (2024-08-06)
259

360

docs/chinese.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ ScrapeGraphAI 是一个*网络爬虫* Python 库,使用大型语言模型和
1111
只需告诉库您想提取哪些信息,它将为您完成!
1212

1313
<p align="center">
14-
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/scrapegraphai_logo.png" alt="Scrapegraph-ai Logo" style="width: 50%;">
14+
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/sgai-hero.png" alt="ScrapeGraphAI Hero" style="width: 100%;">
1515
</p>
1616

1717
## 🚀 快速安装

docs/japanese.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ ScrapeGraphAIは、大規模言語モデルと直接グラフロジックを使
1111
クロールしたい情報をライブラリに伝えるだけで、残りはすべてライブラリが行います!
1212

1313
<p align="center">
14-
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/scrapegraphai_logo.png" alt="Scrapegraph-ai Logo" style="width: 50%;">
14+
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/sgai-hero.png" alt="ScrapeGraphAI Hero" style="width: 100%;">
1515
</p>
1616

1717
## 🚀 インストール方法

docs/korean.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ ScrapeGraphAI는 웹 사이트와 로컬 문서(XML, HTML, JSON 등)에 대한
66
추출하려는 정보를 말하기만 하면 라이브러리가 알아서 처리해 줍니다!
77

88
<p align="center">
9-
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/scrapegraphai_logo.png" alt="Scrapegraph-ai Logo" style="width: 50%;">
9+
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/sgai-hero.png" alt="ScrapeGraphAI Hero" style="width: 100%;">
1010
</p>
1111

1212
## 🚀 빠른 설치

examples/local_models/smart_scraper_ollama.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99

1010
graph_config = {
1111
"llm": {
12-
"model": "ollama/llama3",
12+
"model": "ollama/llama3.1",
1313
"temperature": 0,
1414
"format": "json", # Ollama needs the format to be specified explicitly
1515
# "base_url": "http://localhost:11434", # set ollama URL arbitrarily

examples/openai/md_scraper_openai.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
"""
2-
Basic example of scraping pipeline using MDScraperGraph from XML documents
2+
Basic example of scraping pipeline using MDScraperGraph from MD documents
33
"""
44

55
import os
@@ -9,7 +9,7 @@
99
load_dotenv()
1010

1111
# ************************************************
12-
# Read the XML file
12+
# Read the MD file
1313
# ************************************************
1414

1515
FILE_NAME = "inputs/markdown_example.md"

pyproject.toml

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,11 @@
11
[project]
22
name = "scrapegraphai"
33

4-
version = "1.11.0b12"
4+
version = "1.12.0"
5+
56

67
description = "A web scraping library based on LangChain which uses LLM and direct graph logic to create scraping pipelines."
8+
79
authors = [
810
{ name = "Marco Vinciguerra", email = "[email protected]" },
911
{ name = "Marco Perini", email = "[email protected]" },
@@ -12,8 +14,10 @@ authors = [
1214

1315
dependencies = [
1416
"langchain>=0.2.10",
17+
"langchain-fireworks>=0.1.3",
18+
"langchain_community>=0.2.9",
1519
"langchain-google-genai>=1.0.7",
16-
"langchain-google-vertexai",
20+
"langchain-google-vertexai>=1.0.7",
1721
"langchain-openai>=0.1.17",
1822
"langchain-groq>=0.1.3",
1923
"langchain-aws>=0.1.3",
@@ -36,7 +40,7 @@ dependencies = [
3640
"langchain-fireworks>=0.1.3",
3741
"langchain-community>=0.2.9",
3842
"langchain-huggingface>=0.0.3",
39-
"browserbase==0.3.0"
43+
"browserbase>=0.3.0"
4044
]
4145

4246
license = "MIT"

requirements-dev.lock

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -218,7 +218,7 @@ httpx==0.27.0
218218
# via openai
219219
httpx-sse==0.4.0
220220
# via fireworks-ai
221-
huggingface-hub==0.24.0
221+
huggingface-hub==0.24.1
222222
# via langchain-huggingface
223223
# via sentence-transformers
224224
# via tokenizers
@@ -231,7 +231,7 @@ idna==3.7
231231
# via yarl
232232
imagesize==1.4.1
233233
# via sphinx
234-
importlib-metadata==8.0.0
234+
importlib-metadata==8.1.0
235235
# via sphinx
236236
importlib-resources==6.4.0
237237
# via matplotlib
@@ -263,16 +263,16 @@ jsonschema-specifications==2023.12.1
263263
# via jsonschema
264264
kiwisolver==1.4.5
265265
# via matplotlib
266-
langchain==0.2.10
266+
langchain==0.2.11
267267
# via langchain-community
268268
# via scrapegraphai
269269
langchain-anthropic==0.1.20
270270
# via scrapegraphai
271271
langchain-aws==0.1.12
272272
# via scrapegraphai
273-
langchain-community==0.2.9
273+
langchain-community==0.2.10
274274
# via scrapegraphai
275-
langchain-core==0.2.22
275+
langchain-core==0.2.23
276276
# via langchain
277277
# via langchain-anthropic
278278
# via langchain-aws
@@ -295,7 +295,7 @@ langchain-groq==0.1.6
295295
# via scrapegraphai
296296
langchain-huggingface==0.0.3
297297
# via scrapegraphai
298-
langchain-nvidia-ai-endpoints==0.1.6
298+
langchain-nvidia-ai-endpoints==0.1.7
299299
# via scrapegraphai
300300
langchain-openai==0.1.17
301301
# via scrapegraphai
@@ -386,7 +386,7 @@ pillow==10.4.0
386386
# via streamlit
387387
platformdirs==4.2.2
388388
# via pylint
389-
playwright==1.45.0
389+
playwright==1.45.1
390390
# via browserbase
391391
# via scrapegraphai
392392
# via undetected-playwright

requirements.lock

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -162,7 +162,7 @@ httpx==0.27.0
162162
# via openai
163163
httpx-sse==0.4.0
164164
# via fireworks-ai
165-
huggingface-hub==0.24.0
165+
huggingface-hub==0.24.1
166166
# via langchain-huggingface
167167
# via sentence-transformers
168168
# via tokenizers
@@ -185,16 +185,16 @@ jsonpatch==1.33
185185
# via langchain-core
186186
jsonpointer==3.0.0
187187
# via jsonpatch
188-
langchain==0.2.10
188+
langchain==0.2.11
189189
# via langchain-community
190190
# via scrapegraphai
191191
langchain-anthropic==0.1.20
192192
# via scrapegraphai
193193
langchain-aws==0.1.12
194194
# via scrapegraphai
195-
langchain-community==0.2.9
195+
langchain-community==0.2.10
196196
# via scrapegraphai
197-
langchain-core==0.2.22
197+
langchain-core==0.2.23
198198
# via langchain
199199
# via langchain-anthropic
200200
# via langchain-aws
@@ -217,7 +217,7 @@ langchain-groq==0.1.6
217217
# via scrapegraphai
218218
langchain-huggingface==0.0.3
219219
# via scrapegraphai
220-
langchain-nvidia-ai-endpoints==0.1.6
220+
langchain-nvidia-ai-endpoints==0.1.7
221221
# via scrapegraphai
222222
langchain-openai==0.1.17
223223
# via scrapegraphai
@@ -278,7 +278,7 @@ pillow==10.4.0
278278
# via fireworks-ai
279279
# via langchain-nvidia-ai-endpoints
280280
# via sentence-transformers
281-
playwright==1.45.0
281+
playwright==1.45.1
282282
# via browserbase
283283
# via scrapegraphai
284284
# via undetected-playwright

requirements.txt

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
langchain>=0.2.10
2+
langchain-fireworks>=0.1.3
3+
langchain_community>=0.2.9
24
langchain-google-genai>=1.0.7
3-
langchain-google-vertexai
5+
langchain-google-vertexai>=1.0.7
46
langchain-openai>=0.1.17
57
langchain-groq>=0.1.3
68
langchain-aws>=0.1.3
@@ -23,4 +25,4 @@ semchunk>=1.0.1
2325
langchain-fireworks>=0.1.3
2426
langchain-community>=0.2.9
2527
langchain-huggingface>=0.0.3
26-
browserbase==0.3.0
28+
browserbase>=0.3.0

scrapegraphai/nodes/fetch_node.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -168,7 +168,10 @@ def execute(self, state):
168168
parsed_content = source
169169

170170
if isinstance(self.llm_model, ChatOpenAI) and not self.script_creator or self.force and not self.script_creator:
171+
171172
parsed_content = convert_to_md(source)
173+
else:
174+
parsed_content = source
172175

173176
compressed_document = [
174177
Document(page_content=parsed_content, metadata={"source": "local_dir"})
@@ -187,6 +190,7 @@ def execute(self, state):
187190
if (isinstance(self.llm_model, ChatOpenAI)
188191
and not self.script_creator) or (self.force and not self.script_creator):
189192
parsed_content = convert_to_md(source, input_data[0])
193+
190194
compressed_document = [Document(page_content=parsed_content)]
191195
else:
192196
self.logger.warning(
@@ -217,6 +221,7 @@ def execute(self, state):
217221
if isinstance(self.llm_model, ChatOpenAI) and not self.script_creator or self.force and not self.script_creator and not self.openai_md_enabled:
218222
parsed_content = convert_to_md(document[0].page_content, input_data[0])
219223

224+
220225
compressed_document = [
221226
Document(page_content=parsed_content, metadata={"source": "html file"})
222227
]

scrapegraphai/utils/convert_to_md.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,12 @@ def convert_to_md(html: str, url: str = None) -> str:
2020
2121
Note: All the styles and links are ignored during the conversion. """
2222

23-
if url:
24-
parsed_url = urlparse(url)
25-
domain = f"{parsed_url.scheme}://{parsed_url.netloc}"
2623
h = html2text.HTML2Text()
2724
h.ignore_links = False
28-
h.baseurl = domain
2925
h.body_width = 0
26+
if url is not None:
27+
parsed_url = urlparse(url)
28+
domain = f"{parsed_url.scheme}://{parsed_url.netloc}"
29+
h.baseurl = domain
30+
3031
return h.handle(html)

0 commit comments

Comments
 (0)