Skip to content

Commit 886bab7

Browse files
authored
Merge branch 'main' into temp-1
2 parents 66a29bc + 385084d commit 886bab7

File tree

12 files changed

+98
-31
lines changed

12 files changed

+98
-31
lines changed

CHANGELOG.md

Lines changed: 49 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,15 @@
2525
* fixed bug on fetch_node ([968c69e](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/968c69e217d9c180b9b8c2aa52ca59b9a1733525))
2626

2727
## [1.11.0-beta.7](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.11.0-beta.6...v1.11.0-beta.7) (2024-08-01)
28+
## [1.10.0-beta.7](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.10.0-beta.6...v1.10.0-beta.7) (2024-07-23)
29+
30+
## [1.11.3](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.11.2...v1.11.3) (2024-07-25)
31+
2832

2933

3034
### Bug Fixes
3135

36+
3237
* abstract_graph and removed unused embeddings ([0b4cfd6](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/0b4cfd6522dcad0eb418f0badd0f7824a1efd534))
3338

3439

@@ -79,6 +84,16 @@
7984
* rebuild requirements ([2edad66](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/2edad66788cbd92f197e3b37db13c44bfa39e36a))
8085

8186
## [1.11.0-beta.3](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.11.0-beta.2...v1.11.0-beta.3) (2024-07-25)
87+
=======
88+
* add llama 3.1 ([f872bdd](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/f872bdd24f9874660eea04f9ade570c96b6e7e93))
89+
90+
91+
### Docs
92+
93+
* prev version ([5c08eea](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/5c08eea189d7ede6f29399a67d897aa3b3f6a7b0))
94+
95+
96+
## [1.11.2](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.11.1...v1.11.2) (2024-07-23)
8297

8398

8499
### Bug Fixes
@@ -93,6 +108,16 @@
93108
* pdate models_tokens.py ([377d679](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/377d679eecd62611c0c9a3cba8202c6f0719ed31))
94109

95110
## [1.11.0-beta.1](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.10.4...v1.11.0-beta.1) (2024-07-23)
111+
* md conversion ([1d41f6e](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/1d41f6eafe8ed0e191bb6a258d54c6388ff283c6))
112+
113+
## [1.11.1](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.11.0...v1.11.1) (2024-07-23)
114+
115+
116+
### Bug Fixes
117+
118+
* md conversion ([5a45e9f](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/5a45e9f2d86a1c58b8ea321e3df9718bc00f9c28))
119+
120+
## [1.11.0](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.10.4...v1.11.0) (2024-07-23)
96121

97122

98123
### Features
@@ -181,8 +206,11 @@
181206

182207

183208

209+
184210
### Features
185211

212+
* add nvidia connection ([fc0dadb](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/fc0dadb8f812dfd636dec856921a971b58695ce3))
213+
186214

187215
* add new toml ([fcb3220](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/fcb3220868e7ef1127a7a47f40d0379be282e6eb))
188216

@@ -201,6 +229,24 @@
201229

202230
### chore
203231

232+
* **dependecies:** add script to auto-update requirements ([3289c7b](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/3289c7bf5ec58ac3d04e9e5e8e654af9abcee228))
233+
* **ci:** set up workflow for requirements auto-update ([295fc28](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/295fc28ceb02c78198f7fbe678352503b3259b6b))
234+
* update requirements.txt ([c7bac98](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/c7bac98d2e79e5ab98fa65d7efa858a2cdda1622))
235+
236+
## [1.10.0-beta.6](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.10.0-beta.5...v1.10.0-beta.6) (2024-07-22)
237+
238+
239+
### Features
240+
241+
* add new toml ([fcb3220](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/fcb3220868e7ef1127a7a47f40d0379be282e6eb))
242+
* add gpt4o omni ([431edb7](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/431edb7bb2504f4c1335c3ae3ce2f91867fa7222))
243+
* add searchngx integration ([5c92186](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/5c9218608140bf694fbfd96aa90276bc438bb475))
244+
* refactoring_to_md function ([602dd00](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/602dd00209ee1d72a1223fc4793759450921fcf9))
245+
246+
247+
248+
249+
### chore
204250

205251
* **pyproject:** upgrade dependencies ([0425124](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/0425124c570f765b98fcf67ba6649f4f9fe76b15))
206252
* correct search engine name ([7ba2f6a](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/7ba2f6ae0b9d2e9336e973e1f57ab8355c739e27))
@@ -209,11 +255,13 @@
209255
* upgrade tiktoken ([7314bc3](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/7314bc383068db590662bf7e512f799529308991))
210256

211257

258+
212259
### Docs
213260

214261
* **gpt-4o-mini:** added new gpt, fixed chromium lazy loading, ([99dc849](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/99dc8497d85289759286a973e4aecc3f924d3ada))
215262

216263

264+
217265
### CI
218266

219267
* **release:** 1.10.0-beta.1 [skip ci] ([8f619de](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/8f619de23540216934b53bcf3426702e56c48f31))
@@ -554,7 +602,7 @@
554602
* **release:** 1.6.1 [skip ci] ([44fbd71](https://github.com/VinciGit00/Scrapegraph-ai/commit/44fbd71742a57a4b10f22ed33781bb67aa77e58d))
555603

556604
## [1.6.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.6.0...v1.6.1) (2024-06-15)
557-
=======
605+
558606

559607

560608
### Bug Fixes

docs/chinese.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ ScrapeGraphAI 是一个*网络爬虫* Python 库,使用大型语言模型和
1111
只需告诉库您想提取哪些信息,它将为您完成!
1212

1313
<p align="center">
14-
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/scrapegraphai_logo.png" alt="Scrapegraph-ai Logo" style="width: 50%;">
14+
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/sgai-hero.png" alt="ScrapeGraphAI Hero" style="width: 100%;">
1515
</p>
1616

1717
## 🚀 快速安装

docs/japanese.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ ScrapeGraphAIは、大規模言語モデルと直接グラフロジックを使
1111
クロールしたい情報をライブラリに伝えるだけで、残りはすべてライブラリが行います!
1212

1313
<p align="center">
14-
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/scrapegraphai_logo.png" alt="Scrapegraph-ai Logo" style="width: 50%;">
14+
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/sgai-hero.png" alt="ScrapeGraphAI Hero" style="width: 100%;">
1515
</p>
1616

1717
## 🚀 インストール方法

docs/korean.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ ScrapeGraphAI는 웹 사이트와 로컬 문서(XML, HTML, JSON 등)에 대한
66
추출하려는 정보를 말하기만 하면 라이브러리가 알아서 처리해 줍니다!
77

88
<p align="center">
9-
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/scrapegraphai_logo.png" alt="Scrapegraph-ai Logo" style="width: 50%;">
9+
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/sgai-hero.png" alt="ScrapeGraphAI Hero" style="width: 100%;">
1010
</p>
1111

1212
## 🚀 빠른 설치

examples/local_models/smart_scraper_ollama.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99

1010
graph_config = {
1111
"llm": {
12-
"model": "ollama/llama3",
12+
"model": "ollama/llama3.1",
1313
"temperature": 0,
1414
"format": "json", # Ollama needs the format to be specified explicitly
1515
# "base_url": "http://localhost:11434", # set ollama URL arbitrarily

examples/openai/md_scraper_openai.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
"""
2-
Basic example of scraping pipeline using MDScraperGraph from XML documents
2+
Basic example of scraping pipeline using MDScraperGraph from MD documents
33
"""
44

55
import os
@@ -9,7 +9,7 @@
99
load_dotenv()
1010

1111
# ************************************************
12-
# Read the XML file
12+
# Read the MD file
1313
# ************************************************
1414

1515
FILE_NAME = "inputs/markdown_example.md"

pyproject.toml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,20 @@ name = "scrapegraphai"
44
version = "1.11.0b10"
55

66
description = "A web scraping library based on LangChain which uses LLM and direct graph logic to create scraping pipelines."
7+
78
authors = [
89
{ name = "Marco Vinciguerra", email = "[email protected]" },
910
{ name = "Marco Perini", email = "[email protected]" },
1011
{ name = "Lorenzo Padoan", email = "[email protected]" }
1112
]
1213

13-
dependencies = [
1414
"langchain>=0.2.10",
15+
16+
"langchain-fireworks>=0.1.3",
17+
"langchain_community>=0.2.9",
18+
1519
"langchain-google-genai>=1.0.7",
16-
"langchain-google-vertexai",
20+
"langchain-google-vertexai>=1.0.7",
1721
"langchain-openai>=0.1.17",
1822
"langchain-groq>=0.1.3",
1923
"langchain-aws>=0.1.3",

requirements-dev.lock

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -221,6 +221,8 @@ httpx-sse==0.4.0
221221
huggingface-hub==0.24.0
222222
# via langchain-huggingface
223223
# via sentence-transformers
224+
huggingface-hub==0.24.1
225+
224226
# via tokenizers
225227
# via transformers
226228
idna==3.7
@@ -231,7 +233,7 @@ idna==3.7
231233
# via yarl
232234
imagesize==1.4.1
233235
# via sphinx
234-
importlib-metadata==8.0.0
236+
importlib-metadata==8.1.0
235237
# via sphinx
236238
importlib-resources==6.4.0
237239
# via matplotlib
@@ -263,16 +265,16 @@ jsonschema-specifications==2023.12.1
263265
# via jsonschema
264266
kiwisolver==1.4.5
265267
# via matplotlib
266-
langchain==0.2.10
268+
langchain==0.2.11
267269
# via langchain-community
268270
# via scrapegraphai
269271
langchain-anthropic==0.1.20
270272
# via scrapegraphai
271273
langchain-aws==0.1.12
272274
# via scrapegraphai
273-
langchain-community==0.2.9
275+
langchain-community==0.2.10
274276
# via scrapegraphai
275-
langchain-core==0.2.22
277+
langchain-core==0.2.23
276278
# via langchain
277279
# via langchain-anthropic
278280
# via langchain-aws
@@ -296,6 +298,8 @@ langchain-groq==0.1.6
296298
langchain-huggingface==0.0.3
297299
# via scrapegraphai
298300
langchain-nvidia-ai-endpoints==0.1.6
301+
302+
langchain-nvidia-ai-endpoints==0.1.7
299303
# via scrapegraphai
300304
langchain-openai==0.1.17
301305
# via scrapegraphai
@@ -386,8 +390,8 @@ pillow==10.4.0
386390
# via streamlit
387391
platformdirs==4.2.2
388392
# via pylint
389-
playwright==1.45.0
390-
# via browserbase
393+
394+
playwright==1.45.1
391395
# via scrapegraphai
392396
# via undetected-playwright
393397
pluggy==1.5.0

requirements.lock

Lines changed: 10 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -162,9 +162,7 @@ httpx==0.27.0
162162
# via openai
163163
httpx-sse==0.4.0
164164
# via fireworks-ai
165-
huggingface-hub==0.24.0
166-
# via langchain-huggingface
167-
# via sentence-transformers
165+
huggingface-hub==0.24.1
168166
# via tokenizers
169167
# via transformers
170168
idna==3.7
@@ -185,16 +183,16 @@ jsonpatch==1.33
185183
# via langchain-core
186184
jsonpointer==3.0.0
187185
# via jsonpatch
188-
langchain==0.2.10
186+
langchain==0.2.11
189187
# via langchain-community
190188
# via scrapegraphai
191189
langchain-anthropic==0.1.20
192190
# via scrapegraphai
193191
langchain-aws==0.1.12
194192
# via scrapegraphai
195-
langchain-community==0.2.9
193+
langchain-community==0.2.10
196194
# via scrapegraphai
197-
langchain-core==0.2.22
195+
langchain-core==0.2.23
198196
# via langchain
199197
# via langchain-anthropic
200198
# via langchain-aws
@@ -217,7 +215,8 @@ langchain-groq==0.1.6
217215
# via scrapegraphai
218216
langchain-huggingface==0.0.3
219217
# via scrapegraphai
220-
langchain-nvidia-ai-endpoints==0.1.6
218+
langchain-nvidia-ai-endpoints==0.1.7
219+
221220
# via scrapegraphai
222221
langchain-openai==0.1.17
223222
# via scrapegraphai
@@ -271,15 +270,14 @@ packaging==24.1
271270
# via huggingface-hub
272271
# via langchain-core
273272
# via marshmallow
274-
# via transformers
273+
275274
pandas==2.2.2
276275
# via scrapegraphai
277276
pillow==10.4.0
278277
# via fireworks-ai
279278
# via langchain-nvidia-ai-endpoints
280-
# via sentence-transformers
281-
playwright==1.45.0
282-
# via browserbase
279+
280+
playwright==1.45.1
283281
# via scrapegraphai
284282
# via undetected-playwright
285283
proto-plus==1.24.0
@@ -420,7 +418,7 @@ typing-extensions==4.12.2
420418
# via pydantic-core
421419
# via pyee
422420
# via sqlalchemy
423-
# via torch
421+
424422
# via typing-inspect
425423
typing-inspect==0.9.0
426424
# via dataclasses-json

requirements.txt

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,12 @@
11
langchain>=0.2.10
22
langchain-google-genai>=1.0.7
33
langchain-google-vertexai
4+
5+
langchain-fireworks>=0.1.3
6+
langchain_community>=0.2.9
7+
langchain-google-genai>=1.0.7
8+
langchain-google-vertexai>=1.0.7
9+
410
langchain-openai>=0.1.17
511
langchain-groq>=0.1.3
612
langchain-aws>=0.1.3
@@ -24,3 +30,4 @@ langchain-fireworks>=0.1.3
2430
langchain-community>=0.2.9
2531
langchain-huggingface>=0.0.3
2632
browserbase==0.3.0
33+

scrapegraphai/nodes/fetch_node.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -168,7 +168,10 @@ def execute(self, state):
168168
parsed_content = source
169169

170170
if isinstance(self.llm_model, ChatOpenAI) and not self.script_creator or self.force and not self.script_creator:
171+
171172
parsed_content = convert_to_md(source)
173+
else:
174+
parsed_content = source
172175

173176
compressed_document = [
174177
Document(page_content=parsed_content, metadata={"source": "local_dir"})
@@ -187,6 +190,7 @@ def execute(self, state):
187190
if (isinstance(self.llm_model, ChatOpenAI)
188191
and not self.script_creator) or (self.force and not self.script_creator):
189192
parsed_content = convert_to_md(source, input_data[0])
193+
190194
compressed_document = [Document(page_content=parsed_content)]
191195
else:
192196
self.logger.warning(
@@ -217,6 +221,7 @@ def execute(self, state):
217221
if isinstance(self.llm_model, ChatOpenAI) and not self.script_creator or self.force and not self.script_creator and not self.openai_md_enabled:
218222
parsed_content = convert_to_md(document[0].page_content, input_data[0])
219223

224+
220225
compressed_document = [
221226
Document(page_content=parsed_content, metadata={"source": "html file"})
222227
]

scrapegraphai/utils/convert_to_md.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,12 @@ def convert_to_md(html: str, url: str = None) -> str:
2020
2121
Note: All the styles and links are ignored during the conversion. """
2222

23-
if url:
24-
parsed_url = urlparse(url)
25-
domain = f"{parsed_url.scheme}://{parsed_url.netloc}"
2623
h = html2text.HTML2Text()
2724
h.ignore_links = False
28-
h.baseurl = domain
2925
h.body_width = 0
26+
if url is not None:
27+
parsed_url = urlparse(url)
28+
domain = f"{parsed_url.scheme}://{parsed_url.netloc}"
29+
h.baseurl = domain
30+
3031
return h.handle(html)

0 commit comments

Comments
 (0)