Skip to content

Commit 2a602a1

Browse files
authored
Merge branch 'pre/beta' into jamie-beck-patch-1
2 parents 35b994a + 869bbd7 commit 2a602a1

32 files changed

+316
-206
lines changed

CHANGELOG.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,26 @@
11
## [1.14.1](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.14.0...v1.14.1) (2024-08-24)
22

33

4+
5+
### Bug Fixes
6+
7+
8+
* update abstract graph ([86fe5fc](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/86fe5fcaf1a6ba28786678874378f07fba1db40f))
9+
10+
## [1.15.0-beta.2](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.15.0-beta.1...v1.15.0-beta.2) (2024-08-23)
11+
12+
413
### Bug Fixes
514

6-
* add claude3.5 sonnet ([ee8f8b3](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/ee8f8b31ecfe4ffd311528d2f48cb055e4609d99))
15+
* abstract graph ([cf1fada](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/cf1fada36a6716cb0e24bbc5da7509446a964145))
716

817

918
### Docs
1019

1120
* added sponsors ([b3a2d0d](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/b3a2d0d65a41f6e645fac3fc84f702fdf64b951c))
1221

22+
23+
#
1324
## [1.14.0](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.13.3...v1.14.0) (2024-08-20)
1425

1526

README.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,28 @@ playwright install
3232

3333
**Note**: it is recommended to install the library in a virtual environment to avoid conflicts with other libraries 🐱
3434

35+
By the way if you to use not mandatory modules it is necessary to install by yourself with the following command:
36+
37+
### Installing "Other Language Models"
38+
39+
This group allows you to use additional language models like Fireworks, Groq, Anthropic, Hugging Face, and Nvidia AI Endpoints.
40+
```bash
41+
pip install scrapegraphai[other-language-models]
42+
43+
```
44+
### Installing "More Semantic Options"
45+
46+
This group includes tools for advanced semantic processing, such as Graphviz.
47+
```bash
48+
pip install scrapegraphai[more-semantic-options]
49+
```
50+
### Installing "More Browser Options"
51+
52+
This group includes additional browser management options, such as BrowserBase.
53+
```bash
54+
pip install scrapegraphai[more-browser-options]
55+
```
56+
3557
## 💻 Usage
3658
There are multiple standard scraping pipelines that can be used to extract information from a website (or local file).
3759

docs/README.md

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -9,12 +9,6 @@ markmap:
99

1010
## **Short-Term Goals**
1111

12-
- Integration with more llm APIs
13-
14-
- Test proxy rotation implementation
15-
16-
- Add more search engines inside the SearchInternetNode
17-
1812
- Improve the documentation (ReadTheDocs)
1913
- [Issue #102](https://github.com/VinciGit00/Scrapegraph-ai/issues/102)
2014

@@ -23,9 +17,6 @@ markmap:
2317
## **Medium-Term Goals**
2418

2519
- Node for handling API requests
26-
27-
- Improve SearchGraph to look into the first 5 results of the search engine
28-
2920
- Make scraping more deterministic
3021
- Create DOM tree of the website
3122
- HTML tag text embeddings with tags metadata
@@ -70,5 +61,3 @@ markmap:
7061
- Automatic generation of scraping pipelines from a given prompt
7162

7263
- Create API for the library
73-
74-
- Finetune a LLM for html content

docs/source/scrapers/llm.rst

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -194,3 +194,35 @@ We can also pass a model instance for the chat model and the embedding model. Fo
194194
"model_instance": embedder_model_instance
195195
}
196196
}
197+
198+
Other LLM models
199+
^^^^^^^^^^^^^^^^
200+
201+
We can also pass a model instance for the chat model and the embedding model through the **model_instance** parameter.
202+
This feature enables you to utilize a Langchain model instance.
203+
You will discover the model you require within the provided list:
204+
205+
- `chat model list <https://python.langchain.com/v0.2/docs/integrations/chat/#all-chat-models>`_
206+
- `embedding model list <https://python.langchain.com/v0.2/docs/integrations/text_embedding/#all-embedding-models>`_.
207+
208+
For instance, consider **chat model** Moonshot. We can integrate it in the following manner:
209+
210+
.. code-block:: python
211+
212+
from langchain_community.chat_models.moonshot import MoonshotChat
213+
214+
# The configuration parameters are contingent upon the specific model you select
215+
llm_instance_config = {
216+
"model": "moonshot-v1-8k",
217+
"base_url": "https://api.moonshot.cn/v1",
218+
"moonshot_api_key": "MOONSHOT_API_KEY",
219+
}
220+
221+
llm_model_instance = MoonshotChat(**llm_instance_config)
222+
graph_config = {
223+
"llm": {
224+
"model_instance": llm_model_instance,
225+
"model_tokens": 5000
226+
},
227+
}
228+

examples/model_instance/.env.example

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
MOONLIGHT_API_KEY="YOUR MOONLIGHT API KEY"
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
"""
2+
Basic example of scraping pipeline using SmartScraper and model_instace
3+
"""
4+
5+
import os, json
6+
from scrapegraphai.graphs import SmartScraperGraph
7+
from scrapegraphai.utils import prettify_exec_info
8+
from langchain_community.chat_models.moonshot import MoonshotChat
9+
from dotenv import load_dotenv
10+
load_dotenv()
11+
12+
# ************************************************
13+
# Define the configuration for the graph
14+
# ************************************************
15+
16+
17+
llm_instance_config = {
18+
"model": "moonshot-v1-8k",
19+
"base_url": "https://api.moonshot.cn/v1",
20+
"moonshot_api_key": os.getenv("MOONLIGHT_API_KEY"),
21+
}
22+
23+
24+
llm_model_instance = MoonshotChat(**llm_instance_config)
25+
26+
graph_config = {
27+
"llm": {
28+
"model_instance": llm_model_instance,
29+
"model_tokens": 10000
30+
},
31+
"verbose": True,
32+
"headless": True,
33+
}
34+
35+
# ************************************************
36+
# Create the SmartScraperGraph instance and run it
37+
# ************************************************
38+
39+
smart_scraper_graph = SmartScraperGraph(
40+
prompt="List me what does the company do, the name and a contact email.",
41+
source="https://scrapegraphai.com/",
42+
config=graph_config
43+
)
44+
45+
result = smart_scraper_graph.run()
46+
print(json.dumps(result, indent=4))
47+
48+
# ************************************************
49+
# Get graph execution info
50+
# ************************************************
51+
52+
graph_exec_info = smart_scraper_graph.get_execution_info()
53+
print(prettify_exec_info(graph_exec_info))

examples/moonshot/.env.example

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
MOONLIGHT_API_KEY="YOUR MOONLIGHT API KEY"

examples/moonshot/readme.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
This folder offer an example of how to use ScrapeGraph-AI with Moonshot and SmartScraperGraph. More usage examples can refer to openai exapmles.
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
"""
2+
Basic example of scraping pipeline using SmartScraper and model_instace
3+
"""
4+
5+
import os, json
6+
from scrapegraphai.graphs import SmartScraperGraph
7+
from scrapegraphai.utils import prettify_exec_info
8+
from langchain_community.chat_models.moonshot import MoonshotChat
9+
from dotenv import load_dotenv
10+
load_dotenv()
11+
12+
# ************************************************
13+
# Define the configuration for the graph
14+
# ************************************************
15+
16+
17+
llm_instance_config = {
18+
"model": "moonshot-v1-8k",
19+
"base_url": "https://api.moonshot.cn/v1",
20+
"moonshot_api_key": os.getenv("MOONLIGHT_API_KEY"),
21+
}
22+
23+
24+
llm_model_instance = MoonshotChat(**llm_instance_config)
25+
26+
graph_config = {
27+
"llm": {
28+
"model_instance": llm_model_instance,
29+
"model_tokens": 10000
30+
},
31+
"verbose": True,
32+
"headless": True,
33+
}
34+
35+
# ************************************************
36+
# Create the SmartScraperGraph instance and run it
37+
# ************************************************
38+
39+
smart_scraper_graph = SmartScraperGraph(
40+
prompt="List me what does the company do, the name and a contact email.",
41+
source="https://scrapegraphai.com/",
42+
config=graph_config
43+
)
44+
45+
result = smart_scraper_graph.run()
46+
print(json.dumps(result, indent=4))
47+
48+
# ************************************************
49+
# Get graph execution info
50+
# ************************************************
51+
52+
graph_exec_info = smart_scraper_graph.get_execution_info()
53+
print(prettify_exec_info(graph_exec_info))

pyproject.toml

Lines changed: 22 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,7 @@ name = "scrapegraphai"
44

55
version = "1.14.1"
66

7-
87
description = "A web scraping library based on LangChain which uses LLM and direct graph logic to create scraping pipelines."
9-
108
authors = [
119
{ name = "Marco Vinciguerra", email = "[email protected]" },
1210
{ name = "Marco Perini", email = "[email protected]" },
@@ -15,32 +13,24 @@ authors = [
1513

1614
dependencies = [
1715
"langchain>=0.2.14",
18-
"langchain-fireworks>=0.1.3",
19-
"langchain_community>=0.2.9",
2016
"langchain-google-genai>=1.0.7",
21-
"langchain-google-vertexai>=1.0.7",
2217
"langchain-openai>=0.1.22",
23-
"langchain-groq>=0.1.3",
24-
"langchain-aws>=0.1.3",
25-
"langchain-anthropic>=0.1.11",
2618
"langchain-mistralai>=0.1.12",
27-
"langchain-huggingface>=0.0.3",
28-
"langchain-nvidia-ai-endpoints>=0.1.6",
19+
"langchain_community>=0.2.9",
20+
"langchain-aws>=0.1.3",
2921
"html2text>=2024.2.26",
3022
"faiss-cpu>=1.8.0",
3123
"beautifulsoup4>=4.12.3",
3224
"pandas>=2.2.2",
3325
"python-dotenv>=1.0.1",
3426
"tiktoken>=0.7",
3527
"tqdm>=4.66.4",
36-
"graphviz>=0.20.3",
3728
"minify-html>=0.15.0",
3829
"free-proxy>=1.1.1",
3930
"playwright>=1.43.0",
40-
"google>=3.0.0",
4131
"undetected-playwright>=0.3.0",
32+
"google>=3.0.0",
4233
"semchunk>=1.0.1",
43-
"browserbase>=0.3.0",
4434
]
4535

4636
license = "MIT"
@@ -79,6 +69,25 @@ requires-python = ">=3.9,<4.0"
7969
burr = ["burr[start]==0.22.1"]
8070
docs = ["sphinx==6.0", "furo==2024.5.6"]
8171

72+
# Group 1: Other Language Models
73+
other-language-models = [
74+
"langchain-fireworks>=0.1.3",
75+
"langchain-groq>=0.1.3",
76+
"langchain-anthropic>=0.1.11",
77+
"langchain-huggingface>=0.0.3",
78+
"langchain-nvidia-ai-endpoints>=0.1.6",
79+
]
80+
81+
# Group 2: More Semantic Options
82+
more-semantic-options = [
83+
"graphviz>=0.20.3",
84+
]
85+
86+
# Group 3: More Browser Options
87+
more-browser-options = [
88+
"browserbase>=0.3.0",
89+
]
90+
8291
[build-system]
8392
requires = ["hatchling"]
8493
build-backend = "hatchling.build"

0 commit comments

Comments
 (0)