Skip to content

Commit 74fd530

Browse files
authored
Merge branch 'pre/beta' into 332-pydantic-schema-validation
2 parents f8b08e0 + ac8e7c1 commit 74fd530

File tree

87 files changed

+2905
-117
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

87 files changed

+2905
-117
lines changed

CHANGELOG.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,41 @@
1+
## [1.6.0-beta.6](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.6.0-beta.5...v1.6.0-beta.6) (2024-06-04)
2+
3+
4+
### Features
5+
6+
* refactoring of abstract graph ([fff89f4](https://github.com/VinciGit00/Scrapegraph-ai/commit/fff89f431f60b5caa4dd87643a1bb8895bf96d48))
7+
8+
## [1.6.0-beta.5](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.6.0-beta.4...v1.6.0-beta.5) (2024-06-04)
9+
10+
11+
### Features
12+
13+
* refactoring of an in if ([244aada](https://github.com/VinciGit00/Scrapegraph-ai/commit/244aada2de1f3bc88782fa90e604e8b936b79aa4))
14+
15+
## [1.6.0-beta.4](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.6.0-beta.3...v1.6.0-beta.4) (2024-06-03)
16+
17+
18+
### Features
19+
20+
* fix an if ([c8d556d](https://github.com/VinciGit00/Scrapegraph-ai/commit/c8d556da4e4b8730c6c35f1d448270b8e26923f2))
21+
22+
## [1.6.0-beta.3](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.6.0-beta.2...v1.6.0-beta.3) (2024-06-03)
23+
24+
25+
### Features
26+
27+
* removed a bug ([8de720d](https://github.com/VinciGit00/Scrapegraph-ai/commit/8de720d37958e31b73c5c89bc21f474f3303b42b))
28+
29+
## [1.6.0-beta.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.6.0-beta.1...v1.6.0-beta.2) (2024-06-03)
30+
31+
32+
### Features
33+
34+
* add csv scraper and xml scraper multi ([b408655](https://github.com/VinciGit00/Scrapegraph-ai/commit/b4086550cc9dc42b2fd91ee7ef60c6a2c2ac3fd2))
35+
* add json multiscraper ([5bda918](https://github.com/VinciGit00/Scrapegraph-ai/commit/5bda918a39e4b50d86d784b4c592cc2ea1a68986))
36+
* add pdf scraper multi graph ([f5cbd80](https://github.com/VinciGit00/Scrapegraph-ai/commit/f5cbd80c977f51233ac1978d8450fcf0ec2ff461))
37+
* removed rag node ([930f673](https://github.com/VinciGit00/Scrapegraph-ai/commit/930f67374752561903462a25728c739946f9449b))
38+
139
## [1.6.0-beta.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.5.5-beta.1...v1.6.0-beta.1) (2024-06-02)
240

341

README.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11

22
# 🕷️ ScrapeGraphAI: You Only Scrape Once
3-
[English](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/README.md) | [中国人](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/chinese.md)
3+
[English](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/README.md) | [中文](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/chinese.md)
44

55
[![Downloads](https://static.pepy.tech/badge/scrapegraphai)](https://pepy.tech/project/scrapegraphai)
66
[![linting: pylint](https://img.shields.io/badge/linting-pylint-yellowgreen)](https://github.com/pylint-dev/pylint)
@@ -164,6 +164,16 @@ print(result)
164164

165165
The output will be an audio file with the summary of the projects on the page.
166166

167+
## Sponsors
168+
<div style="text-align: center;">
169+
<a href="https://serpapi.com?utm_source=scrapegraphai">
170+
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/serp_api_logo.png" alt="SerpAPI" style="width: 10%;">
171+
</a>
172+
<a href="https://dashboard.statproxies.com/?refferal=scrapegraph">
173+
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/transparent_stat.png" alt="Stats" style="width: 15%;">
174+
</a>
175+
</div>
176+
167177
## 🤝 Contributing
168178

169179
Feel free to contribute and join our Discord server to discuss with us improvements and give us suggestions!
@@ -182,16 +192,6 @@ Wanna visualize the roadmap in a more interactive way? Check out the [markmap](h
182192
## ❤️ Contributors
183193
[![Contributors](https://contrib.rocks/image?repo=VinciGit00/Scrapegraph-ai)](https://github.com/VinciGit00/Scrapegraph-ai/graphs/contributors)
184194

185-
## Sponsors
186-
<div style="text-align: center;">
187-
<a href="https://serpapi.com?utm_source=scrapegraphai">
188-
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/serp_api_logo.png" alt="SerpAPI" style="width: 10%;">
189-
</a>
190-
<a href="https://dashboard.statproxies.com/?refferal=scrapegraph">
191-
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/transparent_stat.png" alt="Stats" style="width: 15%;">
192-
</a>
193-
</div>
194-
195195
## 🎓 Citations
196196
If you have used our library for research purposes please quote us with the following reference:
197197
```text

docs/chinese.md

Lines changed: 58 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -21,34 +21,36 @@ Scrapegraph-ai 的参考页面可以在 PyPI 的官方网站上找到: [pypi](ht
2121
```bash
2222
pip install scrapegraphai
2323
```
24-
注意: 建议在虚拟环境中安装该库,以避免与其他库发生冲突 🐱
24+
**注意**: 建议在虚拟环境中安装该库,以避免与其他库发生冲突 🐱
2525

26-
🔍 演示
26+
## 🔍 演示
2727

2828
官方 Streamlit 演示:
2929

30-
30+
[![My Skills](https://skillicons.dev/icons?i=react)](https://scrapegraph-ai-web-dashboard.streamlit.app)
3131

3232
在 Google Colab 上直接尝试:
3333

34+
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sEZBonBMGP44CtO6GQTwAlL0BGJXjtfd?usp=sharing)
35+
3436
## 📖 文档
3537

36-
ScrapeGraphAI 的文档可以在这里找到
38+
ScrapeGraphAI 的文档可以在[这里](https://scrapegraph-ai.readthedocs.io/en/latest/)找到
3739

38-
还可以查看 Docusaurus 这里
40+
还可以查看 Docusaurus [版本](https://scrapegraph-doc.onrender.com/)
3941

4042
## 💻 用法
4143

4244
有三种主要的爬取管道可用于从网站(或本地文件)提取信息:
4345

44-
SmartScraperGraph: 单页爬虫,只需用户提示和输入源;
45-
SearchGraph: 多页爬虫,从搜索引擎的前 n 个搜索结果中提取信息;
46-
SpeechGraph: 单页爬虫,从网站提取信息并生成音频文件。
47-
SmartScraperMultiGraph: 多页爬虫,给定一个提示
48-
可以通过 API 使用不同的 LLM,如 OpenAIGroqAzure 和 Gemini,或者使用 Ollama 的本地模型。
46+
- `SmartScraperGraph`: 单页爬虫,只需用户提示和输入源;
47+
- `SearchGraph`: 多页爬虫,从搜索引擎的前 n 个搜索结果中提取信息;
48+
- `SpeechGraph`: 单页爬虫,从网站提取信息并生成音频文件。
49+
- `SmartScraperMultiGraph`: 多页爬虫,给定一个提示
50+
可以通过 API 使用不同的 LLM,如 **OpenAI****Groq****Azure****Gemini**,或者使用 **Ollama** 的本地模型。
4951

50-
案例 1: 使用本地模型的 SmartScraper
51-
请确保已安装 Ollama 并使用 ollama pull 命令下载模型。
52+
### 案例 1: 使用本地模型的 SmartScraper
53+
请确保已安装 [Ollama](https://ollama.com/) 并使用 `ollama pull` 命令下载模型。
5254

5355
``` python
5456
from scrapegraphai.graphs import SmartScraperGraph
@@ -68,23 +70,24 @@ graph_config = {
6870
}
6971

7072
smart_scraper_graph = SmartScraperGraph(
71-
prompt="列出所有项目及其描述",
73+
prompt="List me all the projects with their descriptions",
7274
# 也接受已下载的 HTML 代码的字符串
7375
source="https://perinim.github.io/projects",
7476
config=graph_config
7577
)
7678

7779
result = smart_scraper_graph.run()
7880
print(result)
79-
```
81+
```
8082

8183
输出将是一个包含项目及其描述的列表,如下所示:
8284

83-
python
84-
Copia codice
85-
{'projects': [{'title': 'Rotary Pendulum RL', 'description': '开源项目,旨在使用 RL 算法控制现实中的旋转摆'}, {'title': 'DQN Implementation from scratch', 'description': '开发了一个深度 Q 网络算法来训练简单和双摆'}, ...]}
86-
案例 2: 使用混合模型的 SearchGraph
87-
我们使用 Groq 作为 LLM,使用 Ollama 作为嵌入模型。
85+
```python
86+
{'projects': [{'title': 'Rotary Pendulum RL', 'description': 'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'}, {'title': 'DQN Implementation from scratch', 'description': 'Developed a Deep Q-Network algorithm to train a simple and double pendulum'}, ...]}
87+
```
88+
89+
### 案例 2: 使用混合模型的 SearchGraph
90+
我们使用 **Groq** 作为 LLM,使用 **Ollama** 作为嵌入模型。
8891

8992
```python
9093
from scrapegraphai.graphs import SearchGraph
@@ -105,7 +108,7 @@ graph_config = {
105108

106109
# 创建 SearchGraph 实例
107110
search_graph = SearchGraph(
108-
prompt="列出所有来自基奥贾的传统食谱",
111+
prompt="List me all the traditional recipes from Chioggia",
109112
config=graph_config
110113
)
111114

@@ -118,9 +121,12 @@ print(result)
118121

119122
```python
120123
{'recipes': [{'name': 'Sarde in Saòre'}, {'name': 'Bigoli in salsa'}, {'name': 'Seppie in umido'}, {'name': 'Moleche frite'}, {'name': 'Risotto alla pescatora'}, {'name': 'Broeto'}, {'name': 'Bibarasse in Cassopipa'}, {'name': 'Risi e bisi'}, {'name': 'Smegiassa Ciosota'}]}
121-
案例 3: 使用 OpenAI 的 SpeechGraph
122-
您只需传递 OpenAI API 密钥和模型名称。
123124
```
125+
126+
### 案例 3: 使用 OpenAI 的 SpeechGraph
127+
128+
您只需传递 OpenAI API 密钥和模型名称。
129+
124130
```python
125131
from scrapegraphai.graphs import SpeechGraph
126132

@@ -142,7 +148,7 @@ graph_config = {
142148
# ************************************************
143149

144150
speech_graph = SpeechGraph(
145-
prompt="详细总结这些项目并生成音频。",
151+
prompt="Make a detailed audio summary of the projects.",
146152
source="https://perinim.github.io/projects/",
147153
config=graph_config,
148154
)
@@ -152,36 +158,38 @@ print(result)
152158
```
153159
输出将是一个包含页面上项目摘要的音频文件。
154160

155-
## 🤝 贡献
161+
## 赞助商
156162

157-
欢迎贡献并加入我们的 Discord 服务器与我们讨论改进和提出建议!
163+
<div style="text-align: center;">
164+
<a href="https://serpapi.com?utm_source=scrapegraphai">
165+
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/serp_api_logo.png" alt="SerpAPI" style="width: 10%;">
166+
</a>
167+
<a href="https://dashboard.statproxies.com/?refferal=scrapegraph">
168+
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/transparent_stat.png" alt="Stats" style="width: 15%;">
169+
</a>
170+
</div>
158171

159-
请参阅贡献指南。
172+
## 🤝 贡献
160173

174+
欢迎贡献并加入我们的 Discord 服务器与我们讨论改进和提出建议!
161175

176+
请参阅[贡献指南](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/CONTRIBUTING.md)
162177

178+
[![My Skills](https://skillicons.dev/icons?i=discord)](https://discord.gg/uJN7TYcpNa)
179+
[![My Skills](https://skillicons.dev/icons?i=linkedin)](https://www.linkedin.com/company/scrapegraphai/)
180+
[![My Skills](https://skillicons.dev/icons?i=twitter)](https://twitter.com/scrapegraphai)
163181

164182

165-
📈 路线图
183+
## 📈 路线图
166184

167-
查看项目路线图这里! 🚀
185+
[这里](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/docs/README.md)查看项目路线图! 🚀
168186

169-
想要以更互动的方式可视化路线图?请查看 markmap 通过将 markdown 内容复制粘贴到编辑器中进行可视化!
187+
想要以更互动的方式可视化路线图?请查看 [markmap](https://markmap.js.org/repl) 通过将 markdown 内容复制粘贴到编辑器中进行可视化!
170188

171189
## ❤️ 贡献者
190+
[![Contributors](https://contrib.rocks/image?repo=VinciGit00/Scrapegraph-ai)](https://github.com/VinciGit00/Scrapegraph-ai/graphs/contributors)
172191

173192

174-
赞助商
175-
176-
<div style="text-align: center;">
177-
<a href="https://serpapi.com?utm_source=scrapegraphai">
178-
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/serp_api_logo.png" alt="SerpAPI" style="width: 10%;">
179-
</a>
180-
<a href="https://dashboard.statproxies.com/?refferal=scrapegraph">
181-
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/transparent_stat.png" alt="Stats" style="width: 15%;">
182-
</a>
183-
</div>
184-
185193
## 🎓 引用
186194

187195
如果您将我们的库用于研究目的,请引用以下参考文献:
@@ -199,16 +207,19 @@ print(result)
199207
<p align="center">
200208
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/logo_authors.png" alt="Authors_logos">
201209
</p>
210+
202211
## 联系方式
212+
| | Contact Info |
213+
|--------------------|----------------------|
214+
| Marco Vinciguerra | [![Linkedin Badge](https://img.shields.io/badge/-Linkedin-blue?style=flat&logo=Linkedin&logoColor=white)](https://www.linkedin.com/in/marco-vinciguerra-7ba365242/) |
215+
| Marco Perini | [![Linkedin Badge](https://img.shields.io/badge/-Linkedin-blue?style=flat&logo=Linkedin&logoColor=white)](https://www.linkedin.com/in/perinim/) |
216+
| Lorenzo Padoan | [![Linkedin Badge](https://img.shields.io/badge/-Linkedin-blue?style=flat&logo=Linkedin&logoColor=white)](https://www.linkedin.com/in/lorenzo-padoan-4521a2154/) |
203217

204-
Marco Vinciguerra
205-
Marco Perini
206-
Lorenzo Padoan
207218
## 📜 许可证
208219

209-
ScrapeGraphAI 采用 MIT 许可证。更多信息请查看 LICENSE 文件。
220+
ScrapeGraphAI 采用 MIT 许可证。更多信息请查看 [LICENSE](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/LICENSE) 文件。
210221

211-
鸣谢
222+
## 鸣谢
212223

213-
我们要感谢所有项目贡献者和开源社区的支持。
214-
ScrapeGraphAI 仅用于数据探索和研究目的。我们不对任何滥用该库的行为负责。
224+
- 我们要感谢所有项目贡献者和开源社区的支持。
225+
- ScrapeGraphAI 仅用于数据探索和研究目的。我们不对任何滥用该库的行为负责。
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
"""
2+
Basic example of scraping pipeline using CSVScraperMultiGraph from CSV documents
3+
"""
4+
5+
import os
6+
from dotenv import load_dotenv
7+
import pandas as pd
8+
from scrapegraphai.graphs import CSVScraperMultiGraph
9+
from scrapegraphai.utils import convert_to_csv, convert_to_json, prettify_exec_info
10+
11+
load_dotenv()
12+
# ************************************************
13+
# Read the CSV file
14+
# ************************************************
15+
16+
FILE_NAME = "inputs/username.csv"
17+
curr_dir = os.path.dirname(os.path.realpath(__file__))
18+
file_path = os.path.join(curr_dir, FILE_NAME)
19+
20+
text = pd.read_csv(file_path)
21+
22+
# ************************************************
23+
# Define the configuration for the graph
24+
# ************************************************
25+
26+
graph_config = {
27+
"llm": {
28+
"api_key": os.getenv("ANTHROPIC_API_KEY"),
29+
"model": "claude-3-haiku-20240307",
30+
"max_tokens": 4000},
31+
}
32+
33+
# ************************************************
34+
# Create the CSVScraperMultiGraph instance and run it
35+
# ************************************************
36+
37+
csv_scraper_graph = CSVScraperMultiGraph(
38+
prompt="List me all the last names",
39+
source=[str(text), str(text)],
40+
config=graph_config
41+
)
42+
43+
result = csv_scraper_graph.run()
44+
print(result)
45+
46+
# ************************************************
47+
# Get graph execution info
48+
# ************************************************
49+
50+
graph_exec_info = csv_scraper_graph.get_execution_info()
51+
print(prettify_exec_info(graph_exec_info))
52+
53+
# Save to json or csv
54+
convert_to_csv(result, "result")
55+
convert_to_json(result, "result")
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
"""
2+
Module for showing how JSONScraperMultiGraph multi works
3+
"""
4+
import os
5+
import json
6+
from dotenv import load_dotenv
7+
from scrapegraphai.graphs import JSONScraperMultiGraph
8+
9+
load_dotenv()
10+
11+
graph_config = {
12+
"llm": {
13+
"api_key": os.getenv("ANTHROPIC_API_KEY"),
14+
"model": "claude-3-haiku-20240307",
15+
"max_tokens": 4000
16+
},
17+
}
18+
19+
FILE_NAME = "inputs/example.json"
20+
curr_dir = os.path.dirname(os.path.realpath(__file__))
21+
file_path = os.path.join(curr_dir, FILE_NAME)
22+
23+
with open(file_path, 'r', encoding="utf-8") as file:
24+
text = file.read()
25+
26+
sources = [text, text]
27+
28+
multiple_search_graph = JSONScraperMultiGraph(
29+
prompt= "List me all the authors, title and genres of the books",
30+
source= sources,
31+
schema=None,
32+
config=graph_config
33+
)
34+
35+
result = multiple_search_graph.run()
36+
print(json.dumps(result, indent=4))

examples/anthropic/pdf_scraper_graph_haiku.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
1+
"""
2+
Module for showing how PDFScraper multi works
3+
"""
14
import os, json
25
from dotenv import load_dotenv
36
from scrapegraphai.graphs import PDFScraperGraph
47

58
load_dotenv()
69

7-
810
# ************************************************
911
# Define the configuration for the graph
1012
# ************************************************

0 commit comments

Comments
 (0)