Skip to content

Commit f3f7dfe

Browse files
authored
Merge pull request #438 from ScrapeGraphAI/pre/beta
Pre/beta
2 parents b1fcfd4 + 5cb5fbf commit f3f7dfe

File tree

83 files changed

+3093
-205
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

83 files changed

+3093
-205
lines changed

.github/workflows/pylint.yml

Lines changed: 11 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,26 @@
1-
on: [push]
1+
on:
2+
push:
3+
paths:
4+
- 'scrapegraphai/**'
5+
- '.github/workflows/pylint.yml'
26

37
jobs:
48
build:
59
runs-on: ubuntu-latest
6-
strategy:
7-
matrix:
8-
python-version: ["3.10"]
910
steps:
1011
- uses: actions/checkout@v3
11-
- name: Set up Python ${{ matrix.python-version }}
12-
uses: actions/setup-python@v3
13-
with:
14-
python-version: ${{ matrix.python-version }}
12+
- name: Install the latest version of rye
13+
uses: eifinger/setup-rye@v3
1514
- name: Install dependencies
16-
run: |
17-
python -m pip install --upgrade pip
18-
pip install pylint
19-
pip install -r requirements.txt
15+
run: rye sync --no-lock
2016
- name: Analysing the code with pylint
21-
run: pylint --disable=C0114,C0115,C0116 --exit-zero scrapegraphai/**/*.py scrapegraphai/*.py
17+
run: rye run pylint-ci
2218
- name: Check Pylint score
2319
run: |
24-
pylint_score=$(pylint --disable=all --enable=metrics --output-format=text scrapegraphai/**/*.py scrapegraphai/*.py | grep 'Raw metrics' | awk '{print $4}')
20+
pylint_score=$(rye run pylint-score-ci | grep 'Raw metrics' | awk '{print $4}')
2521
if (( $(echo "$pylint_score < 8" | bc -l) )); then
2622
echo "Pylint score is below 8. Blocking commit."
2723
exit 1
2824
else
2925
echo "Pylint score is acceptable."
30-
fi
26+
fi

CHANGELOG.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,41 @@
1+
## [1.9.0-beta.2](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.9.0-beta.1...v1.9.0-beta.2) (2024-07-05)
2+
3+
4+
### Bug Fixes
5+
6+
* fix pyproject.toml ([7570bf8](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/7570bf8294e49bc54ec9e296aaadb763873390ca))
7+
8+
## [1.9.0-beta.1](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.8.1-beta.1...v1.9.0-beta.1) (2024-07-04)
9+
10+
11+
### Features
12+
13+
* add fireworks integration ([df0e310](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/df0e3108299071b849d7e055bd11d72764d24f08))
14+
* add integration for infos ([3bf5f57](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/3bf5f570a8f8e1b037a7ad3c9f583261a1536421))
15+
* add integrations for markdown files ([2804434](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/2804434a9ee12c52ae8956a88b1778a4dd3ec32f))
16+
* add vertexai integration ([119514b](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/119514bdfc2a16dfb8918b0c34ae7cc43a01384c))
17+
* improve md prompt recognition ([5fe694b](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/5fe694b6b4545a5091d16110318b992acfca4f58))
18+
19+
20+
### chore
21+
22+
* **Docker:** fix port number ([afeb81f](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/afeb81f77a884799192d79dcac85666190fb1c9d))
23+
* **CI:** fix pylint workflow ([583c321](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/583c32106e827f50235d8fc69511652fd4b07a35))
24+
* **rye:** rebuild lockfiles ([27c2dd2](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/27c2dd23517a7e4b14fafd00320a8b81f73145dc))
25+
26+
## [1.8.1-beta.1](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.8.0...v1.8.1-beta.1) (2024-07-04)
27+
28+
29+
### Bug Fixes
30+
31+
* add test ([3a537ee](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/3a537eec6fef1743924a9aa5cef0ba2f8d44bf11))
32+
33+
34+
### Docs
35+
36+
* **roadmap:** fix urls ([14faba4](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/14faba4f00dd9f947f8dc5e0b51be49ea684179f))
37+
* **roadmap:** next steps ([3e644f4](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/3e644f498f05eb505fbd4e94b144c81567569aaa))
38+
139
## [1.8.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.7.5...v1.8.0) (2024-06-30)
240

341

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=for-the-badge)](https://opensource.org/licenses/MIT)
1313
[![](https://dcbadge.vercel.app/api/server/gkxQDAjfeX)](https://discord.gg/gkxQDAjfeX)
1414

15-
ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, etc.).
15+
ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.).
1616

1717
Just say which information you want to extract and the library will do it for you!
1818

docker-compose.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ services:
44
image: ollama/ollama
55
container_name: ollama
66
ports:
7-
- "5000:5000"
7+
- "11434:11434"
88
volumes:
99
- ollama_volume:/root/.ollama
1010
restart: unless-stopped
Lines changed: 19 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,17 @@
11
# Local models
2+
# Local models
23
The two websites benchmark are:
34
- Example 1: https://perinim.github.io/projects
45
- Example 2: https://www.wired.com (at 17/4/2024)
56

67
Both are strored locally as txt file in .txt format because in this way we do not have to think about the internet connection
78

8-
| Hardware | Model | Example 1 | Example 2 |
9-
| ------------------ | --------------------------------------- | --------- | --------- |
10-
| Macbook 14' m1 pro | Mistral on Ollama with nomic-embed-text | 11.60s | 26.61s |
11-
| Macbook m2 max | Mistral on Ollama with nomic-embed-text | 8.05s | 12.17s |
12-
| Macbook 14' m1 pro | Llama3 on Ollama with nomic-embed-text | 29.87s | 35.32s |
13-
| Macbook m2 max | Llama3 on Ollama with nomic-embed-text | 18.36s | 78.32s |
9+
| Hardware | Model | Example 1 | Example 2 |
10+
| ---------------------- | --------------------------------------- | --------- | --------- |
11+
| Macbook 14' m1 pro | Mistral on Ollama with nomic-embed-text | 16.291s | 38.74s |
12+
| Macbook m2 max | Mistral on Ollama with nomic-embed-text | | |
13+
| Macbook 14' m1 pro<br> | Llama3 on Ollama with nomic-embed-text | 12.88s | 13.84s |
14+
| Macbook m2 max<br> | Llama3 on Ollama with nomic-embed-text | | |
1415

1516
**Note**: the examples on Docker are not runned on other devices than the Macbook because the performance are to slow (10 times slower than Ollama). Indeed the results are the following:
1617

@@ -22,20 +23,20 @@ Both are strored locally as txt file in .txt format because in this way we do n
2223
**URL**: https://perinim.github.io/projects
2324
**Task**: List me all the projects with their description.
2425

25-
| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
26-
| --------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
27-
| gpt-3.5-turbo | 25.22 | 445 | 272 | 173 | 1 | 0.000754 |
28-
| gpt-4-turbo-preview | 9.53 | 449 | 272 | 177 | 1 | 0.00803 |
29-
| Grooq with nomic-embed-text | 1.99 | 474 | 284 | 190 | 1 | 0 |
26+
| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
27+
| ------------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
28+
| gpt-3.5-turbo | 4.132s | 438 | 303 | 135 | 1 | 0.000724 |
29+
| gpt-4-turbo-preview | 6.965s | 442 | 303 | 139 | 1 | 0.0072 |
30+
| gpt-4-o | 4.446s | 444 | 305 | 139 | 1 | 0 |
31+
| Grooq with nomic-embed-text<br> | 1.335s | 648 | 482 | 166 | 1 | 0 |
3032

3133
### Example 2: Wired
3234
**URL**: https://www.wired.com
3335
**Task**: List me all the articles with their description.
3436

35-
| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
36-
| --------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
37-
| gpt-3.5-turbo | 25.89 | 445 | 272 | 173 | 1 | 0.000754 |
38-
| gpt-4-turbo-preview | 64.70 | 3573 | 2199 | 1374 | 1 | 0.06321 |
39-
| Grooq with nomic-embed-text | 3.82 | 2459 | 2192 | 267 | 1 | 0 |
40-
41-
37+
| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
38+
| ------------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
39+
| gpt-3.5-turbo | 8.836s | 1167 | 726 | 441 | 1 | 0.001971 |
40+
| gpt-4-turbo-preview | 21.53s | 1205 | 726 | 479 | 1 | 0.02163 |
41+
| gpt-4-o | 15.27s | 1400 | 715 | 685 | 1 | 0 |
42+
| Grooq with nomic-embed-text<br> | 3.82s | 2459 | 2192 | 267 | 1 | 0 |
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
"""
2+
Basic example of scraping pipeline using SmartScraper from text
3+
"""
4+
5+
import os
6+
from dotenv import load_dotenv
7+
from scrapegraphai.graphs import SmartScraperGraph
8+
from scrapegraphai.utils import prettify_exec_info
9+
load_dotenv()
10+
11+
# ************************************************
12+
# Read the text file
13+
# ************************************************
14+
files = ["inputs/example_1.txt", "inputs/example_2.txt"]
15+
tasks = ["List me all the projects with their description.",
16+
"List me all the articles with their description."]
17+
18+
19+
# ************************************************
20+
# Define the configuration for the graph
21+
# ************************************************
22+
23+
openai_key = os.getenv("OPENAI_APIKEY")
24+
25+
graph_config = {
26+
"llm": {
27+
"api_key": openai_key,
28+
"model": "gpt-4o",
29+
},
30+
}
31+
32+
# ************************************************
33+
# Create the SmartScraperGraph instance and run it
34+
# ************************************************
35+
36+
for i in range(0, 2):
37+
with open(files[i], 'r', encoding="utf-8") as file:
38+
text = file.read()
39+
40+
smart_scraper_graph = SmartScraperGraph(
41+
prompt=tasks[i],
42+
source=text,
43+
config=graph_config
44+
)
45+
46+
result = smart_scraper_graph.run()
47+
print(result)
48+
# ************************************************
49+
# Get graph execution info
50+
# ************************************************
51+
52+
graph_exec_info = smart_scraper_graph.get_execution_info()
53+
print(prettify_exec_info(graph_exec_info))

examples/extras/custom_prompt.py

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
"""
2+
Basic example of scraping pipeline using SmartScraper
3+
"""
4+
import os
5+
import json
6+
from dotenv import load_dotenv
7+
from scrapegraphai.graphs import SmartScraperGraph
8+
from scrapegraphai.utils import prettify_exec_info
9+
10+
load_dotenv()
11+
12+
13+
# ************************************************
14+
# Define the configuration for the graph
15+
# ************************************************
16+
17+
openai_key = os.getenv("OPENAI_APIKEY")
18+
19+
prompt = "Some more info"
20+
21+
graph_config = {
22+
"llm": {
23+
"api_key": openai_key,
24+
"model": "gpt-3.5-turbo",
25+
},
26+
"additional_info": prompt,
27+
"verbose": True,
28+
"headless": False,
29+
}
30+
31+
# ************************************************
32+
# Create the SmartScraperGraph instance and run it
33+
# ************************************************
34+
35+
smart_scraper_graph = SmartScraperGraph(
36+
prompt="List me all the projects with their description",
37+
# also accepts a string with the already downloaded HTML code
38+
source="https://perinim.github.io/projects/",
39+
config=graph_config,
40+
)
41+
42+
result = smart_scraper_graph.run()
43+
print(json.dumps(result, indent=4))
44+
45+
# ************************************************
46+
# Get graph execution info
47+
# ************************************************
48+
49+
graph_exec_info = smart_scraper_graph.get_execution_info()
50+
print(prettify_exec_info(graph_exec_info))

examples/extras/example.yml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
{
2+
"llm": {
3+
"model": "ollama/llama3",
4+
"temperature": 0,
5+
"format": "json",
6+
# "base_url": "http://localhost:11434",
7+
},
8+
"embeddings": {
9+
"model": "ollama/nomic-embed-text",
10+
"temperature": 0,
11+
# "base_url": "http://localhost:11434",
12+
},
13+
"verbose": true,
14+
"headless": false
15+
}

examples/extras/force_mode.py

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
"""
2+
Basic example of scraping pipeline using SmartScraper
3+
"""
4+
5+
import os
6+
from dotenv import load_dotenv
7+
from scrapegraphai.graphs import SmartScraperGraph
8+
from scrapegraphai.utils import prettify_exec_info
9+
10+
load_dotenv()
11+
12+
13+
# ************************************************
14+
# Define the configuration for the graph
15+
# ************************************************
16+
17+
openai_key = os.getenv("OPENAI_APIKEY")
18+
19+
graph_config = {
20+
"llm": {
21+
"model": "ollama/llama3",
22+
"temperature": 0,
23+
# "format": "json", # Ollama needs the format to be specified explicitly
24+
# "base_url": "http://localhost:11434", # set ollama URL arbitrarily
25+
},
26+
"embeddings": {
27+
"model": "ollama/nomic-embed-text",
28+
"temperature": 0,
29+
# "base_url": "http://localhost:11434", # set ollama URL arbitrarily
30+
},
31+
"force": True,
32+
"caching": True
33+
}
34+
35+
# ************************************************
36+
# Create the SmartScraperGraph instance and run it
37+
# ************************************************
38+
39+
smart_scraper_graph = SmartScraperGraph(
40+
prompt="List me all the projects with their description.",
41+
# also accepts a string with the already downloaded HTML code
42+
source="https://perinim.github.io/projects/",
43+
config=graph_config
44+
)
45+
46+
result = smart_scraper_graph.run()
47+
print(result)
48+
49+
# ************************************************
50+
# Get graph execution info
51+
# ************************************************
52+
53+
graph_exec_info = smart_scraper_graph.get_execution_info()
54+
print(prettify_exec_info(graph_exec_info))

examples/extras/load_yml.py

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
"""
2+
Basic example of scraping pipeline using SmartScraper
3+
"""
4+
import yaml
5+
from scrapegraphai.graphs import SmartScraperGraph
6+
from scrapegraphai.utils import prettify_exec_info
7+
8+
# ************************************************
9+
# Define the configuration for the graph
10+
# ************************************************
11+
with open("example.yml", 'r') as file:
12+
graph_config = yaml.safe_load(file)
13+
14+
# ************************************************
15+
# Create the SmartScraperGraph instance and run it
16+
# ************************************************
17+
18+
smart_scraper_graph = SmartScraperGraph(
19+
prompt="List me all the titles",
20+
source="https://sport.sky.it/nba?gr=www",
21+
config=graph_config
22+
)
23+
24+
result = smart_scraper_graph.run()
25+
print(result)
26+
27+
# ************************************************
28+
# Get graph execution info
29+
# ************************************************
30+
31+
graph_exec_info = smart_scraper_graph.get_execution_info()
32+
print(prettify_exec_info(graph_exec_info))

0 commit comments

Comments
 (0)