Skip to content

Commit fd6142e

Browse files
authored
Merge pull request #436 from ScrapeGraphAI/support
Support
2 parents 8f9f96f + 104d869 commit fd6142e

File tree

65 files changed

+2795
-154
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

65 files changed

+2795
-154
lines changed

.github/workflows/pylint.yml

Lines changed: 11 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,26 @@
1-
on: [push]
1+
on:
2+
push:
3+
paths:
4+
- 'scrapegraphai/**'
5+
- '.github/workflows/pylint.yml'
26

37
jobs:
48
build:
59
runs-on: ubuntu-latest
6-
strategy:
7-
matrix:
8-
python-version: ["3.10"]
910
steps:
1011
- uses: actions/checkout@v3
11-
- name: Set up Python ${{ matrix.python-version }}
12-
uses: actions/setup-python@v3
13-
with:
14-
python-version: ${{ matrix.python-version }}
12+
- name: Install the latest version of rye
13+
uses: eifinger/setup-rye@v3
1514
- name: Install dependencies
16-
run: |
17-
python -m pip install --upgrade pip
18-
pip install pylint
19-
pip install -r requirements.txt
15+
run: rye sync --no-lock
2016
- name: Analysing the code with pylint
21-
run: pylint --disable=C0114,C0115,C0116 --exit-zero scrapegraphai/**/*.py scrapegraphai/*.py
17+
run: rye run pylint-ci
2218
- name: Check Pylint score
2319
run: |
24-
pylint_score=$(pylint --disable=all --enable=metrics --output-format=text scrapegraphai/**/*.py scrapegraphai/*.py | grep 'Raw metrics' | awk '{print $4}')
20+
pylint_score=$(rye run pylint-score-ci | grep 'Raw metrics' | awk '{print $4}')
2521
if (( $(echo "$pylint_score < 8" | bc -l) )); then
2622
echo "Pylint score is below 8. Blocking commit."
2723
exit 1
2824
else
2925
echo "Pylint score is acceptable."
30-
fi
26+
fi

docker-compose.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ services:
44
image: ollama/ollama
55
container_name: ollama
66
ports:
7-
- "5000:5000"
7+
- "11434:11434"
88
volumes:
99
- ollama_volume:/root/.ollama
1010
restart: unless-stopped
Lines changed: 19 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,17 @@
11
# Local models
2+
# Local models
23
The two websites benchmark are:
34
- Example 1: https://perinim.github.io/projects
45
- Example 2: https://www.wired.com (at 17/4/2024)
56

67
Both are strored locally as txt file in .txt format because in this way we do not have to think about the internet connection
78

8-
| Hardware | Model | Example 1 | Example 2 |
9-
| ------------------ | --------------------------------------- | --------- | --------- |
10-
| Macbook 14' m1 pro | Mistral on Ollama with nomic-embed-text | 11.60s | 26.61s |
11-
| Macbook m2 max | Mistral on Ollama with nomic-embed-text | 8.05s | 12.17s |
12-
| Macbook 14' m1 pro | Llama3 on Ollama with nomic-embed-text | 29.87s | 35.32s |
13-
| Macbook m2 max | Llama3 on Ollama with nomic-embed-text | 18.36s | 78.32s |
9+
| Hardware | Model | Example 1 | Example 2 |
10+
| ---------------------- | --------------------------------------- | --------- | --------- |
11+
| Macbook 14' m1 pro | Mistral on Ollama with nomic-embed-text | 16.291s | 38.74s |
12+
| Macbook m2 max | Mistral on Ollama with nomic-embed-text | | |
13+
| Macbook 14' m1 pro<br> | Llama3 on Ollama with nomic-embed-text | 12.88s | 13.84s |
14+
| Macbook m2 max<br> | Llama3 on Ollama with nomic-embed-text | | |
1415

1516
**Note**: the examples on Docker are not runned on other devices than the Macbook because the performance are to slow (10 times slower than Ollama). Indeed the results are the following:
1617

@@ -22,20 +23,20 @@ Both are strored locally as txt file in .txt format because in this way we do n
2223
**URL**: https://perinim.github.io/projects
2324
**Task**: List me all the projects with their description.
2425

25-
| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
26-
| --------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
27-
| gpt-3.5-turbo | 25.22 | 445 | 272 | 173 | 1 | 0.000754 |
28-
| gpt-4-turbo-preview | 9.53 | 449 | 272 | 177 | 1 | 0.00803 |
29-
| Grooq with nomic-embed-text | 1.99 | 474 | 284 | 190 | 1 | 0 |
26+
| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
27+
| ------------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
28+
| gpt-3.5-turbo | 4.132s | 438 | 303 | 135 | 1 | 0.000724 |
29+
| gpt-4-turbo-preview | 6.965s | 442 | 303 | 139 | 1 | 0.0072 |
30+
| gpt-4-o | 4.446s | 444 | 305 | 139 | 1 | 0 |
31+
| Grooq with nomic-embed-text<br> | 1.335s | 648 | 482 | 166 | 1 | 0 |
3032

3133
### Example 2: Wired
3234
**URL**: https://www.wired.com
3335
**Task**: List me all the articles with their description.
3436

35-
| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
36-
| --------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
37-
| gpt-3.5-turbo | 25.89 | 445 | 272 | 173 | 1 | 0.000754 |
38-
| gpt-4-turbo-preview | 64.70 | 3573 | 2199 | 1374 | 1 | 0.06321 |
39-
| Grooq with nomic-embed-text | 3.82 | 2459 | 2192 | 267 | 1 | 0 |
40-
41-
37+
| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
38+
| ------------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
39+
| gpt-3.5-turbo | 8.836s | 1167 | 726 | 441 | 1 | 0.001971 |
40+
| gpt-4-turbo-preview | 21.53s | 1205 | 726 | 479 | 1 | 0.02163 |
41+
| gpt-4-o | 15.27s | 1400 | 715 | 685 | 1 | 0 |
42+
| Grooq with nomic-embed-text<br> | 3.82s | 2459 | 2192 | 267 | 1 | 0 |
Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
"""
2+
Basic example of scraping pipeline using SmartScraper from text
3+
"""
4+
5+
import os
6+
from dotenv import load_dotenv
7+
from scrapegraphai.graphs import SmartScraperGraph
8+
from scrapegraphai.utils import prettify_exec_info
9+
load_dotenv()
10+
11+
# ************************************************
12+
# Read the text file
13+
# ************************************************
14+
files = ["inputs/example_1.txt", "inputs/example_2.txt"]
15+
tasks = ["List me all the projects with their description.",
16+
"List me all the articles with their description."]
17+
18+
19+
# ************************************************
20+
# Define the configuration for the graph
21+
# ************************************************
22+
23+
openai_key = os.getenv("OPENAI_APIKEY")
24+
25+
graph_config = {
26+
"llm": {
27+
"api_key": openai_key,
28+
"model": "gpt-4o",
29+
},
30+
}
31+
32+
# ************************************************
33+
# Create the SmartScraperGraph instance and run it
34+
# ************************************************
35+
36+
for i in range(0, 2):
37+
with open(files[i], 'r', encoding="utf-8") as file:
38+
text = file.read()
39+
40+
smart_scraper_graph = SmartScraperGraph(
41+
prompt=tasks[i],
42+
source=text,
43+
config=graph_config
44+
)
45+
46+
result = smart_scraper_graph.run()
47+
print(result)
48+
# ************************************************
49+
# Get graph execution info
50+
# ************************************************
51+
52+
graph_exec_info = smart_scraper_graph.get_execution_info()
53+
print(prettify_exec_info(graph_exec_info))

examples/extras/example.yml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
{
2+
"llm": {
3+
"model": "ollama/llama3",
4+
"temperature": 0,
5+
"format": "json",
6+
# "base_url": "http://localhost:11434",
7+
},
8+
"embeddings": {
9+
"model": "ollama/nomic-embed-text",
10+
"temperature": 0,
11+
# "base_url": "http://localhost:11434",
12+
},
13+
"verbose": true,
14+
"headless": false
15+
}

examples/extras/force_mode.py

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
"""
2+
Basic example of scraping pipeline using SmartScraper
3+
"""
4+
5+
import os
6+
from dotenv import load_dotenv
7+
from scrapegraphai.graphs import SmartScraperGraph
8+
from scrapegraphai.utils import prettify_exec_info
9+
10+
load_dotenv()
11+
12+
13+
# ************************************************
14+
# Define the configuration for the graph
15+
# ************************************************
16+
17+
openai_key = os.getenv("OPENAI_APIKEY")
18+
19+
graph_config = {
20+
"llm": {
21+
"model": "ollama/llama3",
22+
"temperature": 0,
23+
# "format": "json", # Ollama needs the format to be specified explicitly
24+
# "base_url": "http://localhost:11434", # set ollama URL arbitrarily
25+
},
26+
"embeddings": {
27+
"model": "ollama/nomic-embed-text",
28+
"temperature": 0,
29+
# "base_url": "http://localhost:11434", # set ollama URL arbitrarily
30+
},
31+
"force": True,
32+
"caching": True
33+
}
34+
35+
# ************************************************
36+
# Create the SmartScraperGraph instance and run it
37+
# ************************************************
38+
39+
smart_scraper_graph = SmartScraperGraph(
40+
prompt="List me all the projects with their description.",
41+
# also accepts a string with the already downloaded HTML code
42+
source="https://perinim.github.io/projects/",
43+
config=graph_config
44+
)
45+
46+
result = smart_scraper_graph.run()
47+
print(result)
48+
49+
# ************************************************
50+
# Get graph execution info
51+
# ************************************************
52+
53+
graph_exec_info = smart_scraper_graph.get_execution_info()
54+
print(prettify_exec_info(graph_exec_info))

examples/extras/load_yml.py

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
"""
2+
Basic example of scraping pipeline using SmartScraper
3+
"""
4+
import yaml
5+
from scrapegraphai.graphs import SmartScraperGraph
6+
from scrapegraphai.utils import prettify_exec_info
7+
8+
# ************************************************
9+
# Define the configuration for the graph
10+
# ************************************************
11+
with open("example.yml", 'r') as file:
12+
graph_config = yaml.safe_load(file)
13+
14+
# ************************************************
15+
# Create the SmartScraperGraph instance and run it
16+
# ************************************************
17+
18+
smart_scraper_graph = SmartScraperGraph(
19+
prompt="List me all the titles",
20+
source="https://sport.sky.it/nba?gr=www",
21+
config=graph_config
22+
)
23+
24+
result = smart_scraper_graph.run()
25+
print(result)
26+
27+
# ************************************************
28+
# Get graph execution info
29+
# ************************************************
30+
31+
graph_exec_info = smart_scraper_graph.get_execution_info()
32+
print(prettify_exec_info(graph_exec_info))

examples/extras/no_cut.py

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
"""
2+
This example shows how to do not process the html code in the fetch phase
3+
"""
4+
5+
import os, json
6+
from scrapegraphai.graphs import SmartScraperGraph
7+
from scrapegraphai.utils import prettify_exec_info
8+
9+
10+
# ************************************************
11+
# Define the configuration for the graph
12+
# ************************************************
13+
14+
15+
graph_config = {
16+
"llm": {
17+
"api_key": "s",
18+
"model": "gpt-3.5-turbo",
19+
},
20+
"cut": False,
21+
"verbose": True,
22+
"headless": False,
23+
}
24+
25+
# ************************************************
26+
# Create the SmartScraperGraph instance and run it
27+
# ************************************************
28+
29+
smart_scraper_graph = SmartScraperGraph(
30+
prompt="Extract me the python code inside the page",
31+
source="https://www.exploit-db.com/exploits/51447",
32+
config=graph_config
33+
)
34+
35+
result = smart_scraper_graph.run()
36+
print(json.dumps(result, indent=4))
37+
38+
# ************************************************
39+
# Get graph execution info
40+
# ************************************************
41+
42+
graph_exec_info = smart_scraper_graph.get_execution_info()
43+
print(prettify_exec_info(graph_exec_info))

examples/extras/proxy_rotation.py

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
"""
2+
Basic example of scraping pipeline using SmartScraper
3+
"""
4+
5+
from scrapegraphai.graphs import SmartScraperGraph
6+
from scrapegraphai.utils import prettify_exec_info
7+
8+
9+
# ************************************************
10+
# Define the configuration for the graph
11+
# ************************************************
12+
13+
graph_config = {
14+
"llm": {
15+
"api_key": "API_KEY",
16+
"model": "gpt-3.5-turbo",
17+
},
18+
"loader_kwargs": {
19+
"proxy" : {
20+
"server": "http:/**********",
21+
"username": "********",
22+
"password": "***",
23+
},
24+
},
25+
"verbose": True,
26+
"headless": False,
27+
}
28+
29+
# ************************************************
30+
# Create the SmartScraperGraph instance and run it
31+
# ************************************************
32+
33+
smart_scraper_graph = SmartScraperGraph(
34+
prompt="List me all the projects with their description",
35+
# also accepts a string with the already downloaded HTML code
36+
source="https://perinim.github.io/projects/",
37+
config=graph_config
38+
)
39+
40+
result = smart_scraper_graph.run()
41+
print(result)
42+
43+
# ************************************************
44+
# Get graph execution info
45+
# ************************************************
46+
47+
graph_exec_info = smart_scraper_graph.get_execution_info()
48+
print(prettify_exec_info(graph_exec_info))

0 commit comments

Comments
 (0)