Skip to content

295 scrapegraph ai接入oneapi模型qwen turbo #298

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 72 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
e53766b
feat: add logger integration
VinciGit00 May 14, 2024
0589083
refactoring of loggers
VinciGit00 May 15, 2024
a4700bf
add robot node
VinciGit00 May 15, 2024
0b71b9a
Add a new graph traversal that allows more than one edges out of a graph
mayurdb May 15, 2024
42d2ab1
Merge pull request #250 from mayurdb/graphRevamp
VinciGit00 May 15, 2024
008f8d9
Merge pull request #248 from VinciGit00/main
VinciGit00 May 15, 2024
fd3e0aa
ci(release): 1.2.0-beta.1 [skip ci]
semantic-release-bot May 15, 2024
29d284e
Merge branch 'main' into logger-integration
VinciGit00 May 15, 2024
40260d8
remove asdt
VinciGit00 May 15, 2024
a8fb851
remove asdt
VinciGit00 May 15, 2024
4fe58d9
fix logger
VinciGit00 May 15, 2024
befa48c
update lock
VinciGit00 May 15, 2024
ba8a4f7
removed duplicates
VinciGit00 May 15, 2024
d60438c
Add a n-level deep search support
mayurdb May 15, 2024
1e0b2f7
Merge branch 'pre/beta' into nDeep
mayurdb May 15, 2024
f36b3e3
Merge pull request #254 from mayurdb/nDeep
VinciGit00 May 16, 2024
9483afd
revert
VinciGit00 May 16, 2024
9e9f8f0
removed max depth
VinciGit00 May 16, 2024
02745a4
Merge branch 'main' into pre/beta
VinciGit00 May 17, 2024
3453f72
add graph
VinciGit00 May 17, 2024
8c33ea3
feat(node): knowledge graph node
PeriniM May 17, 2024
73fa31d
feat(base_graph): alligned with main
PeriniM May 17, 2024
191db0b
ci(release): 1.3.0-beta.1 [skip ci]
semantic-release-bot May 17, 2024
0196423
feat(knowledgegraph): add knowledge graph node
PeriniM May 17, 2024
05e511e
add new prompts
VinciGit00 May 17, 2024
bed3eed
feat(multiple_search): working multiple example
PeriniM May 17, 2024
b82f33a
fix: template names
VinciGit00 May 18, 2024
6f62b05
Merge branch 'multi_scraper_graph' of https://github.com/VinciGit00/S…
VinciGit00 May 18, 2024
ff53771
add falcon model
VinciGit00 May 18, 2024
58cc903
feat(multiple): quick fix working
PeriniM May 18, 2024
c75e6a0
feat(kg): working rag kg
PeriniM May 18, 2024
a338383
feat(kg): removed import
PeriniM May 18, 2024
5701afe
add new import
VinciGit00 May 18, 2024
7b3ee4e
feat(docloaders): undetected-playwright
QIN2DIM May 19, 2024
ec8cbca
Merge branch 'pre/beta' into try
VinciGit00 May 19, 2024
82585aa
Merge pull request #270 from VinciGit00/try
VinciGit00 May 19, 2024
2caddf9
ci(release): 1.4.0-beta.1 [skip ci]
semantic-release-bot May 19, 2024
9096da7
Merge pull request #269 from QIN2DIM/feat-undetected-playwright
PeriniM May 19, 2024
f1a2523
ci(release): 1.4.0-beta.2 [skip ci]
semantic-release-bot May 19, 2024
7e5ff4e
Added bedrock examples
JGalego May 20, 2024
6058492
Updated JSON scraper example prompt
JGalego May 20, 2024
9e92b03
Added missing Titan text embedding models
JGalego May 20, 2024
05ecc3a
Added missing logic to extract model_name from model_id
JGalego May 20, 2024
d0a301d
Moved common params up (verbose, headless and loader_kwargs)
JGalego May 20, 2024
3ffa896
Fixed model ID -> model name conversion
JGalego May 20, 2024
0ad78ca
Merge pull request #272 from JGalego/docs/bedrock-examples
VinciGit00 May 20, 2024
c8c3201
Merge pull request #273 from JGalego/bugfix/bedrock-runs
VinciGit00 May 20, 2024
fc58e2d
feat(smart-scraper-multi): add schema to graphs and created SmartScra…
PeriniM May 21, 2024
be4237a
Merge branch 'pre/beta' into multi_scraper_graph
VinciGit00 May 21, 2024
7369a4d
Merge pull request #281 from VinciGit00/multi_scraper_graph
VinciGit00 May 21, 2024
ca436ab
fix: error in jsons
VinciGit00 May 21, 2024
aa14271
Update README.md
VinciGit00 May 21, 2024
ffd6015
Update abstract_graph.py
stoensin May 23, 2024
0ba3a59
Update models_tokens.py
VinciGit00 May 23, 2024
f00ed35
Merge branch 'pre/beta' into patch-1
VinciGit00 May 23, 2024
1cb71ed
Merge pull request #289 from stoensin/patch-1
VinciGit00 May 23, 2024
b6f7b64
Merge pull request #290 from VinciGit00/pre/beta
VinciGit00 May 23, 2024
1774b18
refactor of embeddings
VinciGit00 May 23, 2024
b377467
add info
VinciGit00 May 23, 2024
909af8d
refactor gen answ node
VinciGit00 May 23, 2024
6d33a8a
rollback
VinciGit00 May 23, 2024
c93dbe0
Update smart_scraper_graph.py
VinciGit00 May 23, 2024
00a392b
Merge pull request #292 from VinciGit00/refactoring
VinciGit00 May 23, 2024
d139480
fix(logging): source code citation
DiTo97 May 23, 2024
0790ecd
fix(web-loader): use sublogger
DiTo97 May 23, 2024
c807695
feat(verbose): centralized graph logging on debug or warning dependin…
DiTo97 May 23, 2024
4348d4f
fix(logger): set up centralized root logger in base node
DiTo97 May 23, 2024
c251cc4
fix(node-logging): use centralized logger in each node for logging
DiTo97 May 23, 2024
3d0f671
Merge pull request #294 from DiTo97/logger-integration
VinciGit00 May 24, 2024
b913b51
Merge branch 'logger-integration' into pre/beta
VinciGit00 May 24, 2024
e1006f3
ci(release): 1.5.0-beta.1 [skip ci]
semantic-release-bot May 24, 2024
b6f1766
add OneAPI integration
VinciGit00 May 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -32,5 +32,7 @@ examples/graph_examples/ScrapeGraphAI_generated_graph
examples/**/result.csv
examples/**/result.json
main.py
lib/
*.html
.idea


1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.10.14
40 changes: 39 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,39 @@
## [1.4.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.3.2...v1.4.0) (2024-05-22)
## [1.5.0-beta.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.4.0...v1.5.0-beta.1) (2024-05-24)


### Features

* **knowledgegraph:** add knowledge graph node ([0196423](https://github.com/VinciGit00/Scrapegraph-ai/commit/0196423bdeea6568086aae6db8fc0f5652fc4e87))
* add logger integration ([e53766b](https://github.com/VinciGit00/Scrapegraph-ai/commit/e53766b16e89254f945f9b54b38445a24f8b81f2))
* **smart-scraper-multi:** add schema to graphs and created SmartScraperMultiGraph ([fc58e2d](https://github.com/VinciGit00/Scrapegraph-ai/commit/fc58e2d3a6f05efa72b45c9e68c6bb41a1eee755))
* **base_graph:** alligned with main ([73fa31d](https://github.com/VinciGit00/Scrapegraph-ai/commit/73fa31db0f791d1fd63b489ac88cc6e595aa07f9))
* **verbose:** centralized graph logging on debug or warning depending on verbose ([c807695](https://github.com/VinciGit00/Scrapegraph-ai/commit/c807695720a85c74a0b4365afb397bbbcd7e2889))
* **node:** knowledge graph node ([8c33ea3](https://github.com/VinciGit00/Scrapegraph-ai/commit/8c33ea3fbce18f74484fe7bd9469ab95c985ad0b))
* **multiple:** quick fix working ([58cc903](https://github.com/VinciGit00/Scrapegraph-ai/commit/58cc903d556d0b8db10284493b05bed20992c339))
* **kg:** removed import ([a338383](https://github.com/VinciGit00/Scrapegraph-ai/commit/a338383399b669ae2dd7bfcec168b791e8206816))
* **docloaders:** undetected-playwright ([7b3ee4e](https://github.com/VinciGit00/Scrapegraph-ai/commit/7b3ee4e71e4af04edeb47999d70d398b67c93ac4))
* **multiple_search:** working multiple example ([bed3eed](https://github.com/VinciGit00/Scrapegraph-ai/commit/bed3eed50c1678cfb07cba7b451ac28d38c87d7c))
* **kg:** working rag kg ([c75e6a0](https://github.com/VinciGit00/Scrapegraph-ai/commit/c75e6a06b1a647f03e6ac6eeacdc578a85baa25b))


### Bug Fixes

* error in jsons ([ca436ab](https://github.com/VinciGit00/Scrapegraph-ai/commit/ca436abf3cbff21d752a71969e787e8f8c98c6a8))
* **logger:** set up centralized root logger in base node ([4348d4f](https://github.com/VinciGit00/Scrapegraph-ai/commit/4348d4f4db6f30213acc1bbccebc2b143b4d2636))
* **logging:** source code citation ([d139480](https://github.com/VinciGit00/Scrapegraph-ai/commit/d1394809d704bee4085d494ddebab772306b3b17))
* template names ([b82f33a](https://github.com/VinciGit00/Scrapegraph-ai/commit/b82f33aee72515e4258e6f508fce15028eba5cbe))
* **node-logging:** use centralized logger in each node for logging ([c251cc4](https://github.com/VinciGit00/Scrapegraph-ai/commit/c251cc45d3694f8e81503e38a6d2b362452b740e))
* **web-loader:** use sublogger ([0790ecd](https://github.com/VinciGit00/Scrapegraph-ai/commit/0790ecd2083642af9f0a84583216ababe351cd76))


### CI

* **release:** 1.2.0-beta.1 [skip ci] ([fd3e0aa](https://github.com/VinciGit00/Scrapegraph-ai/commit/fd3e0aa5823509dfb46b4f597521c24d4eb345f1))
* **release:** 1.3.0-beta.1 [skip ci] ([191db0b](https://github.com/VinciGit00/Scrapegraph-ai/commit/191db0bc779e4913713b47b68ec4162a347da3ea))
* **release:** 1.4.0-beta.1 [skip ci] ([2caddf9](https://github.com/VinciGit00/Scrapegraph-ai/commit/2caddf9a99b5f3aedc1783216f21d23cd35b3a8c))
* **release:** 1.4.0-beta.2 [skip ci] ([f1a2523](https://github.com/VinciGit00/Scrapegraph-ai/commit/f1a25233d650010e1932e0ab80938079a22a296d))

## [1.4.0-beta.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.4.0-beta.1...v1.4.0-beta.2) (2024-05-19)


### Features
Expand All @@ -19,13 +54,16 @@

* add deepseek embeddings ([659fad7](https://github.com/VinciGit00/Scrapegraph-ai/commit/659fad770a5b6ace87511513e5233a3bc1269009))


## [1.3.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.2.4...v1.3.0) (2024-05-19)



### Features

* add new model ([8c7afa7](https://github.com/VinciGit00/Scrapegraph-ai/commit/8c7afa7570f0a104578deb35658168435cfe5ae1))


## [1.2.4](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.2.3...v1.2.4) (2024-05-17)


Expand Down
5 changes: 1 addition & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,6 @@ The reference page for Scrapegraph-ai is available on the official page of pypy:
```bash
pip install scrapegraphai
```
you will also need to install Playwright for javascript-based scraping:
```bash
playwright install
```

**Note**: it is recommended to install the library in a virtual environment to avoid conflicts with other libraries 🐱

Expand All @@ -49,6 +45,7 @@ There are three main scraping pipelines that can be used to extract information
- `SmartScraperGraph`: single-page scraper that only needs a user prompt and an input source;
- `SearchGraph`: multi-page scraper that extracts information from the top n search results of a search engine;
- `SpeechGraph`: single-page scraper that extracts information from a website and generates an audio file.
- `SmartScraperMultiGraph`: multiple page scraper given a single prompt

It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**, **Azure** and **Gemini**, or local models using **Ollama**.

Expand Down
4 changes: 4 additions & 0 deletions examples/bedrock/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
AWS_ACCESS_KEY_ID="..."
AWS_SECRET_ACCESS_KEY="..."
AWS_SESSION_TOKEN="..."
AWS_DEFAULT_REGION="..."
3 changes: 3 additions & 0 deletions examples/bedrock/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
This folder contains examples of how to use ScrapeGraphAI with [Amazon Bedrock](https://aws.amazon.com/bedrock/) ⛰️. The examples show how to extract information from websites and files using a natural language prompt.

![](scrapegraphai_bedrock.png)
63 changes: 63 additions & 0 deletions examples/bedrock/csv_scraper_bedrock.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
"""
Basic example of scraping pipeline using CSVScraperGraph from CSV documents
"""

import os
import json

from dotenv import load_dotenv

import pandas as pd

from scrapegraphai.graphs import CSVScraperGraph
from scrapegraphai.utils import convert_to_csv, convert_to_json, prettify_exec_info

load_dotenv()

# ************************************************
# Read the CSV file
# ************************************************

FILE_NAME = "inputs/username.csv"
curr_dir = os.path.dirname(os.path.realpath(__file__))
file_path = os.path.join(curr_dir, FILE_NAME)

text = pd.read_csv(file_path)

# ************************************************
# Define the configuration for the graph
# ************************************************

graph_config = {
"llm": {
"model": "bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
"temperature": 0.0
},
"embeddings": {
"model": "bedrock/cohere.embed-multilingual-v3"
}
}

# ************************************************
# Create the CSVScraperGraph instance and run it
# ************************************************

csv_scraper_graph = CSVScraperGraph(
prompt="List me all the last names",
source=str(text), # Pass the content of the file, not the file object
config=graph_config
)

result = csv_scraper_graph.run()
print(json.dumps(result, indent=4))

# ************************************************
# Get graph execution info
# ************************************************

graph_exec_info = csv_scraper_graph.get_execution_info()
print(prettify_exec_info(graph_exec_info))

# Save to json or csv
convert_to_csv(result, "result")
convert_to_json(result, "result")
127 changes: 127 additions & 0 deletions examples/bedrock/custom_graph_bedrock.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
"""
Example of custom graph using existing nodes
"""

import json

from dotenv import load_dotenv

from langchain_aws import BedrockEmbeddings
from scrapegraphai.models import Bedrock
from scrapegraphai.graphs import BaseGraph
from scrapegraphai.nodes import (
FetchNode,
ParseNode,
RAGNode,
GenerateAnswerNode,
RobotsNode
)

load_dotenv()

# ************************************************
# Define the configuration for the graph
# ************************************************

graph_config = {
"llm": {
"model": "bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
"temperature": 0.0
},
"embeddings": {
"model": "bedrock/cohere.embed-multilingual-v3"
}
}

# ************************************************
# Define the graph nodes
# ************************************************

llm_model = Bedrock({
'model_id': graph_config["llm"]["model"].split("/")[-1],
'model_kwargs': {
'temperature': 0.0
}})
embedder = BedrockEmbeddings(model_id=graph_config["embeddings"]["model"].split("/")[-1])

# Define the nodes for the graph
robot_node = RobotsNode(
input="url",
output=["is_scrapable"],
node_config={
"llm_model": llm_model,
"force_scraping": True,
"verbose": True,
}
)

fetch_node = FetchNode(
input="url | local_dir",
output=["doc", "link_urls", "img_urls"],
node_config={
"verbose": True,
"headless": True,
}
)

parse_node = ParseNode(
input="doc",
output=["parsed_doc"],
node_config={
"chunk_size": 4096,
"verbose": True,
}
)

rag_node = RAGNode(
input="user_prompt & (parsed_doc | doc)",
output=["relevant_chunks"],
node_config={
"llm_model": llm_model,
"embedder_model": embedder,
"verbose": True,
}
)

generate_answer_node = GenerateAnswerNode(
input="user_prompt & (relevant_chunks | parsed_doc | doc)",
output=["answer"],
node_config={
"llm_model": llm_model,
"verbose": True,
}
)

# ************************************************
# Create the graph by defining the connections
# ************************************************

graph = BaseGraph(
nodes=[
robot_node,
fetch_node,
parse_node,
rag_node,
generate_answer_node,
],
edges=[
(robot_node, fetch_node),
(fetch_node, parse_node),
(parse_node, rag_node),
(rag_node, generate_answer_node)
],
entry_point=robot_node
)

# ************************************************
# Execute the graph
# ************************************************

result, execution_info = graph.execute({
"user_prompt": "List me all the articles",
"url": "https://perinim.github.io/projects"
})

# Get the answer from the result
result = result.get("answer", "No answer found.")
print(json.dumps(result, indent=4))
Loading
Loading