Skip to content

Commit 8bacd53

Browse files
authored
Merge pull request #724 from ScrapeGraphAI/tem
allignment
2 parents 4f65be4 + 99aac5b commit 8bacd53

File tree

8 files changed

+42
-41
lines changed

8 files changed

+42
-41
lines changed

CHANGELOG.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,11 @@
1+
## [1.25.2](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.25.1...v1.25.2) (2024-10-03)
2+
3+
4+
### Bug Fixes
5+
6+
* update dependencies ([7579d0e](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/7579d0e2599d63c0003b1b7a0918132511a9c8f1))
7+
8+
## [1.25.1](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.25.0...v1.25.1) (2024-09-29)
19
## [1.26.0-beta.3](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.26.0-beta.2...v1.26.0-beta.3) (2024-10-04)
210

311

README.md

Lines changed: 5 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,6 @@ The output will be a dictionary like the following:
9898
"contact_email": "[email protected]"
9999
}
100100
```
101-
102101
There are other pipelines that can be used to extract information from multiple pages, generate Python scripts, or even generate audio files.
103102

104103
| Pipeline Name | Description |
@@ -110,6 +109,8 @@ There are other pipelines that can be used to extract information from multiple
110109
| SmartScraperMultiGraph | Multi-page scraper that extracts information from multiple pages given a single prompt and a list of sources. |
111110
| ScriptCreatorMultiGraph | Multi-page scraper that generates a Python script for extracting information from multiple pages and sources. |
112111

112+
For each of these graphs there is the multi version. It allows to make calls of the LLM in parallel.
113+
113114
It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**, **Azure** and **Gemini**, or local models using **Ollama**.
114115

115116
Remember to have [Ollama](https://ollama.com/) installed and download the models using the **ollama pull** command, if you want to use local models.
@@ -140,6 +141,9 @@ Check out also the Docusaurus [here](https://scrapegraph-doc.onrender.com/).
140141
<a href="https://2ly.link/1zNj1">
141142
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/transparent_stat.png" alt="Stats" style="width: 15%;">
142143
</a>
144+
<a href="https://scrape.do">
145+
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/scrapedo.png" alt="Stats" style="width: 11%;">
146+
</a>
143147
</div>
144148

145149
## 🤝 Contributing
@@ -152,34 +156,6 @@ Please see the [contributing guidelines](https://github.com/VinciGit00/Scrapegra
152156
[![My Skills](https://skillicons.dev/icons?i=linkedin)](https://www.linkedin.com/company/scrapegraphai/)
153157
[![My Skills](https://skillicons.dev/icons?i=twitter)](https://twitter.com/scrapegraphai)
154158

155-
## 🗺️ Roadmap
156-
157-
We are working on the following features! If you are interested in collaborating right-click on the feature and open in a new tab to file a PR. If you have doubts and wanna discuss them with us, just contact us on [discord](https://discord.gg/uJN7TYcpNa) or open a [Discussion](https://github.com/VinciGit00/Scrapegraph-ai/discussions) here on Github!
158-
159-
```mermaid
160-
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#5C4B9B', 'edgeLabelBackground':'#ffffff', 'tertiaryColor': '#ffffff', 'primaryBorderColor': '#5C4B9B', 'fontFamily': 'Arial', 'fontSize': '16px', 'textColor': '#5C4B9B' }}}%%
161-
graph LR
162-
A[DeepSearch Graph] --> F[Use Existing Chromium Instances]
163-
F --> B[Page Caching]
164-
B --> C[Screenshot Scraping]
165-
C --> D[Handle Dynamic Content]
166-
D --> E[New Webdrivers]
167-
168-
style A fill:#ffffff,stroke:#5C4B9B,stroke-width:2px,rx:10,ry:10
169-
style F fill:#ffffff,stroke:#5C4B9B,stroke-width:2px,rx:10,ry:10
170-
style B fill:#ffffff,stroke:#5C4B9B,stroke-width:2px,rx:10,ry:10
171-
style C fill:#ffffff,stroke:#5C4B9B,stroke-width:2px,rx:10,ry:10
172-
style D fill:#ffffff,stroke:#5C4B9B,stroke-width:2px,rx:10,ry:10
173-
style E fill:#ffffff,stroke:#5C4B9B,stroke-width:2px,rx:10,ry:10
174-
175-
click A href "https://github.com/VinciGit00/Scrapegraph-ai/issues/260" "Open DeepSearch Graph Issue"
176-
click F href "https://github.com/VinciGit00/Scrapegraph-ai/issues/329" "Open Chromium Instances Issue"
177-
click B href "https://github.com/VinciGit00/Scrapegraph-ai/issues/197" "Open Page Caching Issue"
178-
click C href "https://github.com/VinciGit00/Scrapegraph-ai/issues/197" "Open Screenshot Scraping Issue"
179-
click D href "https://github.com/VinciGit00/Scrapegraph-ai/issues/279" "Open Handle Dynamic Content Issue"
180-
click E href "https://github.com/VinciGit00/Scrapegraph-ai/issues/171" "Open New Webdrivers Issue"
181-
```
182-
183159
## 📈 Telemetry
184160
We collect anonymous usage metrics to enhance our package's quality and user experience. The data helps us prioritize improvements and ensure compatibility. If you wish to opt-out, set the environment variable SCRAPEGRAPHAI_TELEMETRY_ENABLED=false. For more information, please refer to the documentation [here](https://scrapegraph-ai.readthedocs.io/en/latest/scrapers/telemetry.html).
185161

docs/assets/scrapedo.png

19.2 KB
Loading

pyproject.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
[project]
22
name = "scrapegraphai"
33

4-
version = "1.26.0b3"
4+
version = "1.25.2"
55

66
description = "A web scraping library based on LangChain which uses LLM and direct graph logic to create scraping pipelines."
77
authors = [
@@ -30,6 +30,7 @@ dependencies = [
3030
"undetected-playwright>=0.3.0",
3131
"google>=3.0.0",
3232
"langchain-ollama>=0.1.3",
33+
3334
"semchunk==2.2.0",
3435
"transformers==4.44.2",
3536
"qdrant-client>=1.11.3",

scrapegraphai/utils/cleanup_code.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,9 @@
44
import re
55

66
def extract_code(code: str) -> str:
7+
"""
8+
Module for extracting code
9+
"""
710
pattern = r'```(?:python)?\n(.*?)```'
811

912
match = re.search(pattern, code, re.DOTALL)

scrapegraphai/utils/cleanup_html.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ def reduce_html(html, reduction):
101101
for attr in list(tag.attrs):
102102
if attr not in attrs_to_keep:
103103
del tag[attr]
104-
104+
105105
if reduction == 1:
106106
return minify_html(str(soup))
107107

scrapegraphai/utils/code_error_analysis.py

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,24 +2,27 @@
22
This module contains the functions that are used to generate the prompts for the code error analysis.
33
"""
44
from typing import Any, Dict
5+
import json
56
from langchain.prompts import PromptTemplate
67
from langchain_core.output_parsers import StrOutputParser
7-
import json
88
from ..prompts import (
99
TEMPLATE_SYNTAX_ANALYSIS, TEMPLATE_EXECUTION_ANALYSIS,
1010
TEMPLATE_VALIDATION_ANALYSIS, TEMPLATE_SEMANTIC_ANALYSIS
1111
)
1212

1313
def syntax_focused_analysis(state: dict, llm_model) -> str:
14-
prompt = PromptTemplate(template=TEMPLATE_SYNTAX_ANALYSIS, input_variables=["generated_code", "errors"])
14+
prompt = PromptTemplate(template=TEMPLATE_SYNTAX_ANALYSIS,
15+
input_variables=["generated_code", "errors"])
1516
chain = prompt | llm_model | StrOutputParser()
1617
return chain.invoke({
1718
"generated_code": state["generated_code"],
1819
"errors": state["errors"]["syntax"]
1920
})
2021

2122
def execution_focused_analysis(state: dict, llm_model) -> str:
22-
prompt = PromptTemplate(template=TEMPLATE_EXECUTION_ANALYSIS, input_variables=["generated_code", "errors", "html_code", "html_analysis"])
23+
prompt = PromptTemplate(template=TEMPLATE_EXECUTION_ANALYSIS,
24+
input_variables=["generated_code", "errors",
25+
"html_code", "html_analysis"])
2326
chain = prompt | llm_model | StrOutputParser()
2427
return chain.invoke({
2528
"generated_code": state["generated_code"],
@@ -29,7 +32,9 @@ def execution_focused_analysis(state: dict, llm_model) -> str:
2932
})
3033

3134
def validation_focused_analysis(state: dict, llm_model) -> str:
32-
prompt = PromptTemplate(template=TEMPLATE_VALIDATION_ANALYSIS, input_variables=["generated_code", "errors", "json_schema", "execution_result"])
35+
prompt = PromptTemplate(template=TEMPLATE_VALIDATION_ANALYSIS,
36+
input_variables=["generated_code", "errors",
37+
"json_schema", "execution_result"])
3338
chain = prompt | llm_model | StrOutputParser()
3439
return chain.invoke({
3540
"generated_code": state["generated_code"],
@@ -39,7 +44,9 @@ def validation_focused_analysis(state: dict, llm_model) -> str:
3944
})
4045

4146
def semantic_focused_analysis(state: dict, comparison_result: Dict[str, Any], llm_model) -> str:
42-
prompt = PromptTemplate(template=TEMPLATE_SEMANTIC_ANALYSIS, input_variables=["generated_code", "differences", "explanation"])
47+
prompt = PromptTemplate(template=TEMPLATE_SEMANTIC_ANALYSIS,
48+
input_variables=["generated_code",
49+
"differences", "explanation"])
4350
chain = prompt | llm_model | StrOutputParser()
4451
return chain.invoke({
4552
"generated_code": state["generated_code"],

scrapegraphai/utils/code_error_correction.py

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,32 +10,38 @@
1010
)
1111

1212
def syntax_focused_code_generation(state: dict, analysis: str, llm_model) -> str:
13-
prompt = PromptTemplate(template=TEMPLATE_SYNTAX_CODE_GENERATION, input_variables=["analysis", "generated_code"])
13+
prompt = PromptTemplate(template=TEMPLATE_SYNTAX_CODE_GENERATION,
14+
input_variables=["analysis", "generated_code"])
1415
chain = prompt | llm_model | StrOutputParser()
1516
return chain.invoke({
1617
"analysis": analysis,
1718
"generated_code": state["generated_code"]
1819
})
1920

2021
def execution_focused_code_generation(state: dict, analysis: str, llm_model) -> str:
21-
prompt = PromptTemplate(template=TEMPLATE_EXECUTION_CODE_GENERATION, input_variables=["analysis", "generated_code"])
22+
prompt = PromptTemplate(template=TEMPLATE_EXECUTION_CODE_GENERATION,
23+
input_variables=["analysis", "generated_code"])
2224
chain = prompt | llm_model | StrOutputParser()
2325
return chain.invoke({
2426
"analysis": analysis,
2527
"generated_code": state["generated_code"]
2628
})
2729

2830
def validation_focused_code_generation(state: dict, analysis: str, llm_model) -> str:
29-
prompt = PromptTemplate(template=TEMPLATE_VALIDATION_CODE_GENERATION, input_variables=["analysis", "generated_code", "json_schema"])
31+
prompt = PromptTemplate(template=TEMPLATE_VALIDATION_CODE_GENERATION,
32+
input_variables=["analysis", "generated_code",
33+
"json_schema"])
3034
chain = prompt | llm_model | StrOutputParser()
3135
return chain.invoke({
3236
"analysis": analysis,
3337
"generated_code": state["generated_code"],
3438
"json_schema": state["json_schema"]
3539
})
36-
40+
3741
def semantic_focused_code_generation(state: dict, analysis: str, llm_model) -> str:
38-
prompt = PromptTemplate(template=TEMPLATE_SEMANTIC_CODE_GENERATION, input_variables=["analysis", "generated_code", "generated_result", "reference_result"])
42+
prompt = PromptTemplate(template=TEMPLATE_SEMANTIC_CODE_GENERATION,
43+
input_variables=["analysis", "generated_code",
44+
"generated_result", "reference_result"])
3945
chain = prompt | llm_model | StrOutputParser()
4046
return chain.invoke({
4147
"analysis": analysis,

0 commit comments

Comments
 (0)