Skip to content

Commit 2526831

Browse files
authored
Merge pull request #302 from VinciGit00/pre/beta
Pre/beta
2 parents eb41f0d + 1f51147 commit 2526831

File tree

105 files changed

+4246
-909
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

105 files changed

+4246
-909
lines changed

.gitignore

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,16 +21,20 @@ docs/source/_templates/
2121
docs/source/_static/
2222
.env
2323
venv/
24+
.venv/
2425
.vscode/
2526

2627
# exclude pdf, mp3
2728
*.pdf
2829
*.mp3
2930
*.sqlite
3031
*.google-cookie
32+
*.python-version
3133
examples/graph_examples/ScrapeGraphAI_generated_graph
3234
examples/**/result.csv
3335
examples/**/result.json
3436
main.py
37+
lib/
38+
*.html
39+
.idea
3540

36-

CHANGELOG.md

Lines changed: 83 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,83 @@
1-
## [1.4.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.3.2...v1.4.0) (2024-05-22)
1+
## [1.5.0-beta.5](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.5.0-beta.4...v1.5.0-beta.5) (2024-05-26)
2+
3+
4+
### Features
5+
6+
* **version:** python 3.12 is now supported 🚀 ([5fb9115](https://github.com/VinciGit00/Scrapegraph-ai/commit/5fb9115330141ac2c1dd97490284d4f1fa2c01c3))
7+
8+
9+
### Docs
10+
11+
* **faq:** added faq section and refined installation ([545374c](https://github.com/VinciGit00/Scrapegraph-ai/commit/545374c17e9101a240fd1fbc380ce813c5aa6c2e))
12+
* updated requirements ([e43b801](https://github.com/VinciGit00/Scrapegraph-ai/commit/e43b8018f5f360b88c52e45ff4e1b4221386ea8e))
13+
14+
## [1.5.0-beta.4](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.5.0-beta.3...v1.5.0-beta.4) (2024-05-25)
15+
16+
17+
### Features
18+
19+
* **burr:** added burr integration in graphs and optional burr installation ([ac10128](https://github.com/VinciGit00/Scrapegraph-ai/commit/ac10128ff3af35c52b48c79d085e458524e8e48a))
20+
* **burr-bridge:** BurrBridge class to integrate inside BaseGraph ([6cbd84f](https://github.com/VinciGit00/Scrapegraph-ai/commit/6cbd84f254ebc1f1c68699273bdd8fcdb0fe26d4))
21+
* **burr:** first burr integration and docs ([19b27bb](https://github.com/VinciGit00/Scrapegraph-ai/commit/19b27bbe852f134cf239fc1945e7906bc24d7098))
22+
* **burr-node:** working burr bridge ([654a042](https://github.com/VinciGit00/Scrapegraph-ai/commit/654a04239640a89d9fa408ccb2e4485247ab84df))
23+
24+
25+
### Docs
26+
27+
* **burr:** added dependecies and switched to furo ([819f071](https://github.com/VinciGit00/Scrapegraph-ai/commit/819f071f2dc64d090cb05c3571aff6c9cb9196d7))
28+
* **graph:** added new graphs and schema ([d27cad5](https://github.com/VinciGit00/Scrapegraph-ai/commit/d27cad591196b932c1bbcbaa936479a030ac67b5))
29+
30+
## [1.5.0-beta.3](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.5.0-beta.2...v1.5.0-beta.3) (2024-05-24)
31+
32+
33+
### Bug Fixes
34+
35+
* **kg:** removed unused nodes and utils ([5684578](https://github.com/VinciGit00/Scrapegraph-ai/commit/5684578fab635e862de58f7847ad736c6a57f766))
36+
37+
## [1.5.0-beta.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.5.0-beta.1...v1.5.0-beta.2) (2024-05-24)
38+
39+
40+
### Bug Fixes
41+
42+
* **pdf_scraper:** fix the pdf scraper gaph ([d00cde6](https://github.com/VinciGit00/Scrapegraph-ai/commit/d00cde60309935e283ba9116cf0b114e53cb9640))
43+
* **local_file:** fixed textual input pdf, csv, json and xml graph ([8d5eb0b](https://github.com/VinciGit00/Scrapegraph-ai/commit/8d5eb0bb0d5d008a63a96df94ce3842320376b8e))
44+
45+
## [1.5.0-beta.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.4.0...v1.5.0-beta.1) (2024-05-24)
46+
47+
48+
### Features
49+
50+
* **knowledgegraph:** add knowledge graph node ([0196423](https://github.com/VinciGit00/Scrapegraph-ai/commit/0196423bdeea6568086aae6db8fc0f5652fc4e87))
51+
* add logger integration ([e53766b](https://github.com/VinciGit00/Scrapegraph-ai/commit/e53766b16e89254f945f9b54b38445a24f8b81f2))
52+
* **smart-scraper-multi:** add schema to graphs and created SmartScraperMultiGraph ([fc58e2d](https://github.com/VinciGit00/Scrapegraph-ai/commit/fc58e2d3a6f05efa72b45c9e68c6bb41a1eee755))
53+
* **base_graph:** alligned with main ([73fa31d](https://github.com/VinciGit00/Scrapegraph-ai/commit/73fa31db0f791d1fd63b489ac88cc6e595aa07f9))
54+
* **verbose:** centralized graph logging on debug or warning depending on verbose ([c807695](https://github.com/VinciGit00/Scrapegraph-ai/commit/c807695720a85c74a0b4365afb397bbbcd7e2889))
55+
* **node:** knowledge graph node ([8c33ea3](https://github.com/VinciGit00/Scrapegraph-ai/commit/8c33ea3fbce18f74484fe7bd9469ab95c985ad0b))
56+
* **multiple:** quick fix working ([58cc903](https://github.com/VinciGit00/Scrapegraph-ai/commit/58cc903d556d0b8db10284493b05bed20992c339))
57+
* **kg:** removed import ([a338383](https://github.com/VinciGit00/Scrapegraph-ai/commit/a338383399b669ae2dd7bfcec168b791e8206816))
58+
* **docloaders:** undetected-playwright ([7b3ee4e](https://github.com/VinciGit00/Scrapegraph-ai/commit/7b3ee4e71e4af04edeb47999d70d398b67c93ac4))
59+
* **multiple_search:** working multiple example ([bed3eed](https://github.com/VinciGit00/Scrapegraph-ai/commit/bed3eed50c1678cfb07cba7b451ac28d38c87d7c))
60+
* **kg:** working rag kg ([c75e6a0](https://github.com/VinciGit00/Scrapegraph-ai/commit/c75e6a06b1a647f03e6ac6eeacdc578a85baa25b))
61+
62+
63+
### Bug Fixes
64+
65+
* error in jsons ([ca436ab](https://github.com/VinciGit00/Scrapegraph-ai/commit/ca436abf3cbff21d752a71969e787e8f8c98c6a8))
66+
* **logger:** set up centralized root logger in base node ([4348d4f](https://github.com/VinciGit00/Scrapegraph-ai/commit/4348d4f4db6f30213acc1bbccebc2b143b4d2636))
67+
* **logging:** source code citation ([d139480](https://github.com/VinciGit00/Scrapegraph-ai/commit/d1394809d704bee4085d494ddebab772306b3b17))
68+
* template names ([b82f33a](https://github.com/VinciGit00/Scrapegraph-ai/commit/b82f33aee72515e4258e6f508fce15028eba5cbe))
69+
* **node-logging:** use centralized logger in each node for logging ([c251cc4](https://github.com/VinciGit00/Scrapegraph-ai/commit/c251cc45d3694f8e81503e38a6d2b362452b740e))
70+
* **web-loader:** use sublogger ([0790ecd](https://github.com/VinciGit00/Scrapegraph-ai/commit/0790ecd2083642af9f0a84583216ababe351cd76))
71+
72+
73+
### CI
74+
75+
* **release:** 1.2.0-beta.1 [skip ci] ([fd3e0aa](https://github.com/VinciGit00/Scrapegraph-ai/commit/fd3e0aa5823509dfb46b4f597521c24d4eb345f1))
76+
* **release:** 1.3.0-beta.1 [skip ci] ([191db0b](https://github.com/VinciGit00/Scrapegraph-ai/commit/191db0bc779e4913713b47b68ec4162a347da3ea))
77+
* **release:** 1.4.0-beta.1 [skip ci] ([2caddf9](https://github.com/VinciGit00/Scrapegraph-ai/commit/2caddf9a99b5f3aedc1783216f21d23cd35b3a8c))
78+
* **release:** 1.4.0-beta.2 [skip ci] ([f1a2523](https://github.com/VinciGit00/Scrapegraph-ai/commit/f1a25233d650010e1932e0ab80938079a22a296d))
79+
80+
## [1.4.0-beta.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.4.0-beta.1...v1.4.0-beta.2) (2024-05-19)
281

382

483
### Features
@@ -19,13 +98,16 @@
1998

2099
* add deepseek embeddings ([659fad7](https://github.com/VinciGit00/Scrapegraph-ai/commit/659fad770a5b6ace87511513e5233a3bc1269009))
21100

101+
22102
## [1.3.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.2.4...v1.3.0) (2024-05-19)
23103

24104

105+
25106
### Features
26107

27108
* add new model ([8c7afa7](https://github.com/VinciGit00/Scrapegraph-ai/commit/8c7afa7570f0a104578deb35658168435cfe5ae1))
28109

110+
29111
## [1.2.4](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.2.3...v1.2.4) (2024-05-17)
30112

31113

README.md

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -22,10 +22,6 @@ The reference page for Scrapegraph-ai is available on the official page of pypy:
2222
```bash
2323
pip install scrapegraphai
2424
```
25-
you will also need to install Playwright for javascript-based scraping:
26-
```bash
27-
playwright install
28-
```
2925

3026
**Note**: it is recommended to install the library in a virtual environment to avoid conflicts with other libraries 🐱
3127

@@ -49,6 +45,7 @@ There are three main scraping pipelines that can be used to extract information
4945
- `SmartScraperGraph`: single-page scraper that only needs a user prompt and an input source;
5046
- `SearchGraph`: multi-page scraper that extracts information from the top n search results of a search engine;
5147
- `SpeechGraph`: single-page scraper that extracts information from a website and generates an audio file.
48+
- `SmartScraperMultiGraph`: multiple page scraper given a single prompt
5249

5350
It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**, **Azure** and **Gemini**, or local models using **Ollama**.
5451

@@ -171,7 +168,7 @@ Feel free to contribute and join our Discord server to discuss with us improveme
171168

172169
Please see the [contributing guidelines](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/CONTRIBUTING.md).
173170

174-
[![My Skills](https://skillicons.dev/icons?i=discord)](https://discord.gg/gkxQDAjfeX)
171+
[![My Skills](https://skillicons.dev/icons?i=discord)](https://discord.gg/uJN7TYcpNa)
175172
[![My Skills](https://skillicons.dev/icons?i=linkedin)](https://www.linkedin.com/company/scrapegraphai/)
176173
[![My Skills](https://skillicons.dev/icons?i=twitter)](https://twitter.com/scrapegraphai)
177174

@@ -182,13 +179,14 @@ Wanna visualize the roadmap in a more interactive way? Check out the [markmap](h
182179

183180
## ❤️ Contributors
184181
[![Contributors](https://contrib.rocks/image?repo=VinciGit00/Scrapegraph-ai)](https://github.com/VinciGit00/Scrapegraph-ai/graphs/contributors)
182+
185183
## Sponsors
186184
<div style="text-align: center;">
187185
<a href="https://serpapi.com?utm_source=scrapegraphai">
188186
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/serp_api_logo.png" alt="SerpAPI" style="width: 10%;">
189187
</a>
190188
<a href="https://dashboard.statproxies.com/?refferal=scrapegraph">
191-
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/transparent_stat.png" alt="Stats" style="width: 10%;">
189+
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/transparent_stat.png" alt="Stats" style="width: 15%;">
192190
</a>
193191
</div>
194192

docs/source/conf.py

Lines changed: 7 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -23,27 +23,17 @@
2323
# -- General configuration ---------------------------------------------------
2424
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
2525

26-
extensions = ['sphinx.ext.autodoc', 'sphinx.ext.napoleon','sphinx_wagtail_theme']
26+
extensions = ['sphinx.ext.autodoc', 'sphinx.ext.napoleon']
2727

2828
templates_path = ['_templates']
2929
exclude_patterns = []
3030

3131
# -- Options for HTML output -------------------------------------------------
3232
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
3333

34-
# html_theme = 'sphinx_rtd_theme'
35-
html_theme = 'sphinx_wagtail_theme'
36-
37-
html_theme_options = dict(
38-
project_name = "ScrapeGraphAI",
39-
logo = "scrapegraphai_logo.png",
40-
logo_alt = "ScrapeGraphAI",
41-
logo_height = 59,
42-
logo_url = "https://scrapegraph-ai.readthedocs.io/en/latest/",
43-
logo_width = 45,
44-
github_url = "https://github.com/VinciGit00/Scrapegraph-ai/tree/main/docs/source/",
45-
footer_links = ",".join(
46-
["Landing Page|https://scrapegraphai.com/",
47-
"Docusaurus|https://scrapegraph-doc.onrender.com/docs/intro"]
48-
),
49-
)
34+
html_theme = 'furo'
35+
html_theme_options = {
36+
"source_repository": "https://github.com/VinciGit00/Scrapegraph-ai/",
37+
"source_branch": "main",
38+
"source_directory": "docs/source/",
39+
}

docs/source/getting_started/installation.rst

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,11 +25,18 @@ The library is available on PyPI, so it can be installed using the following com
2525

2626
It is higly recommended to install the library in a virtual environment (conda, venv, etc.)
2727

28-
If your clone the repository, you can install the library using `poetry <https://python-poetry.org/docs/>`_:
28+
If your clone the repository, it is recommended to use a package manager like `rye <https://rye.astral.sh/>`_.
29+
To install the library using rye, you can run the following command:
2930

3031
.. code-block:: bash
3132
32-
poetry install
33+
rye pin 3.10
34+
rye sync
35+
rye build
36+
37+
.. caution::
38+
39+
**Rye** must be installed first by following the instructions on the `official website <https://rye.astral.sh/>`_.
3340

3441
Additionally on Windows when using WSL
3542
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

docs/source/index.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,15 @@
3232

3333
modules/modules
3434

35+
.. toctree::
36+
:hidden:
37+
:caption: EXTERNAL RESOURCES
38+
39+
GitHub <https://github.com/VinciGit00/Scrapegraph-ai>
40+
Discord <https://discord.gg/uJN7TYcpNa>
41+
Linkedin <https://www.linkedin.com/company/scrapegraphai/>
42+
Twitter <https://twitter.com/scrapegraphai>
43+
3544
Indices and tables
3645
==================
3746

docs/source/introduction/overview.rst

Lines changed: 67 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,11 @@
66
Overview
77
========
88

9-
ScrapeGraphAI is a open-source web scraping python library designed to usher in a new era of scraping tools.
10-
In today's rapidly evolving and data-intensive digital landscape, this library stands out by integrating LLM and
11-
direct graph logic to automate the creation of scraping pipelines for websites and various local documents, including XML,
12-
HTML, JSON, and more.
9+
ScrapeGraphAI is an **open-source** Python library designed to revolutionize **scraping** tools.
10+
In today's data-intensive digital landscape, this library stands out by integrating **Large Language Models** (LLMs)
11+
and modular **graph-based** pipelines to automate the scraping of data from various sources (e.g., websites, local files etc.).
1312

14-
Simply specify the information you need to extract, and ScrapeGraphAI handles the rest,
15-
providing a more flexible and low-maintenance solution compared to traditional scraping tools.
13+
Simply specify the information you need to extract, and ScrapeGraphAI handles the rest, providing a more **flexible** and **low-maintenance** solution compared to traditional scraping tools.
1614

1715
Why ScrapegraphAI?
1816
==================
@@ -21,17 +19,75 @@ Traditional web scraping tools often rely on fixed patterns or manual configurat
2119
ScrapegraphAI, leveraging the power of LLMs, adapts to changes in website structures, reducing the need for constant developer intervention.
2220
This flexibility ensures that scrapers remain functional even when website layouts change.
2321

24-
We support many Large Language Models (LLMs) including GPT, Gemini, Groq, Azure, Hugging Face etc.
25-
as well as local models which can run on your machine using Ollama.
22+
We support many LLMs including **GPT, Gemini, Groq, Azure, Hugging Face** etc.
23+
as well as local models which can run on your machine using **Ollama**.
2624

2725
Library Diagram
2826
===============
2927

30-
With ScrapegraphAI you first construct a pipeline of steps you want to execute by combining nodes into a graph.
31-
Executing the graph takes care of all the steps that are often part of scraping: fetching, parsing etc...
32-
Finally the scraped and processed data gets fed to an LLM which generates a response.
28+
With ScrapegraphAI you can use many already implemented scraping pipelines or create your own.
29+
30+
The diagram below illustrates the high-level architecture of ScrapeGraphAI:
3331

3432
.. image:: ../../assets/project_overview_diagram.png
3533
:align: center
3634
:width: 70%
3735
:alt: ScrapegraphAI Overview
36+
37+
FAQ
38+
===
39+
40+
1. **What is ScrapeGraphAI?**
41+
42+
ScrapeGraphAI is an open-source python library that uses large language models (LLMs) and graph logic to automate the creation of scraping pipelines for websites and various document types.
43+
44+
2. **How does ScrapeGraphAI differ from traditional scraping tools?**
45+
46+
Traditional scraping tools rely on fixed patterns and manual configurations, whereas ScrapeGraphAI adapts to website structure changes using LLMs, reducing the need for constant developer intervention.
47+
48+
3. **Which LLMs are supported by ScrapeGraphAI?**
49+
50+
ScrapeGraphAI supports several LLMs, including GPT, Gemini, Groq, Azure, Hugging Face, and local models that can run on your machine using Ollama.
51+
52+
4. **Can ScrapeGraphAI handle different document formats?**
53+
54+
Yes, ScrapeGraphAI can scrape information from various document formats such as XML, HTML, JSON, and more.
55+
56+
5. **I get an empty or incorrect output when scraping a website. What should I do?**
57+
58+
There are several reasons behind this issue, but for most cases, you can try the following:
59+
60+
- Set the `headless` parameter to `False` in the graph_config. Some javascript-heavy websites might require it.
61+
62+
- Check your internet connection. Low speed or unstable connection can cause the HTML to not load properly.
63+
64+
- Try using a proxy server to mask your IP address. Check out the :ref:`Proxy` section for more information on how to configure proxy settings.
65+
66+
- Use a different LLM model. Some models might perform better on certain websites than others.
67+
68+
- Set the `verbose` parameter to `True` in the graph_config to see more detailed logs.
69+
70+
- Visualize the pipeline graphically using :ref:`Burr`.
71+
72+
If the issue persists, please report it on the GitHub repository.
73+
74+
6. **How does ScrapeGraphAI handle the context window limit of LLMs?**
75+
76+
By splitting big websites/documents into chunks with overlaps and applying compression techniques to reduce the number of tokens. If multiple chunks are present, we will have multiple answers to the user prompt, and therefore, we merge them together in the last step of the scraping pipeline.
77+
78+
7. **How can I contribute to ScrapeGraphAI?**
79+
80+
You can contribute to ScrapeGraphAI by submitting bug reports, feature requests, or pull requests on the GitHub repository. Join our `Discord <https://discord.gg/uJN7TYcpNa>`_ community and follow us on social media!
81+
82+
Sponsors
83+
========
84+
85+
.. image:: ../../assets/serp_api_logo.png
86+
:width: 10%
87+
:alt: Serp API
88+
:target: https://serpapi.com?utm_source=scrapegraphai
89+
90+
.. image:: ../../assets/transparent_stat.png
91+
:width: 15%
92+
:alt: Stat Proxies
93+
:target: https://dashboard.statproxies.com/?refferal=scrapegraph

docs/source/modules/modules.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
1+
scrapegraphai
2+
=============
3+
14
.. toctree::
25
:maxdepth: 4
36

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
scrapegraphai.builders package
2+
==============================
3+
4+
Submodules
5+
----------
6+
7+
scrapegraphai.builders.graph\_builder module
8+
--------------------------------------------
9+
10+
.. automodule:: scrapegraphai.builders.graph_builder
11+
:members:
12+
:undoc-members:
13+
:show-inheritance:
14+
15+
Module contents
16+
---------------
17+
18+
.. automodule:: scrapegraphai.builders
19+
:members:
20+
:undoc-members:
21+
:show-inheritance:
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
scrapegraphai.docloaders package
2+
================================
3+
4+
Submodules
5+
----------
6+
7+
scrapegraphai.docloaders.chromium module
8+
----------------------------------------
9+
10+
.. automodule:: scrapegraphai.docloaders.chromium
11+
:members:
12+
:undoc-members:
13+
:show-inheritance:
14+
15+
Module contents
16+
---------------
17+
18+
.. automodule:: scrapegraphai.docloaders
19+
:members:
20+
:undoc-members:
21+
:show-inheritance:

0 commit comments

Comments
 (0)