Skip to content

Pre/beta #302

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 102 commits into from
May 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
102 commits
Select commit Hold shift + click to select a range
be16fec
WIP
skrawcz May 10, 2024
d94195f
WIP
skrawcz May 10, 2024
82afa0e
Working smart scraper graph
skrawcz May 10, 2024
0bcb0fb
Merge pull request #210 from skrawcz/burr
VinciGit00 May 10, 2024
f2bb1cc
Fixes LC document deserialization
skrawcz May 11, 2024
20604bd
Merge pull request #218 from skrawcz/burr
VinciGit00 May 11, 2024
e53766b
feat: add logger integration
VinciGit00 May 14, 2024
0589083
refactoring of loggers
VinciGit00 May 15, 2024
a4700bf
add robot node
VinciGit00 May 15, 2024
0b71b9a
Add a new graph traversal that allows more than one edges out of a graph
mayurdb May 15, 2024
42d2ab1
Merge pull request #250 from mayurdb/graphRevamp
VinciGit00 May 15, 2024
0b5cdd4
Merge pull request #246 from VinciGit00/main
VinciGit00 May 15, 2024
008f8d9
Merge pull request #248 from VinciGit00/main
VinciGit00 May 15, 2024
fd3e0aa
ci(release): 1.2.0-beta.1 [skip ci]
semantic-release-bot May 15, 2024
29d284e
Merge branch 'main' into logger-integration
VinciGit00 May 15, 2024
40260d8
remove asdt
VinciGit00 May 15, 2024
a8fb851
remove asdt
VinciGit00 May 15, 2024
4fe58d9
fix logger
VinciGit00 May 15, 2024
befa48c
update lock
VinciGit00 May 15, 2024
ba8a4f7
removed duplicates
VinciGit00 May 15, 2024
d60438c
Add a n-level deep search support
mayurdb May 15, 2024
1e0b2f7
Merge branch 'pre/beta' into nDeep
mayurdb May 15, 2024
f36b3e3
Merge pull request #254 from mayurdb/nDeep
VinciGit00 May 16, 2024
9483afd
revert
VinciGit00 May 16, 2024
9e9f8f0
removed max depth
VinciGit00 May 16, 2024
02745a4
Merge branch 'main' into pre/beta
VinciGit00 May 17, 2024
3453f72
add graph
VinciGit00 May 17, 2024
8c33ea3
feat(node): knowledge graph node
PeriniM May 17, 2024
73fa31d
feat(base_graph): alligned with main
PeriniM May 17, 2024
191db0b
ci(release): 1.3.0-beta.1 [skip ci]
semantic-release-bot May 17, 2024
0196423
feat(knowledgegraph): add knowledge graph node
PeriniM May 17, 2024
05e511e
add new prompts
VinciGit00 May 17, 2024
bed3eed
feat(multiple_search): working multiple example
PeriniM May 17, 2024
b82f33a
fix: template names
VinciGit00 May 18, 2024
6f62b05
Merge branch 'multi_scraper_graph' of https://github.com/VinciGit00/S…
VinciGit00 May 18, 2024
ff53771
add falcon model
VinciGit00 May 18, 2024
58cc903
feat(multiple): quick fix working
PeriniM May 18, 2024
c75e6a0
feat(kg): working rag kg
PeriniM May 18, 2024
a338383
feat(kg): removed import
PeriniM May 18, 2024
5701afe
add new import
VinciGit00 May 18, 2024
7b3ee4e
feat(docloaders): undetected-playwright
QIN2DIM May 19, 2024
ec8cbca
Merge branch 'pre/beta' into try
VinciGit00 May 19, 2024
82585aa
Merge pull request #270 from VinciGit00/try
VinciGit00 May 19, 2024
2caddf9
ci(release): 1.4.0-beta.1 [skip ci]
semantic-release-bot May 19, 2024
9096da7
Merge pull request #269 from QIN2DIM/feat-undetected-playwright
PeriniM May 19, 2024
f1a2523
ci(release): 1.4.0-beta.2 [skip ci]
semantic-release-bot May 19, 2024
7e5ff4e
Added bedrock examples
JGalego May 20, 2024
6058492
Updated JSON scraper example prompt
JGalego May 20, 2024
9e92b03
Added missing Titan text embedding models
JGalego May 20, 2024
05ecc3a
Added missing logic to extract model_name from model_id
JGalego May 20, 2024
d0a301d
Moved common params up (verbose, headless and loader_kwargs)
JGalego May 20, 2024
3ffa896
Fixed model ID -> model name conversion
JGalego May 20, 2024
0ad78ca
Merge pull request #272 from JGalego/docs/bedrock-examples
VinciGit00 May 20, 2024
c8c3201
Merge pull request #273 from JGalego/bugfix/bedrock-runs
VinciGit00 May 20, 2024
fc58e2d
feat(smart-scraper-multi): add schema to graphs and created SmartScra…
PeriniM May 21, 2024
be4237a
Merge branch 'pre/beta' into multi_scraper_graph
VinciGit00 May 21, 2024
7369a4d
Merge pull request #281 from VinciGit00/multi_scraper_graph
VinciGit00 May 21, 2024
ca436ab
fix: error in jsons
VinciGit00 May 21, 2024
aa14271
Update README.md
VinciGit00 May 21, 2024
6cbd84f
feat(burr-bridge): BurrBridge class to integrate inside BaseGraph
PeriniM May 21, 2024
d96840f
Updates Burr bridge to use class-based API
elijahbenizzy May 21, 2024
cfaf7ee
Merge pull request #284 from DAGWorks-Inc/burr_integration
PeriniM May 21, 2024
654a042
feat(burr-node): working burr bridge
PeriniM May 21, 2024
ac10128
feat(burr): added burr integration in graphs and optional burr instal…
PeriniM May 22, 2024
ffd6015
Update abstract_graph.py
stoensin May 23, 2024
0ba3a59
Update models_tokens.py
VinciGit00 May 23, 2024
f00ed35
Merge branch 'pre/beta' into patch-1
VinciGit00 May 23, 2024
1cb71ed
Merge pull request #289 from stoensin/patch-1
VinciGit00 May 23, 2024
b6f7b64
Merge pull request #290 from VinciGit00/pre/beta
VinciGit00 May 23, 2024
1774b18
refactor of embeddings
VinciGit00 May 23, 2024
b377467
add info
VinciGit00 May 23, 2024
909af8d
refactor gen answ node
VinciGit00 May 23, 2024
6d33a8a
rollback
VinciGit00 May 23, 2024
c93dbe0
Update smart_scraper_graph.py
VinciGit00 May 23, 2024
00a392b
Merge pull request #292 from VinciGit00/refactoring
VinciGit00 May 23, 2024
d00cde6
fix(pdf_scraper): fix the pdf scraper gaph
VinciGit00 May 23, 2024
5fd7633
Update pdf_scraper_graph.py
VinciGit00 May 23, 2024
d139480
fix(logging): source code citation
DiTo97 May 23, 2024
0790ecd
fix(web-loader): use sublogger
DiTo97 May 23, 2024
c807695
feat(verbose): centralized graph logging on debug or warning dependin…
DiTo97 May 23, 2024
4348d4f
fix(logger): set up centralized root logger in base node
DiTo97 May 23, 2024
c251cc4
fix(node-logging): use centralized logger in each node for logging
DiTo97 May 23, 2024
3d0f671
Merge pull request #294 from DiTo97/logger-integration
VinciGit00 May 24, 2024
b913b51
Merge branch 'logger-integration' into pre/beta
VinciGit00 May 24, 2024
e1006f3
ci(release): 1.5.0-beta.1 [skip ci]
semantic-release-bot May 24, 2024
819f071
docs(burr): added dependecies and switched to furo
PeriniM May 24, 2024
8d5eb0b
fix(local_file): fixed textual input pdf, csv, json and xml graph
PeriniM May 24, 2024
a4ee757
Merge branch 'pre/beta' into pdf_scraper_refactoring
PeriniM May 24, 2024
8b032a9
Merge pull request #293 from VinciGit00/pdf_scraper_refactoring
PeriniM May 24, 2024
edf221d
ci(release): 1.5.0-beta.2 [skip ci]
semantic-release-bot May 24, 2024
5684578
fix(kg): removed unused nodes and utils
PeriniM May 24, 2024
90d5691
ci(release): 1.5.0-beta.3 [skip ci]
semantic-release-bot May 24, 2024
d27cad5
docs(graph): added new graphs and schema
PeriniM May 24, 2024
e65faca
Merge branch 'pre/beta' of https://github.com/VinciGit00/Scrapegraph-…
PeriniM May 24, 2024
19b27bb
feat(burr): first burr integration and docs
PeriniM May 24, 2024
f9f6b08
Merge branch 'pre/beta' into burr_integration
PeriniM May 25, 2024
7848060
Merge pull request #299 from VinciGit00/burr_integration
PeriniM May 25, 2024
15b7682
ci(release): 1.5.0-beta.4 [skip ci]
semantic-release-bot May 25, 2024
545374c
docs(faq): added faq section and refined installation
PeriniM May 25, 2024
e43b801
docs: updated requirements
PeriniM May 25, 2024
5fb9115
feat(version): python 3.12 is now supported 🚀
PeriniM May 26, 2024
1f51147
ci(release): 1.5.0-beta.5 [skip ci]
semantic-release-bot May 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,16 +21,20 @@ docs/source/_templates/
docs/source/_static/
.env
venv/
.venv/
.vscode/

# exclude pdf, mp3
*.pdf
*.mp3
*.sqlite
*.google-cookie
*.python-version
examples/graph_examples/ScrapeGraphAI_generated_graph
examples/**/result.csv
examples/**/result.json
main.py
lib/
*.html
.idea


84 changes: 83 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,83 @@
## [1.4.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.3.2...v1.4.0) (2024-05-22)
## [1.5.0-beta.5](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.5.0-beta.4...v1.5.0-beta.5) (2024-05-26)


### Features

* **version:** python 3.12 is now supported 🚀 ([5fb9115](https://github.com/VinciGit00/Scrapegraph-ai/commit/5fb9115330141ac2c1dd97490284d4f1fa2c01c3))


### Docs

* **faq:** added faq section and refined installation ([545374c](https://github.com/VinciGit00/Scrapegraph-ai/commit/545374c17e9101a240fd1fbc380ce813c5aa6c2e))
* updated requirements ([e43b801](https://github.com/VinciGit00/Scrapegraph-ai/commit/e43b8018f5f360b88c52e45ff4e1b4221386ea8e))

## [1.5.0-beta.4](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.5.0-beta.3...v1.5.0-beta.4) (2024-05-25)


### Features

* **burr:** added burr integration in graphs and optional burr installation ([ac10128](https://github.com/VinciGit00/Scrapegraph-ai/commit/ac10128ff3af35c52b48c79d085e458524e8e48a))
* **burr-bridge:** BurrBridge class to integrate inside BaseGraph ([6cbd84f](https://github.com/VinciGit00/Scrapegraph-ai/commit/6cbd84f254ebc1f1c68699273bdd8fcdb0fe26d4))
* **burr:** first burr integration and docs ([19b27bb](https://github.com/VinciGit00/Scrapegraph-ai/commit/19b27bbe852f134cf239fc1945e7906bc24d7098))
* **burr-node:** working burr bridge ([654a042](https://github.com/VinciGit00/Scrapegraph-ai/commit/654a04239640a89d9fa408ccb2e4485247ab84df))


### Docs

* **burr:** added dependecies and switched to furo ([819f071](https://github.com/VinciGit00/Scrapegraph-ai/commit/819f071f2dc64d090cb05c3571aff6c9cb9196d7))
* **graph:** added new graphs and schema ([d27cad5](https://github.com/VinciGit00/Scrapegraph-ai/commit/d27cad591196b932c1bbcbaa936479a030ac67b5))

## [1.5.0-beta.3](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.5.0-beta.2...v1.5.0-beta.3) (2024-05-24)


### Bug Fixes

* **kg:** removed unused nodes and utils ([5684578](https://github.com/VinciGit00/Scrapegraph-ai/commit/5684578fab635e862de58f7847ad736c6a57f766))

## [1.5.0-beta.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.5.0-beta.1...v1.5.0-beta.2) (2024-05-24)


### Bug Fixes

* **pdf_scraper:** fix the pdf scraper gaph ([d00cde6](https://github.com/VinciGit00/Scrapegraph-ai/commit/d00cde60309935e283ba9116cf0b114e53cb9640))
* **local_file:** fixed textual input pdf, csv, json and xml graph ([8d5eb0b](https://github.com/VinciGit00/Scrapegraph-ai/commit/8d5eb0bb0d5d008a63a96df94ce3842320376b8e))

## [1.5.0-beta.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.4.0...v1.5.0-beta.1) (2024-05-24)


### Features

* **knowledgegraph:** add knowledge graph node ([0196423](https://github.com/VinciGit00/Scrapegraph-ai/commit/0196423bdeea6568086aae6db8fc0f5652fc4e87))
* add logger integration ([e53766b](https://github.com/VinciGit00/Scrapegraph-ai/commit/e53766b16e89254f945f9b54b38445a24f8b81f2))
* **smart-scraper-multi:** add schema to graphs and created SmartScraperMultiGraph ([fc58e2d](https://github.com/VinciGit00/Scrapegraph-ai/commit/fc58e2d3a6f05efa72b45c9e68c6bb41a1eee755))
* **base_graph:** alligned with main ([73fa31d](https://github.com/VinciGit00/Scrapegraph-ai/commit/73fa31db0f791d1fd63b489ac88cc6e595aa07f9))
* **verbose:** centralized graph logging on debug or warning depending on verbose ([c807695](https://github.com/VinciGit00/Scrapegraph-ai/commit/c807695720a85c74a0b4365afb397bbbcd7e2889))
* **node:** knowledge graph node ([8c33ea3](https://github.com/VinciGit00/Scrapegraph-ai/commit/8c33ea3fbce18f74484fe7bd9469ab95c985ad0b))
* **multiple:** quick fix working ([58cc903](https://github.com/VinciGit00/Scrapegraph-ai/commit/58cc903d556d0b8db10284493b05bed20992c339))
* **kg:** removed import ([a338383](https://github.com/VinciGit00/Scrapegraph-ai/commit/a338383399b669ae2dd7bfcec168b791e8206816))
* **docloaders:** undetected-playwright ([7b3ee4e](https://github.com/VinciGit00/Scrapegraph-ai/commit/7b3ee4e71e4af04edeb47999d70d398b67c93ac4))
* **multiple_search:** working multiple example ([bed3eed](https://github.com/VinciGit00/Scrapegraph-ai/commit/bed3eed50c1678cfb07cba7b451ac28d38c87d7c))
* **kg:** working rag kg ([c75e6a0](https://github.com/VinciGit00/Scrapegraph-ai/commit/c75e6a06b1a647f03e6ac6eeacdc578a85baa25b))


### Bug Fixes

* error in jsons ([ca436ab](https://github.com/VinciGit00/Scrapegraph-ai/commit/ca436abf3cbff21d752a71969e787e8f8c98c6a8))
* **logger:** set up centralized root logger in base node ([4348d4f](https://github.com/VinciGit00/Scrapegraph-ai/commit/4348d4f4db6f30213acc1bbccebc2b143b4d2636))
* **logging:** source code citation ([d139480](https://github.com/VinciGit00/Scrapegraph-ai/commit/d1394809d704bee4085d494ddebab772306b3b17))
* template names ([b82f33a](https://github.com/VinciGit00/Scrapegraph-ai/commit/b82f33aee72515e4258e6f508fce15028eba5cbe))
* **node-logging:** use centralized logger in each node for logging ([c251cc4](https://github.com/VinciGit00/Scrapegraph-ai/commit/c251cc45d3694f8e81503e38a6d2b362452b740e))
* **web-loader:** use sublogger ([0790ecd](https://github.com/VinciGit00/Scrapegraph-ai/commit/0790ecd2083642af9f0a84583216ababe351cd76))


### CI

* **release:** 1.2.0-beta.1 [skip ci] ([fd3e0aa](https://github.com/VinciGit00/Scrapegraph-ai/commit/fd3e0aa5823509dfb46b4f597521c24d4eb345f1))
* **release:** 1.3.0-beta.1 [skip ci] ([191db0b](https://github.com/VinciGit00/Scrapegraph-ai/commit/191db0bc779e4913713b47b68ec4162a347da3ea))
* **release:** 1.4.0-beta.1 [skip ci] ([2caddf9](https://github.com/VinciGit00/Scrapegraph-ai/commit/2caddf9a99b5f3aedc1783216f21d23cd35b3a8c))
* **release:** 1.4.0-beta.2 [skip ci] ([f1a2523](https://github.com/VinciGit00/Scrapegraph-ai/commit/f1a25233d650010e1932e0ab80938079a22a296d))

## [1.4.0-beta.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.4.0-beta.1...v1.4.0-beta.2) (2024-05-19)


### Features
Expand All @@ -19,13 +98,16 @@

* add deepseek embeddings ([659fad7](https://github.com/VinciGit00/Scrapegraph-ai/commit/659fad770a5b6ace87511513e5233a3bc1269009))


## [1.3.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.2.4...v1.3.0) (2024-05-19)



### Features

* add new model ([8c7afa7](https://github.com/VinciGit00/Scrapegraph-ai/commit/8c7afa7570f0a104578deb35658168435cfe5ae1))


## [1.2.4](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.2.3...v1.2.4) (2024-05-17)


Expand Down
10 changes: 4 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,6 @@ The reference page for Scrapegraph-ai is available on the official page of pypy:
```bash
pip install scrapegraphai
```
you will also need to install Playwright for javascript-based scraping:
```bash
playwright install
```

**Note**: it is recommended to install the library in a virtual environment to avoid conflicts with other libraries 🐱

Expand All @@ -49,6 +45,7 @@ There are three main scraping pipelines that can be used to extract information
- `SmartScraperGraph`: single-page scraper that only needs a user prompt and an input source;
- `SearchGraph`: multi-page scraper that extracts information from the top n search results of a search engine;
- `SpeechGraph`: single-page scraper that extracts information from a website and generates an audio file.
- `SmartScraperMultiGraph`: multiple page scraper given a single prompt

It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**, **Azure** and **Gemini**, or local models using **Ollama**.

Expand Down Expand Up @@ -171,7 +168,7 @@ Feel free to contribute and join our Discord server to discuss with us improveme

Please see the [contributing guidelines](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/CONTRIBUTING.md).

[![My Skills](https://skillicons.dev/icons?i=discord)](https://discord.gg/gkxQDAjfeX)
[![My Skills](https://skillicons.dev/icons?i=discord)](https://discord.gg/uJN7TYcpNa)
[![My Skills](https://skillicons.dev/icons?i=linkedin)](https://www.linkedin.com/company/scrapegraphai/)
[![My Skills](https://skillicons.dev/icons?i=twitter)](https://twitter.com/scrapegraphai)

Expand All @@ -182,13 +179,14 @@ Wanna visualize the roadmap in a more interactive way? Check out the [markmap](h

## ❤️ Contributors
[![Contributors](https://contrib.rocks/image?repo=VinciGit00/Scrapegraph-ai)](https://github.com/VinciGit00/Scrapegraph-ai/graphs/contributors)

## Sponsors
<div style="text-align: center;">
<a href="https://serpapi.com?utm_source=scrapegraphai">
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/serp_api_logo.png" alt="SerpAPI" style="width: 10%;">
</a>
<a href="https://dashboard.statproxies.com/?refferal=scrapegraph">
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/transparent_stat.png" alt="Stats" style="width: 10%;">
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/transparent_stat.png" alt="Stats" style="width: 15%;">
</a>
</div>

Expand Down
24 changes: 7 additions & 17 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,27 +23,17 @@
# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration

extensions = ['sphinx.ext.autodoc', 'sphinx.ext.napoleon','sphinx_wagtail_theme']
extensions = ['sphinx.ext.autodoc', 'sphinx.ext.napoleon']

templates_path = ['_templates']
exclude_patterns = []

# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output

# html_theme = 'sphinx_rtd_theme'
html_theme = 'sphinx_wagtail_theme'

html_theme_options = dict(
project_name = "ScrapeGraphAI",
logo = "scrapegraphai_logo.png",
logo_alt = "ScrapeGraphAI",
logo_height = 59,
logo_url = "https://scrapegraph-ai.readthedocs.io/en/latest/",
logo_width = 45,
github_url = "https://github.com/VinciGit00/Scrapegraph-ai/tree/main/docs/source/",
footer_links = ",".join(
["Landing Page|https://scrapegraphai.com/",
"Docusaurus|https://scrapegraph-doc.onrender.com/docs/intro"]
),
)
html_theme = 'furo'
html_theme_options = {
"source_repository": "https://github.com/VinciGit00/Scrapegraph-ai/",
"source_branch": "main",
"source_directory": "docs/source/",
}
11 changes: 9 additions & 2 deletions docs/source/getting_started/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,18 @@ The library is available on PyPI, so it can be installed using the following com

It is higly recommended to install the library in a virtual environment (conda, venv, etc.)

If your clone the repository, you can install the library using `poetry <https://python-poetry.org/docs/>`_:
If your clone the repository, it is recommended to use a package manager like `rye <https://rye.astral.sh/>`_.
To install the library using rye, you can run the following command:

.. code-block:: bash

poetry install
rye pin 3.10
rye sync
rye build

.. caution::

**Rye** must be installed first by following the instructions on the `official website <https://rye.astral.sh/>`_.

Additionally on Windows when using WSL
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down
9 changes: 9 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,15 @@

modules/modules

.. toctree::
:hidden:
:caption: EXTERNAL RESOURCES

GitHub <https://github.com/VinciGit00/Scrapegraph-ai>
Discord <https://discord.gg/uJN7TYcpNa>
Linkedin <https://www.linkedin.com/company/scrapegraphai/>
Twitter <https://twitter.com/scrapegraphai>

Indices and tables
==================

Expand Down
78 changes: 67 additions & 11 deletions docs/source/introduction/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,11 @@
Overview
========

ScrapeGraphAI is a open-source web scraping python library designed to usher in a new era of scraping tools.
In today's rapidly evolving and data-intensive digital landscape, this library stands out by integrating LLM and
direct graph logic to automate the creation of scraping pipelines for websites and various local documents, including XML,
HTML, JSON, and more.
ScrapeGraphAI is an **open-source** Python library designed to revolutionize **scraping** tools.
In today's data-intensive digital landscape, this library stands out by integrating **Large Language Models** (LLMs)
and modular **graph-based** pipelines to automate the scraping of data from various sources (e.g., websites, local files etc.).

Simply specify the information you need to extract, and ScrapeGraphAI handles the rest,
providing a more flexible and low-maintenance solution compared to traditional scraping tools.
Simply specify the information you need to extract, and ScrapeGraphAI handles the rest, providing a more **flexible** and **low-maintenance** solution compared to traditional scraping tools.

Why ScrapegraphAI?
==================
Expand All @@ -21,17 +19,75 @@ Traditional web scraping tools often rely on fixed patterns or manual configurat
ScrapegraphAI, leveraging the power of LLMs, adapts to changes in website structures, reducing the need for constant developer intervention.
This flexibility ensures that scrapers remain functional even when website layouts change.

We support many Large Language Models (LLMs) including GPT, Gemini, Groq, Azure, Hugging Face etc.
as well as local models which can run on your machine using Ollama.
We support many LLMs including **GPT, Gemini, Groq, Azure, Hugging Face** etc.
as well as local models which can run on your machine using **Ollama**.

Library Diagram
===============

With ScrapegraphAI you first construct a pipeline of steps you want to execute by combining nodes into a graph.
Executing the graph takes care of all the steps that are often part of scraping: fetching, parsing etc...
Finally the scraped and processed data gets fed to an LLM which generates a response.
With ScrapegraphAI you can use many already implemented scraping pipelines or create your own.

The diagram below illustrates the high-level architecture of ScrapeGraphAI:

.. image:: ../../assets/project_overview_diagram.png
:align: center
:width: 70%
:alt: ScrapegraphAI Overview

FAQ
===

1. **What is ScrapeGraphAI?**

ScrapeGraphAI is an open-source python library that uses large language models (LLMs) and graph logic to automate the creation of scraping pipelines for websites and various document types.

2. **How does ScrapeGraphAI differ from traditional scraping tools?**

Traditional scraping tools rely on fixed patterns and manual configurations, whereas ScrapeGraphAI adapts to website structure changes using LLMs, reducing the need for constant developer intervention.

3. **Which LLMs are supported by ScrapeGraphAI?**

ScrapeGraphAI supports several LLMs, including GPT, Gemini, Groq, Azure, Hugging Face, and local models that can run on your machine using Ollama.

4. **Can ScrapeGraphAI handle different document formats?**

Yes, ScrapeGraphAI can scrape information from various document formats such as XML, HTML, JSON, and more.

5. **I get an empty or incorrect output when scraping a website. What should I do?**

There are several reasons behind this issue, but for most cases, you can try the following:

- Set the `headless` parameter to `False` in the graph_config. Some javascript-heavy websites might require it.

- Check your internet connection. Low speed or unstable connection can cause the HTML to not load properly.

- Try using a proxy server to mask your IP address. Check out the :ref:`Proxy` section for more information on how to configure proxy settings.

- Use a different LLM model. Some models might perform better on certain websites than others.

- Set the `verbose` parameter to `True` in the graph_config to see more detailed logs.

- Visualize the pipeline graphically using :ref:`Burr`.

If the issue persists, please report it on the GitHub repository.

6. **How does ScrapeGraphAI handle the context window limit of LLMs?**

By splitting big websites/documents into chunks with overlaps and applying compression techniques to reduce the number of tokens. If multiple chunks are present, we will have multiple answers to the user prompt, and therefore, we merge them together in the last step of the scraping pipeline.

7. **How can I contribute to ScrapeGraphAI?**

You can contribute to ScrapeGraphAI by submitting bug reports, feature requests, or pull requests on the GitHub repository. Join our `Discord <https://discord.gg/uJN7TYcpNa>`_ community and follow us on social media!

Sponsors
========

.. image:: ../../assets/serp_api_logo.png
:width: 10%
:alt: Serp API
:target: https://serpapi.com?utm_source=scrapegraphai

.. image:: ../../assets/transparent_stat.png
:width: 15%
:alt: Stat Proxies
:target: https://dashboard.statproxies.com/?refferal=scrapegraph
3 changes: 3 additions & 0 deletions docs/source/modules/modules.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
scrapegraphai
=============

.. toctree::
:maxdepth: 4

Expand Down
21 changes: 21 additions & 0 deletions docs/source/modules/scrapegraphai.builders.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
scrapegraphai.builders package
==============================

Submodules
----------

scrapegraphai.builders.graph\_builder module
--------------------------------------------

.. automodule:: scrapegraphai.builders.graph_builder
:members:
:undoc-members:
:show-inheritance:

Module contents
---------------

.. automodule:: scrapegraphai.builders
:members:
:undoc-members:
:show-inheritance:
21 changes: 21 additions & 0 deletions docs/source/modules/scrapegraphai.docloaders.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
scrapegraphai.docloaders package
================================

Submodules
----------

scrapegraphai.docloaders.chromium module
----------------------------------------

.. automodule:: scrapegraphai.docloaders.chromium
:members:
:undoc-members:
:show-inheritance:

Module contents
---------------

.. automodule:: scrapegraphai.docloaders
:members:
:undoc-members:
:show-inheritance:
Loading