Skip to content

Commit db012ed

Browse files
bors[bot]sanders41
andauthored
Merge #112
112: Adding config info to allow urls with ports r=curquiza a=sanders41 Closes #103 Co-authored-by: Paul Sanders <[email protected]>
2 parents d7ba692 + 39237c5 commit db012ed

File tree

2 files changed

+39
-0
lines changed

2 files changed

+39
-0
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ src/strategies/__pycache__/
88
*.pyc
99
.env
1010
update.sh
11+
.python-version
1112

1213
*yarn.lock
1314

README.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,9 @@ This scraper is used in production and runs on the [MeiliSearch documentation](h
4848
- [`custom_settings` (optional)](#custom_settings-optional)
4949
- [`min_indexed_level` (optional)](#min_indexed_level-optional)
5050
- [`only_content_level` (optional)](#only_content_level-optional)
51+
- [`js_render` (optional)](#js_render-optional)
52+
- [`js_wait` (optional)](#js_wait-optional)
53+
- [`allowed_domains` (optional)](#allowed_domains-optional)
5154
- [Authentication](#authentication)
5255
- [Installing Chrome Headless](#installing-chrome-headless)
5356
- [🤖 Compatibility with MeiliSearch](#-compatibility-with-meilisearch)
@@ -459,6 +462,41 @@ If used, `min_indexed_level` is ignored.
459462
}
460463
```
461464

465+
#### `js_render` (optional)
466+
467+
When `js_render` is set to `true`, the scraper will use ChromeDriver. This is needed for pages that are rendered with JavaScript, for example, pages generated with React, Vue, or applications that are running in development mode: `autoreload` `watch`.
468+
469+
After installing ChromeDriver, provide the path to the bin using the following environment variable `CHROMEDRIVER_PATH` (default value is `/usr/bin/chromedriver`).
470+
471+
The default value of `js_render` is `false`.
472+
473+
```json
474+
{
475+
"js_render": true
476+
}
477+
```
478+
479+
#### `js_wait` (optional)
480+
481+
This setting can be used when `js_render` is set to `true` and the pages need time to fully load. `js_wait` takes an integer is specifies the number of seconds the scraper should wait for the page to load.
482+
483+
```json
484+
{
485+
"js_render": true,
486+
"js_wait": 1
487+
}
488+
```
489+
490+
#### `allowed_domains` (optional)
491+
492+
This setting specifies the domains that the scraper is allowed to access. In most cases the `allowed_domains` will be automatically set using the `start_urls` and `stop_urls`. When scraping a domain that contains a port, for example `http://localhost:8080`, the domain needs to be manually added to the configuration.
493+
494+
```json
495+
{
496+
"allowed_domains": ["localhost"]
497+
}
498+
```
499+
462500
### Authentication
463501

464502
__WARNING:__ Please be aware that the scraper will send authentication headers to every scraped site, so use `allowed_domains` to adjust the scope accordingly!

0 commit comments

Comments
 (0)