Skip to content

Commit 73e8503

Browse files
committed
Adding config info to allow urls with ports
1 parent d7ba692 commit 73e8503

File tree

2 files changed

+35
-0
lines changed

2 files changed

+35
-0
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ src/strategies/__pycache__/
88
*.pyc
99
.env
1010
update.sh
11+
.python-version
1112

1213
*yarn.lock
1314

README.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,9 @@ This scraper is used in production and runs on the [MeiliSearch documentation](h
4848
- [`custom_settings` (optional)](#custom_settings-optional)
4949
- [`min_indexed_level` (optional)](#min_indexed_level-optional)
5050
- [`only_content_level` (optional)](#only_content_level-optional)
51+
- [`js_render` (optional)](#js_render-optional)
52+
- [`js_wait` (optional)](#js_wait-optional)
53+
- [`allowed_domains` (optional)](#allowed_domains-optional)
5154
- [Authentication](#authentication)
5255
- [Installing Chrome Headless](#installing-chrome-headless)
5356
- [🤖 Compatibility with MeiliSearch](#-compatibility-with-meilisearch)
@@ -459,6 +462,37 @@ If used, `min_indexed_level` is ignored.
459462
}
460463
```
461464

465+
#### `js_render` (optional)
466+
467+
When `js_render` is set to `true`, The scraper will use ChromeDriver. This is needed for pages that are rendered with JavaScript, for example pages generated with React or Vue. The default value is false.
468+
469+
```json
470+
{
471+
"js_render": true
472+
}
473+
```
474+
475+
#### `js_wait` (optional)
476+
477+
This setting can be used when `js_render` is set to `true` and the pages need time to fully load. `js_wait` takes an integer is specifies the number of seconds the scraper should wait for the page to load.
478+
479+
```json
480+
{
481+
"js_render": true,
482+
"js_wait": 1
483+
}
484+
```
485+
486+
#### `allowed_domains` (optional)
487+
488+
This setting specifies the domains that the scraper is allowed to access. In most cases the `allowed_domains` will be automatically set using the `start_urls` and `stop_urls`. When scraping a domain that contains a port, for example `http://localhost:8080`, the domain needs to be manually added to the configuration.
489+
490+
```json
491+
{
492+
"allowed_domains": ["localhost"]
493+
}
494+
```
495+
462496
### Authentication
463497

464498
__WARNING:__ Please be aware that the scraper will send authentication headers to every scraped site, so use `allowed_domains` to adjust the scope accordingly!

0 commit comments

Comments
 (0)