Adding config info to allow urls with ports

sanders41 · sanders41 · commit 73e85034eeec · 2021-03-30T23:33:40.000-04:00
diff --git a/.gitignore b/.gitignore
@@ -8,6 +8,7 @@ src/strategies/__pycache__/
 *.pyc
 .env
 update.sh
+.python-version
 
 *yarn.lock
 
diff --git a/README.md b/README.md
@@ -48,6 +48,9 @@ This scraper is used in production and runs on the [MeiliSearch documentation](h
     - [`custom_settings` (optional)](#custom_settings-optional)
     - [`min_indexed_level` (optional)](#min_indexed_level-optional)
     - [`only_content_level` (optional)](#only_content_level-optional)
+    - [`js_render` (optional)](#js_render-optional)
+    - [`js_wait` (optional)](#js_wait-optional)
+    - [`allowed_domains` (optional)](#allowed_domains-optional)
   - [Authentication](#authentication)
   - [Installing Chrome Headless](#installing-chrome-headless)
 - [🤖 Compatibility with MeiliSearch](#-compatibility-with-meilisearch)
@@ -459,6 +462,37 @@ If used, `min_indexed_level` is ignored.
 }
 ```
 
+#### `js_render` (optional)
+
+When `js_render` is set to `true`, The scraper will use ChromeDriver. This is needed for pages that are rendered with JavaScript, for example pages generated with React or Vue. The default value is false.
+
+```json
+{
+  "js_render": true
+}
+```
+
+#### `js_wait` (optional)
+
+This setting can be used when `js_render` is set to `true` and the pages need time to fully load. `js_wait` takes an integer is specifies the number of seconds the scraper should wait for the page to load.
+
+```json
+{
+  "js_render": true,
+  "js_wait": 1
+}
+```
+
+#### `allowed_domains` (optional)
+
+This setting specifies the domains that the scraper is allowed to access. In most cases the `allowed_domains` will be automatically set using the `start_urls` and `stop_urls`. When scraping a domain that contains a port, for example `http://localhost:8080`, the domain needs to be manually added to the configuration.
+
+```json
+{
+  "allowed_domains": ["localhost"]
+}
+```
+
 ### Authentication
 
 __WARNING:__ Please be aware that the scraper will send authentication headers to every scraped site, so use `allowed_domains` to adjust the scope accordingly!