meilisearch · curquiza · Apr 8, 2020 · Apr 8, 2020 · Apr 8, 2020 · Apr 8, 2020
diff --git a/.github/workflows/test-lint.yml b/.github/workflows/test-lint.yml
@@ -0,0 +1,22 @@
+name: Test and Lint
+
+on: [pull_request]
+
+jobs:
+  test:
+    runs-on: ubuntu-18.04
+
+    steps:
+      - uses: actions/checkout@v2
+      - name: Set up Python 3.7
+        uses: actions/setup-python@v1
+        with:
+          python-version: 3.7
+      - name: Install pipenv
+        uses: dschep/install-pipenv-action@v1
+      - name: Install dependencies
+        run: pipenv install --dev
+      - name: Linter with pylint
+        run: pipenv run pylint scraper
+      - name: Run tests
+        run: pipenv run pytest ./scraper/src -k "not _browser"
diff --git a/CHANGELOG.md b/CHANGELOG.md
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
diff --git a/README.md b/README.md
@@ -9,6 +9,8 @@ A scraper for your documentation website that indexes the scraped content into a
   - [About the API Key](#about-the-api-key)
 - [Configuration file](#configuration-file)
 - [And for the search bar?](#and-for-the-search-bar)
+- [Authentication](#authentication)
+- [Installing Chrome Headless](#installing-chrome-headless)
 - [Development Workflow](#development-workflow)
 - [Credits](#credits)
 
@@ -19,6 +21,8 @@ This project supports Python 3.6+.
 
 ### From Source Code
 
+The [`pipenv` command](https://pipenv.readthedocs.io/en/latest/install/#installing-pipenv) must be installed.
+
 Set both environment variables `MEILISEARCH_HOST_URL` and `MEILISEARCH_API_KEY`.
 
 Then, run:
@@ -111,10 +115,37 @@ After having crawled your documentation, you might need a search bar to improve
 
 For the front part, check out the [docs-searchbar.js repository](https://github.com/meilisearch/docs-searchbar.js), wich provides a front-end search bar adapted for documentation.
 
+## Authentication
+
+__WARNING:__ Please be aware that the scraper will send authentication headers to every scraped site, so use `allowed_domains` to adjust the scope accordingly!
+
+### Basic HTTP <!-- omit in TOC -->
+
+Basic HTTP authentication is supported by setting these environment variables:
+- `DOCS_SCRAPER_BASICAUTH_USERNAME`
+- `DOCS_SCRAPER_BASICAUTH_PASSWORD`
+
+### Cloudflare Access: Identity and Access Management <!-- omit in TOC -->
+
+If it happens to you to scrape sites protected by Cloudflare Access, you have to set appropriate HTTP headers.
+
+Values for these headers are taken from env variables `CF_ACCESS_CLIENT_ID` and `CF_ACCESS_CLIENT_SECRET`.
+
+In case of Google Cloud Identity-Aware Proxy, please specify these env variables:
+- `IAP_AUTH_CLIENT_ID` - # pick [client ID of the application](https://console.cloud.google.com/apis/credentials) you are connecting to
+- `IAP_AUTH_SERVICE_ACCOUNT_JSON` - # generate in [Actions](https://console.cloud.google.com/iam-admin/serviceaccounts) -> Create key -> JSON
+
+## Installing Chrome Headless
+
+Websites that need JavaScript for rendering are passed through ChromeDriver.<br>
+[Download the version](http://chromedriver.chromium.org/downloads) suited to your OS and then set the environment variable `CHROMEDRIVER_PATH`.
+
 ## Development Workflow
 
 ### Install and Launch <!-- omit in TOC -->
 
+The [`pipenv` command](https://pipenv.readthedocs.io/en/latest/install/#installing-pipenv) must be installed.
+
 Set both environment variables `MEILISEARCH_HOST_URL` and `MEILISEARCH_API_KEY`.
 
 Then, run:
@@ -123,6 +154,16 @@ $ pipenv install
 $ pipenv run ./docs_scraper run <path-to-your-config-file>
 ```
 
+### Linter and Tests <!-- omit in TOC -->
+
+```bash
+$ pipenv install --dev
+# Linter
+$ pipenv run pylint scraper
+# Tests
+$ pipenv run pytest ./scraper/src -k "not _browser"
+```
+
 ### Release <!-- omit in TOC -->
 
 Once the changes are merged on `master`, in your terminal, you must be on the `master` branch and push a new tag with the right version:

diff --git a/scraper/src/strategies/default_strategy.py b/scraper/src/strategies/default_strategy.py
@@ -163,8 +163,8 @@ def get_records_from_dom(self, current_page_url=None):
             for meta_node in self.select('//meta'):
                 name = meta_node.get('name')
                 content = meta_node.get('content')
-                if name and name.startswith('docsearch:') and content:
-                    name = name.replace('docsearch:', '')
+                if name and name.startswith('docs-scraper:') and content:
+                    name = name.replace('docs-scraper:', '')
                     jsonized = to_json(content)
                     if jsonized:
                         record[name] = jsonized