Skip to content

GitHub action #12

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 8, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions .github/workflows/test-lint.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
name: Test and Lint

on: [pull_request]

jobs:
test:
runs-on: ubuntu-18.04

steps:
- uses: actions/checkout@v2
- name: Set up Python 3.7
uses: actions/setup-python@v1
with:
python-version: 3.7
- name: Install pipenv
uses: dschep/install-pipenv-action@v1
- name: Install dependencies
run: pipenv install --dev
- name: Linter with pylint
run: pipenv run pylint scraper
- name: Run tests
run: pipenv run pytest ./scraper/src -k "not _browser"
4 changes: 0 additions & 4 deletions CHANGELOG.md

This file was deleted.

77 changes: 0 additions & 77 deletions CONTRIBUTING.md

This file was deleted.

41 changes: 41 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ A scraper for your documentation website that indexes the scraped content into a
- [About the API Key](#about-the-api-key)
- [Configuration file](#configuration-file)
- [And for the search bar?](#and-for-the-search-bar)
- [Authentication](#authentication)
- [Installing Chrome Headless](#installing-chrome-headless)
- [Development Workflow](#development-workflow)
- [Credits](#credits)

Expand All @@ -19,6 +21,8 @@ This project supports Python 3.6+.

### From Source Code

The [`pipenv` command](https://pipenv.readthedocs.io/en/latest/install/#installing-pipenv) must be installed.

Set both environment variables `MEILISEARCH_HOST_URL` and `MEILISEARCH_API_KEY`.

Then, run:
Expand Down Expand Up @@ -111,10 +115,37 @@ After having crawled your documentation, you might need a search bar to improve

For the front part, check out the [docs-searchbar.js repository](https://github.com/meilisearch/docs-searchbar.js), wich provides a front-end search bar adapted for documentation.

## Authentication

__WARNING:__ Please be aware that the scraper will send authentication headers to every scraped site, so use `allowed_domains` to adjust the scope accordingly!

### Basic HTTP <!-- omit in TOC -->

Basic HTTP authentication is supported by setting these environment variables:
- `DOCS_SCRAPER_BASICAUTH_USERNAME`
- `DOCS_SCRAPER_BASICAUTH_PASSWORD`

### Cloudflare Access: Identity and Access Management <!-- omit in TOC -->

If it happens to you to scrape sites protected by Cloudflare Access, you have to set appropriate HTTP headers.

Values for these headers are taken from env variables `CF_ACCESS_CLIENT_ID` and `CF_ACCESS_CLIENT_SECRET`.

In case of Google Cloud Identity-Aware Proxy, please specify these env variables:
- `IAP_AUTH_CLIENT_ID` - # pick [client ID of the application](https://console.cloud.google.com/apis/credentials) you are connecting to
- `IAP_AUTH_SERVICE_ACCOUNT_JSON` - # generate in [Actions](https://console.cloud.google.com/iam-admin/serviceaccounts) -> Create key -> JSON

## Installing Chrome Headless

Websites that need JavaScript for rendering are passed through ChromeDriver.<br>
[Download the version](http://chromedriver.chromium.org/downloads) suited to your OS and then set the environment variable `CHROMEDRIVER_PATH`.

## Development Workflow

### Install and Launch <!-- omit in TOC -->

The [`pipenv` command](https://pipenv.readthedocs.io/en/latest/install/#installing-pipenv) must be installed.

Set both environment variables `MEILISEARCH_HOST_URL` and `MEILISEARCH_API_KEY`.

Then, run:
Expand All @@ -123,6 +154,16 @@ $ pipenv install
$ pipenv run ./docs_scraper run <path-to-your-config-file>
```

### Linter and Tests <!-- omit in TOC -->

```bash
$ pipenv install --dev
# Linter
$ pipenv run pylint scraper
# Tests
$ pipenv run pytest ./scraper/src -k "not _browser"
```

### Release <!-- omit in TOC -->

Once the changes are merged on `master`, in your terminal, you must be on the `master` branch and push a new tag with the right version:
Expand Down
4 changes: 2 additions & 2 deletions scraper/src/strategies/default_strategy.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,8 +163,8 @@ def get_records_from_dom(self, current_page_url=None):
for meta_node in self.select('//meta'):
name = meta_node.get('name')
content = meta_node.get('content')
if name and name.startswith('docsearch:') and content:
name = name.replace('docsearch:', '')
if name and name.startswith('docs-scraper:') and content:
name = name.replace('docs-scraper:', '')
jsonized = to_json(content)
if jsonized:
record[name] = jsonized
Expand Down