Skip to content

Commit 2ec8fbd

Browse files
authored
Merge pull request #12 from meilisearch/github-action
GitHub action
2 parents e1e6746 + a0c4f4c commit 2ec8fbd

File tree

5 files changed

+65
-83
lines changed

5 files changed

+65
-83
lines changed

.github/workflows/test-lint.yml

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
name: Test and Lint
2+
3+
on: [pull_request]
4+
5+
jobs:
6+
test:
7+
runs-on: ubuntu-18.04
8+
9+
steps:
10+
- uses: actions/checkout@v2
11+
- name: Set up Python 3.7
12+
uses: actions/setup-python@v1
13+
with:
14+
python-version: 3.7
15+
- name: Install pipenv
16+
uses: dschep/install-pipenv-action@v1
17+
- name: Install dependencies
18+
run: pipenv install --dev
19+
- name: Linter with pylint
20+
run: pipenv run pylint scraper
21+
- name: Run tests
22+
run: pipenv run pytest ./scraper/src -k "not _browser"

CHANGELOG.md

Lines changed: 0 additions & 4 deletions
This file was deleted.

CONTRIBUTING.md

Lines changed: 0 additions & 77 deletions
This file was deleted.

README.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@ A scraper for your documentation website that indexes the scraped content into a
99
- [About the API Key](#about-the-api-key)
1010
- [Configuration file](#configuration-file)
1111
- [And for the search bar?](#and-for-the-search-bar)
12+
- [Authentication](#authentication)
13+
- [Installing Chrome Headless](#installing-chrome-headless)
1214
- [Development Workflow](#development-workflow)
1315
- [Credits](#credits)
1416

@@ -19,6 +21,8 @@ This project supports Python 3.6+.
1921

2022
### From Source Code
2123

24+
The [`pipenv` command](https://pipenv.readthedocs.io/en/latest/install/#installing-pipenv) must be installed.
25+
2226
Set both environment variables `MEILISEARCH_HOST_URL` and `MEILISEARCH_API_KEY`.
2327

2428
Then, run:
@@ -111,10 +115,37 @@ After having crawled your documentation, you might need a search bar to improve
111115

112116
For the front part, check out the [docs-searchbar.js repository](https://github.com/meilisearch/docs-searchbar.js), wich provides a front-end search bar adapted for documentation.
113117

118+
## Authentication
119+
120+
__WARNING:__ Please be aware that the scraper will send authentication headers to every scraped site, so use `allowed_domains` to adjust the scope accordingly!
121+
122+
### Basic HTTP <!-- omit in TOC -->
123+
124+
Basic HTTP authentication is supported by setting these environment variables:
125+
- `DOCS_SCRAPER_BASICAUTH_USERNAME`
126+
- `DOCS_SCRAPER_BASICAUTH_PASSWORD`
127+
128+
### Cloudflare Access: Identity and Access Management <!-- omit in TOC -->
129+
130+
If it happens to you to scrape sites protected by Cloudflare Access, you have to set appropriate HTTP headers.
131+
132+
Values for these headers are taken from env variables `CF_ACCESS_CLIENT_ID` and `CF_ACCESS_CLIENT_SECRET`.
133+
134+
In case of Google Cloud Identity-Aware Proxy, please specify these env variables:
135+
- `IAP_AUTH_CLIENT_ID` - # pick [client ID of the application](https://console.cloud.google.com/apis/credentials) you are connecting to
136+
- `IAP_AUTH_SERVICE_ACCOUNT_JSON` - # generate in [Actions](https://console.cloud.google.com/iam-admin/serviceaccounts) -> Create key -> JSON
137+
138+
## Installing Chrome Headless
139+
140+
Websites that need JavaScript for rendering are passed through ChromeDriver.<br>
141+
[Download the version](http://chromedriver.chromium.org/downloads) suited to your OS and then set the environment variable `CHROMEDRIVER_PATH`.
142+
114143
## Development Workflow
115144

116145
### Install and Launch <!-- omit in TOC -->
117146

147+
The [`pipenv` command](https://pipenv.readthedocs.io/en/latest/install/#installing-pipenv) must be installed.
148+
118149
Set both environment variables `MEILISEARCH_HOST_URL` and `MEILISEARCH_API_KEY`.
119150

120151
Then, run:
@@ -123,6 +154,16 @@ $ pipenv install
123154
$ pipenv run ./docs_scraper run <path-to-your-config-file>
124155
```
125156

157+
### Linter and Tests <!-- omit in TOC -->
158+
159+
```bash
160+
$ pipenv install --dev
161+
# Linter
162+
$ pipenv run pylint scraper
163+
# Tests
164+
$ pipenv run pytest ./scraper/src -k "not _browser"
165+
```
166+
126167
### Release <!-- omit in TOC -->
127168

128169
Once the changes are merged on `master`, in your terminal, you must be on the `master` branch and push a new tag with the right version:

scraper/src/strategies/default_strategy.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -163,8 +163,8 @@ def get_records_from_dom(self, current_page_url=None):
163163
for meta_node in self.select('//meta'):
164164
name = meta_node.get('name')
165165
content = meta_node.get('content')
166-
if name and name.startswith('docsearch:') and content:
167-
name = name.replace('docsearch:', '')
166+
if name and name.startswith('docs-scraper:') and content:
167+
name = name.replace('docs-scraper:', '')
168168
jsonized = to_json(content)
169169
if jsonized:
170170
record[name] = jsonized

0 commit comments

Comments
 (0)