|
1 |
| -# Docs Scraper |
| 1 | +# Docs Scraper <!-- omit in TOC --> |
2 | 2 |
|
3 |
| -A scraper for MeiliSearch's documentation, indexing the content into a MeiliSearch instance. |
| 3 | +A scraper for your documentation website that indexes the scraped content into a MeiliSearch instance. |
| 4 | + |
| 5 | +- [Installation and Usage](#installation-and-usage) |
| 6 | + - [From Source Code](#from-source-code) |
| 7 | + - [With Docker](#with-docker) |
| 8 | + - [In a GitHub Action](#in-a-github-action) |
| 9 | + - [About the API Key](#about-the-api-key) |
| 10 | +- [Configuration file](#configuration-file) |
| 11 | +- [And for the search bar?](#and-for-the-search-bar) |
| 12 | +- [Development Workflow](#development-workflow) |
| 13 | +- [Credits](#credits) |
4 | 14 |
|
5 |
| -_Will be generalized soon for all documentations_ |
6 | 15 |
|
7 | 16 | ## Installation and Usage
|
8 | 17 |
|
9 | 18 | This project supports Python 3.6+.
|
10 | 19 |
|
| 20 | +### From Source Code |
| 21 | + |
| 22 | +Set both environment variables `MEILISEARCH_HOST_URL` and `MEILISEARCH_API_KEY`. |
| 23 | + |
| 24 | +Then, run: |
| 25 | +```bash |
| 26 | +$ pipenv install |
| 27 | +$ pipenv run ./docs_scraper run <path-to-your-config-file> |
| 28 | +``` |
| 29 | + |
| 30 | +### With Docker |
| 31 | + |
| 32 | +```bash |
| 33 | +$ docker run -t --rm \ |
| 34 | + -e MEILISEARCH_HOST_URL=<your-meilisearch-host-url> \ |
| 35 | + -e MEILISEARCH_API_KEY=<your-meilisearch-api-key> \ |
| 36 | + -v <absolute-path-to-your-config-file>:/docs-scraper/config.json \ |
| 37 | + getmeili/docs-scraper:v0.9.0 pipenv run ./docs_scraper config.json |
| 38 | +``` |
| 39 | + |
| 40 | +### In a GitHub Action |
| 41 | + |
| 42 | +To run after your deployment job: |
| 43 | + |
| 44 | +```yml |
| 45 | +run-scraper: |
| 46 | + needs: <your-deployment-job> |
| 47 | + runs-on: ubuntu-18.04 |
| 48 | + steps: |
| 49 | + - uses: actions/checkout@master |
| 50 | + - name: Run scraper |
| 51 | + env: |
| 52 | + HOST_URL: ${{ secrets.MEILISEARCH_HOST_URL }} |
| 53 | + API_KEY: ${{ secrets.MEILISEARCH_API_KEY }} |
| 54 | + CONFIG_FILE_PATH: <path-to-your-config-file> |
| 55 | + run: | |
| 56 | + docker run -t --rm \ |
| 57 | + -e MEILISEARCH_HOST_URL=$HOST_URL \ |
| 58 | + -e MEILISEARCH_API_KEY=$API_KEY \ |
| 59 | + -v $CONFIG_FILE_PATH:/docs-scraper/config.json \ |
| 60 | + getmeili/docs-scraper:v0.9.0 pipenv run ./docs_scraper config.json |
| 61 | +``` |
| 62 | +
|
| 63 | +Here is the [GitHub Action file](https://github.com/meilisearch/documentation/blob/master/.github/workflows/gh-pages-scraping.yml) we use in production for the MeiliSearch documentation. |
| 64 | +
|
| 65 | +### About the API Key |
| 66 | +
|
| 67 | +The API key you must provide as environment variable should have the permissions to add documents into your MeiliSearch instance. |
| 68 | +
|
| 69 | +Thus, you need to provide the private key or the master key. |
| 70 | +
|
| 71 | +_More about [MeiliSearch authentication](https://docs.meilisearch.com/guides/advanced_guides/authentication.html)._ |
| 72 | +
|
| 73 | +## Configuration file |
| 74 | +
|
| 75 | +A generic configuration file: |
| 76 | +
|
| 77 | +```json |
| 78 | +{ |
| 79 | + "index_uid": "docs", |
| 80 | + "start_urls": ["https://www.example.com/doc/"], |
| 81 | + "sitemap_urls": ["https://www.example.com/sitemap.xml"], |
| 82 | + "stop_urls": [], |
| 83 | + "selectors": { |
| 84 | + "lvl0": { |
| 85 | + "selector": ".docs-lvl0", |
| 86 | + "global": true, |
| 87 | + "default_value": "Documentation" |
| 88 | + }, |
| 89 | + "lvl1": { |
| 90 | + "selector": ".docs-lvl1", |
| 91 | + "global": true, |
| 92 | + "default_value": "Chapter" |
| 93 | + }, |
| 94 | + "lvl2": ".docs-content .docs-lvl2", |
| 95 | + "lvl3": ".docs-content .docs-lvl3", |
| 96 | + "lvl4": ".docs-content .docs-lvl4", |
| 97 | + "lvl5": ".docs-content .docs-lvl5", |
| 98 | + "lvl6": ".docs-content .docs-lvl6", |
| 99 | + "text": ".docs-content p, .docs-content li" |
| 100 | + } |
| 101 | +} |
| 102 | +``` |
| 103 | + |
| 104 | +The scraper will focus on the highlighted information depending on your selectors. |
| 105 | + |
| 106 | +Here is the [configuration file](https://github.com/meilisearch/documentation/blob/master/.vuepress/scraper/config.json) we use for the MeiliSearch documentation. |
| 107 | + |
| 108 | +## And for the search bar? |
| 109 | + |
| 110 | +After having crawled your documentation, you might need a search bar to improve your user experience! |
| 111 | + |
| 112 | +For the front part, check out the [docs-searchbar.js repository](https://github.com/meilisearch/docs-searchbar.js), wich provides a front-end search bar adapted for documentation. |
| 113 | + |
| 114 | +## Development Workflow |
| 115 | + |
| 116 | +### Install and Launch <!-- omit in TOC --> |
| 117 | + |
11 | 118 | Set both environment variables `MEILISEARCH_HOST_URL` and `MEILISEARCH_API_KEY`.
|
12 | 119 |
|
13 | 120 | Then, run:
|
14 | 121 | ```bash
|
15 | 122 | $ pipenv install
|
16 |
| -$ pipenv shell |
17 |
| -$ ./docs_scraper run config/config.json |
| 123 | +$ pipenv run ./docs_scraper run <path-to-your-config-file> |
18 | 124 | ```
|
19 | 125 |
|
20 |
| -_WIP_ |
| 126 | +### Release <!-- omit in TOC --> |
| 127 | + |
| 128 | +Once the changes are merged on `master`, in your terminal, you must be on the `master` branch and push a new tag with the right version: |
21 | 129 |
|
22 |
| -## Related projects |
| 130 | +```bash |
| 131 | +$ git checkout master |
| 132 | +$ git pull origin master |
| 133 | +$ git tag vX.X.X |
| 134 | +$ git push --tag origin master |
| 135 | +``` |
23 | 136 |
|
24 |
| -For the front part, check out the [docs-searchbar.js repository](https://github.com/meilisearch/docs-searchbar.js), wich provides a front-end search bar. |
| 137 | +A GitHub Action will be triggered and push the `latest` and `vX.X.X` version of Docker image on [DockerHub](https://hub.docker.com/repository/docker/getmeili/docs-scraper) |
25 | 138 |
|
26 | 139 | ## Credits
|
27 | 140 |
|
|
0 commit comments