-
Notifications
You must be signed in to change notification settings - Fork 12.2k
server: passkey challenge / self-extend with context shift demo #5832
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
28 commits
Select commit
Hold shift + click to select a range
73a7e42
server: tests: add models endpoint scenario
phymbert 0f774a8
server: /v1/models add some metadata
phymbert 1780d96
server: tests: add debug field in context before scenario
phymbert 319ded7
server: tests: download model from HF, add batch size
phymbert 18e739d
server: tests: add passkey test
phymbert ab5b06b
server: logs: do not truncate log values
phymbert 60113da
server: tests: add group attention params
phymbert 616d7e9
server: do not truncate prompt tokens if self-extend through group at…
phymbert 2495f72
server: logs: do not truncate log values
phymbert af82fb4
server: revert change on slot n_ctx
phymbert 3b8242a
server: tests - missing EOL at EOF
phymbert ed60b97
server: tests - fix passkey not using pre/suffix
phymbert cf4c86e
server: tests - passkey - first good working value of nga
phymbert f8773f7
server: tests - passkey - limit the number of max tokens to predix
phymbert a80533e
server: tests - passkey - limit the number of max tokens to predix
phymbert 8abf8d3
server: tests: fix server timeout
phymbert 407cc60
server: tests: fix passkey, add doc, fix regex content matching, fix …
phymbert 178b0c6
server: tests: fix regex content matching
phymbert 9ab72d7
server: tests: schedule slow tests on master
phymbert 9fcfa63
server: tests: schedule slow tests on master
phymbert 61b9791
server: metrics: fix when no prompt processed
phymbert 763ae0a
Merge remote-tracking branch 'origin/tests/server/passkey' into tests…
phymbert 830d0ef
server: tests: CI workflow failed on first scenario failed
phymbert 1aa5ad9
server: tests: fix re content
phymbert c1f66f0
server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1
phymbert 2cdd21e
server: tests: increase timeout for completion
phymbert a6ea725
server: tests: keep only the PHI-2 test
phymbert 0c7f5b2
server: tests: passkey add a negative test
phymbert File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,47 +1,67 @@ | ||
# Server tests | ||
|
||
Python based server tests scenario using [BDD](https://en.wikipedia.org/wiki/Behavior-driven_development) and [behave](https://behave.readthedocs.io/en/latest/): | ||
* [issues.feature](./features/issues.feature) Pending issues scenario | ||
* [parallel.feature](./features/parallel.feature) Scenario involving multi slots and concurrent requests | ||
* [security.feature](./features/security.feature) Security, CORS and API Key | ||
* [server.feature](./features/server.feature) Server base scenario: completion, embedding, tokenization, etc... | ||
Python based server tests scenario using [BDD](https://en.wikipedia.org/wiki/Behavior-driven_development) | ||
and [behave](https://behave.readthedocs.io/en/latest/): | ||
|
||
* [issues.feature](./features/issues.feature) Pending issues scenario | ||
* [parallel.feature](./features/parallel.feature) Scenario involving multi slots and concurrent requests | ||
* [security.feature](./features/security.feature) Security, CORS and API Key | ||
* [server.feature](./features/server.feature) Server base scenario: completion, embedding, tokenization, etc... | ||
|
||
Tests target GitHub workflows job runners with 4 vCPU. | ||
|
||
Requests are using [aiohttp](https://docs.aiohttp.org/en/stable/client_reference.html), [asyncio](https://docs.python.org/fr/3/library/asyncio.html) based http client. | ||
Requests are | ||
using [aiohttp](https://docs.aiohttp.org/en/stable/client_reference.html), [asyncio](https://docs.python.org/fr/3/library/asyncio.html) | ||
based http client. | ||
|
||
Note: If the host architecture inference speed is faster than GitHub runners one, parallel scenario may randomly fail. To mitigate it, you can increase values in `n_predict`, `kv_size`. | ||
Note: If the host architecture inference speed is faster than GitHub runners one, parallel scenario may randomly fail. | ||
To mitigate it, you can increase values in `n_predict`, `kv_size`. | ||
|
||
### Install dependencies | ||
|
||
`pip install -r requirements.txt` | ||
|
||
### Run tests | ||
|
||
1. Build the server | ||
|
||
```shell | ||
cd ../../.. | ||
mkdir build | ||
cd build | ||
cmake ../ | ||
cmake --build . --target server | ||
``` | ||
2. download required models: | ||
1. `../../../scripts/hf.sh --repo ggml-org/models --file tinyllamas/stories260K.gguf` | ||
3. Start the test: `./tests.sh` | ||
|
||
2. Start the test: `./tests.sh` | ||
|
||
It's possible to override some scenario steps values with environment variables: | ||
- `PORT` -> `context.server_port` to set the listening port of the server during scenario, default: `8080` | ||
- `LLAMA_SERVER_BIN_PATH` -> to change the server binary path, default: `../../../build/bin/server` | ||
- `DEBUG` -> "ON" to enable steps and server verbose mode `--verbose` | ||
- `SERVER_LOG_FORMAT_JSON` -> if set switch server logs to json format | ||
|
||
| variable | description | | ||
|--------------------------|------------------------------------------------------------------------------------------------| | ||
| `PORT` | `context.server_port` to set the listening port of the server during scenario, default: `8080` | | ||
| `LLAMA_SERVER_BIN_PATH` | to change the server binary path, default: `../../../build/bin/server` | | ||
| `DEBUG` | "ON" to enable steps and server verbose mode `--verbose` | | ||
| `SERVER_LOG_FORMAT_JSON` | if set switch server logs to json format | | ||
| `N_GPU_LAYERS` | number of model layers to offload to VRAM `-ngl --n-gpu-layers` | | ||
|
||
### Run @bug, @wip or @wrong_usage annotated scenario | ||
|
||
Feature or Scenario must be annotated with `@llama.cpp` to be included in the default scope. | ||
|
||
- `@bug` annotation aims to link a scenario with a GitHub issue. | ||
- `@wrong_usage` are meant to show user issue that are actually an expected behavior | ||
- `@wip` to focus on a scenario working in progress | ||
- `@slow` heavy test, disabled by default | ||
|
||
To run a scenario annotated with `@bug`, start: | ||
`DEBUG=ON ./tests.sh --no-skipped --tags bug` | ||
|
||
```shell | ||
DEBUG=ON ./tests.sh --no-skipped --tags bug | ||
``` | ||
|
||
After changing logic in `steps.py`, ensure that `@bug` and `@wrong_usage` scenario are updated. | ||
|
||
```shell | ||
./tests.sh --no-skipped --tags bug,wrong_usage || echo "should failed but compile" | ||
``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,5 @@ | ||
# List of ongoing issues | ||
# run with: DEBUG=ON ./tests.sh --no-skipped --tags bug | ||
@bug | ||
Feature: Issues | ||
# No confirmed issue at the moment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
# run with: ./tests.sh --no-skipped --tags passkey | ||
@passkey | ||
@slow | ||
Feature: Passkey / Self-extend with context shift | ||
|
||
Background: Server startup | ||
Given a server listening on localhost:8080 | ||
|
||
# Generates a long text of junk and inserts a secret passkey number inside it. | ||
# Then we query the LLM for the secret passkey. | ||
# see #3856 and #4810 | ||
Scenario Outline: Passkey | ||
Given a model file <hf_file> from HF repo <hf_repo> | ||
And <n_batch> as batch size | ||
And <n_junk> as number of junk | ||
And <n_predicted> server max tokens to predict | ||
And 42 as seed | ||
And <n_ctx> KV cache size | ||
And 1 slots | ||
And <n_ga> group attention factor to extend context size through self-extend | ||
And <n_ga_w> group attention width to extend context size through self-extend | ||
# Can be override with N_GPU_LAYERS | ||
And <ngl> GPU offloaded layers | ||
Then the server is starting | ||
Then the server is healthy | ||
Given available models | ||
Then model 0 is trained on <n_ctx_train> tokens context | ||
Given a prefix prompt: | ||
""" | ||
here is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there. | ||
""" | ||
And a passkey prompt template: | ||
""" | ||
The pass key is <passkey> Remember it. <passkey> is the pass key. | ||
""" | ||
And a junk suffix prompt: | ||
""" | ||
The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again. | ||
""" | ||
And a suffix prompt: | ||
""" | ||
What is the pass key? The pass key is | ||
""" | ||
Given a "<passkey>" passkey challenge prompt with the passkey inserted every <i_pos> junk | ||
And a completion request with no api error | ||
Then <n_predicted> tokens are predicted matching <re_content> | ||
|
||
Examples: | ||
| hf_repo | hf_file | n_ctx_train | ngl | n_ctx | n_batch | n_ga | n_ga_w | n_junk | i_pos | passkey | n_predicted | re_content | | ||
| TheBloke/phi-2-GGUF | phi-2.Q4_K_M.gguf | 2048 | 5 | 8192 | 512 | 4 | 512 | 250 | 50 | 42 | 1 | 42 | | ||
| TheBloke/phi-2-GGUF | phi-2.Q4_K_M.gguf | 2048 | 5 | 8192 | 512 | 2 | 512 | 250 | 50 | 42 | 1 | \b((?!42)\w)+\b | | ||
#| TheBloke/Llama-2-7B-GGUF | llama-2-7b.Q2_K.gguf | 4096 | 3 | 16384 | 512 | 4 | 512 | 500 | 300 | 1234 | 5 | 1234 | | ||
#| TheBloke/Mixtral-8x7B-v0.1-GGUF | mixtral-8x7b-v0.1.Q2_K.gguf | 32768 | 2 | 16384 | 512 | 4 | 512 | 500 | 100 | 0987 | 5 | 0 | ||
# 987 | | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.