Skip to content

Server: Enable setting default sampling parameters via command-line #8402

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 9, 2024

Conversation

HanClinto
Copy link
Collaborator

In the discussion of #8279, @hopto-dot was (rightfully) confused by the fact that the server CLI will accept a grammar parameter, but then not do anything with it for the individual requests that are served from it. @ngxson helpfully suggested:

llama-server does read the grammar if you specific one. It's just not taken by slot.

You can patch launch_slot_with_task so that slot takes grammar from server_context.params

I was curious what it would take to patch launch_slot_with_task, and -- best I can tell -- this one-liner seems like this is all that one needs to do in order for this to work...?

To test:

  1. Compiled server:
    make -j LLAMA_CURL=1

  2. Created a grammar file that doesn't permit the usage of the letter 'e'

root ::= [^eE]*
  1. Ran the server with model set and a default grammar file specified:

./llama-server -mu https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf --grammar-file "./grammars/no-e.gbnf"

  1. Sent a request to the server via CURL that did not set a grammar
curl \
--request POST \
--url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt": "Please write a Haiku about the sun rising in the morning.","n_predict":128}'

Even though I didn't specify any grammar in the request, the generated output complied with the grammar specified when starting the server, and generated poetry without using that particular vowel:

"content": "\n## INPUT\n\n##OUTPUT\nSunlight's warm cafton,\nAkin to a symphony,\nMorning's symphony.\n",

Note that this still lets any user request override individual parameters -- this only sets the default sampling parameters for the server.

Also note that if one goes through the web GUI, then those requests (even those that have a blank grammar box on the site) will set the grammar parameter to whatever is in that box, so it will override anything set when initializing the server (hence why we needed to test this with curl).

If we care about that, then maybe a later PR could prevent the web GUI from sending parameters when they're set to the default. For now, I'd love to know if anyone can think of any "gotcha's" with my simplistic approach, or if it's really as simple as we see here.

@HanClinto
Copy link
Collaborator Author

Small testing session output, and I'm enjoying these e-less Haikus. :)

Bright light of day
Sun's warmth on my skin
Rising to start

In this second one, it clearly wanted to use "ascends" in the final line, but had to change after the first token was already chosen.

In morning light,
Sorrow turns to joy,
Sun ascists high.

And I don't know what to call this one. It started off great, and then that last line just comes out of nowhere. :D

Sun slowly awaking
Colors fill up, day is born
A brand-spanking-smooth start

All that to say, it's highly unlikely that the model would have generated all this poetry without 'e' if it wasn't using the grammar -- so I think it's effective. :)

@HanClinto HanClinto changed the title Load server sampling parameters from the server context by default. Server: Enable setting default sampling parameters via command-line Jul 9, 2024
Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, clever solution. Thanks! It works well on my side:

make llama-server && ./llama-server -m ../models/gemma-2-9b-it-Q4_K_M.gguf --grammar-file ./grammars/json.gbnf
{
    "messages": [
        {"role": "user", "content": "hello"}
    ],
    "max_tokens": 20,
    "temperature": 0
}

Response:

"content": "{\n  \"response\": \"Hello! 👋 How can I help you today?\"\n}",

@HanClinto
Copy link
Collaborator Author

Ha, clever solution. Thanks! It works well on my side:

Glad to hear it! :)

I started looking into seeing what it would take to load the chat interface's default parameters from the server, and there is already an endpoint that will serve the parameters -- http://localhost:8080/props

As much as I'd love for this to be a one-line PR, handle_props should probably be updated to grab the new global parameters, rather than json'ing the parameters of an uninitialized slot.

@HanClinto
Copy link
Collaborator Author

HanClinto commented Jul 9, 2024

Ugh. That's too much work to do right now. I think I'm just going to merge this as-is and worry about that edge case later. I strongly suspect that this endpoint is rarely (if ever) used.

@HanClinto HanClinto merged commit a59f8fd into ggml-org:master Jul 9, 2024
53 checks passed
@ngxson

This comment was marked as off-topic.

@ngxson
Copy link
Collaborator

ngxson commented Jul 10, 2024

Sorry I misread your comment (I thought we're talking about chat template). Yeah seems like /props does not return anything other than some basic server props, so it make sense to returns default sampling params (and maybe model params, though I'm not sure if this should be the job for /models)

@HanClinto
Copy link
Collaborator Author

Sorry I misread your comment (I thought we're talking about chat template). Yeah seems like /props does not return anything other than some basic server props, so it make sense to returns default sampling params (and maybe model params, though I'm not sure if this should be the job for /models)

Okay, sounds good. I'll try and see what I can do about updating this. I'm not sure if this endpoint is used by anyone or anything else, but unless I have reason to change it to something else, then I'll try to keep it in the existing format (with all the same fields) and simply update it to return the default server parameters.

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 11, 2024
…gml-org#8402)

* Load server sampling parameters from the server context by default.

* Wordsmithing comment
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 13, 2024
…gml-org#8402)

* Load server sampling parameters from the server context by default.

* Wordsmithing comment
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 13, 2024
…gml-org#8402)

* Load server sampling parameters from the server context by default.

* Wordsmithing comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants