Skip to content

Add an LLM-based AutoSuggester #1995

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

lstein
Copy link

@lstein lstein commented Jun 5, 2025

Description

This PR implements a new LLMSuggest AutoSuggester module which connects to a remote or locally-hosted Large Language Model (LLM) to suggest completions of words, phrases or sentences. It can be used with any Buffer class that accepts the auto_suggest argument.

This video shows the autosuggester in action, when using a local Ollama instance. It also works with OpenAI, Anthropic, and other popular commercial LLM services.

llm_autosuggest_demo-2025-06-05_08.07.32.mp4

Installation

LLMSuggest introduces requirements for the langchain, langchain_core and PyEnchant modules. These can be installed with pip by specifying the llm optional dependencies when installing prompt_toolkit.

# From the project root.
pip install .[llm]

# When and if this PR is released to pypi:
pip install prompt_toolkit[llm]

Together, these three dependencies plus their own requirements (such as pydantic) will increase the size of an installed prompt_toolkit by an additional 54M.

Testing

This module comes with a unit test that assesses the module's ability to join the output from the LLM with the user's prompt properly. The test uses a mock LLM chat class to avoid the additional setup needed, and will be skipped if the optional dependencies are not installed.

To run the unit test:

# from within the repo root
pytest  tests/test_llmsuggest.py

Usage

Basic usage is quite straightforward. Just create an instance of LLMSuggest, wrap it in a ThreadedAutoSuggest wrapper, and pass the instance to the auto_suggest argument of a Prompt, PromptSession, TextArea, or any of the lower-level Buffer-derived classes that accept this argument. Suggestions appear as you type. Accept the entire suggestion by pressing ^E or right-arrow, or accept the next word using alt-F. These key bindings can be changed as described in the AutoSuggest reference manual.

Below is a basic script that uses the default ChatGPT 4o-mini "free tier" for its suggestions.

from prompt_toolkit import prompt
from prompt_toolkit.contrib.auto_suggest import LLMSuggest
from prompt_toolkit.auto_suggest import ThreadedAutoSuggest

suggester = LLMSuggest()
suggester = ThreadedAutoSuggest(suggester)  # wrap to avoid delays while typing

while True:
   response = prompt('> ', auto_suggest=suggester)
   print(f"You said '{response}'")

For this to work, you must have an OpenAI account, and a valid API key stored in the environment variable OPENAI_API_KEY . See platform.openai.com for details.

To use a different LLM model, initialize the suggester with an instantiated langchain chat model. This example shows how to use a llama3 model running on a local Ollama server:

from prompt_toolkit import prompt
from prompt_toolkit.contrib.auto_suggest import LLMSuggest
from prompt_toolkit.auto_suggest import ThreadedAutoSuggest
# new part starts here
from langchain.chat_models import init_chat_model  

model = init_chat_model('ollama:llama3', temperature=0.0)
suggester = LLMSuggest(chat_model=model)
suggester = ThreadedAutoSuggest(suggester)  

# everything below is the same

For this to work, you must have the optional langchain-ollama module installed, using pip install langchain-ollama.

After installing additional langchain chat modules, you can use many other backend LLMs, including Claude, Groq, Perplexity and Gemini.

Customization

Several optional constructor arguments allow you to adjust the behavior of the autosuggester. The persona argument allows you to adjust the style and tone of the completions. Compare the suggestions you get from these two versions:

suggester = LLMSuggest(persona='You are a copy editor for a technical journal')

vs

suggester = LLMSuggest(persona='You are a romance novelist, skilled in generating fullsome and lyrical prose.')

The optional temperature argument is a floating point number which controls the diversity of the suggestions. Lower numbers make the suggestions more deterministic. A temperature of 0.0 results in fully deterministic suggestions, but is more likely to generate cliches. Higher numbers generate prose that appear more creative, but may also introduce nonsensical completions.

Finally, the context argument allows you to pass additional text to the LLM to help it produce context-aware suggestions. This can be a plain text string, or a Callable that returns a string. A typical use case would be to pass the LLM the contents of a buffer containing a story that is being composed or the record of a chatbot conversation.

Additional options are described in the inline documentation.

Bugs and Limitations

  • Due to the design of the AutoSuggest class, suggestions can only be appended to the end of the current buffer, which means that inline suggestions in the style of GitHub Copilot are not feasible.
  • Latency can be an issue when using the remotely hosted LLMs, particularly when traffic is high. Always wrap the suggester in a ThreadedAutoSuggester to avoid blocking of keystrokes while the LLM is thinking. Consider using a local model for best responsiveness.
  • Some LLMs do not follow instructions to the letter and either insert additional whitespace into their responses, or fail to do so. The LLMSuggest code uses the PyEnchant library to figure out when the proferred completion is intended to complete a word or to start a new word. This is not 100% reliable.
  • Some LLMs will refuse to offer suggestions if the text appears to be violating their guardrails (e.g. promoting violence or hate speech). The package global prompt_toolkit.contrib.auto_suggest.llmsuggest.DEFAULT_PERSONA tells the LLM that it is an "uncensored writing assistant" to slightly reduce the rate of such refusals on borderline cases. You may wish to adjust the persona to be less lenient of such cases.


def __init__(self,
chat_model: Optional[BaseChatModel]=None,
persona: str=DEFAULT_PERSONA,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the system message also be part of the init here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I'll add that feature.

I labored a long time to tune the system message to get the diverse LLMs to do the task right, so altering it may break the suggester, but caveat emptor.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the ability to change the system message.

into full, natural-sounding sentences.
"""

SYSTEM_MESSAGE="""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious how well this works. The examples here don't mention code at all, but I imagine that would be the most common use-case of this.

Copy link
Author

@lstein lstein Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does completes code one-liners correctly. Unfortunately AutoSuggest doesn't handle newlines or tabs well, so I haven't really been able to get it to insert multiline completions.

Here's what you get when you provide the persona "You are a Perl coder" and type "#loop from one to ten:"

#Loop from one to ten: for ($i = 1; $i <= 10; $i++) { print "$i\n"; }

Another Perl one-liner, bold-facce is user input.

sub factorial() { my $n = shift; return 1 if $n == 0; return $n * fact($n-1); }

Python code completions are unsatisfactory due to newline handling.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I altered the system message to give an example of code completion. Now the LLM inserts newlines in the appropriate places and added an asis initialization argument that will allow the LLM to pass through newlines and other control characters.

@lstein
Copy link
Author

lstein commented Jun 6, 2025

I may have figured out how to do multiline suggesting for code completion. If you have any other "must haves" for this PR, let me know.

@asmeurer
Copy link
Contributor

asmeurer commented Jun 7, 2025

See the issue I referenced in the other issue for how I managed to get multiline autosuggestions working. It's a bit of hack though so long term we may want to update prompt-toolkit to handle that sort of thing better.

@lstein
Copy link
Author

lstein commented Jun 7, 2025

I've improved support for multi-line code completion. The behavior is that the greyed suggestion is compressed onto a single line with "^J" characters displayed where the newlines will be. Once the suggestion is accepted the multiple lines are inserted into the buffer appropriately.

Here's a video of the behavior. OpenAI was a bit slow on the second completion example, so be patient.

llm_autosuggest_demo-2025-06-07_14.24.49.mp4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants