Skip to content

Update embedders settings, hybrid search, and add tests for AI search methods #1087

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
May 15, 2025

Conversation

Strift
Copy link
Contributor

@Strift Strift commented Mar 8, 2025

Pull Request

Related issue

Requires #1086

Fixes #1081

What does this PR do?

Update settings to handle embedders

Docs: https://www.meilisearch.com/docs/reference/api/settings#embedders

Add embedders setting: Update methods getEmbedders, updateEmbedders, resetEmbedders. Also, the method updateSettings should be able to accept the new embedders field. Here is the list of the acceptable sub fields:

  • source sub field is available and accepts: ollama, rest, openAI, huggingFace and userProvided
  • apiKey sub field is available (string) - optional because not compatible with all sources. Only for openAi, ollama, rest.
  • model sub field is available (string) - optional because not compatible with all sources. Only for ollama, openAI, huggingFace
  • documentTemplate sub field is available (string) - optional
  • dimensions - optional because not compatible with all sources. Only for openAi, huggingFace, ollama, and rest
  • distribution - optional
  • request - mandatory only if using rest embedder
  • response - mandatory only if using rest embedder
  • documentTemplateMaxBytes - optional
  • revision - optional, only for huggingFace
  • headers - optional, only for rest
  • binaryQuantized - optional

Review comment:

  • Add a test to check if the format of each embedder type has the fields you need them to have

Update search to handle vector search and hybrid search

Docs: https://www.meilisearch.com/docs/reference/api/search

Update search method:

  • hybrid search parameter, with sub fields semanticRatio and embedder. embedder is mandatory if hybrid is set.
  • vector parameter is available
  • retrieveVectors parameter available
  • semanticHitCount in search response
  • Accept _semanticScore in the search response (optional)
  • _vectors are returned in the hit objects, when retrieveVectors is true
  • _vectors NOT present in the search response

Add similar documents endpoint

Docs: https://www.meilisearch.com/docs/reference/api/similar

  • Implement searchSimilarDocuments associated with the POST /indexes/:uid/similar. Do NOT implement with GET.

PR checklist

Please check if your PR fulfills the following requirements:

  • Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
  • Have you read the contributing guidelines?
  • Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

Summary by CodeRabbit

  • New Features

    • Added support for configuring and managing various embedder types, enabling advanced vector and hybrid search capabilities.
    • Introduced hybrid search functionality that combines keyword and semantic search, with adjustable balance between the two.
  • Documentation

    • Updated documentation to include a new section on hybrid search, detailing its usage and configuration options.
  • Tests

    • Added comprehensive tests for vector and hybrid search, as well as for validating the format and configuration of different embedder types.

@Strift Strift marked this pull request as draft March 8, 2025 09:36
@Strift Strift marked this pull request as ready for review March 9, 2025 05:58
@Strift Strift changed the title Feat/add embedders settings Update embedders settings, hybrid search, and add tests for AI search methods Mar 9, 2025
Comment on lines 1956 to 1976
Supported embedder sources:
- 'openAi': OpenAI embedder
- 'huggingFace': HuggingFace embedder
- 'ollama': Ollama embedder
- 'rest': REST API embedder
- 'userProvided': User-provided embedder

Required fields depend on the embedder source:
- 'rest' requires 'request' and 'response' fields
- 'userProvided' requires 'dimensions' field

Optional fields (availability depends on source):
- 'url': The URL Meilisearch contacts when querying the embedder
- 'apiKey': Authentication token for the embedder
- 'model': The model used for generating vectors
- 'documentTemplate': Template defining the data sent to the embedder
- 'documentTemplateMaxBytes': Maximum size of rendered document template
- 'dimensions': Number of dimensions in the chosen model
- 'revision': Model revision hash (only for 'huggingFace')
- 'distribution': Object with 'mean' and 'sigma' fields
- 'binaryQuantized': Boolean to convert vector dimensions to 1-bit values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this docs follow any pattern to describe the fields and if this is the correct definition, can you confirm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's AI-generated based on my input (the meilisearch docs). It does not follow any particular conventions.

I think it might be better to have less information here, and let the user refer to the documentation. I will remove it.


if body:
for _, v in body.items():
if "documentTemplateMaxBytes" in v and v["documentTemplateMaxBytes"] is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this handling done by Meili now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing it did not trigger any test failure but it might simply be untested, so I added it back to avoid any unwanted side effects

assert "default" in response["hits"][0]["_vectors"]


def test_get_similar_documents_with_identical_vectors(empty_index):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood it correctly, you're manually creating the vector for that given document, so you don't need to define any model in your test instance before, right?

Copy link
Contributor Author

@Strift Strift Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're referring to the test_get_similar_documents_with_identical_vectors test, that's correct.

I'm only creating an embedder so Meilisearch knows which embedder to use to compute the vector similarity:

# Configure the embedder
settings_update_task = index.update_embedders(
    {
        "default": {
            "source": "userProvided",
            "dimensions": 2,
        }
    }
)

Copy link
Contributor Author

@Strift Strift left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ty for the review @brunoocasali, I answered your comments and made the changes

Comment on lines 1956 to 1976
Supported embedder sources:
- 'openAi': OpenAI embedder
- 'huggingFace': HuggingFace embedder
- 'ollama': Ollama embedder
- 'rest': REST API embedder
- 'userProvided': User-provided embedder

Required fields depend on the embedder source:
- 'rest' requires 'request' and 'response' fields
- 'userProvided' requires 'dimensions' field

Optional fields (availability depends on source):
- 'url': The URL Meilisearch contacts when querying the embedder
- 'apiKey': Authentication token for the embedder
- 'model': The model used for generating vectors
- 'documentTemplate': Template defining the data sent to the embedder
- 'documentTemplateMaxBytes': Maximum size of rendered document template
- 'dimensions': Number of dimensions in the chosen model
- 'revision': Model revision hash (only for 'huggingFace')
- 'distribution': Object with 'mean' and 'sigma' fields
- 'binaryQuantized': Boolean to convert vector dimensions to 1-bit values
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's AI-generated based on my input (the meilisearch docs). It does not follow any particular conventions.

I think it might be better to have less information here, and let the user refer to the documentation. I will remove it.


if body:
for _, v in body.items():
if "documentTemplateMaxBytes" in v and v["documentTemplateMaxBytes"] is None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing it did not trigger any test failure but it might simply be untested, so I added it back to avoid any unwanted side effects

@Strift Strift requested a review from brunoocasali March 21, 2025 08:50
@Strift Strift force-pushed the feat/add-embedders-settings branch from bf4e7cb to 268aa4c Compare March 26, 2025 07:39
Copy link
Collaborator

@sanders41 sanders41 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I've been busy and only had a chance to take a really quick look at this. Can we test this against some older v1 meilisearch? I'm thinking there may be some changes here that make the package incompatible with older versions, but haven't had time to test myself.

@curquiza
Copy link
Member

curquiza commented Mar 30, 2025

@Strift I merged the PR of Ellnix first: #1075 and now there are conflicts on your PR

Maybe I shouldn't have merged it
It's either his PR or your PR with conflict anyway...

Sorry again!

@brunoocasali
Copy link
Member

I'm thinking there may be some changes here that make the package incompatible with older versions, but haven't had time to test myself.

This is expected @sanders41 since this version introduced the stabilization of the AI capabilities. @curquiza it looks good to me, but I would wait for @sanders41 review before merging :)

@curquiza curquiza requested a review from sanders41 April 3, 2025 17:56
Copy link
Member

@brunoocasali brunoocasali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Strift let's move on with this PR it looks good enough to me, let's put it in production so users can guide us with further improvements :)

@Strift Strift merged commit 2b0bd13 into main May 15, 2025
10 checks passed
@Strift Strift deleted the feat/add-embedders-settings branch May 15, 2025 04:42
@Strift Strift added the enhancement New feature or request label May 15, 2025
Copy link
Contributor

coderabbitai bot commented May 15, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

The changes introduce a new meilisearch.models.embedders module to define structured embedder configuration models, update the index and settings logic to use these models, and improve embedder handling and documentation. Test coverage is expanded for embedders, hybrid/vector search, and similar document retrieval. All embedder-related classes are removed from models/index.py and now reside in the new module.

Changes

File(s) Change Summary
README.md Added a "Hybrid Search" section documenting hybrid search usage, parameters, and example code.
meilisearch/models/embedders.py New module defining Pydantic-based models for embedder configurations (OpenAI, HuggingFace, Ollama, Rest, UserProvided), a distribution class, and a container class for embedders.
meilisearch/models/index.py Removed all embedder classes and the Embedders container (moved to models/embedders.py). Cleaned up imports.
meilisearch/index.py Updated imports to use new embedders module. Improved docstrings for search and settings methods. Enhanced embedder handling in settings methods, including better type annotations and explicit handling of embedder sources. Added serialization for embedder models when updating settings.
tests/conftest.py
tests/settings/test_settings.py
Updated imports to source embedder classes from models/embedders.py instead of models/index.py.
tests/settings/test_settings_embedders.py Updated imports. Added comprehensive tests for each embedder type, verifying configuration, required/optional fields, and correct retrieval after update.
tests/index/test_index_search_meilisearch.py Added/extended tests for vector search, hybrid search, retrieval of vectors in results, and similar document search with identical vectors.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Index
    participant EmbeddersModel

    Client->>Index: update_embedders(embedder_dict)
    Index->>EmbeddersModel: Parse embedder_dict by source
    EmbeddersModel-->>Index: Return structured embedder objects
    Index->>Index: Serialize embedders for API
    Index->>MeiliSearch API: PATCH /indexes/:uid/settings/embedders

    Client->>Index: get_embedders()
    Index->>MeiliSearch API: GET /indexes/:uid/settings/embedders
    MeiliSearch API-->>Index: Return embedder configs
    Index->>EmbeddersModel: Parse configs into objects
    EmbeddersModel-->>Index: Return structured embedders
    Index-->>Client: Return embedders object
Loading
sequenceDiagram
    participant Client
    participant Index

    Client->>Index: search(query, {hybrid: {semanticRatio, embedder}})
    Index->>MeiliSearch API: POST /indexes/:uid/search (with hybrid params)
    MeiliSearch API-->>Index: Return results with semanticHitCount, etc.
    Index-->>Client: Return search results
Loading

Assessment against linked issues

Objective Addressed Explanation
Embedders setting: support all sources and subfields (embedders, methods, fields, serialization) (#1081)
Search: hybrid/vector params, retrieveVectors, semanticHitCount, response fields (#1081)
Similar documents endpoint: implement searchSimilarDocuments via POST (#1081)
Remove _vectors from search response, optional vector field (#1081)

Poem

A bunny hopped through embedders anew,
With OpenAI, HuggingFace, and Ollama too!
Rest and UserProvided, all in a row,
Hybrid searches now smarter, results in tow.
Tests abound for every case,
This codebase leaps with rabbit grace!
🐇✨

Note

⚡️ AI Code Reviews for VS Code, Cursor, Windsurf

CodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback.
Learn more here.


Note

⚡️ Faster reviews with caching

CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.
Enjoy the performance boost—your workflow just got faster.


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 1603f44 and 44a68a5.

📒 Files selected for processing (8)
  • README.md (1 hunks)
  • meilisearch/index.py (10 hunks)
  • meilisearch/models/embedders.py (1 hunks)
  • meilisearch/models/index.py (1 hunks)
  • tests/conftest.py (1 hunks)
  • tests/index/test_index_search_meilisearch.py (2 hunks)
  • tests/settings/test_settings.py (1 hunks)
  • tests/settings/test_settings_embedders.py (2 hunks)
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[v1.13] Stabilize AI-powered search
4 participants