Skip to content

Add text-ranking pipeline tag #1267

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 18, 2025
Merged

Conversation

tomaarsen
Copy link
Member

@tomaarsen tomaarsen commented Mar 11, 2025

Hello!

Pull Request overview

  • Add text-ranking pipeline tag
  • Slightly update the docs for sentence-similarity

Details

This PR adds a text-ranking pipeline tag for reranker models like:

E.g.:

from sentence_transformers import CrossEncoder

# 1. Load a pre-trained CrossEncoder model
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")

# 2a. Either: predict scores for a pair of sentences
scores = model.predict([
    ("How many people live in Berlin?", "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers."),
    ("How many people live in Berlin?", "Berlin is well known for its museums."),
])
# => array([ 8.607138 , -4.3200774], dtype=float32)

# 2b. Or: rank a list of passages for a query
query = "How many people live in Berlin?"
passages = [
    "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.",
    "Berlin is well known for its museums.",
    "In 2014, the city state Berlin had 37,368 live births (+6.6%), a record number since 1991.",
    "The urban area of Berlin comprised about 4.1 million people in 2014, making it the seventh most populous urban area in the European Union.",
    "The city of Paris had a population of 2,165,423 people within its administrative city limits as of January 1, 2019",
    "An estimated 300,000-420,000 Muslims reside in Berlin, making up about 8-11 percent of the population.",
    "Berlin is subdivided into 12 boroughs or districts (Bezirke).",
    "In 2015, the total labour force in Berlin was 1.85 million.",
    "In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.",
    "Berlin has a yearly total of about 135 million day visitors, which puts it in third place among the most-visited city destinations in the European Union.",
]
ranks = model.rank(query, passages)

# Print the scores
print("Query:", query)
for rank in ranks:
    print(f"{rank['score']:.2f}\t{passages[rank['corpus_id']]}")
"""
Query: How many people live in Berlin?
8.92    The urban area of Berlin comprised about 4.1 million people in 2014, making it the seventh most populous urban area in the European Union.
8.61    Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.
8.24    An estimated 300,000-420,000 Muslims reside in Berlin, making up about 8-11 percent of the population.
7.60    In 2014, the city state Berlin had 37,368 live births (+6.6%), a record number since 1991.
6.35    In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.
5.42    Berlin has a yearly total of about 135 million day visitors, which puts it in third place among the most-visited city destinations in the European Union.
3.45    In 2015, the total labour force in Berlin was 1.85 million.
0.33    Berlin is subdivided into 12 boroughs or districts (Bezirke).
-4.24   The city of Paris had a population of 2,165,423 people within its administrative city limits as of January 1, 2019
-4.32   Berlin is well known for its museums.
"""

I haven't created a spec for the API here, as I think that's better left to those who've created other specs. I think we might already have a Sentence Ranking API that we might not want to break.

This is slightly blocking the next Sentence Transformers release, as I'd like to know whether I can tag CrossEncoder (a.k.a. reranker) models as text-ranking.
Related to this PR: https://github.com/huggingface-internal/moon-landing/pull/12877 (private repo).

  • Tom Aarsen

I forgot to replace these when I locally renamed the tag from `sentence-ranking` to `text-ranking`
@tomaarsen tomaarsen requested a review from Wauplin March 13, 2025 09:09
Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on my side but let's wait for an approval from @merveenoyan or @pcuenca who are more used to updating the tasks files.

id: "microsoft/ms_marco",
},
],
demo: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the "demo" data used ? The inputs don't seem to follow the API you've described in the PR description (with query: str + passages: List[str])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no API for task-ranking yet, nor consensus on what the spec should be. Because of this, I adopted the format from https://huggingface.co/tasks/sentence-similarity

I was under the impression that this was only used to format this box:
image

I can definitely remove the demo section if you prefer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having the one you provided makes sense imo @tomaarsen

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how will it render?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea, I can't get moon-landing to work anymore, not with the "Easy mode" nor with the default one. Not with Windows, not with WSL, not with Docker, not with local builds. I'll try again some other time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can make sure it works after merging (when updating the dependency in moon-landing), no worries 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks

Copy link
Contributor

@merveenoyan merveenoyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot (also for adding task page!)

id: "microsoft/ms_marco",
},
],
demo: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having the one you provided makes sense imo @tomaarsen

id: "microsoft/ms_marco",
},
],
demo: {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how will it render?

-4.24 The city of Paris had a population of 2,165,423 people within its administrative city limits as of January 1, 2019
-4.32 Berlin is well known for its museums.
"""
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this will be faster, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the model.rank vs model.predict? No, the model.rank is just a more convenient interface to the model. They're equally fast.

@tomaarsen tomaarsen merged commit 1540c48 into huggingface:main Mar 18, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants