Skip to content

We can have a tokenizer anywhere. #2527

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Oct 28, 2024
Merged

We can have a tokenizer anywhere. #2527

merged 12 commits into from
Oct 28, 2024

Conversation

Narsil
Copy link
Collaborator

@Narsil Narsil commented Sep 17, 2024

What does this PR do?

We can check that it works with google/byt5-small for instance (And also deadlocks if multiple validation workers are called since they all fight for the GIL as I've owned the gil in each thread.)

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Copy link
Contributor

@OlivierDehaene OlivierDehaene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

let tokens: Vec<SimpleToken> = encoding
.get_ids()
.iter()
.zip(encoding.get_offsets())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This zip will fail for Python.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I "fixed" it by ignoring offsets if not present.

Python non rust will definitely exist in forms where offsets are not available (ByT5) so I'll assume we cannot have them.
In that case we'll send the raw ids and totally ignore the "text" part (since it doesn't have a good definition without offsets in the original query)

let tokens: Vec<SimpleToken> = encoding
.get_ids()
.iter()
.zip(encoding.get_offsets())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

OlivierDehaene
OlivierDehaene previously approved these changes Sep 17, 2024
Comment on lines +199 to +202
try:
self.embed_tokens = TensorParallelEmbedding(f"{prefix}.embeddings", weights)
except RuntimeError:
self.embed_tokens = TensorParallelEmbedding(f"{prefix}.embedding", weights)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused by this.
Why should it work the second time?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No s. There's a difference between mamba-hf and mamba ssh here.

@Narsil Narsil requested a review from OlivierDehaene October 25, 2024 08:57
@Narsil Narsil merged commit 90b226d into main Oct 28, 2024
11 of 12 checks passed
@Narsil Narsil deleted the omni_tokenizer branch October 28, 2024 04:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants