We can have a tokenizer anywhere. #2527

Narsil · 2024-09-17T14:23:34Z

What does this PR do?

We can check that it works with google/byt5-small for instance (And also deadlocks if multiple validation workers are called since they all fight for the GIL as I've owned the gil in each thread.)

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

OlivierDehaene

Very nice!

OlivierDehaene · 2024-09-17T14:36:39Z

router/src/server.rs

+    let tokens: Vec<SimpleToken> = encoding
+        .get_ids()
+        .iter()
+        .zip(encoding.get_offsets())


This zip will fail for Python.

I "fixed" it by ignoring offsets if not present.

Python non rust will definitely exist in forms where offsets are not available (ByT5) so I'll assume we cannot have them.
In that case we'll send the raw ids and totally ignore the "text" part (since it doesn't have a good definition without offsets in the original query)

OlivierDehaene · 2024-09-17T14:37:08Z

router/src/server.rs

+    let tokens: Vec<SimpleToken> = encoding
+        .get_ids()
+        .iter()
+        .zip(encoding.get_offsets())


OlivierDehaene · 2024-10-17T08:32:48Z

server/text_generation_server/models/custom_modeling/mamba_modeling.py

+        try:
+            self.embed_tokens = TensorParallelEmbedding(f"{prefix}.embeddings", weights)
+        except RuntimeError:
+            self.embed_tokens = TensorParallelEmbedding(f"{prefix}.embedding", weights)


Confused by this.
Why should it work the second time?

No s. There's a difference between mamba-hf and mamba ssh here.

OlivierDehaene reviewed Sep 17, 2024

View reviewed changes

OlivierDehaene previously approved these changes Sep 17, 2024

View reviewed changes

Narsil dismissed OlivierDehaene’s stale review via c88e9d4 September 24, 2024 12:55

Narsil force-pushed the omni_tokenizer branch 4 times, most recently from 0f0e6a7 to 982f474 Compare September 25, 2024 01:37

Narsil force-pushed the omni_tokenizer branch from e4f8546 to 905731b Compare October 16, 2024 13:07

OlivierDehaene reviewed Oct 17, 2024

View reviewed changes

Narsil added 11 commits October 25, 2024 07:22

We can have a tokenizer anywhere.

5ba7805

Handling potential lack of offsets (python tokenizer)

9d702bc

Remove redundancy.

b89b9fd

Fixing the tests.

5bc1fe8

Flake.lock update ?

c0151cc

Fixing the GIL locking.

9d7a95b

Fixing mamba by using the transformers version.

cd355d0

Adding the legacy handle.

f20ef61

Ellide lifetime.

b07935b

Lint.

d4d4321

Deprecation message.

bbbd9a6

Narsil force-pushed the omni_tokenizer branch from 8350797 to bbbd9a6 Compare October 25, 2024 05:23

Fixing bad rebase.

123ff3a

Narsil requested a review from OlivierDehaene October 25, 2024 08:57

Narsil merged commit 90b226d into main Oct 28, 2024
11 of 12 checks passed

Narsil deleted the omni_tokenizer branch October 28, 2024 04:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

We can have a tokenizer anywhere. #2527

We can have a tokenizer anywhere. #2527

Uh oh!

Narsil commented Sep 17, 2024 •

edited

Loading

Uh oh!

OlivierDehaene left a comment

Uh oh!

OlivierDehaene Sep 17, 2024

Uh oh!

Narsil Sep 17, 2024

Uh oh!

OlivierDehaene Sep 17, 2024

Uh oh!

OlivierDehaene Oct 17, 2024

Uh oh!

Narsil Oct 25, 2024

Uh oh!

Uh oh!

Uh oh!

We can have a tokenizer anywhere. #2527

We can have a tokenizer anywhere. #2527

Uh oh!

Conversation

Narsil commented Sep 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

OlivierDehaene left a comment

Choose a reason for hiding this comment

Uh oh!

OlivierDehaene Sep 17, 2024

Choose a reason for hiding this comment

Uh oh!

Narsil Sep 17, 2024

Choose a reason for hiding this comment

Uh oh!

OlivierDehaene Sep 17, 2024

Choose a reason for hiding this comment

Uh oh!

OlivierDehaene Oct 17, 2024

Choose a reason for hiding this comment

Uh oh!

Narsil Oct 25, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Narsil commented Sep 17, 2024 •

edited

Loading