Adding Llava-Next (Llava 1.6) with full support. #1709

Narsil · 2024-04-05T16:12:01Z

What does this PR do?

Changed all models to extract embed_tokens in order to enable llava to separately call the embeddings and the core model layers.
Added VlmCausalLM to inherit from FlashMistral in order to be maximally supported. The only added logics sits on top and parses images
into pixel values, preallocates input_ids space for the image embeddings, and passes them for the model.
Added Clip for the vision tower.
Didn't add flash for the vision tower since there's no padding anyway.
Added heuristic (potentially incomplete) to calculate number of features before calculating the clip patches (allows for easier logic reuse of the LLM under the hood).

Still needs to be done:

Implement the image parsing in the controller side, to avoid downloading n times per TP shard and also refusing requests too large early and avoid issues where the truncation actually truncates the image.
Make sure it works with quantization properly.
Make sure it works with TP>1

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Still unsolved: - Rust parameter validation (to calculate number of tokens). - Integration test. - Validate other text heads. - Quantization.

HuggingFaceDocBuilderDev · 2024-04-05T16:22:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

server/text_generation_server/models/custom_modeling/llava_next.py

andrewrreed · 2024-04-08T12:15:07Z

Awesome! Documenting that this PR should fix #1689 - right?

Narsil · 2024-04-08T14:38:00Z

Awesome! Documenting that this PR should fix #1689 - right?

Indeed it would ! Not super generally quite yet (we need to transfer part of the token counting logic to rust, which means more code but we'll do the transfer slowly but surely).
The not generally means it can potentially trigger OOMs still because the rust part (in charge of scheduling) might make wrong assumptions on query regarding memory therefore it can potentially schedule more than the hardware can withstand. (If we make the scheduler too strict, we might disallow some legit requests, so we really need to transfer the whole logic in due time)

Narsil · 2024-04-08T17:19:03Z

Hmm Docker is more tight on RAM and is leading to OOM, probably need to fix the scheduling before merging then.

server/text_generation_server/models/vlm_causal_lm.py

drbh

nice this looks great! 🙌

I think it might be helpful to add some docs around the request format (example curl and etc) in a later PR

shuaills · 2024-04-20T12:59:07Z

I encountered a bug where most image inputs cause the model to crash with the following error:

RuntimeError: shape mismatch: value tensor of shape [2352, 7168] cannot be broadcast to indexing result of shape [3712, 7168]

What are the expected image input dimensions for the llava model? Do the dimensions [2352, 7168] and [3712, 7168] have any special meaning?

@OlivierDehaene

- Changed all models to extract `embed_tokens` in order to enable llava to separately call the embeddings and the core model layers. - Added VlmCausalLM to inherit from FlashMistral in order to be maximally supported. The only added logics sits on top and parses images into pixel values, preallocates input_ids space for the image embeddings, and passes them for the model. - Added Clip for the vision tower. - Didn't add flash for the vision tower since there's no padding anyway. - Added heuristic (potentially incomplete) to calculate number of features *before* calculating the clip patches (allows for easier logic reuse of the LLM under the hood). Still needs to be done: - [x] Implement the image parsing in the controller side, to avoid downloading n times per TP shard and also refusing requests too large early and avoid issues where the truncation actually truncates the image. - [ ] Make sure it works with quantization properly. - [x] Make sure it works with TP>1   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@OlivierDehaene

# What does this PR do? - Changed all models to extract `embed_tokens` in order to enable llava to separately call the embeddings and the core model layers. - Added VlmCausalLM to inherit from FlashMistral in order to be maximally supported. The only added logics sits on top and parses images into pixel values, preallocates input_ids space for the image embeddings, and passes them for the model. - Added Clip for the vision tower. - Didn't add flash for the vision tower since there's no padding anyway. - Added heuristic (potentially incomplete) to calculate number of features *before* calculating the clip patches (allows for easier logic reuse of the LLM under the hood). Still needs to be done: - [x] Implement the image parsing in the controller side, to avoid downloading n times per TP shard and also refusing requests too large early and avoid issues where the truncation actually truncates the image. - [ ] Make sure it works with quantization properly. - [x] Make sure it works with TP>1   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

pseudotensor · 2024-05-28T02:44:14Z

QQ: the docs specifically point to vicuna 13b 1.6 model, but what about all the other llava next models, including latest ones:

e.g.

liuhaotian/llava-v1.6-34b
lmms-lab/llama3-llava-next-8b
lmms-lab/llava-next-72b
lmms-lab/llava-next-110b

pseudotensor · 2024-05-28T02:51:36Z

Also, when I try sharded, I get:

> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 220, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 857, in get_model
    raise NotImplementedError("sharded is not supported for AutoModel")
NotImplementedError: sharded is not supported for AutoModel

But I have to use sharded for the 72b model.

pseudotensor · 2024-05-28T02:54:38Z

Also, for liuhaotian/llava-v1.6-34b I get:

docker run -d --restart=always --gpus '"device=7"' \
--shm-size 12g \
-v $HOME/.cache/huggingface/hub/:/data \
-p 30030:80 \
--name next34b \
ghcr.io/huggingface/text-generation-inference:2.0.4 \
--model-id liuhaotian/llava-v1.6-34b --trust-remote-code --max-stop-sequences=10 \
--max-batch-prefill-tokens=32768 --max-input-length 4096 --max-total-tokens 8192

2024-05-28T02:53:53.846667Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 257, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 220, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 908, in get_model
    raise ValueError(f"Unsupported model type {model_type}")
ValueError: Unsupported model type llava

2024-05-28T02:53:54.271514Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
    server.serve(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 257, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 220, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 908, in get_model
    raise ValueError(f"Unsupported model type {model_type}")

ValueError: Unsupported model type llava
 rank=0
2024-05-28T02:53:54.370653Z ERROR text_generation_launcher: Shard 0 failed to start
2024-05-28T02:53:54.370671Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

pseudotensor · 2024-05-28T02:58:11Z

Even lmms-lab/llama3-llava-next-8b fails same way

jiguanglizipao · 2024-06-11T13:15:15Z

This pull request comments out the truncate function and raises MaxNewTokens error when prompt_length + max_new_tokens is larger than max_total_token. Any plan to fix it?

Narsil added 6 commits April 5, 2024 16:18

Llava next dump.

b68fc4d

Update by abstracting away text model.

b8be0d1

More work on the CLIP Side.

5f4b395

Tmp dump (running on images hardcoded size.)

df4c700

Working for TP, Llama + Mistral

6c350f2

Still unsolved: - Rust parameter validation (to calculate number of tokens). - Integration test. - Validate other text heads. - Quantization.

Adding docs.

7852a85

Narsil force-pushed the llava branch from df04ec8 to 7852a85 Compare April 5, 2024 16:18

drbh reviewed Apr 5, 2024

View reviewed changes

server/text_generation_server/models/custom_modeling/llava_next.py Show resolved Hide resolved

Narsil added 2 commits April 5, 2024 18:06

Fixing integration tests ? (Failures locally).

ccbfc05

Upgrade tests (still missing load tests for some reason).

99771cf

Fixed load test. Bad sanitation on the router meant CUDA OOM.

39620ce

More GPUs for more VRAM.

274b68a

Narsil added 3 commits April 9, 2024 05:15

Tmp dump (sending real image for real memory offset to be computed.

215030a

Created all the logic server side (with image download on the fly too).

2283562

Update mt0 (not more truncating).

61821f4

drbh reviewed Apr 9, 2024

View reviewed changes

server/text_generation_server/models/vlm_causal_lm.py Outdated Show resolved Hide resolved

Narsil added 2 commits April 9, 2024 17:16

Fixing select_best_resolution.

8c114e5

Move import up.

4217ddb

drbh approved these changes Apr 9, 2024

View reviewed changes

Narsil merged commit 4634b00 into main Apr 9, 2024

Narsil deleted the llava branch April 9, 2024 19:32

JoeyTPChou mentioned this pull request May 23, 2024

LlaVa support huggingface/tgi-gaudi#149

Closed

2 tasks

pseudotensor mentioned this pull request May 28, 2024

ValueError: Unsupported model type llava #1962

Closed

4 tasks

tgaddair mentioned this pull request Oct 31, 2024

Add Llava Next (VLM) predibase/lorax#586

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding Llava-Next (Llava 1.6) with full support. #1709

Adding Llava-Next (Llava 1.6) with full support. #1709

Uh oh!

Narsil commented Apr 5, 2024 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Apr 5, 2024

Uh oh!

Uh oh!

andrewrreed commented Apr 8, 2024

Uh oh!

Narsil commented Apr 8, 2024

Uh oh!

Narsil commented Apr 8, 2024

Uh oh!

Uh oh!

drbh left a comment

Uh oh!

shuaills commented Apr 20, 2024

Uh oh!

pseudotensor commented May 28, 2024

Uh oh!

pseudotensor commented May 28, 2024

Uh oh!

pseudotensor commented May 28, 2024

Uh oh!

pseudotensor commented May 28, 2024

Uh oh!

jiguanglizipao commented Jun 11, 2024

Uh oh!

Uh oh!

Adding Llava-Next (Llava 1.6) with full support. #1709

Adding Llava-Next (Llava 1.6) with full support. #1709

Uh oh!

Conversation

Narsil commented Apr 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Apr 5, 2024

Uh oh!

Uh oh!

andrewrreed commented Apr 8, 2024

Uh oh!

Narsil commented Apr 8, 2024

Uh oh!

Narsil commented Apr 8, 2024

Uh oh!

Uh oh!

drbh left a comment

Choose a reason for hiding this comment

Uh oh!

shuaills commented Apr 20, 2024

Uh oh!

pseudotensor commented May 28, 2024

Uh oh!

pseudotensor commented May 28, 2024

Uh oh!

pseudotensor commented May 28, 2024

Uh oh!

pseudotensor commented May 28, 2024

Uh oh!

jiguanglizipao commented Jun 11, 2024

Uh oh!

Uh oh!

Narsil commented Apr 5, 2024 •

edited

Loading