Skip to content

ci: Add llama3 gpu workflow in perioidic #399

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 23, 2024
Merged

Conversation

seemethere
Copy link
Member

@seemethere seemethere commented Apr 23, 2024

Adds a llama3 testing workflow for periodic, downloads this using huggingface-cli.

This is somewhat of a working prototype, I left a couple of TODOS in places where things could be done better if given more time.

Another note: This also only works for GPU since this needed to get done fast and I only edited the GPU workflow

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 23, 2024
@seemethere seemethere added ciflow/periodic and removed CLA Signed This label is managed by the Meta Open Source bot. ciflow/periodic labels Apr 23, 2024
Copy link

pytorch-bot bot commented Apr 23, 2024

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

@seemethere seemethere force-pushed the seemethere/add_llama3_test branch from 2545c9f to 8cff173 Compare April 23, 2024 00:46
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 23, 2024
@seemethere seemethere changed the title ci: Add llama3 workflow in perioidic ci: Add llama3 gpu workflow in perioidic Apr 23, 2024
@seemethere seemethere force-pushed the seemethere/add_llama3_test branch from 43b91cf to 610ade6 Compare April 23, 2024 01:41
@seemethere seemethere marked this pull request as ready for review April 23, 2024 18:57
Copy link
Contributor

@orionr orionr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so long as this works you should go for it.

Copy link
Contributor

@mikekgfb mikekgfb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@seemethere
Copy link
Member Author

This PR is currently blocked on the --tiktoken argument being made default in --generate.py.

I did test this locally and ran into an issue with INT4 group-wise quantization:

logs
+ python3 -W ignore export.py --dtype bfloat16 --quant '{"linear:int4-gptq" : {"groupsize": 32}}' --checkpoint-path checkpoints/meta-llama/Meta-Llama-3-8B/model.pth --output-dso-path checkpoints/meta-llama/Meta-Llama-3-8B/model.so --device cuda
Using device=cuda
Loading model ...
name Meta-Llama-3-8B
Time to load model: 4.40 seconds
Quantizing the model with: {"linear:int4-gptq" : {"groupsize": 32}}
device: cuda
2024-04-23:15:52:36,269 INFO     [huggingface.py:162] Using device 'cuda'
2024-04-23:15:52:41,107 WARNING  [task.py:763] [Task: wikitext] metric word_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity
2024-04-23:15:52:41,107 WARNING  [task.py:775] [Task: wikitext] metric word_perplexity is defined, but higher_is_better is not. using default higher_is_better=False
2024-04-23:15:52:41,107 WARNING  [task.py:763] [Task: wikitext] metric byte_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity
2024-04-23:15:52:41,107 WARNING  [task.py:775] [Task: wikitext] metric byte_perplexity is defined, but higher_is_better is not. using default higher_is_better=False
2024-04-23:15:52:41,107 WARNING  [task.py:763] [Task: wikitext] metric bits_per_byte is defined, but aggregation is not. using default aggregation=bits_per_byte
2024-04-23:15:52:41,107 WARNING  [task.py:775] [Task: wikitext] metric bits_per_byte is defined, but higher_is_better is not. using default higher_is_better=False
Repo card metadata block was not found. Setting CardData to empty.
2024-04-23:15:52:42,341 WARNING  [repocard.py:107] Repo card metadata block was not found. Setting CardData to empty.
Obtaining GPTQ calibration inputs on:  ['wikitext']
2024-04-23:15:52:42,390 INFO     [task.py:395] Building contexts for wikitext on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 860.74it/s]
2024-04-23:15:52:42,403 INFO     [evaluator.py:362] Running loglikelihood_rolling requests
  0%|                                                                                                                                                                                                                                                           | 0/10 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/data/users/eliuriegas/torchchat/export.py", line 116, in <module>
    main(args)
  File "/data/users/eliuriegas/torchchat/export.py", line 65, in main
    model = _initialize_model(
  File "/data/users/eliuriegas/torchchat/build/builder.py", line 373, in _initialize_model
    quantize_model(model, builder_args.device, quantize, tokenizer)
  File "/data/users/eliuriegas/torchchat/quantize.py", line 43, in quantize_model
    model = quantizer_class_dict[quantizer](
  File "/data/users/eliuriegas/torchchat/quantize.py", line 1191, in quantized_model
    model_updated_state_dict = self.create_quantized_state_dict(
  File "/home/eliuriegas/local/torchchat/.venv/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/users/eliuriegas/torchchat/quantize.py", line 1077, in create_quantized_state_dict
    inputs = GPTQQuantHandler.get_inputs(
  File "/data/users/eliuriegas/torchchat/quantize.py", line 1050, in get_inputs
    evaluate(
  File "/home/eliuriegas/local/torchchat/.venv/lib/python3.9/site-packages/lm_eval/utils.py", line 288, in _wrapper
    return fn(*args, **kwargs)
  File "/home/eliuriegas/local/torchchat/.venv/lib/python3.9/site-packages/lm_eval/evaluator.py", line 373, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
  File "/home/eliuriegas/local/torchchat/.venv/lib/python3.9/site-packages/lm_eval/models/huggingface.py", line 817, in loglikelihood_rolling
    token_list=self.tok_encode(string),
  File "/data/users/eliuriegas/torchchat/eval.py", line 133, in tok_encode
    encoded = encode_tokens(self._tokenizer, string, bos=True, device=self._device)
  File "/data/users/eliuriegas/torchchat/generate.py", line 357, in encode_tokens
    tokens = tokenizer.encode(string)
AttributeError: 'NoneType' object has no attribute 'encode'

I'm also skeptical if this will work on an A10G since it is somewhat memory constrained compared to the H100 I tested this on.

seemethere and others added 12 commits April 23, 2024 16:09
Adds a llama3 testing workflow for periodic, downloads this using
huggingface-cli.

This is somewhat of a working prototype, I left a couple of TODOS in
places where things could be done better if given more time.

Signed-off-by: Eli Uriegas <[email protected]>
Signed-off-by: Eli Uriegas <[email protected]>
Signed-off-by: Eli Uriegas <[email protected]>
Signed-off-by: Eli Uriegas <[email protected]>
Signed-off-by: Eli Uriegas <[email protected]>
Signed-off-by: Eli Uriegas <[email protected]>
Signed-off-by: Eli Uriegas <[email protected]>
Signed-off-by: Eli Uriegas <[email protected]>
Didn't realiize the checkpoint normalization was already taken care of
later on down the line

Signed-off-by: Eli Uriegas <[email protected]>
@seemethere seemethere force-pushed the seemethere/add_llama3_test branch from 96670c0 to 1936561 Compare April 23, 2024 23:10
@seemethere
Copy link
Member Author

seemethere commented Apr 23, 2024

Yeah appears as though my fears of memory limits are founded: https://github.com/pytorch/torchchat/actions/runs/8808406315/job/24177435186?pr=399#step:11:809

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.96 GiB. GPU 0 has a total capacity of 21.99 GiB of which 751.00 MiB is free. Process 7077 has 21.24 GiB memory in use. Of the allocated memory 20.83 GiB is allocated by PyTorch, and 130.07 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Going to merge though so the team can continue to iterate here

@seemethere seemethere merged commit 525acfb into main Apr 23, 2024
@seemethere seemethere deleted the seemethere/add_llama3_test branch April 23, 2024 23:27
malfet pushed a commit that referenced this pull request Jul 17, 2024
malfet pushed a commit that referenced this pull request Jul 17, 2024
malfet pushed a commit that referenced this pull request Jul 17, 2024
malfet pushed a commit that referenced this pull request Jul 17, 2024
malfet pushed a commit that referenced this pull request Jul 17, 2024
malfet pushed a commit that referenced this pull request Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/periodic CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants