[Distributed Inference] Make torch run work for torchchat and fix TP bugs #877

fduwjj · 2024-07-02T18:21:57Z

Somehow in TorchChat, we only set device to be "cuda" which makes everyone use cuda:0 and leads to CUDA OOM when it comes to checkpoint loading. And now I can run all the way until the prompt is showing up. But somehow we now need to enter so many times for each rank so this is something we need to solve next.

Also for TP part, we need to use TP not sequence parallel like what we did for training.

To test torchrun DI, one can just run ./distributed/run_dist_inference.sh to run the DI program

pytorch-bot · 2024-07-02T18:22:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/877

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 655ea0f with merge base c716548 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

distributed/parallelize_llama.py

lessw2020

thanks for adding, esp for the OOM (device 0) fix.
tiny nit to update the one tp comment and remove ref to seq parallel since it's not being used now.

…bugs (#877) * [Distributed Inference] Make torch run work for torchchat

fduwjj added 6 commits July 2, 2024 10:21

[Dist][Inference] U-haul TP and distribute utils code to TorchChat

d700a8e

Remove unnecessary code and add comment

8b5ac5c

Add Torchrun script and enable distributed for that script

f9052af

Remove unnecessary changes

c2dbd20

Add checkpoint loading for meta init model

7429672

[Distributed Inference] Make torch run work for torchchat

ab41031

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 2, 2024

fduwjj requested review from lessw2020 and kartikayk July 2, 2024 18:22

fduwjj changed the title ~~[Dist][Inference] U-haul TP and distribute utils code to TorchChat~~ [Distributed Inference] Make torch run work for torchchat and fix TP bugs Jul 2, 2024

lessw2020 reviewed Jul 2, 2024

View reviewed changes

distributed/parallelize_llama.py Outdated Show resolved Hide resolved

lessw2020 approved these changes Jul 2, 2024

View reviewed changes

Address comments

655ea0f

fduwjj merged commit 7973c2a into main Jul 2, 2024
51 checks passed

vmpuri pushed a commit that referenced this pull request Jul 8, 2024

[Distributed Inference] Make torch run work for torchchat and fix TP …

0fad9c3

…bugs (#877) * [Distributed Inference] Make torch run work for torchchat

malfet pushed a commit that referenced this pull request Jul 17, 2024

[Distributed Inference] Make torch run work for torchchat and fix TP …

e7d4ef7

…bugs (#877) * [Distributed Inference] Make torch run work for torchchat

malfet pushed a commit that referenced this pull request Jul 17, 2024

[Distributed Inference] Make torch run work for torchchat and fix TP …

763df7a

…bugs (#877) * [Distributed Inference] Make torch run work for torchchat

malfet pushed a commit that referenced this pull request Jul 17, 2024

[Distributed Inference] Make torch run work for torchchat and fix TP …

02b5876

…bugs (#877) * [Distributed Inference] Make torch run work for torchchat

malfet pushed a commit that referenced this pull request Jul 17, 2024

[Distributed Inference] Make torch run work for torchchat and fix TP …

7e1ddb5

…bugs (#877) * [Distributed Inference] Make torch run work for torchchat

malfet pushed a commit that referenced this pull request Jul 17, 2024

[Distributed Inference] Make torch run work for torchchat and fix TP …

6e608cd

…bugs (#877) * [Distributed Inference] Make torch run work for torchchat

malfet pushed a commit that referenced this pull request Jul 17, 2024

[Distributed Inference] Make torch run work for torchchat and fix TP …

5aacd36

…bugs (#877) * [Distributed Inference] Make torch run work for torchchat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Distributed Inference] Make torch run work for torchchat and fix TP bugs #877

[Distributed Inference] Make torch run work for torchchat and fix TP bugs #877

Uh oh!

fduwjj commented Jul 2, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 2, 2024 •

edited

Loading

Uh oh!

Uh oh!

lessw2020 left a comment

Uh oh!

Uh oh!

Uh oh!

[Distributed Inference] Make torch run work for torchchat and fix TP bugs #877

[Distributed Inference] Make torch run work for torchchat and fix TP bugs #877

Uh oh!

Conversation

fduwjj commented Jul 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/877

✅ No Failures

Uh oh!

Uh oh!

lessw2020 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fduwjj commented Jul 2, 2024 •

edited

Loading

pytorch-bot bot commented Jul 2, 2024 •

edited

Loading