-
Notifications
You must be signed in to change notification settings - Fork 250
[Distributed Inference] Make torch run work for torchchat and fix TP bugs #877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/877
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 655ea0f with merge base c716548 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding, esp for the OOM (device 0) fix.
tiny nit to update the one tp comment and remove ref to seq parallel since it's not being used now.
…bugs (#877) * [Distributed Inference] Make torch run work for torchchat
…bugs (#877) * [Distributed Inference] Make torch run work for torchchat
…bugs (#877) * [Distributed Inference] Make torch run work for torchchat
…bugs (#877) * [Distributed Inference] Make torch run work for torchchat
…bugs (#877) * [Distributed Inference] Make torch run work for torchchat
…bugs (#877) * [Distributed Inference] Make torch run work for torchchat
…bugs (#877) * [Distributed Inference] Make torch run work for torchchat
Somehow in TorchChat, we only set device to be "cuda" which makes everyone use cuda:0 and leads to CUDA OOM when it comes to checkpoint loading. And now I can run all the way until the prompt is showing up. But somehow we now need to enter so many times for each rank so this is something we need to solve next.
Also for TP part, we need to use TP not sequence parallel like what we did for training.
To test torchrun DI, one can just run
./distributed/run_dist_inference.sh
to run the DI program