-
Notifications
You must be signed in to change notification settings - Fork 251
[Distributed]Integrate toml for configs, sink distributed launch & DCP work to distributed level #898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/898
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit be7db92 with merge base ceb9a3a ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good to me, leave some nits.
Also please make sure all CI passed before merging, thanks! |
…tributed launch & DCP work to distributed level (#898) * start inference.sh, toml configs * first toml * add config_manager * basic toml load, prep for starting dist * sink init and add toml parsing * toml load working * add distributed logger * logging working * ruff and isort * remove inference.py * better toml breakout, add tomli if python < 3.11
…tributed launch & DCP work to distributed level (#898) * start inference.sh, toml configs * first toml * add config_manager * basic toml load, prep for starting dist * sink init and add toml parsing * toml load working * add distributed logger * logging working * ruff and isort * remove inference.py * better toml breakout, add tomli if python < 3.11
…tributed launch & DCP work to distributed level (#898) * start inference.sh, toml configs * first toml * add config_manager * basic toml load, prep for starting dist * sink init and add toml parsing * toml load working * add distributed logger * logging working * ruff and isort * remove inference.py * better toml breakout, add tomli if python < 3.11
…tributed launch & DCP work to distributed level (#898) * start inference.sh, toml configs * first toml * add config_manager * basic toml load, prep for starting dist * sink init and add toml parsing * toml load working * add distributed logger * logging working * ruff and isort * remove inference.py * better toml breakout, add tomli if python < 3.11
…tributed launch & DCP work to distributed level (#898) * start inference.sh, toml configs * first toml * add config_manager * basic toml load, prep for starting dist * sink init and add toml parsing * toml load working * add distributed logger * logging working * ruff and isort * remove inference.py * better toml breakout, add tomli if python < 3.11
…tributed launch & DCP work to distributed level (#898) * start inference.sh, toml configs * first toml * add config_manager * basic toml load, prep for starting dist * sink init and add toml parsing * toml load working * add distributed logger * logging working * ruff and isort * remove inference.py * better toml breakout, add tomli if python < 3.11
…tributed launch & DCP work to distributed level (#898) * start inference.sh, toml configs * first toml * add config_manager * basic toml load, prep for starting dist * sink init and add toml parsing * toml load working * add distributed logger * logging working * ruff and isort * remove inference.py * better toml breakout, add tomli if python < 3.11
This PR:
1 - adds toml support for distributed inference.
toml files can be spec'ed and stored in /inference_configs/llama3_8b.toml
at distributed launch the relevant toml file is loaded and parsed, and then used for distributed config esp pp and tp dimensions.
2 - moving the distributed launch code into world_maker.py in /distributed folder and simply exposing single api's to builder.py.
The idea here is to keep all code local to the distributed folder, and just expose surface apis to builder. This should make it easier to develop, as all code specific to distributed is local instead of interspersed between torchchat generic code (builder etc) and /distributed code.
3 - ruff and isort - ran ruff and isort to clean up current /distributed files.