-
Notifications
You must be signed in to change notification settings - Fork 4.3k
[refactor] Structure configuration files into classes #3936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -0,0 +1,110 @@ | |||
import attr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long do you think we'll need to keep this script around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully only for 1-2 releases. Or perhaps it should be released with Trainer 1.0 and removed after that.
docs/Training-Configuration-File.md
Outdated
| `hyperparameters -> buffer_size` | (default = `10240` for PPO and `50000` for SAC) Number of experiences to collect before updating the policy model. Corresponds to how many experiences should be collected before we do any learning or updating of the model. **This should be multiple times larger than `batch_size`**. Typically a larger `buffer_size` corresponds to more stable training updates. In SAC, the max size of the experience buffer - on the order of thousands of times longer than your episodes, so that SAC can learn from old as well as new experiences. <br><br>Typical range: PPO: `2048` - `409600`; SAC: `50000` - `1000000` | | ||
| `hyperparameters -> learning_rate` | (default = `3e-4`) Initial learning rate for gradient descent. Corresponds to the strength of each gradient descent update step. This should typically be decreased if training is unstable, and the reward does not consistently increase. <br><br>Typical range: `1e-5` - `1e-3` | | ||
| `hyperparameters -> learning_rate_schedule` | (default = `linear` for PPO and `constant` for SAC) Determines how learning rate changes over time. For PPO, we recommend decaying learning rate until max_steps so learning converges more stably. However, for some cases (e.g. training for an unknown amount of time) this feature can be disabled. For SAC, we recommend holding learning rate constant so that the agent can continue to learn until its Q function converges naturally. <br><br>`linear` decays the learning_rate linearly, reaching 0 at max_steps, while `constant` keeps the learning rate constant for the entire training run. | | ||
| `max_steps` | (default = `500000`) Total number of experience points that must be collected from the simulation before ending the training process. <br><br>Typical range: `5e5` - `1e7` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'experience points'
@@ -26,9 +20,11 @@ def __init__( | |||
:param encoding_size: The size of the hidden encoding layer for the ICM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
param docstring :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to my eye -- nice refactor 👍
self.trainer_settings["sequence_length"] | ||
> self.trainer_settings["batch_size"] | ||
and self.trainer_settings["use_recurrent"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be:
self.trainer_settings["sequence_length"] | |
> self.trainer_settings["batch_size"] | |
and self.trainer_settings["use_recurrent"] | |
self.trainer_settings.sequence_length | |
> self.trainer_settings.batch_size | |
and self.trainer_settings.use_recurrent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this method isn't actually used anymore, the check is done in settings.py
in _check_batch_size_seq_length
. I've removed them from PPO and SAC.
vis_encode_type = policy_network_settings.vis_encode_type | ||
|
||
self.tau = hyperparameters.tau | ||
self.burn_in_ratio = 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this used to be a trainer param but is not just being set to 0.0? Any reason why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wasn't configurable before (also set to 0 as a constant at the top of the file). Currently the burn-in feature doesn't actually work properly with values greater than 0. Will fix or remove in a PR that just touches that.
self.trainer_settings["sequence_length"] | ||
> self.trainer_settings["batch_size"] | ||
and self.trainer_settings["use_recurrent"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.trainer_settings["sequence_length"] | |
> self.trainer_settings["batch_size"] | |
and self.trainer_settings["use_recurrent"] | |
self.trainer_settings.sequence_length | |
> self.trainer_settings.batch_size | |
and self.trainer_settings.use_recurrent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this method isn't actually used anymore, the check is done in settings.py
in _check_batch_size_seq_length
. I've removed them from PPO and SAC.
) | ||
init_path: Optional[str] = None | ||
output_path: str = "default" | ||
# TODO: Remove parser default and remove from CLI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address before merging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Proposed change(s)
This PR implements the changes described in this design doc. In particular, it replaces the configuration dictionary (called
trainer_params
in many files) with objects based on theattrs
library. These are more flexible than NamedTuples and support smart defaults, validators, typing, and can be structured/unstructured from Dicts based on thecattrs
library.This PR also assigns reasonable defaults to every hyperparameter, and
mlagents-learn
now allows for training without any configuration file specified (defaults to PPO with extrinsic reward signal). In the future, it will allow for defaults that are being set relative to other defaults (as done in the SelfPlaySettings).The core of the changes are in settings.py.
Useful links (Github issues, JIRA tickets, ML-Agents forum threads etc.)
Attrs:
https://www.attrs.org/en/stable/
Cattrs:
https://cattrs.readthedocs.io/en/latest/
This PR does not yet refactor how the settings are passed down from Trainer->Policy-> Optimizer. The other two objects still receive the entire
TrainerSettings
object.TODO: Improved testing for error handling in
settings.py
,Update docs.NOTE: The appearance of
output_path
asdefault
in the configs is a bit odd, asoutput_path
is not settable through the configuration (it's determined by therun-id
). I plan to removeoutput_path
in a follow-up PR.Types of change(s)
Checklist
Other comments