-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Random Network Distillation for Torch #4473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
with torch.no_grad(): | ||
target = self._random_network(mini_batch) | ||
prediction = self._training_network(mini_batch) | ||
unnormalized_rewards = torch.sum((prediction - target) ** 2, dim=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious - do you know if (prediction - target) ** 2
gets treated as x*x
or pow(x, 2)
? I'd imagine the former is more efficient for evaluating (both it and the gradient).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Torch converts x ** 2 to pow(x, 2). The two are equivalent, so we can pick the one we are more comfortable with.
@@ -118,6 +119,18 @@ settings: | |||
| `gail -> use_actions` | (default = `false`) Determines whether the discriminator should discriminate based on both observations and actions, or just observations. Set to True if you want the agent to mimic the actions from the demonstrations, and False if you'd rather have the agent visit the same states as in the demonstrations but with possibly different actions. Setting to False is more likely to be stable, especially with imperfect demonstrations, but may learn slower. | | |||
| `gail -> use_vail` | (default = `false`) Enables a variational bottleneck within the GAIL discriminator. This forces the discriminator to learn a more general representation and reduces its tendency to be "too good" at discriminating, making learning more stable. However, it does increase training time. Enable this if you notice your imitation learning is unstable, or unable to learn the task at hand. | | |||
|
|||
### RND Intrinsic Reward | |||
|
|||
Random Network Distillation (RND) is only available for the PyTorch trainers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain in more detail why a user would want to enable this? What scenarios does it help solve better, and what are the drawbacks/weaknesses?
@@ -118,6 +119,18 @@ settings: | |||
| `gail -> use_actions` | (default = `false`) Determines whether the discriminator should discriminate based on both observations and actions, or just observations. Set to True if you want the agent to mimic the actions from the demonstrations, and False if you'd rather have the agent visit the same states as in the demonstrations but with possibly different actions. Setting to False is more likely to be stable, especially with imperfect demonstrations, but may learn slower. | | |||
| `gail -> use_vail` | (default = `false`) Enables a variational bottleneck within the GAIL discriminator. This forces the discriminator to learn a more general representation and reduces its tendency to be "too good" at discriminating, making learning more stable. However, it does increase training time. Enable this if you notice your imitation learning is unstable, or unable to learn the task at hand. | | |||
|
|||
### RND Intrinsic Reward | |||
|
|||
Random Network Distillation (RND) is only available for the PyTorch trainers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you try to enable this with tensorflow? Or without torch installed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You get an error that says the intrinsic reward signal type is not recognized.
Out of curiosity (no pun intended) have you tried this with Pyramids-SAC? |
It does not do as well (It behaves a bit like curiosity, which is also having a bad time) |
self.optimizer = torch.optim.Adam( | ||
self._training_network.parameters(), lr=settings.learning_rate | ||
) | ||
self._has_updated_once = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see we use this flag to zero out the reward before the training network has been trained to predict the random network. Is the idea that this initial reward is misleading in some way? If that's the case, I am not completely sold that that's true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. We do the same thing in Curiosity. If there has been no update yet, the reward is set to 0. This is because if there has been no update yet, the reward is just the mse of two random networks. The first reward will just reward any action, which in my opinion is the same as rewarding none. But I see your point, I can remove this part. I do not think it will have any impact on training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a user runs with --resume
, the first reward would be ignored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. I see the issue. I will remove the _has_updated_once variable
Do we not mind leaving rewards unnormalized (in the paper they use the STD of full returns as a normalizer)? |
Yes, you are right. I took some liberties with the original paper. For example, there is also a terminal reward for end states that I simply removed and I do not consider the Agent being done zeros the future rewards. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposed change(s)
Adding Random Network Distillation to the Torch trainers
Tested with Pyramids (It get to 1.7 reward consistently)
Useful links (Github issues, JIRA tickets, ML-Agents forum threads etc.)
Types of change(s)
Checklist
Other comments