Skip to content

Random Network Distillation for Torch #4473

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Sep 23, 2020
Merged

Random Network Distillation for Torch #4473

merged 15 commits into from
Sep 23, 2020

Conversation

vincentpierre
Copy link
Contributor

Proposed change(s)

Adding Random Network Distillation to the Torch trainers

Screen Shot 2020-09-11 at 9 38 35 AM

Tested with Pyramids (It get to 1.7 reward consistently)

Useful links (Github issues, JIRA tickets, ML-Agents forum threads etc.)

Types of change(s)

  • Bug fix
  • New feature
  • Code refactor
  • Breaking change
  • Documentation update
  • Other (please describe)

Checklist

  • Added tests that prove my fix is effective or that my feature works
  • Updated the changelog (if applicable)
  • Updated the documentation (if applicable)
  • Updated the migration guide (if applicable)

Other comments

with torch.no_grad():
target = self._random_network(mini_batch)
prediction = self._training_network(mini_batch)
unnormalized_rewards = torch.sum((prediction - target) ** 2, dim=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious - do you know if (prediction - target) ** 2 gets treated as x*x or pow(x, 2)? I'd imagine the former is more efficient for evaluating (both it and the gradient).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Torch converts x ** 2 to pow(x, 2). The two are equivalent, so we can pick the one we are more comfortable with.

@@ -118,6 +119,18 @@ settings:
| `gail -> use_actions` | (default = `false`) Determines whether the discriminator should discriminate based on both observations and actions, or just observations. Set to True if you want the agent to mimic the actions from the demonstrations, and False if you'd rather have the agent visit the same states as in the demonstrations but with possibly different actions. Setting to False is more likely to be stable, especially with imperfect demonstrations, but may learn slower. |
| `gail -> use_vail` | (default = `false`) Enables a variational bottleneck within the GAIL discriminator. This forces the discriminator to learn a more general representation and reduces its tendency to be "too good" at discriminating, making learning more stable. However, it does increase training time. Enable this if you notice your imitation learning is unstable, or unable to learn the task at hand. |

### RND Intrinsic Reward

Random Network Distillation (RND) is only available for the PyTorch trainers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain in more detail why a user would want to enable this? What scenarios does it help solve better, and what are the drawbacks/weaknesses?

@@ -118,6 +119,18 @@ settings:
| `gail -> use_actions` | (default = `false`) Determines whether the discriminator should discriminate based on both observations and actions, or just observations. Set to True if you want the agent to mimic the actions from the demonstrations, and False if you'd rather have the agent visit the same states as in the demonstrations but with possibly different actions. Setting to False is more likely to be stable, especially with imperfect demonstrations, but may learn slower. |
| `gail -> use_vail` | (default = `false`) Enables a variational bottleneck within the GAIL discriminator. This forces the discriminator to learn a more general representation and reduces its tendency to be "too good" at discriminating, making learning more stable. However, it does increase training time. Enable this if you notice your imitation learning is unstable, or unable to learn the task at hand. |

### RND Intrinsic Reward

Random Network Distillation (RND) is only available for the PyTorch trainers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you try to enable this with tensorflow? Or without torch installed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You get an error that says the intrinsic reward signal type is not recognized.

@ervteng
Copy link
Contributor

ervteng commented Sep 11, 2020

Out of curiosity (no pun intended) have you tried this with Pyramids-SAC?

@vincentpierre
Copy link
Contributor Author

Out of curiosity (no pun intended) have you tried this with Pyramids-SAC?

It does not do as well (It behaves a bit like curiosity, which is also having a bad time)
Will do more tests for SAC

@vincentpierre vincentpierre self-assigned this Sep 22, 2020
@vincentpierre
Copy link
Contributor Author

Some comparison between RND in orange and Curiosity in blue on a single cloud training run (using PPO). SAC still does not want to hear anything related to either Curiosity or RND.
Curiosity (blue) vs RND (orange)

self.optimizer = torch.optim.Adam(
self._training_network.parameters(), lr=settings.learning_rate
)
self._has_updated_once = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we use this flag to zero out the reward before the training network has been trained to predict the random network. Is the idea that this initial reward is misleading in some way? If that's the case, I am not completely sold that that's true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. We do the same thing in Curiosity. If there has been no update yet, the reward is set to 0. This is because if there has been no update yet, the reward is just the mse of two random networks. The first reward will just reward any action, which in my opinion is the same as rewarding none. But I see your point, I can remove this part. I do not think it will have any impact on training.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user runs with --resume, the first reward would be ignored?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. I see the issue. I will remove the _has_updated_once variable

@andrewcoh
Copy link
Contributor

Do we not mind leaving rewards unnormalized (in the paper they use the STD of full returns as a normalizer)?

@vincentpierre
Copy link
Contributor Author

Do we not mind leaving rewards unnormalized (in the paper they use the STD of full returns as a normalizer)?

Yes, you are right. I took some liberties with the original paper. For example, there is also a terminal reward for end states that I simply removed and I do not consider the Agent being done zeros the future rewards.
The argument behind it is that our current Reward Provider abstraction does not allow for either of these. I think we could implement these features, but it will a rather big refactor. It still trains the Pyramids environment correctly.

Copy link
Contributor

@andrewcoh andrewcoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@vincentpierre vincentpierre merged commit 73fa8bd into master Sep 23, 2020
@delete-merged-branch delete-merged-branch bot deleted the develop-rnd branch September 23, 2020 22:11
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 24, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants