Random Network Distillation for Torch #4473

vincentpierre · 2020-09-11T16:40:52Z

Proposed change(s)

Adding Random Network Distillation to the Torch trainers

Tested with Pyramids (It get to 1.7 reward consistently)

Useful links (Github issues, JIRA tickets, ML-Agents forum threads etc.)

Types of change(s)

Checklist

Added tests that prove my fix is effective or that my feature works
Updated the changelog (if applicable)
Updated the documentation (if applicable)
Updated the migration guide (if applicable)

Other comments

chriselion · 2020-09-11T17:25:24Z

ml-agents/mlagents/trainers/torch/components/reward_providers/rnd_reward_provider.py

+        with torch.no_grad():
+            target = self._random_network(mini_batch)
+            prediction = self._training_network(mini_batch)
+            unnormalized_rewards = torch.sum((prediction - target) ** 2, dim=1)


Just curious - do you know if (prediction - target) ** 2 gets treated as x*x or pow(x, 2)? I'd imagine the former is more efficient for evaluating (both it and the gradient).

Torch converts x ** 2 to pow(x, 2). The two are equivalent, so we can pick the one we are more comfortable with.

chriselion · 2020-09-11T17:38:01Z

docs/Training-Configuration-File.md

@@ -118,6 +119,18 @@ settings:
 | `gail -> use_actions`   | (default = `false`) Determines whether the discriminator should discriminate based on both observations and actions, or just observations. Set to True if you want the agent to mimic the actions from the demonstrations, and False if you'd rather have the agent visit the same states as in the demonstrations but with possibly different actions. Setting to False is more likely to be stable, especially with imperfect demonstrations, but may learn slower. |
 | `gail -> use_vail`      | (default = `false`) Enables a variational bottleneck within the GAIL discriminator. This forces the discriminator to learn a more general representation and reduces its tendency to be "too good" at discriminating, making learning more stable. However, it does increase training time. Enable this if you notice your imitation learning is unstable, or unable to learn the task at hand.                                                                       |

+### RND Intrinsic Reward
+
+Random Network Distillation (RND) is only available for the PyTorch trainers.


Can you explain in more detail why a user would want to enable this? What scenarios does it help solve better, and what are the drawbacks/weaknesses?

chriselion · 2020-09-11T17:40:53Z

docs/Training-Configuration-File.md

@@ -118,6 +119,18 @@ settings:
 | `gail -> use_actions`   | (default = `false`) Determines whether the discriminator should discriminate based on both observations and actions, or just observations. Set to True if you want the agent to mimic the actions from the demonstrations, and False if you'd rather have the agent visit the same states as in the demonstrations but with possibly different actions. Setting to False is more likely to be stable, especially with imperfect demonstrations, but may learn slower. |
 | `gail -> use_vail`      | (default = `false`) Enables a variational bottleneck within the GAIL discriminator. This forces the discriminator to learn a more general representation and reduces its tendency to be "too good" at discriminating, making learning more stable. However, it does increase training time. Enable this if you notice your imitation learning is unstable, or unable to learn the task at hand.                                                                       |

+### RND Intrinsic Reward
+
+Random Network Distillation (RND) is only available for the PyTorch trainers.


What happens if you try to enable this with tensorflow? Or without torch installed?

You get an error that says the intrinsic reward signal type is not recognized.

ervteng · 2020-09-11T19:01:14Z

Out of curiosity (no pun intended) have you tried this with Pyramids-SAC?

vincentpierre · 2020-09-14T18:15:02Z

Out of curiosity (no pun intended) have you tried this with Pyramids-SAC?

It does not do as well (It behaves a bit like curiosity, which is also having a bad time)
Will do more tests for SAC

vincentpierre · 2020-09-23T16:57:22Z

Some comparison between RND in orange and Curiosity in blue on a single cloud training run (using PPO). SAC still does not want to hear anything related to either Curiosity or RND.

…g file

andrewcoh · 2020-09-23T18:35:21Z

ml-agents/mlagents/trainers/torch/components/reward_providers/rnd_reward_provider.py

+        self.optimizer = torch.optim.Adam(
+            self._training_network.parameters(), lr=settings.learning_rate
+        )
+        self._has_updated_once = False


I see we use this flag to zero out the reward before the training network has been trained to predict the random network. Is the idea that this initial reward is misleading in some way? If that's the case, I am not completely sold that that's true.

You are right. We do the same thing in Curiosity. If there has been no update yet, the reward is set to 0. This is because if there has been no update yet, the reward is just the mse of two random networks. The first reward will just reward any action, which in my opinion is the same as rewarding none. But I see your point, I can remove this part. I do not think it will have any impact on training.

If a user runs with --resume, the first reward would be ignored?

Yes, you are right. I see the issue. I will remove the _has_updated_once variable

andrewcoh · 2020-09-23T18:38:27Z

Do we not mind leaving rewards unnormalized (in the paper they use the STD of full returns as a normalizer)?

vincentpierre · 2020-09-23T18:59:05Z

Do we not mind leaving rewards unnormalized (in the paper they use the STD of full returns as a normalizer)?

Yes, you are right. I took some liberties with the original paper. For example, there is also a terminal reward for end states that I simply removed and I do not consider the Agent being done zeros the future rewards.
The argument behind it is that our current Reward Provider abstraction does not allow for either of these. I think we could implement these features, but it will a rather big refactor. It still trains the Pyramids environment correctly.

andrewcoh

vincentpierre added 9 commits August 20, 2020 16:35

initial commit

fc17b99

works with Pyramids

02bd483

Merge branch 'master' into develop-rnd

35f02c5

added unit tests and a separate config file

45605e8

Adding first batch of documentation

b4d58de

adding in the docs that rnd is only for PyTorch

1209229

adding newline at the end of the config files

00c5237

adding some docs

03e51f4

Code comments

10e5d0c

chriselion reviewed Sep 11, 2020

View reviewed changes

vincentpierre added 2 commits September 17, 2020 11:14

no normalization of the reward

d0330f5

Fixing the tests

7313872

vincentpierre self-assigned this Sep 22, 2020

[skip ci]

13a7694

vincentpierre added 2 commits September 23, 2020 10:00

[skip ci] Make sure RND will only work for Torch by editing the confi…

d6677f7

…g file

[skip ci] Additional information in the Documentation

7c1c8a4

vincentpierre requested review from ervteng and andrewcoh September 23, 2020 17:58

andrewcoh reviewed Sep 23, 2020

View reviewed changes

Remove the _has_updated_once flag

78f3fca

andrewcoh approved these changes Sep 23, 2020

View reviewed changes

vincentpierre merged commit 73fa8bd into master Sep 23, 2020

delete-merged-branch bot deleted the develop-rnd branch September 23, 2020 22:11

github-actions bot locked as resolved and limited conversation to collaborators Sep 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Random Network Distillation for Torch #4473

Random Network Distillation for Torch #4473

Uh oh!

vincentpierre commented Sep 11, 2020

Uh oh!

chriselion Sep 11, 2020

Uh oh!

vincentpierre Sep 14, 2020

Uh oh!

chriselion Sep 11, 2020

Uh oh!

chriselion Sep 11, 2020

Uh oh!

vincentpierre Sep 14, 2020

Uh oh!

ervteng commented Sep 11, 2020

Uh oh!

vincentpierre commented Sep 14, 2020

Uh oh!

vincentpierre commented Sep 23, 2020

Uh oh!

andrewcoh Sep 23, 2020

Uh oh!

vincentpierre Sep 23, 2020

Uh oh!

andrewcoh Sep 23, 2020

Uh oh!

vincentpierre Sep 23, 2020

Uh oh!

andrewcoh commented Sep 23, 2020

Uh oh!

vincentpierre commented Sep 23, 2020

Uh oh!

andrewcoh left a comment

Uh oh!

Uh oh!

Random Network Distillation for Torch #4473

Random Network Distillation for Torch #4473

Uh oh!

Conversation

vincentpierre commented Sep 11, 2020

Proposed change(s)

Useful links (Github issues, JIRA tickets, ML-Agents forum threads etc.)

Types of change(s)

Checklist

Other comments

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ervteng commented Sep 11, 2020

Uh oh!

vincentpierre commented Sep 14, 2020

Uh oh!

vincentpierre commented Sep 23, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewcoh commented Sep 23, 2020

Uh oh!

vincentpierre commented Sep 23, 2020

Uh oh!

andrewcoh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!