You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/src/How_to_implement_a_new_algorithm.md
+17-14Lines changed: 17 additions & 14 deletions
Original file line number
Diff line number
Diff line change
@@ -46,7 +46,7 @@ end
46
46
47
47
```
48
48
49
-
Implementing a new algorithm mainly consists of creating your own `AbstractPolicy` subtype, its action sampling method (by overloading `Base.push!(policy::YourPolicyType, env)`) and implementing its behavior at each stage. However, ReinforcemementLearning.jl provides plenty of pre-implemented utilities that you should use to 1) have less code to write 2) lower the chances of bugs and 3) make your code more understandable and maintainable (if you intend to contribute your algorithm).
49
+
Implementing a new algorithm mainly consists of creating your own `AbstractPolicy`(or `AbstractLearner`, see [this section](#using-resources-from-rlcore)) subtype, its action sampling method (by overloading `Base.push!(policy::YourPolicyType, env)`) and implementing its behavior at each stage. However, ReinforcemementLearning.jl provides plenty of pre-implemented utilities that you should use to 1) have less code to write 2) lower the chances of bugs and 3) make your code more understandable and maintainable (if you intend to contribute your algorithm).
50
50
51
51
## Using Agents
52
52
The recommended way is to use the policy wrapper `Agent`. An agent is itself an `AbstractPolicy` that wraps a policy and a trajectory (also called Experience Replay Buffer in RL literature). Agent comes with default implementations of `push!(agent, stage, env)` that will probably fit what you need at most stages so that you don't have to write them again. Looking at the [source code](https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/main/src/ReinforcementLearningCore/src/policies/agent.jl/), we can see that the default Agent calls are
@@ -73,7 +73,7 @@ If you need a different behavior at some stages, then you can overload the `Base
73
73
74
74
## Updating the policy
75
75
76
-
Finally, you need to implement the learning function by implementing `RLBase.optimise!(::YourPolicyType, ::Stage, ::Trajectory)`. By default this does nothing at all stages. Overload it on the stage where you wish to optimise (most often, at `PreActStage`or `PostEpisodeStage`). This function should loop the trajectory to sample batches. Inside the loop, put whatever is required. For example:
76
+
Finally, you need to implement the learning function by implementing `RLBase.optimise!(::YourPolicyType, ::Stage, ::Trajectory)`. By default this does nothing at all stages. Overload it on the stage where you wish to optimise (most often, at `PreActStage()`, `PostActStage()`or `PostEpisodeStage()`). This function should loop the trajectory to sample batches. Inside the loop, put whatever is required. For example:
77
77
78
78
```julia
79
79
function RLBase.optimise!(p::YourPolicyType, ::PostEpisodeStage, traj::Trajectory)
@@ -83,7 +83,7 @@ function RLBase.optimise!(p::YourPolicyType, ::PostEpisodeStage, traj::Trajector
83
83
end
84
84
85
85
```
86
-
where `optimise!(p, batch)` is a function that will typically compute the gradient and update a neural network, or update tabular policy. What is inside the loop is free to be whatever you need. This is further discussed in the next section on `Trajectory`s.
86
+
where `optimise!(p, batch)` is a function that will typically compute the gradient and update a neural network, or update a tabular policy. What is inside the loop is free to be whatever you need but it's a good idea to implement a `optimise!(p::YourPolicyType, batch::NamedTuple)` function for clarity instead of coding everything in the loop. This is further discussed in the next section on `Trajectory`s. An example of where this could be different is when you want to update priorities, see [the PER learner](https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/main/src/ReinforcementLearningZoo/src/algorithms/dqns/prioritized_dqn.jl) for an example.
87
87
88
88
## ReinforcementLearningTrajectories
89
89
@@ -122,29 +122,32 @@ The sampler is the object that will fetch data in your trajectory to create the
122
122
123
123
## Using resources from RLCore
124
124
125
-
RL algorithms typically only differ partially but broadly use the same mechanisms. The subpackage RLCore contains a lot of utilities that you can reuse to implement your algorithm.
125
+
RL algorithms typically only differ partially but broadly use the same mechanisms. The subpackage RLCore contains some utilities that you can reuse to implement your algorithm.
126
126
127
-
The utils folder contains utilities and extensions to external packages to fit needs that are specific to RL.jl. We will not list them all here, but it is a good idea to skim over the files to see what they contain. The policies folder notably contains several explorer implementations. Here are a few interesting examples:
127
+
### QBasedPolicy
128
128
129
-
-`QBasedPolicy` wraps a policy that relies on a Q-Value _learner_ (tabular or approximated) and an _explorer_ .
130
-
RLCore provides several pre-implemented learners and the most common explorers (such as epsilon-greedy, UCB, etc.).
129
+
`QBasedPolicy` is a policy that wraps a Q-Value _learner_ (tabular or approximated) and an _explorer_. Use this wrapper to implement a policy that directly uses a Q-value function to
130
+
decide its next action. In that case, instead of creating an `AbstractPolicy` subtype for your algorithm, define an `AbstractLearner` subtype and specialize `RLBase.optimise!(::YourLearnerType, ::Stage, ::Trajectory)`. This way you will not have to code the interaction between your policy and the explorer yourself.
131
+
RLCore provides the most common explorers (such as epsilon-greedy, UCB, etc.).
131
132
132
-
- If your algorithm use tabular learners, check out the tabular_learner.jl and the tabular_approximator source files. If your algorithms uses deep neural nets then use the `NeuralNetworkApproximator` to wrap an Neural Network and an optimizer. Common policy architectures are also provided such as the `GaussianNetwork`.
133
+
### Neural and linear approximators
133
134
134
-
- Equivalently, the `VBasedPolicy` learner is provided for algorithms that use a state-value function. Though they are not bundled in the same folder, most approximators can be used with a VBasedPolicy too.
135
+
If your algorithm uses a neural network or a linear approximator to approximate a function trained with `Flux.jl`, use the `Approximator`. Approximator
136
+
wraps a `Flux` model and an `Optimiser` (such as Adam or SGD). Your `optimise!(::PolicyOrLearner, batch)` function will probably consist in computing a gradient
137
+
and call the `RLCore.optimise!(app::Approximator, gradient::Flux.Grads)` after that.
135
138
136
-
<!--- ### Batch samplers
137
-
Since this is going to be outdated soon, I'll write this part later on when Trajectories.jl will be done -->
139
+
Common model architectures are also provided such as the `GaussianNetwork` for continuous policies with diagonal multivariate policies; and `CovGaussianNetwork` for full covariance (very slow on GPUs at the moment).
138
140
139
-
- In utils/distributions.jl you will find implementations of gaussian log probabilities functions that are both GPU compatible and differentiable and that do not require the overhead of using Distributions.jl structs.
141
+
### Utils
142
+
In utils/distributions.jl you will find implementations of gaussian log probabilities functions that are both GPU compatible and differentiable and that do not require the overhead of using Distributions.jl structs.
140
143
141
144
## Conventions
142
145
Finally, there are a few "conventions" and good practices that you should follow, especially if you intend to contribute to this package (don't worry we'll be happy to help if needed).
143
146
144
147
### Random Numbers
145
-
ReinforcementLearning.jl aims to provide a framework for reproducible experiments. To do so, make sure that your policy type has a `rng` field and that all random operations (e.g. action sampling or trajectory sampling) use `rand(your_policy.rng, args...)`.
148
+
ReinforcementLearning.jl aims to provide a framework for reproducible experiments. To do so, make sure that your policy type has a `rng` field and that all random operations (e.g. action sampling) use `rand(your_policy.rng, args...)`. For trajectory sampling, you can set the sampler's rng to that of the policy when creating and agent or simply instantiate its own rng.
146
149
147
-
### GPU friendlyness
150
+
### GPU compatibility
148
151
Deep RL algorithms are often much faster when the neural nets are updated on a GPU. For now, we only support CUDA.jl as a backend. This means that you will have to think about the transfer of data between the CPU (where the trajectory is) and the GPU memory (where the neural nets are). To do so you will find in utils/device.jl some functions that do most of the work for you. The ones that you need to know are `send_to_device(device, data)` that sends data to the specified device, `send_to_host(data)` which sends data to the CPU memory (it fallbacks to `send_to_device(Val{:cpu}, data)`) and `device(x)` that returns the device on which `x` is.
149
152
Normally, you should be able to write a single implementation of your algorithm that works on CPU and GPUs thanks to the multiple dispatch offered by Julia.
Copy file name to clipboardExpand all lines: docs/src/tips.md
+5Lines changed: 5 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -27,3 +27,8 @@ dependency, remember to update both `docs/Project.toml` and
27
27
All the cells after the `#+ tangle=true` line in `Your_Experment.jl` will be extracted into the
28
28
`ReinforcementLearningExperiments` package automatically. This feature is
29
29
supported by [Weave.jl](https://weavejl.mpastell.com/stable/usage/#tangle).
30
+
31
+
## How to enable debug timings for experiment runs?
32
+
33
+
Call `RLCore.TimerOutputs.enable_debug_timings(RLCore)` and default timings for hooks, policies and optimization steps will be printed. How do I reset the timer? Call `RLCore.TimerOutputs.reset_timer!(RLCore.timer)`. How do I show the timer results? Call `RLCore.timer`.
# Take Learner and Environment, get state, send to RLCore.forward(Learner, State)
11
11
forward(L::Le, env::E) where {Le <:AbstractLearner, E <:AbstractEnv} = env |> state |>send_to_device(L.approximator) |> x ->forward(L, x) |>send_to_device(env)
12
12
13
+
function RLBase.optimise!(::AbstractLearner, ::AbstractStage, ::Trajectory) end
0 commit comments