JuliaReinforcementLearning
diff --git a/‎.cspell/julia_words.txt
Lines changed: 2 additions & 1 deletion b/‎.cspell/julia_words.txt
Lines changed: 2 additions & 1 deletion
diff --git a/‎docs/src/How_to_implement_a_new_algorithm.md
Lines changed: 17 additions & 14 deletions b/‎docs/src/How_to_implement_a_new_algorithm.md
Lines changed: 17 additions & 14 deletions
diff --git a/‎docs/src/tips.md
Lines changed: 5 additions & 0 deletions b/‎docs/src/tips.md
Lines changed: 5 additions & 0 deletions
diff --git a/‎src/ReinforcementLearningCore/Project.toml
Lines changed: 3 additions & 1 deletion b/‎src/ReinforcementLearningCore/Project.toml
Lines changed: 3 additions & 1 deletion
diff --git a/‎src/ReinforcementLearningCore/src/ReinforcementLearningCore.jl
Lines changed: 4 additions & 0 deletions b/‎src/ReinforcementLearningCore/src/ReinforcementLearningCore.jl
Lines changed: 4 additions & 0 deletions
diff --git a/‎src/ReinforcementLearningCore/src/core/run.jl
Lines changed: 20 additions & 19 deletions b/‎src/ReinforcementLearningCore/src/core/run.jl
Lines changed: 20 additions & 19 deletions
diff --git a/‎src/ReinforcementLearningCore/src/policies/agent/multi_agent.jl
Lines changed: 20 additions & 19 deletions b/‎src/ReinforcementLearningCore/src/policies/agent/multi_agent.jl
Lines changed: 20 additions & 19 deletions
diff --git a/‎src/ReinforcementLearningCore/src/policies/learners.jl
Lines changed: 2 additions & 0 deletions b/‎src/ReinforcementLearningCore/src/policies/learners.jl
Lines changed: 2 additions & 0 deletions
diff --git a/‎src/ReinforcementLearningCore/src/policies/q_based_policy.jl
Lines changed: 7 additions & 5 deletions b/‎src/ReinforcementLearningCore/src/policies/q_based_policy.jl
Lines changed: 7 additions & 5 deletions
@@ -5294,4 +5294,5 @@ sqmahal
 logdpf
 devmode
 logpdfs
-kldivs
+kldivs
+Riedmiller
@@ -46,7 +46,7 @@ end
 
 ```
 
-Implementing a new algorithm mainly consists of creating your own `AbstractPolicy` subtype, its action sampling method (by overloading `Base.push!(policy::YourPolicyType, env)`) and implementing its behavior at each stage. However, ReinforcemementLearning.jl provides plenty of pre-implemented utilities that you should use to 1) have less code to write 2) lower the chances of bugs and 3) make your code more understandable and maintainable (if you intend to contribute your algorithm). 
+Implementing a new algorithm mainly consists of creating your own `AbstractPolicy` (or `AbstractLearner`, see [this section](#using-resources-from-rlcore)) subtype, its action sampling method (by overloading `Base.push!(policy::YourPolicyType, env)`) and implementing its behavior at each stage. However, ReinforcemementLearning.jl provides plenty of pre-implemented utilities that you should use to 1) have less code to write 2) lower the chances of bugs and 3) make your code more understandable and maintainable (if you intend to contribute your algorithm). 
 
 ## Using Agents
 The recommended way is to use the policy wrapper `Agent`. An agent is itself an `AbstractPolicy` that wraps a policy and a trajectory (also called Experience Replay Buffer in RL literature). Agent comes with default implementations of `push!(agent, stage, env)` that will probably fit what you need at most stages so that you don't have to write them again. Looking at the [source code](https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/main/src/ReinforcementLearningCore/src/policies/agent.jl/), we can see that the default Agent calls are  
@@ -73,7 +73,7 @@ If you need a different behavior at some stages, then you can overload the `Base
 
 ## Updating the policy
 
-Finally, you need to implement the learning function by implementing `RLBase.optimise!(::YourPolicyType, ::Stage, ::Trajectory)`. By default this does nothing at all stages. Overload it on the stage where you wish to optimise (most often, at `PreActStage` or `PostEpisodeStage`). This function should loop the trajectory to sample batches. Inside the loop, put whatever is required. For example:
+Finally, you need to implement the learning function by implementing `RLBase.optimise!(::YourPolicyType, ::Stage, ::Trajectory)`. By default this does nothing at all stages. Overload it on the stage where you wish to optimise (most often, at `PreActStage()`, `PostActStage()` or `PostEpisodeStage()`). This function should loop the trajectory to sample batches. Inside the loop, put whatever is required. For example:
 
 ```julia
 function RLBase.optimise!(p::YourPolicyType, ::PostEpisodeStage, traj::Trajectory)
@@ -83,7 +83,7 @@ function RLBase.optimise!(p::YourPolicyType, ::PostEpisodeStage, traj::Trajector
 end
 
 ```
-where `optimise!(p, batch)` is a function that will typically compute the gradient and update a neural network, or update tabular policy. What is inside the loop is free to be whatever you need. This is further discussed in the next section on `Trajectory`s. 
+where `optimise!(p, batch)` is a function that will typically compute the gradient and update a neural network, or update a tabular policy. What is inside the loop is free to be whatever you need but it's a good idea to implement a `optimise!(p::YourPolicyType, batch::NamedTuple)` function for clarity instead of coding everything in the loop. This is further discussed in the next section on `Trajectory`s. An example of where this could be different is when you want to update priorities, see [the PER learner](https://github.com/JuliaReinforcementLearning/ReinforcementLearning.jl/blob/main/src/ReinforcementLearningZoo/src/algorithms/dqns/prioritized_dqn.jl) for an example.
 
 ## ReinforcementLearningTrajectories
 
@@ -122,29 +122,32 @@ The sampler is the object that will fetch data in your trajectory to create the
 
 ## Using resources from RLCore
 
-RL algorithms typically only differ partially  but broadly use the same mechanisms. The subpackage RLCore contains a lot of utilities that you can reuse to implement your algorithm.
+RL algorithms typically only differ partially  but broadly use the same mechanisms. The subpackage RLCore contains some utilities that you can reuse to implement your algorithm.
 
-The utils folder contains utilities and extensions to external packages to fit needs that are specific to RL.jl. We will not list them all here, but it is a good idea to skim over the files to see what they contain. The policies folder notably contains several explorer implementations. Here are a few interesting examples:
+### QBasedPolicy
 
-- `QBasedPolicy` wraps a policy that relies on a Q-Value _learner_ (tabular or approximated) and an _explorer_ . 
-RLCore provides several pre-implemented learners and the most common explorers (such as epsilon-greedy, UCB, etc.). 
+`QBasedPolicy` is a policy that wraps a Q-Value _learner_ (tabular or approximated) and an _explorer_. Use this wrapper to implement a policy that directly uses a Q-value function to 
+decide its next action. In that case, instead of creating an `AbstractPolicy` subtype for your algorithm, define an `AbstractLearner` subtype and specialize `RLBase.optimise!(::YourLearnerType, ::Stage, ::Trajectory)`. This way you will not have to code the interaction between your policy and the explorer yourself. 
+RLCore provides the most common explorers (such as epsilon-greedy, UCB, etc.). 
 
-- If your algorithm use tabular learners, check out the tabular_learner.jl and the tabular_approximator source files. If your algorithms uses deep neural nets then use the `NeuralNetworkApproximator` to wrap an Neural Network and an optimizer. Common policy architectures are also provided such as the `GaussianNetwork`.
+### Neural and linear approximators
 
-- Equivalently, the `VBasedPolicy` learner is provided for algorithms that use a state-value function. Though they are not bundled in the same folder, most approximators can be used with a VBasedPolicy too.
+If your algorithm uses a neural network or a linear approximator to approximate a function trained with `Flux.jl`, use the `Approximator`. Approximator 
+wraps a `Flux` model and an `Optimiser` (such as Adam or SGD). Your `optimise!(::PolicyOrLearner, batch)` function will probably consist in computing a gradient 
+and call the `RLCore.optimise!(app::Approximator, gradient::Flux.Grads)` after that. 
 
-<!--- ### Batch samplers
- Since this is going to be outdated soon, I'll write this part later on when Trajectories.jl will be done -->
+Common model architectures are also provided such as the `GaussianNetwork` for continuous policies with diagonal multivariate policies; and `CovGaussianNetwork` for full covariance (very slow on GPUs at the moment).
 
-- In utils/distributions.jl you will find implementations of gaussian log probabilities functions that are both GPU compatible and differentiable and that do not require the overhead of using Distributions.jl structs.
+### Utils
+In utils/distributions.jl you will find implementations of gaussian log probabilities functions that are both GPU compatible and differentiable and that do not require the overhead of using Distributions.jl structs.
 
 ## Conventions
 Finally, there are a few "conventions" and good practices that you should follow, especially if you intend to contribute to this package (don't worry we'll be happy to help if needed).
 
 ### Random Numbers
-ReinforcementLearning.jl aims to provide a framework for reproducible experiments. To do so, make sure that your policy type has a `rng` field and that all random operations (e.g. action sampling or trajectory sampling) use `rand(your_policy.rng, args...)`.
+ReinforcementLearning.jl aims to provide a framework for reproducible experiments. To do so, make sure that your policy type has a `rng` field and that all random operations (e.g. action sampling) use `rand(your_policy.rng, args...)`. For trajectory sampling, you can set the sampler's rng to that of the policy when creating and agent or simply instantiate its own rng.
 
-### GPU friendlyness
+### GPU compatibility
 Deep RL algorithms are often much faster when the neural nets are updated on a GPU. For now, we only support CUDA.jl as a backend. This means that you will have to think about the transfer of data between the CPU (where the trajectory is) and the GPU memory (where the neural nets are). To do so you will find in utils/device.jl some functions that do most of the work for you. The ones that you need to know are `send_to_device(device, data)` that sends data to the specified device, `send_to_host(data)` which sends data to the CPU memory (it fallbacks to `send_to_device(Val{:cpu}, data)`) and `device(x)` that returns the device on which `x` is. 
 Normally, you should be able to write a single implementation of your algorithm that works on CPU and GPUs thanks to the multiple dispatch offered by Julia.
 
 
@@ -27,3 +27,8 @@ dependency, remember to update both `docs/Project.toml` and
     All the cells after the `#+ tangle=true` line in `Your_Experment.jl` will be extracted into the
     `ReinforcementLearningExperiments` package automatically. This feature is
     supported by [Weave.jl](https://weavejl.mpastell.com/stable/usage/#tangle).
+
+## How to enable debug timings for experiment runs?
+
+Call `RLCore.TimerOutputs.enable_debug_timings(RLCore)` and default timings for hooks, policies and optimization steps will be printed. How do I reset the timer? Call `RLCore.TimerOutputs.reset_timer!(RLCore.timer)`. How do I show the timer results? Call `RLCore.timer`.
+
@@ -1,6 +1,6 @@
 name = "ReinforcementLearningCore"
 uuid = "de1b191a-4ae0-4afa-a27b-92d07f46b2d6"
-version = "0.11.0"
+version = "0.11.2"
 
 [deps]
 AbstractTrees = "1520ce14-60c1-5f80-bbc7-55ef81b5835c"
@@ -22,6 +22,7 @@ ReinforcementLearningBase = "e575027e-6cd6-5018-9292-cdc6200d2b44"
 ReinforcementLearningTrajectories = "6486599b-a3cd-4e92-a99a-2cea90cc8c3c"
 Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
 StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
+TimerOutputs = "a759f4b9-e2f1-59dc-863e-4aeb61b1ea8f"
 UnicodePlots = "b8865327-cd53-5732-bb35-84acbb429228"
 
 [compat]
@@ -41,6 +42,7 @@ Reexport = "1"
 ReinforcementLearningBase = "0.12"
 ReinforcementLearningTrajectories = "^0.1.9"
 StatsBase = "0.32, 0.33, 0.34"
+TimerOutputs = "0.5"
 UnicodePlots = "1.3, 2, 3"
 julia = "1.9"
 
 
@@ -1,5 +1,6 @@
 module ReinforcementLearningCore
 
+using TimerOutputs
 using ReinforcementLearningBase
 using Reexport
 
@@ -14,4 +15,7 @@ include("core/core.jl")
 include("policies/policies.jl")
 include("utils/utils.jl")
 
+# Global timer for TimerOutputs.jl
+const timer = TimerOutput()
+
 end # module
@@ -87,37 +87,38 @@ function _run(policy::AbstractPolicy,
     push!(policy, PreExperimentStage(), env)
     is_stop = false
     while !is_stop
-        reset!(env)
-        push!(policy, PreEpisodeStage(), env)
-        optimise!(policy, PreEpisodeStage())
-        push!(hook, PreEpisodeStage(), policy, env)
+        # NOTE: @timeit_debug statements are used for debug logging
+        @timeit_debug timer "reset!"                            reset!(env)
+        @timeit_debug timer "push!(policy) PreEpisodeStage"     push!(policy, PreEpisodeStage(), env)
+        @timeit_debug timer "optimise! PreEpisodeStage"         optimise!(policy, PreEpisodeStage())
+        @timeit_debug timer "push!(hook) PreEpisodeStage"       push!(hook, PreEpisodeStage(), policy, env)
 
 
         while !reset_condition(policy, env) # one episode
-            push!(policy, PreActStage(), env)
-            optimise!(policy, PreActStage())
-            push!(hook, PreActStage(), policy, env)
+            @timeit_debug timer "push!(policy) PreActStage"     push!(policy, PreActStage(), env)
+            @timeit_debug timer "optimise! PreActStage"         optimise!(policy, PreActStage())
+            @timeit_debug timer "push!(hook) PreActStage"       push!(hook, PreActStage(), policy, env)
 
-            action = RLBase.plan!(policy, env)
-            act!(env, action)
+            action = @timeit_debug timer "plan!"                RLBase.plan!(policy, env)
+            @timeit_debug timer "act!"                          act!(env, action)
 
-            push!(policy, PostActStage(), env)
-            optimise!(policy, PostActStage())
-            push!(hook, PostActStage(), policy, env)
+            @timeit_debug timer "push!(policy) PostActStage"    push!(policy, PostActStage(), env)
+            @timeit_debug timer "optimise! PostActStage"        optimise!(policy, PostActStage())
+            @timeit_debug timer "push!(hook) PostActStage"      push!(hook, PostActStage(), policy, env)
 
             if check_stop(stop_condition, policy, env)
                 is_stop = true
-                push!(policy, PreActStage(), env)
-                optimise!(policy, PreActStage())
-                push!(hook, PreActStage(), policy, env)
-                RLBase.plan!(policy, env)  # let the policy see the last observation
+                @timeit_debug timer "push!(policy) PreActStage"   push!(policy, PreActStage(), env)
+                @timeit_debug timer "optimise! PreActStage"       optimise!(policy, PreActStage())
+                @timeit_debug timer "push!(hook) PreActStage"     push!(hook, PreActStage(), policy, env)
+                @timeit_debug timer "plan!"                       RLBase.plan!(policy, env)  # let the policy see the last observation
                 break
             end
         end # end of an episode
 
-        push!(policy, PostEpisodeStage(), env)  # let the policy see the last observation
-        optimise!(policy, PostEpisodeStage())
-        push!(hook, PostEpisodeStage(), policy, env)
+        @timeit_debug timer "push!(policy) PostEpisodeStage"      push!(policy, PostEpisodeStage(), env)  # let the policy see the last observation
+        @timeit_debug timer "optimise! PostEpisodeStage"          optimise!(policy, PostEpisodeStage())
+        @timeit_debug timer "push!(hook) PostEpisodeStage"        push!(hook, PostEpisodeStage(), policy, env)
 
     end
     push!(policy, PostExperimentStage(), env)
 
@@ -108,34 +108,35 @@ function Base.run(
     push!(multiagent_policy, PreExperimentStage(), env)
     is_stop = false
     while !is_stop
-        reset!(env)
-        push!(multiagent_policy, PreEpisodeStage(), env)
-        optimise!(multiagent_policy, PreEpisodeStage())
-        push!(multiagent_hook, PreEpisodeStage(), multiagent_policy, env)
+        # NOTE: @timeit_debug statements are for debug logging
+        @timeit_debug timer "reset!"                             reset!(env)
+        @timeit_debug timer "push!(policy) PreEpisodeStage"      push!(multiagent_policy, PreEpisodeStage(), env)
+        @timeit_debug timer "optimise! PreEpisodeStage"          optimise!(multiagent_policy, PreEpisodeStage())
+        @timeit_debug timer "push!(hook) PreEpisodeStage"        push!(multiagent_hook, PreEpisodeStage(), multiagent_policy, env)
 
         while !(reset_condition(multiagent_policy, env) || is_stop) # one episode
             for player in CurrentPlayerIterator(env)
                 policy = multiagent_policy[player] # Select appropriate policy
                 hook = multiagent_hook[player] # Select appropriate hook
-                push!(policy, PreActStage(), env)
-                optimise!(policy, PreActStage())
-                push!(hook, PreActStage(), policy, env)
+                @timeit_debug timer "push!(policy) PreActStage"    push!(policy, PreActStage(), env)
+                @timeit_debug timer "optimise! PreActStage"        optimise!(policy, PreActStage())
+                @timeit_debug timer "push!(hook) PreActStage"      push!(hook, PreActStage(), policy, env)
 
-                action = RLBase.plan!(policy, env)
-                act!(env, action)
+                action = @timeit_debug timer "plan!"               RLBase.plan!(policy, env)
+                @timeit_debug timer "act!" act!(env, action)
 
 
 
-                push!(policy, PostActStage(), env)
-                optimise!(policy, PostActStage())
-                push!(hook, PostActStage(), policy, env)
+                @timeit_debug timer "push!(policy) PostActStage"     push!(policy, PostActStage(), env)
+                @timeit_debug timer "optimise! PostActStage"         optimise!(policy, PostActStage())
+                @timeit_debug timer "push!(hook) PostActStage"       push!(hook, PostActStage(), policy, env)
 
                 if check_stop(stop_condition, policy, env)
                     is_stop = true
-                    push!(multiagent_policy, PreActStage(), env)
-                    optimise!(multiagent_policy, PreActStage())
-                    push!(multiagent_hook, PreActStage(), policy, env)
-                    RLBase.plan!(multiagent_policy, env)  # let the policy see the last observation
+                    @timeit_debug timer "push!(policy) PreActStage"  push!(multiagent_policy, PreActStage(), env)
+                    @timeit_debug timer "optimise! PreActStage"      optimise!(multiagent_policy, PreActStage())
+                    @timeit_debug timer "push!(hook) PreActStage"    push!(multiagent_hook, PreActStage(), policy, env)
+                    @timeit_debug timer "plan!"                      RLBase.plan!(multiagent_policy, env)  # let the policy see the last observation
                     break
                 end
 
@@ -145,9 +146,9 @@ function Base.run(
             end
         end # end of an episode
 
-        push!(multiagent_policy, PostEpisodeStage(), env)  # let the policy see the last observation
-        optimise!(multiagent_policy, PostEpisodeStage())
-        push!(multiagent_hook, PostEpisodeStage(), multiagent_policy, env)
+        @timeit_debug timer "push!(policy) PostEpisodeStage"         push!(multiagent_policy, PostEpisodeStage(), env)  # let the policy see the last observation
+        @timeit_debug timer "optimise! PostEpisodeStage"             optimise!(multiagent_policy, PostEpisodeStage())
+        @timeit_debug timer "push!(hook) PostEpisodeStage"           push!(multiagent_hook, PostEpisodeStage(), multiagent_policy, env)
     end
     push!(multiagent_policy, PostExperimentStage(), env)
     push!(multiagent_hook, PostExperimentStage(), multiagent_policy, env)
 
@@ -10,6 +10,8 @@ Base.show(io::IO, m::MIME"text/plain", L::AbstractLearner) = show(io, m, convert
 # Take Learner and Environment, get state, send to RLCore.forward(Learner, State)
 forward(L::Le, env::E) where {Le <: AbstractLearner, E <: AbstractEnv} = env |> state |> send_to_device(L.approximator) |> x -> forward(L, x) |> send_to_device(env) 
 
+function RLBase.optimise!(::AbstractLearner, ::AbstractStage, ::Trajectory) end
+
 Base.@kwdef mutable struct Approximator{M,O}
     model::M
     optimiser::O
 
@@ -7,6 +7,11 @@ using Functors: @functor
 
 """
     QBasedPolicy(;learner, explorer)
+
+Wraps a learner and an explorer. The learner is a struct that should predict the Q-value of each legal
+action of an environment at its current state. It is typically a table or a neural network. 
+QBasedPolicy can be queried for an action with `RLBase.plan!`, the explorer will affect the action selection
+accordingly.
 """
 Base.@kwdef mutable struct QBasedPolicy{L,E} <: AbstractPolicy
     "estimate the Q value"
@@ -37,8 +42,5 @@ end
 RLBase.prob(p::QBasedPolicy{L,Ex}, env::AbstractEnv) where {L<:AbstractLearner,Ex<:AbstractExplorer} =
     prob(p.explorer, forward(p.learner, env), legal_action_space_mask(env))
 
-function RLBase.optimise!(p::QBasedPolicy{L,Ex}, ::PostActStage, trajectory::Trajectory) where {L<:AbstractLearner,Ex<:AbstractExplorer} 
-    for batch in trajectory
-       RLBase.optimise!(p.learner, batch)
-    end
-end
+#the internal learner defines the optimization stage.
+RLBase.optimise!(p::QBasedPolicy, s::AbstractStage, trajectory::Trajectory) = RLBase.optimise!(p.learner, s, trajectory)
-Original file line number
+Diff line change
 logdpf
 devmode
 logpdfs
 -kldivs
 +kldivs
 +Riedmiller