Class
NeuralQLearner
DQN based agent.
Inherits from:
This agent mix DNN and Q-Learning as stated in the Nature letter :
"Human-level control through deep reinforcement learning" (Mnih et al.)
- edit Alexis BRENON alexis.brenon@imag.fr
Data Types
keyboard_arrow_up-
ClassArgument
-
Dump
Dump extracted from a NeuralQLearner. All Tensors are converted to the default type to avoid GPU incompatibilities.
-
InferenceNetwork
Main deep neural network. This network is used to get the best action given a history of preprocessed states
Fields:-
nn.Module
network
The actual network -
table
input_size
The size of the input{d, w, h}
-
table
output_size
The size of the output{w, h}
-
torch.Tensor
parameters
Flat view of learnable parameters -
torch.Tensor
grad_parameters
Flat view of gradient of energy wrt the learnable parameters
-
nn.Module
-
InitArguments
Table used as arguments for the DQN constructor.
Fields:-
{int,...}
observation_size
Size of the observations from the environment{d, w, h}
-
table
actions
Available actions from the environment (actions[0]
isnoop
) -
ClassArgument
preprocess
Parameters of the network to use to preprocess states (optional) -
ClassArgument
inference
Parameters of the network used for inference -
_ExperiencePool.InitArguments
experience_pool
Parameters of the memory of the agent -
number
learn_start
Number of steps after which learning starts (default0
) -
number
update_freq
Learning frequency (epoch size) (default1
) -
number
minibatch_size
Number of samples to take to learn (default1
) -
number
n_replay
Number of minibatch learning during a learning epoch (default1
) -
boolean
rescale_r
Scale rewards (defaultfalse
) -
number
max_reward
Reward maximum value clipping (optional) -
number
min_reward
Reward minimum value clipping (optional) -
number
ep_start
Initial value of epsilon (default1
) -
number
ep_end
Final value of epsilon (defaultep_start
) -
number
ep_endt
Epsilon annealing time (default1000000
) -
number
ep_eval
Epsilon value when evaluating (default0.01
) -
number
lr
Learning rate (default0.001
) -
number
discount
Q-learning Discount factor (0 < x < 1) (default0.99
) -
number
clip_delta
Clipping value for delta (defaultnil
) -
number
target_q
How long a target network is valid (defaultnil
) -
number
wc
L2 weight cost. (default0
) -
RMSPropArgument
rmsprop
Pre-initialized RMSProp arguments. (default{}
)
-
{int,...}
-
PreprocessNetwork
-
RMSPropArgument
Parameters for RMSProp implementation
Todo:- Implement gradient descent in sub-classes?
-
torch.Tensor
mean_square
Accumulated average of the squared gradient -
torch.Tensor
mean
Accumulated average of the gradient -
number
decay
Decay factor of the means -
number
mu
Smoothing term
Fields
keyboard_arrow_up-
function
_convert_tensor
Function to convert tensors if necessary.
This function must be called to convert tensors/network to the appropriate format (CUDA or default Tensor type) to avoid computation errors caused by inconsistent types
-
number
clip_delta
Clipping value for differences between expected Q-Value and actual Q-Value.
-
number
discount
Q-learning discount factor (0 < x < 1).
-
number
ep
Current espilon value.
-
number
ep_end
Final value of epsilon.
-
number
ep_endt
Epsilon annealing time.
-
number
ep_eval
Epsilon value when evaluating.
-
number
ep_start
Initial value of epsilon.
-
agent._ExperiencePool
experience_pool
Experience pool recording interactions.
This experience pool will act as a memory for the agent, recording interactions, and returning them when necessary (e.g. when learning).
-
number
experienced_steps
Number of perceived states.
-
InferenceNetwork
inference
Main neural network
-
number
learn_start
Number of steps after which learning starts. Add delay to populate the experience pool
-
number
learning_epoch
Number of time the agent has learned.
-
number
lr
Learning rate.
-
number
max_reward
Reward maximum value clipping.
-
number
min_reward
Reward minimum value clipping.
-
number
minibatch_size
Number of samples to take to learn.
-
number
n_replay
Number of minibatch learning during a learning epoch.
-
PreprocessNetwork
preprocess
Preprocessing network.
-
number
r_max
Maximal encountered reward for reward scaling.
See also: -
boolean
rescale_r
Scale rewards delta.
-
RMSPropArgument
rmsprop
Parameters for RMSProp implementation.
Todo:- Use dedicated classes for GD implementations.
-
nn.Container
target_network
Current target network.
-
number
target_q
How long a target network is valid.
If not nil a target network will be used during Q-Learning to improve the algorithm convergence.
-
number
update_freq
Learning frequency (epoch size).
-
number
wc
L2 weight cost.
Metamethods
keyboard_arrow_up-
__init ( args[, dump={}] )
Default constructor.
Parameters:-
InitArguments
args
-
Dump
dump
(default{}
)
-
InitArguments
Public Methods
keyboard_arrow_up-
dump ( cycles )
Dump the agent.
Redundant informations (like network parameters) are removed to save space. Tensors are converted to default type to avoid GPU incompatibilities.
Overrides: ArcadesComponent:dump
-
table
cycles
Set of already dumped objects
- Dump A reloadable dump
-
table
-
evaluate ()
Put the agent in an evaluation mode.
The agent will used a separated experience pool and a dedicated epsilon value.
Overrides: BaseAgent:evaluate
-
self
-
-
get_action ()
-
get_experienced_interactions ()
Return how many interactions the agent lived.
Overrides: BaseAgent:get_experienced_interactions
- number Number of interactions done
-
get_learned_epoch ()
Return how many times the agent actually learned from its experience.
Overrides: BaseAgent:get_learned_epoch
- number Number of times the agent learned
-
give_reward ( reward )
Reward or punish the agent.
Overrides: BaseAgent:give_reward
-
number
reward
Reward if positive, punishment if negative
-
self
-
number
-
integrate_observation ( state )
Integrate current observation from the environment.
Overrides: BaseAgent:integrate_observation
-
state
The current state of the environment-
torch.Tensor
observation
The actual observations -
boolean
terminal
Is this state terminal?
-
torch.Tensor
-
self
-
-
training ()
Put the agent in a training mode.
This is the default mode of an agent.
Overrides: BaseAgent:training
-
self
-
Private Methods
keyboard_arrow_up-
_eGreedy ( state )
Get the action to execute according to an epsilon-greedy policy.
Todo:- Use dedicated classes for strategies.
-
state
Current state
- number Choosed action
-
_getQUpdate ( args )
Compute expected action-values as targets of the neural network.
Parameters:-
args
An interaction sample-
s
Initial state -
a
Action -
r
Reward -
s2
Final state -
t
Is state final?
-
- torch.Tensor Expected action-values
-
-
_greedy ( state )
Get the action to execute according to greedy policy.
Parameters:-
state
Current state
- number Choosed action
-
-
_init_inference_network ( args[, dump={}] )
Initialize the main inference network.
Parameters:-
ClassArgument
args
-
Dump
dump
(default{}
)
-
ClassArgument
-
_init_preprocessing_network ( args, observation_size )
Initialize the preprocessing network.
Parameters:-
ClassArgument
args
-
table
observation_size
Size of input tensor{d, w, h}
-
ClassArgument
-
_learn ()
-
_qLearnMinibatch ()
Apply Q-Learning on a minibatch.