Reference

Class `NeuralQLearner`

DQN based agent.

Inherits from:

This agent mix DNN and Q-Learning as stated in the Nature letter :
"Human-level control through deep reinforcement learning" (Mnih et al.)

Info:

edit Alexis BRENON alexis.brenon@imag.fr

See also:

Data Types

ClassArgument
Arguments used to instanciate a class.
Fields:
- string class
  Name of the class to instantiate
- table params
  Parameters of the class (see class documentation)
Dump

Dump extracted from a NeuralQLearner. All Tensors are converted to the default type to avoid GPU incompatibilities.
InferenceNetwork
Main deep neural network. This network is used to get the best action given a history of preprocessed states
Fields:
- nn.Module network
  The actual network
- table input_size
  The size of the input {d, w, h}
- table output_size
  The size of the output {w, h}
- torch.Tensor parameters
  Flat view of learnable parameters
- torch.Tensor grad_parameters
  Flat view of gradient of energy wrt the learnable parameters
InitArguments
Table used as arguments for the DQN constructor.
Warning:
- Some mixes between dump and arguments should be cleared...
Fields:
- {int,...} observation_size
  Size of the observations from the environment {d, w, h}
- table actions
  Available actions from the environment (actions[0] is noop)
- ClassArgument preprocess
  Parameters of the network to use to preprocess states (optional)
- ClassArgument inference
  Parameters of the network used for inference
- _ExperiencePool.InitArguments experience_pool
  Parameters of the memory of the agent
- number learn_start
  Number of steps after which learning starts (default 0)
- number update_freq
  Learning frequency (epoch size) (default 1)
- number minibatch_size
  Number of samples to take to learn (default 1)
- number n_replay
  Number of minibatch learning during a learning epoch (default 1)
- boolean rescale_r
  Scale rewards (default false)
- number max_reward
  Reward maximum value clipping (optional)
- number min_reward
  Reward minimum value clipping (optional)
- number ep_start
  Initial value of epsilon (default 1)
- number ep_end
  Final value of epsilon (default ep_start)
- number ep_endt
  Epsilon annealing time (default 1000000)
- number ep_eval
  Epsilon value when evaluating (default 0.01)
- number lr
  Learning rate (default 0.001)
- number discount
  Q-learning Discount factor (0 < x < 1) (default 0.99)
- number clip_delta
  Clipping value for delta (default nil)
- number target_q
  How long a target network is valid (default nil)
- number wc
  L2 weight cost. (default 0)
- RMSPropArgument rmsprop
  Pre-initialized RMSProp arguments. (default {})
PreprocessNetwork
Preprocessing network. This network is used to preprocess an observation from the environment to change it to a simpler state
Fields:
- nn.Module network
  The actual network
- table input_size
  The size of the input {d, w, h}
- table output_size
  The size of the output {d, w, h}
RMSPropArgument
Parameters for RMSProp implementation
Todo:
- Implement gradient descent in sub-classes?
Fields:
- torch.Tensor mean_square
  Accumulated average of the squared gradient
- torch.Tensor mean
  Accumulated average of the gradient
- number decay
  Decay factor of the means
- number mu
  Smoothing term

Fields

function
_convert_tensor

Function to convert tensors if necessary.

This function must be called to convert tensors/network to the appropriate format (CUDA or default Tensor type) to avoid computation errors caused by inconsistent types
number
clip_delta

Clipping value for differences between expected Q-Value and actual Q-Value.
number
discount

Q-learning discount factor (0 < x < 1).
number
ep

Current espilon value.
number
ep_end

Final value of epsilon.
number
ep_endt

Epsilon annealing time.
number
ep_eval

Epsilon value when evaluating.
number
ep_start

Initial value of epsilon.
agent._ExperiencePool
experience_pool

Experience pool recording interactions.

This experience pool will act as a memory for the agent, recording interactions, and returning them when necessary (e.g. when learning).
number
experienced_steps

Number of perceived states.
InferenceNetwork
inference

Main neural network
number
learn_start

Number of steps after which learning starts. Add delay to populate the experience pool
number
learning_epoch

Number of time the agent has learned.
number
lr

Learning rate.
number
max_reward

Reward maximum value clipping.
number
min_reward

Reward minimum value clipping.
number
minibatch_size

Number of samples to take to learn.
number
n_replay

Number of minibatch learning during a learning epoch.
PreprocessNetwork
preprocess

Preprocessing network.
number
r_max
Maximal encountered reward for reward scaling.
See also:
- self.rescale_r
boolean
rescale_r

Scale rewards delta.
RMSPropArgument
rmsprop
Parameters for RMSProp implementation.
Todo:
- Use dedicated classes for GD implementations.
nn.Container
target_network

Current target network.
number
target_q

How long a target network is valid.

If not nil a target network will be used during Q-Learning to improve the algorithm convergence.
number
update_freq

Learning frequency (epoch size).
number
wc

L2 weight cost.

Metamethods

__init ( args[, dump={}] )
Default constructor.
Parameters:
- InitArguments args
- Dump dump
  (default {})

Public Methods

dump ( cycles )
Dump the agent.

Redundant informations (like network parameters) are removed to save space. Tensors are converted to default type to avoid GPU incompatibilities.

Overrides: ArcadesComponent:dump

Parameters:
- table cycles
  Set of already dumped objects
Returns:
- Dump A reloadable dump
evaluate ()
Put the agent in an evaluation mode.

The agent will used a separated experience pool and a dedicated epsilon value.

Overrides: BaseAgent:evaluate

Returns:
- self
get_action ()
Return an action to do.

Overrides: BaseAgent:get_action

Returns:
- number Action to execute
get_experienced_interactions ()
Return how many interactions the agent lived.

Overrides: BaseAgent:get_experienced_interactions

Returns:
- number Number of interactions done
get_learned_epoch ()
Return how many times the agent actually learned from its experience.

Overrides: BaseAgent:get_learned_epoch

Returns:
- number Number of times the agent learned
give_reward ( reward )
Reward or punish the agent.

Overrides: BaseAgent:give_reward

Parameters:
- number reward
  Reward if positive, punishment if negative
Returns:
- self
integrate_observation ( state )
Integrate current observation from the environment.

Overrides: BaseAgent:integrate_observation

Parameters:
- state The current state of the environment
  - torch.Tensor observation
    The actual observations
  - boolean terminal
    Is this state terminal?
Returns:
- self
training ()
Put the agent in a training mode.

This is the default mode of an agent.

Overrides: BaseAgent:training

Returns:
- self

Private Methods

_eGreedy ( state )
Get the action to execute according to an epsilon-greedy policy.
Todo:
- Use dedicated classes for strategies.
Parameters:
- state
  Current state
Returns:
- number Choosed action
_getQUpdate ( args )
Compute expected action-values as targets of the neural network.
Parameters:
- args An interaction sample
  - s
    Initial state
  - a
    Action
  - r
    Reward
  - s2
    Final state
  - t
    Is state final?
Returns:
- torch.Tensor Expected action-values
_greedy ( state )
Get the action to execute according to greedy policy.
Parameters:
- state
  Current state
Returns:
- number Choosed action
_init_inference_network ( args[, dump={}] )
Initialize the main inference network.
Parameters:
- ClassArgument args
- Dump dump
  (default {})
_init_preprocessing_network ( args, observation_size )
Initialize the preprocessing network.
Parameters:
- ClassArgument args
- table observation_size
  Size of input tensor {d, w, h}
_learn ()
Learn from the past experiences.

This does nothing if agent is in evaluating mode.
Returns:
- self
_qLearnMinibatch ()

Apply Q-Learning on a minibatch.

Arcades

Data Types

ClassArgument

Dump

InferenceNetwork

InitArguments

PreprocessNetwork

RMSPropArgument

Fields

_convert_tensor

clip_delta

discount

ep

ep_end

ep_endt

ep_eval

ep_start

experience_pool

experienced_steps

inference

learn_start

learning_epoch

lr

max_reward

min_reward

minibatch_size

n_replay

preprocess

r_max

rescale_r

rmsprop

target_network

target_q

update_freq

wc

Metamethods

__init ( args[, dump={}] )

Public Methods

dump ( cycles )

evaluate ()

get_action ()

get_experienced_interactions ()

get_learned_epoch ()

give_reward ( reward )

integrate_observation ( state )

training ()

Private Methods

_eGreedy ( state )

_getQUpdate ( args )

_greedy ( state )

_init_inference_network ( args[, dump={}] )

_init_preprocessing_network ( args, observation_size )

_learn ()

_qLearnMinibatch ()

`_convert_tensor`

`clip_delta`

`discount`

`ep`

`ep_end`

`ep_endt`

`ep_eval`

`ep_start`

`experience_pool`

`experienced_steps`

`inference`

`learn_start`

`learning_epoch`

`lr`

`max_reward`

`min_reward`

`minibatch_size`

`n_replay`

`preprocess`

`r_max`

`rescale_r`

`rmsprop`

`target_network`

`target_q`

`update_freq`

`wc`