Part 2 : Reinforcement learning algorithms taxonomy

A list of popular reinforcement learning algorithms grouped by category

3 min read

Photo credit: aaron boris

Pierre Aumjaud

Machine learning researcher

Here is a list of the most common reinforcement learning algorithms grouped by category.

1. Model-free RL algorithms

model free

1.1. value-based

Q-learning = SARSA max - 1992
State Action Reward State-Action (SARSA) - 1994
Deep Q Network (DQN) - 2013
- Double Deep Q Network (DDQN) - 2015
- Deep Recurrent Q Network (DRQN) - 2015
- Dueling Q Network - 2015
- Persistent Advantage Learning (PAL) - 2015
- Bootstrapped Deep Q Network - 2016
- Normalized Advantage Functions (NAF) = Continuous DQN - 2016
- N-Step Q Learning - 2016
- Noisy Deep Q Network (NoisyNet DQN) - 2017
- Deep Q Learning for Demonstration (DqfD) - 2017
- Categorical Deep Q Network = Distributed Deep Q Network = C51 - 2017
- Rainbow - 2017
- Quantile Regression Deep Q Network (QR-DQN) - 2017
Implicit Quantile Network - 2018
Mixed Monte Carlo (MMC) - 2017
Neural Episodic Control (NEC) - 2017

1.2. Policy-based

Cross-Entropy Method (CEM) - 1999
Policy Gradient
- REINFORCE = Vanilla Policy Gradient (VPG) - 1992
- Policy gradient softmax
- Natural Policy Gradient (Optimisation) (NPG) / (NPO) - 2002
- Truncated Natural Policy Gradient (TNPG) - 2016

1.3. Actor-Critic

Advantage Actor Critic (A2C) - 2016
Asynchronous Advantage Actor-Critic (A3C) - 2016
Generalized Advantage Estimation (GAE) - 2015
Trust Region Policy Optimization (TRPO) - 2015
Deterministic Policy Gradient (DPG) - 2014
Deep Deterministic Policy Gradients (DDPG) - 2015
- Distributed Distributional Deterministic Policy Gradients (D4PG) - 2018
- Twin Delayed Deep Deterministic Policy Gradient (TD3) - 2018
Actor-Critic with Experience Replay (ACER) - 2016
Actor Critic using Kronecker-Factored Trust Region (ACKTR) - 2017
Proximal Policy Optimization (PPO) - 2017
- Distributed PPO (DPPO) - 2017
- Clipped PPO (CPPO) - 2017
- Decentralized Distributed PPO (DD-PPO) - 2019
Soft Actor-Critic (SAC) - 2018

2. Model-based RL algorithms

model based

2.1. Dyna-Style Algorithms / Model-based data generation

Dynamic Programming (DP) = DYNA-Q - 1990
Embed to Control (E2C) - 2015
Model-Ensemble Trust-Region Policy Optimization (ME-TRPO) - 2018
Stochastic Lower Bound Optimization (SLBO) - 2018
Model-Based Meta-Policy-Optimzation (MB-MPO) (meta learning) - 2018
Stochastic Ensemble Value Expansion (STEVE) - 2018
Model-based Value Expansion (MVE) - 2018
Simulated Policy Learning (SimPLe) - 2019
Model Based Policy Optimization (MBPO) - 2019

2.2. Policy Search with Backpropagation through Time / Analytic gradient computation

Differential Dynamic Programming (DDP) - 1970
Linear Dynamical Systems and Quadratic Cost (LQR) - 1989
Iterative Linear Quadratic Regulator (ILQR) - 2004
Probabilistic Inference for Learning Control (PILCO) - 2011
Iterative Linear Quadratic-Gaussian (iLQG) - 2012
Approximate iterative LQR with Gaussian Processes (AGP-iLQR) - 2014
Guided Policy Search (GPS) - 2013
Stochastic Value Gradients (SVG) - 2015
Policy search with Gaussian Process - 2019

2.3. Shooting Algorithms / sampling-based planning

Random Shooting (RS) - 2017
Cross-Entropy Method (CEM) - 2013
- Deep Planning Network (DPN) -2018
- Probabilistic Ensembles with Trajectory Sampling (PETS-RS and PETS-CEM) - 2018
- Visual Foresight - 2016
Model Predictive Path Integral (MPPI) - 2015
- Planning with Deep Dynamics Models (PDDM) - 2019
Monte-Carlo Tree Search (MCTS) - 2006
- AlphaZero - 2017

2.4. Value-equivalence prediction

3. Other RL algorithms

other RL algo

3.1. General agents

Covariance Matrix Adaptation Evolution Strategy (CMA-ES) - 1996
Episodic Reward-Weighted Regression (ERWR) - 2009
Relative Entropy Policy Search (REPS) - 2010
Direct Future Prediction (DFP) - 2016

3.2. Imitation learning

Behavioral Cloning (BC)
Dataset Aggregation (Dagger) (i.e. query the expert) - 2011
Adversarial Reinforcement Learning
- Generative Adversarial Imitation Learning (GAIL) - 2016
- Adversarial Inverse Reinforcement Learning (AIRL) - 2017
Conditional Imitation Learning - 2017
Soft Q-Imitation Learning (SQIL) - 2019

3.3. Hierarchical RL

Hierarchical Actor Critic (HAC) - 2017

3.4. Memory types

Prioritized Experience Replay (PER) - 2015
Hindsight Experience Replay (HER) - 2017

3.5. Exploration techniques

E-Greedy
Boltzmann
Ornstein–Uhlenbeck process
Normal Noise
Truncated Normal Noise
Bootstrapped Deep Q Network
UCB Exploration via Q-Ensembles (UCB)
Noisy Networks for Exploration
Intrinsic Curiosity Module (ICM) - 2017

3.6. Meta learning

3.7. Model-free model-based

Imagination-Augmented Agents for Deep Reinforcement Learning (I2A) - 2017
Model-Free Model-Based (MB-MF) - 2017
Normalized Adantage Functions (NAF) - 2018
Temporal Difference Models (TDM) - 2018

Share on

Leave a comment

Also read

Part 9 : Reinforcement learning libraries

3 min read

Let’s compare some reinforcement learning libraries

Part 8.4 : Custom Gym environments for robotics applications

2 min read

I implemented some custom Gym environments for robotics applications with Pybullet and ROS.

Part 8.3 : Getting started with Pybullet

3 min read

An introduction to Pybullet, an open-source physics engine simulator for robotics.

Part 8.2 : Implementing a simple Gym environment - Tic-Tac-Toe

3 min read

Implement the 4 methods for the tic-tac-toe Gym environment.