Reinforcement Learning
Learn decision-making policies by maximizing reward.
Reinforcement learning trains agents to make sequences of decisions by interacting with an environment and maximizing cumulative reward. The families below range from value-based and policy-gradient methods to bandits for the explore-vs-exploit tradeoff.
- Use bandits for online decision optimization.
- Use PPO for many practical RL tasks, and SAC/TD3 for continuous control.
| # | Algorithm | Best for | Common fields |
|---|---|---|---|
| 1 | Q-Learning / Deep Q-Networks | Discrete action problems |
|
| 2 | Policy Gradient Methods | Direct policy optimization |
|
| 3 | Actor-Critic Methods | Stable deep RL |
|
| 4 | PPO: Proximal Policy Optimization | Practical deep RL baseline |
|
| 5 | A3C / A2C | Parallel RL training |
|
| 6 | DDPG / TD3 / SAC | Continuous control |
|
| 7 | Multi-Armed Bandits | Exploration vs exploitation |
|