Chapter 2: Multi-Armed Bandits

The K-Armed Bandit Problem

Define reward
- A numerical score that provides feedback and defines the goal for an agent.
Understand the temporal nature of the bandit problem
- Agent’s goal is to maximize the accumulated rewards over time.
Define k-armed bandit
- An agent who chooses between “K” actions and receives a reward based on the action it chooses.
Define action-values
- A function describing the expected reward for each action \(q_*(a) \doteq \mathbb{E}[R_t \vert A_t = a] \forall a \in {1,...,k}\).

Define action-value estimation methods
- Sample-Average method is using sample averge as estimation.
Define exploration and exploitation
- exploration is to choose an action to improve knowledge for long-term benefit.
- exploitation is to choose the action to exploit knowledge for short-term benefit.
Select actions greedily using an action-value function
- Argmax Q over a
Define online learning
- The learning procedure where policy interacts with environment to learn optimal actions.
Understand a simple online sample-average action-value estimation method
- After an action, update the action-value estimation as soon as possible.
Define the general online update equation
- NewEstimate <- OldEstimate + StepSize * (Target - OldEstimate).
Understand why we might use a constant step-size in the case of non-stationarity
- If we use sample-average, the step size expontentially decay with time, making the policy update heavily biased towards initial rewards. If the problem is nonstationary, the method is unable to keep up with the changing truth.

Define epsilon-greedy
- explore with probability epsilon, eploit with probability (1 - epsilon)
Compare the short-term benefits of exploitation and the long-term benefits of exploration
- When total steps is small, more exploitation may yield more rewards, but exploration may yield higher rewards at a later time.
Understand optimistic initial values
- Set initial action-value estimates to be higher than the actual expected reward.
Describe the benefits of optimistic initial values for early exploration
- Effectively shifting a bit exploitation to exploration.
Explain the criticisms of optimistic initial values
- not well suited for non-stationary problems, and we don’t usually know what the value should be.
Describe the upper confidence bound action selection method
- uses uncertainty in the value estimates for balancing exploration and exploitation.
- \[A_t \doteq argmax [Q_t(a) + c\sqrt{lnt / N_t(a)}]\]
Define optimism in the face of uncertainty

Understand Markov Decision Processes, or MDPs
- A framework for the problem of agent learning from interaction with the environment to achieve a goal.
Describe how the dynamics of an MDP are defined
- An agent interact with environment at a sequence of discrete time steps.
- At each time step, the agent receives a representation of the environment state and on that basis, the agent selects an action.
- One time step later, the agent receives a numerical reward based on its past actions and finds itself in a new state.
Understand the graphical representation of a Markov Decision Process
- Transition graph.
- Two kinds of nodes: state nodes and action nodes
- Each possible state has a state node
- Each action node connect two state nodes
- Write the reward \(r(s,a,s')\) and probability of transition \(p(s' \vert s,a)\) on the edge.
Explain how many diverse processes can be written in terms of the MDP framework

Describe how rewards relate to the goal of an agent
- Agent’s goal is to maximize total future reward.
Understand episodes and identify episodic tasks
- Interactions naturally break into chuncks called episodes.
- Episodes are independent.
- At terminal state reset to the start state.

Formulate returns for continuing tasks using discounting
- discount the rewards in the future by a factor gamma between 0 and 1.
- \(G_t\) is upper bounded by \(R_{max} \times 1 / (1 - \gamma)\)
- When \(\gamma\) approaches 0, only immediate reward matters, agent become short sighted.
- As \(\gamma\) approaches 1, agent is far-sighted.
Describe how returns at successive time steps are related to each other
- \[G_t = R_{t+1} + \gamma G_{t+1}\]
Understand when to formalize a task as episodic or continuing
- episodic tasks break naturally into independent episodes.
- continuing tasks are assumed to continue indefinitely.

Recognize that a policy is a distribution over actions for each possible state
- A policy maps current state onto a set of porbabilities for taking each action.
- deterministic policy: \(\pi(s)=a\)
- stochastic policy: \(\pi(a \vert s)\) where \(\Sigma_{a\in\mathbb{A}(s)}\pi(a \vert s) = 1\) and \(\pi(a \vert s) \geq 0\)
Describe the similarities and differences between stochastic and deterministic policies
- They both describe what action agent can take at given state.
- For deterministic policy, agent can only take one action per state.
- For stochastic policy, there are mutliple action agent can take per state, each action is associated with a probabilty and sum up to 1.
Identify the characteristics of a well-defined policy
- policies can only depend on the current state.
Generate examples of valid policies for a given MDP
Describe the roles of state-value and action-value functions in reinforcement learning
- state-value function \(v(s) \doteq \mathbb{E}[G_t \vert S_t = s]\) where \(G_t = \Sigma_{k=0}^{\infty} \gamma^k R_{t+k+1}\)
Describe the relationship between value functions and policies
Create examples of valid value functions for a given MDP