exercises.txt

### 1.1

I think they would start playing randomly, then start exploiting some bad moves (when played random) from the opponent, becoming both better, until eventually both reaching an optimal policy

### 1.2

We could just say that symmetric positions are the same states? It would make the learning go faster (like 4 time faster is we take into account symmetries etc.)

But if the opponent didn't take advantage of symmetries, it might play differently in different symmetries, so we shouldn't put the same values for all the same positions.

### 1.3

It will learn to play worse because it might fall into local maxima, not exploring some parts of the tree.

### 1.4

When we do learn from exploratory moves, then the probability of winning should always be slightly underestimated (unless for the last moves). When we're not learning from them, then the proba might reach something like only 0 or 1 style integer proba, because it _knows_ at some point what's the optimal moves.

If we continue to make exploratory moves, then the set non-integer proba is better because actually represents what's happening in the game?

Maybe if we just want to maximize for win then the integer thing where we never update makes us be risk averse (in a minimax style).

### 1.5

Other improvements: (From 1.4, make the exploration rate go down.) Add some discount factor? Use some neural network? Have something the decreases the step size in a smart way? Actually learn something from exploratory moves but not too much? Initialize better the probabilities (instead of just 0.5 pb of winning, put something bigger if close to win, and less than 0.5 if close to loss). Or just the average of win vs. loss in the position (same until you reach initial board).

Better way to solve the tic-tac-toe pb: Do some meta-learning on a bunch of different players so it can adapt to new players quickly.

### 2.1

If \epsilon is 0.5 than probability of the greedy action is 0.5.

### 2.2

It definitely occured at the timesteps 4 and 5 (at timestep 4 because Q(2) < 0 = Q(3), and at timestep 5 because Q(2) > Q(3) = 0).

### 2.3

The \eps=0.01 will be better in the long run. It will select the optimal action 11=(0.99/0.9)x times more often. The EV is given by: EV = (1-\epsilon) * mean_maximum_action + \epsilon * EV(random action) = (1-\epsilon) * mean_maximum_action because the action values were sampled from a gaussian of mean 0.

Therefore, the solution with \epsilon = 0.01 will have (0.99 - 0.9) * mean_maximum_action = 0.09 * mean_maximum_action more reward per action.

### 2.4

Q_{n + 1} = Q_1 \prod_{i=1}^n (1-\alpha_i) + \sum_{i=1}^n \alpha_i \prod_{k=1}^{n-i} (1 - \alpha_{n-i-k+2})

### 2.5

Main result is that reward get up to 2+ with alpha=0.1 but if we do sample average it only goes up to 1.5. From plotting the distance to true q, we see that we sample average at some point we get further and further but with a constant alpha we keep on getting closer.

### 2.6

The optimistic greedy will underperform for the first moves or so, because it will try out the arms that have initial values of 5. However, there will be some kind of spike (on average after 5 moves) because it will definitely find the best value in the first 10 steps, on average after 5 moves. Then, after 10 moves, it tried out all the arms at least once, so will start to play a bit with the arms greedily, being a bit better than the one with the realistic algorithm because already explored all of them. So it will grow faster (in percentage of optimal action) until it outperforms the realistic eps-greedy.

### 2.7

The value in front of Q_1 is multiplied by 1-\beta_1 right at the beginning with \beta_1 = 1 so no bias.

More generally, \beta_n = \frac{\alpha}{1-(1-\alpha)^n}, which tends to \alpha when n->+inf (constant step size).


Replacing the \alpha_i by the value of \beta_i and using some cool product trick where everything simplifies, we get that:

the sum of weights is: \sum_{i=1}^n \frac{1-\beta}{1 - \beta^i} \beta^{n-i} \frac{1-\beta}{1 - \beta^{n-i+1}} where \beta = (1-\alpha).

It does sum to 1 (miraculously) for n=1 and n=2.

And for the exponentially weighted, the \beta^{n-i} makes it exponentially smaller for earlier rewards (i=1) than for the latest reward (n=i).

EDIT: after having coded the function (cf. chapter2/weights.py) it occurs that it doesn't really sum to 1... BUT the sum does converge to some mysterious number. So it's still exponentially weighted with the total weight being fixed at some point.

### 2.8

There is a spike at the 11th action because it finally can select the argmax (and not choose the latest random action).

### 2.9

\pi_t(a) = \sigma(H_t(a) - H_t(b))

### 2.10

First case: best expectation of success is 0.5 whatever your policy.
Second case: 0.55 by doing 2 on A and 1 on B.

### 2.11

After averaging over 10 bandits (cf. [/plots/ex2.11.png](/plots/ex2.11.png)), it seems that epsilon-greedy has the better performance, then greedy with optimistic initialization, then ucb and finally gradient bandit.

My explanation: ucb exploration decreases over time so cannot really adapt to 100 000 steps of random walk. Epsilon-greedy does better than optimistic greedy because of the exploration. I don't know why gradient bandit completely fails on random walks (compared to the others).

### 3.1

Task 1: Real Life
- A = {sleep, work}
- S = (awake, asleep)
- P(asleep, 1 | awake, sleep) = 0.9 # going to sleep
  P(awake, -1 | awake, sleep) = 0.1 # insomnia
  P(asleep, -10 | awake, work) = 0.9 # sleeping at work
  P(awake, 2 | awake, work) = 0.1 # working is rewarding
  P(asleep, 100 | asleep, sleep) = 0.99 # going into deeper sleep
  P(awake, -1000 | askleep, sleep) = 0.01 # waking up from realizing you've tried to dream in a dream, in your bed, then starting to wonder if life is not just a dream
  P(asleep, 10 | asleep, work) = 0.8 # dreaming of work in your dream, and staying in your dream
  P(awake, -2 | asleep, work) = 0.2 # dreaming about work because late to work, so waking up unconsciously in a bad mood

Task 2: Play Chess against Random player
- A = {every valid chess move}
- S = {possible chess position. Check mate goes through terminal state.}
- P = {probability of random player going through next state, with reward of +1 if AI didn't block a check (or cannot), 0 if nothing happens and -1 if opponent captured one of your piece or won}

Task 3: Hardcore Navigation
- A = {decide if you should accelerate, brake, change angle to left or change angle to right}
- S = {image you see from the front camera} (in R^(16x16))
- P = {0 most of the time, -np.inf if accident (terminal state)}
--> Limits to the traditional MDP model:
1. States don't fully capture the actual state of the environment (i.e. you can see the same thing multiple times but be in a different position).
2. Continuous action space.
3. Infinite / continuous state space.
4. Infinite reward.

### 3.2

It's hard to find clear exceptions because all "goal directed behaviour" can be redefined as a MDP where "goal achieved" = terminal state with transition leading to this state having a reward of 1, and rest of transitions having a reward of 0. 

However, in practice the MDP framework is not useful when the rewards are not (yet) well-defined or hard to define. For instance, as a clear exception, "understand a text" would be something hard to define for a ML model in February 2020. As for the walking robot problem, defining what "walking in a human way" or "behaving according to human values" is somehow arbitrary and hard to specify clearly.

### 3.3

The natural place to delimit the agent and the environment is what the agent can move in less than 0.01s (e.g. our muscles). The basis is "it has actual control over that thing". For a robot that limit might be different (can send signals much quicker, but cannot move as fast). It's somewhat a free choice where you draw the line, but I still feel that "what the agent can actually move in 0.1s" is kinda natural for human agents.

### 3.4

state s | action  | state s' | reward r | p(s', r | s, a)
--------------------------------------------------------
high | search | high | r_{search} | \alpha
high | search | r_{search} | 1-\alpha
low |search | high | -3 | 1-\beta
low | search | low  | r_{search} | \beta
high | wait | high | r_{wait} | 1
low | wait | low | r_{wait} | 1

### 3.5

In episodic tasks, S^{+} = S \bigup {terminal state}

In (3.3), considering the terminal state as an absorbing state, we just need to replace S by S^{+}.

If we want to make sth more complex, then we might consider terminal state as sth that goes into the first state after a reset? But Suttong & Barto say "very slightly" so I guess just need to change S to S^{+}.

### 3.6

The return would be exactly the same?

### 3.7

Must also include a penalty for every step, so that it learns to get out as quickly as possible.

### 3.8

G_5 = 0
G_4 = 2
G_3 = 3 + (1/2) * 2 = 4
G_2 = 6 + (1/2) * 4 = 8
G_1 = 2 + (1/2) * 8 = 6
G_0 = -1+ + (1/2)* 6 = 2

### 3.9

G_1 = 7 / (1-0.9) = 70
G_0 = 2 + 0.9 * 70 = 65

### 3.10

\sum_{k=0}{n} \gamma^k = (1 - \gamma^{n+1}) / (1-\gamma) (just multiply both sides by (1-\gamma). It's a sum of positive things inferior to 1/(1-\gamma) so it has a limit. Taking the limit, the sum is 1/(1-\gamma).

### 3.11

\mathbb{E}[R_{t+1}] = sum_{r \in R} r \sum{s'} \sum{a} p(s', r|s, a) \pi(a | s)

### 3.12

v_{\pi}(s) = \sum{a} \pi(a|s) q_{\pi}(s, a)

### 3.13

q_{\pi}(s, a) = \sum{s'}\sum{r} p(s', r| s, a) (r + \gamma.v_{\pi}(s'))

### 3.14

0.7 ~= (1/4) * (0.4 - 0.4 + 2.3 + 0.7) = 3/4

### 3.15

Intervals are important, not the sign. Using 3.8, adding a constant c adds v_c=c/(1-\gamma) to the value of all states.

### 3.16

In maze running, if the reward for each step is not negative then the agent might want to never finish the task.

### 3.17

q_{\pi}(s, a) = \sum_{r, s'} r + \gamma \sum_{a'} \pi(a'|s')q_{\pi}(s', a')

### 3.18

v_{\pi}(s) = \mathbb(E)_{a~\pi(\dot|s)}[q_{\pi}(s, a)]

v_{\pi}(s) = \sum_{a} \pi(a|s) q_{\pi}(s, a)

### 3.19

q_{\pi}(s, a) = \mathbb(E)[R_{t+1} + v_{\pi}(S_{t+1})]
              = \sum_r \sum_{s'} p(s', r|s, a) (r + \gamma.v_{\pi}(s'))

### 3.20

optimal state-value function is: -2 everywhere apart the green (just use the driver to get to the green then putter), -1 if already on the green and 0 on the flag.

### 3.21

-4 if cannot reach the -2 zone of q*(., driver), -3 if can reach this zone, -2 if close enough to reach the green, -1 if on the green, 0 if on the flag.

### 3.22

\gamma=0: left
\gamma=0.5: both
\gamma=0.9: right

### 3.23

Let mh = max(q*(h,s), q*(h,w)) and ml=max(q*(l, s), q*(l, w), q*(l, re))
let a = \alpha and b = \beta and g = \gamma

q*(h, s) = rs + g (a.mh + (1-a)ml)
q*(h, w) = rw + g.mh

q*(l, s) = b(rs + g.ml) + (1-b)(-3 + g.mh)
q*(l, w) = rw + g.ml
q*(l, re) = 1 + g.mh

### 3.24

Essentially, you write the sum of rewards as 10 + 10 * 0.9^5 + ... = 10.\sum 0.9^4k = 10 / (1 - 0.9^5) =  24.419

### 3.25

v*(s) = max_a q*(s,a)

### 3.26

q*(s, a) = \sum_{r, s'} p(s', r | s, a) [r + \gamma.v*(s')]

### 3.27

\pi*(a|s) = 1 if a \in argmax q_{\pi}(s,a) else 0

### 3.28

\pi*(a|s) = 1 if a \in argmax \sum_{r, s'} p(s', r | s, a)(r + v_{\pi}(s')) else 0

### 3.29

v*(s) = max_a \sum_{s'} p(s'|s, a) [r(s, a) + v*(s)]

v_{\pi}(s) = \sum_a \pi(a|s) \sum_{s'} p(s'|s, a) [r(s, a) + v_{\pi}(s')]

q*(s, a) = \sum_{s'} p(s'|s, a) [r(s, a) + max_{a'}q*(s, a')]

q_{\pi}(s, a) = \sum_{s'} p(s'|s, a) [r(s, a) + \sum_{a'} \pi(a'|s') q_{\pi}(s', a')]

### 4.1

q_{\pi}(11, down) = -1
q_{\pi}(7, down) = -15

### 4.2

v_{\pi}(15) = -19 then -20

### 4.3

(4.3) q_{\pi}(s, a) = \mathbb(E)[R_{t+1} + \gamma.q_{\pi}(S_{t+1}, A_{t+1}) | S_t=s, A_t=a]
(4.4) q_{\pi}(s, a) = \sum_{s', r} p(s', r | s, a) [r + \gamma.\sum_{a'} \pi(a'|s').q_{\pi}(s', a')]
(4.5) q_{k+1}(s, a) = \sum_{s', r} p(s', r | s, a) [r + \gamma.\sum_{a'} \pi(a'|s').q_(s', a')]

### 4.4

my solution: keep a list of past policies and stop when looping on past policy (see chapter4/figures)

### 4.5

cf. code in chapter4/figures

### 4.6

intuitive needed changes:
- in 3), make the argmax action have value \pi(a|s) of (|min(A(s))|-\epsilon)/|min(A(s))|, \epsilon/|min(A(s))|.
- in 2), make the update be a stochastic sum with \pi(a|s)
- in 1), make all the values be above |min(A(s))|

### 4.7

cf. code in chapter4/figures

### 4.8

what i understand from the policy: it's trying to bet the amount necessary to get to 50, and then bet everything at 50. the intermediary steps are 25 and 75. so bets are centered around reaching either 0, 25, 50, 75 or 100.

the main strategy in betting all your money at 50 and not at 51 is that at 50 you NEED to play the "let's do this" plan. but at 51 you have this one extra cash that you can use to see if you can get higher?  

### 4.9

results are not stable as theta goes to 1e-15 (smallest float?) but the value function seems to become sharper and sharper and the peaks follow a trend (linear from 0 to 0.5, then from 0.5 to 0).

for p\_{heads} = 0.25 the value function is really close to Figure 4.3. My interpretation is that what really matters is p\_{heads} < 0.5 in this context.

for p\_{heads} = 0.55, the value function grows really smoothly, and the optimal policy is to always bet one (kinda of a random walk?) and then not move when having 99 in stocks ?! because moving would lead to the value of the final state (which is zero?).

### 4.10

q(s, a) = \sum_{r, s_p} p(s_p, r|s, a) (r + \gamma max_{a_p} q(s_p, a_p))

### 5.1

the value function jumps off because it's highly likely to win if you stick with 20 (and you definitetly win or draw if you get 21), but if you hit with something like 19 you're very likely to lose.

it drops off for the last row on the left because if the dealer has an ace than it's hard to win.

the frontemost values are higher in the upper diagrams because a sum of 12 is alright when you have a usable ace, but pretty bad otherwise.

### 5.2

because we suppose we're getting cards from an infinite deck (i.e. with replacement), it doesn't seem to really matter if we use every-visit or first-visit MC.

### 5.3

root is the state-action pair, ending in a terminal state (which is not a state-action pair).

### 5.4

Q_{n+1}(s, a) = Q_{n}(s, a) + (1/n) (R_n - Q_{n}(s, a)

cf. implementatino in MonteCarlo ES of chapter5/mc.py

### 5.5

in this case the importance sampling ratio is 1 (only one action so b=\pi anyway), so the two alternatives (weighted importance sampling and ordinary importance sampling) are equivalent. G_t is 10, so for first visit V(s) = 10 and every-visit V(s) = 10/10 = 1.

### 5.6

same with Q(s,a), just change T(s) with T(s, a), i.e. visits to the state-action pair.

### 5.7

for the weighted importance-sampling method, the estimates are biased so the few first biases make the error be more and more wrong? because we're more likely to see highly biased things? but then if we average more we get better estimates?

### 5.8

intuitively, if we divided by the number of visits (as in every-visit), then we might in the worst case get 1/k if the legnth of the episode is k, which would give 0.2 * \sum_k (1.8^k / k) which still converges up to infinity (bc larger than an harmonic sum)?

### 5.9

code in chapter5/mc.py class MonteCarloFirstVisit

### 5.10

here's how the derivation goes:

1) split upper sum in two: WnGn and sum_k WkGk up to n-1

2) write the lower part as Cn

3) for the right part, it's actually Cn-1/Cn times Vn

4) write Cn-1 as Cn - Wn

5) put stuff together and you get the result

### 5.11

my intuition: because we stop when \pi(S_t) is not A_t, and ties are broken consistently, then \pi(A_t|S_t) is either 1 or 0. if it's 0 we already stopped and if it's 1 then 1/b(A_t|S_t) is correct. 

### 5.12

cf. chapter5/racetrack.py and figures.py ex5.14 (somehow it's called ex5.14 instead of ex5.12)

### 5.13

left term is equal to sum_s sum_a b(a|s) \sum_{s',r} p(s', r | s, a) \sum_a' b(a' | s') ... \sum_a^(T-1) x (product \frac{\pi(a^(i)|s^(i))}{b(a^(i)|s^(i))}) x r

all of the below b can be simplified one by one with the b from taking the expected value. and then what's left at each step is the sum over pi(a^(i)|s^(i)) (equal to 1) and then the sum over the transitions (p(s^(i+1), r^(i) | s^(i), a^(i))) is also 1.

we're left with \sum_s \sum_a \pi(a|s) \sum_{s',r} p(s', r | s, a) r

or equivently, \sum_s sum_a b(a|s) \sum_{s', r} \frac{\pi(a|s)}{b(a|s)}

i.e. the right term.

### 5.14

the formula for V(s) is the same as for Q(s,a), just need to change T(s) as T(s,a) or time of visits to pair (s,a).

then it's the same formula with \sum (W_k * G_k) / \sum W_k, we just need to change G_k with G_{t:h} and the weights for k are (1-\gamma) \gamma^k \ro_{t:h-1} for k <= T - t - 2, and \gamma^{T-t-1} \ro_{T-1} for k = T-t-1.

actually, now the weights depend both on t and h so it's sum of W(s,h) G(s,h) divided by a sum of W(s,h).

for the details, see chapter5/mc.py OffPolicyMCControl then truncated_weighted_avg_est.

### 6.1

G_t - V_t[S_t] = sum_{t}^{T-1} \gamma^{k-t} (\delta_k + \gamma.(V_{k+1}[S_{k+1}] - V_k[S_{k+1}])

so the diff. G_t - V_t[S_t] - (sum of td errors) is inferior to (\gamma / (1 - \gamma)) * ||V_{k+1} - V_k||_{inf}

### 6.2

considering the driving pb., a situation where TD update might be more efficient than MC update: when i can't wait until the end of the episode to make my prediction (like, online learning for an autonomous vehicle on unknown environment). or when the MC trajectories are too difficult to sample, as they would necessit an intractable number of steps to reach the end of the episode (cf. race track example from end of chapter 5).

HINT CASE: here, with TD updates, we're updating using the correct value for "entering highway" directly, without having to simulate the entire episode. my intuition tells me that if we stay on the original scenario, then we would O(n) TD updates where only one MC update would work.

### 6.3

the first episode ended in the left absorbing state with value 0. only V[A] changed because for all the other states V[S_t] and V[S_{t+1}] were the same. here, V[Absorbing state] = 0 so the TD error was 0 + 1 * 0 - 1/2 so it diminished by -0.05 (step-size = 0.1).

### 6.4

theoretical answer:
- for MC: lower step size means slower convergence, so won't improve results. from the graph, results stop getting better at stepsize 0.3/0.4, and starts even to be counterproductive to use a higher alpha.
- for TD: idem for lowering the step size. as for using a higher step size, it makes the convergence quicker but the bias is higher (systemic error).

empirical answer (cf. plots/ex6.4.png: using higher step sizes did indeed lead to really high bias (and converging to wrong value with high variance), and using smaller step sizes is definitely too slow (doesn't converge in 100 episodes), but should give better results if we run for more episodes?

### 6.5

theoretical answer: my inituion is that this init value of 0.5 maximizes the kind of variance in the random walks, so that there's an infinity of variance somehow in the samples we get, and so the error gets higher throughout the episodes because we get longer walks? using smaller of bigger V_init will make it converge faster toa wrong value somehow?

empirical answer (cf. plots/ex6.5.png): among other initialization (V_init = 0, 0.25, 0.75 or 1) it only happened that clearly with 0.5.

### 6.6

first way (which i used): solve system of equations.

second way: dynamic programming.

i think the system of equations was used because, given how slow MC/TD methods are, it doesn't seem that you could get fast convergence with DP. and also we can compute exact values here, duh.

### 6.7

cf. implementation in chapter6/old_pol_td.py 

### 6.8

it's exactly the same with u_t = Q(S_t, A_t), instead of V_t. at the end we get 0-0 because we're assuming Q(S_T, .) = 0 (which justifies why we need that in Sarsa).

### 6.9

diagonal moves help. not moving doesn't help and only slows down learning. indeed, the optimal solution (with diagonal moves) is the same number of moves as if you only had to go right without wind. EDIT: after printing the final trajectories, because the policy is epsilon greedy, it doesn't follow an optimal trajectory with diags/king moves.

for an empirical comparison, see plot/6/ex6.9 (code from chapter6/figures.py -> ex6_9). with diags learn a bit faster the optimal trajectory than diags + stay put, and both of them then have the same growth, which is much faster than just left, right, up and down moves.

### 6.10

stochastic winds with king moves takes 4-5x more timesteps to reach 1k episodes (tested on 10 different seeds for stochasticity (cf. plots/ex6.10 vs. plots/ex6.9)

### 6.11

q-learning is considered to be an off-policy control method because the target policy is the greedy policy (that's why the target estimate is r + \gamma * Q(s', \pi(s')). with \pi being the greedy policy), and is different of the behaviour policy (used to chose action a that lead to r, s'.

### 6.12

the formula for the weight updates is the same if they're both using the greedy policy to derive actions from Q values. however, action selection could be different as in sarsa, the action a' (next action) is chosed before the update of Q(s, a), whereas with qlearning the action following a (let's call it a') is chosen after the update of Q(s, a). this might change things if we stay in a same state s_0, and we would choose a_1 if we chose it before updating Q(s_0, a_0) but would chose a_2 if chosing after updating Q(s_0, a_0) (where a_1 != a_2 but a_1 or a_2 could be a_0).

### 6.13

cf. the formula implemented in chapter6/double_experted_sarsa.py. essentially, the "a_max according to Q_1" becomes "we use \pi_1 derived from Q_1 to weight the values of Q_2 accordingly in our update of Q_1(s, a)". cf. plot in plots/ex6.13.png to see the difference between double expected sarsa and expected sarsa for the environment with S_B and S_A. we see that double expected sarsa has less bias at the beginning, but then seems to take longer to converge (for the 300 first episodes).

### 6.14

in the car rental problem, we're in a state (n_1, n_2), and the afterstate is known and is n1-m, n2-m (which could come from many different places, same as in tic tac toe). so instead of going through all the different transitions (state-action pair) we focus on the the n^2 different states, which would speed up convergence.

### 7.1

we use that G_{t:t+k} = G_{t:t+k-1} + \gamma^{k-1}.\delta_{t+k-1} if k >= 2 and R_{t+1} + \gamma.V(S_{t+1}) if k == 1

G_{t:t+n} - V(S_t) = G_{t:t+n-1} + \gamma^{n-1} \delta_{t+n-1} - V(S_t)
                   = ...
                   = G_{t:t+1} + \sum_{k=1}{n-1} \gamma^k.\delta_{t+k} - V(S_t}
                   = R_{t+1] + \gamma.V(S_{t+1}) + ... - V(S_t)
                   = sum_{k=0}{n-1} \gamma^k.\delta_{t+k} 

### 7.2

See chapter7/figures.py ex_7_2 for details about the experiment and chapter7/plots/ex7.2.png for the plot.

We compare the results from fig7.2 with what we get just replacing G by a sum of td errors. The averaged RMS error is higher for small values of n (1 and 2), about equal and better for large step sizes for n = 4, and then consistently lower than fig7.2 for larger values of n.

### 7.3

Theory:
- I think we used a larger number of states (19) so that the episodes would be on average much longer (than using just 5 states), so we would see why using n-step reward make sense.
- I expect a smaller walk to make the best value of n smaller, as a n too large (for the average length of the episode) can update states towards the wrong way (i.e. even the extreme right state can get updated by the value of reward from the extreme left side if n is large enough and there aren't that many states).
- I don't expect the change from -1 to 1 to make much a difference for the best n. It just makes it easier to initialize the V to 0 and have the value of terminal state be zero. but if we had 0 on the left and initialization to 0.5 (for all but the absorbing state) I think we would get about the same errors, with maybe a difference in "when we start with 0.5 then there's always a substraction" but when we initialize to 0 we substract 0 at the beginning.

Experiments:
(cf. chapter7/figures.py ex_7_3 for details about the experiment and chapter7/plots/ex7.3.png for the plot).
- indeed, with 5 states n became as small as... n=1.
- indeed, changing the left reward to 0 didn't change the best value of n=4, but the RMS errors was smaller overall. I think the left reward of 0 is just for symmetry and having V = 0 for the initialization.

### 7.4

let m = min(\tau + n, T)
the right term becomes: sum_{j=0}{m-1-t} \gamma^j R_{t+j+1} + \gamma^{m-t} Q_{m-1}(S_m, A_m)

if tau + n < T: m - t == n so we get G_{t:t+n}
if tau + n >= T, then Q_{m-1}(S_m, A_m) = Q_T(S_T, A_T) = 0 if we add the hypothesis that Q needs always to be 0 on terminal states.

### 7.5

essentially, we compute the result of the recurrent relationship, and replace horizon h with \tau + n (to get sth like for (7.2)), and we get:

G_{\tau:\tau+n} = (sum of \gamma^{i-\tau-1} R_i \ro_{\tau}...\ro_{i-1}) + \ro_{\tau}...\ro_{max_idx}.\gamma^{n-1}.V_{\tau+n-1}(S_{\tau+n}) + (1-\rau_{\tau}).V_{\tau+n-1}(S_{\tau})

see chapter7/off_pol_nstep_td.py for the implementation.

### 7.6

the expected value of the \ro is 1, independent of the estimate and the Qs, so you can replace them by one. then the expected value for the Q_{h-1} estimate is exactly the V_{h-1} with a bar on top, so it simplifies and we get the the same ev of return.

### 7.7

implemented in chapter7/off_pol_nstep_expected_sarsa.py

### 7.8

we use that u_t = G_{t:h} - V(S_t) = \ro_t (\delta_t + \gamma.u_{t+1}) by adding & substracting \ro_t.\gamma.V(S_{t+1})

then because G_{h:h} = V(S_h), the last term is 0 so we get:

G_{t:h} - V(S_t) = \sum_{k=t}{h-1} (\pi_{j=t}{k} \ro_j) \gamma^{k-t} \delta_k

### 7.9

we use that u_t = G_{t:h} - Q(S_t, A_t) = \delta_t + \gamma.\ro_{t+1}.u_{t+1}

then, from using that:
- G_{h:h} - Q(S_h, A_h) = 0 if h < T
- G_{T-1:h} - Q(S_{T-1}, A_{T-1}) = R_T + expV(S_T) - Q(S_{T-1}, A_{T-1}) = \delta_{T-1} (bc expV = 0 on terminal vals).

we get that G_{t:h} is:
- (\sum_{k=t}{h-1} (\pi_{j=t+1}{k} \ro_j) \gamma^{k-t} \delta_k) if h < T
- (same but replace h by T) if h >= T

### 7.10

doing it with not so random walk. cf. ex7.10 in chapter7/figures.py and chapter7/off_pol_nstep_td.py, and chapter7/plots/ex7.10.png

we get about the same results for the two methods... not really conclusive. on my plot it's slightly better to use that more complex method but with a different number of states it can be the other way around. maybe there's some environments where results are more conclusive or there's some error in my code.

### 7.11

G_{t:t+n} - Q(S_t, A_t)
= \delta_t - \gamma.expV(S_{t+1}) + \gamma.expV(S_{t+1}) - \gamma.\pi(A_{t+1}|S_{t+1})Q(S_{t+1}, A_{t+1}) + \gamma.\pi(A_{t+1}|S_{t+1}).G_{t+1:t+n}
= \delta_t + \gamma.\pi(A_{t+1}|S_{t+1}).(G_{t+1:t+n} - Q(S_{t+1}, A_{t+1}))

then, for the final proof, we need to distinguish how it ends, using mathematical induction:

case 1: n == 1
-> if t == T-1: then \delta_t = R_T + \gamma x 0 - Q(S_{T-1}, A_{T-1}) by def. of the expected V on terminal states, which is R_T, so G_{T-1:t+n} (by def).
-> if t < T-1: we get Q(S_t, A_t) + \delta_t = R_{t+1} + \gamma.expV(S_{t+1}) which is by def. G_{t:t+n) (defined in 7.15).

case 2: n >= 2, assuming n-1
G_{t:t+n}
= Q(S_t, A_t) + \delta_t + \gamma.\pi(Q_{t+1}, A_{t+1}).(G_{t+1:t+n} - Q(S_{t+1}, A_{t+1})) (by using recurrence from the main intuition)

then, we replace the G-Q right part by the formula, by applying it to G_{(t+1):(t+1)+(n-1)} (we assumed n-1)

the right part of the summation becomes: \gamma.\pi(Q_{t+1}, A_{t+1}).\sum_{k=t+1}{min(t+1+n-1, T-1)} \delta_k \product_{i=t+2}{k} \gamma.\pi(A_i|S_i)

then:
-the min is the same as what we want,
-after multiplying by gamma.\pi(A_{t+1}|S_{t+1}) the product starts at i=t+1
-the delta_t becomes the first term of the sum.

### 8.1

Theory: n-step shouldn't be doing much better, as it would reach the goal by random for the first episodes. if n = 3 or something, then it might get the right updates for the 3 moves to the goal. is n is larger (let's say 50) the path would be much more messy so the udpates would be wrong.

Practice: from running the n-step sarsa from chapter 7 with n=5 and n=50, it did update correctly next to the goal (next to walls) for n=5, but that's all. then, for n=50, it updated from start to goal almost always correctly, just a few tiles that were wrong (like 2 tiles out of 20). (cf. nstep_sarsa code and ex_8_1 code in figures.py and ex8.1.png in plots).

### 8.2

first phase: finding the hole in the blocking task requires exploration, so it makes sense that the dynaQ+ agent found the hole faster with the addditional exploration phase. also, from the note 1. (page 168) dynaQ+ started with a prior on never seen actions of "i'm going to stay at the same place and receive reward of 0", which is correct when hitting limits of the map + when hitting the map. so maybe after having found the correct path once, it kind of "assumed" that the rest would give a reward of 0 and not be a path?

second phase: well, it adapted and found the new path bc of exploration faster (does 6 episodes 2x faster than dynaQ after the change). 

### 8.3

this one was observed when running simulations! actually, the dynaq+ agent needs to re-explore parts of the map from times to time (bc of the exploration factor), so it's suboptimal compared to the normal dynaq agent. this slight difference, over 3k steps, becomes noticeable (how much actually depends on the factor K of exploration, i used k=0.001).

### 8.4

Theory: it seems pretty important to change those Q-values. indeed, by design, if you change the Q-values somewhere, the max(Q) will change, so it will propagate to the other Q-values until it reaches the point where the agent is currently blocked (using planning). if you only change the estimated reward WHEN CHOOSING ACTIONS then it's only a heuristic for acting, not used in planning. so my guess it that: it's good at unblocking yourself when there are only a few steps to do. if it's a maze where you need to do multiple good choices (> 5) in a row to reach the goal then it won't work. 

Experiments: from comparing dynaq+ with the original version on the blocking task (cf. plots/ex8.4.png) the weakness of this new version is that it takes more time to find the correct solution when the environment changes, but then its episodes lengths are shorter on average because it doesn't need to re-try every part of the map that he hasn't explored in a while.

### 8.5

Instead of selecting "THE" s_p, r from transition s, a, you just sample from a distribution on what you already encountered on s, a (e.g. if you got s_p,r = 1, 0 once and 2, 3 twice then the probability of the first is 1/3 and the second one is 2/3).

This might perform poorly because when everytime it went up it as going up at the beginning, so it will take a long amount of time until the "distribution" mentioned above changes to the correct model. (like we have 1000 wrong instances, to get something 99% correct after a change in environment we would need 99 000 new transition from the new environment).

For something that is good at both stochastic and changing environment, it would need to somehow have a distribution in the model (as mentioned) but this distribution should be rapidly updated for things it hasn't seen in a while (for instance, if it didn't see "up" cause a movement "up" in a while, then the probability of it happening in the model could decrease proportionally to 1 / (number of timesteps since last seen)).

### 8.6

This would weaken the case of sample updates over expected updates.

Let's say that there's one state-action pair that could lead to a reward of 10 ** 20 with probability 10 ** (-15) and 0 otherwise. No matter how many samples you do, you won't do more than 10 ** (15) samples for this state action pair (given current hardware limits), so you're really not likely to sample this really high (possible) reward which gives an expected value for this action pair of 1000 (not 0).

### 8.7

The graph shows the value of a start state under the greedy policy. For b = 1, there are only two actions that lead to two different (or the same) states, deterministically. 

My guesses for why is scalloped in the early portions:
- for each regime, the model oscillates between two different actions for each state action pair, which corresponds to two states (b=1).
- it says "all the while assuming the model is correct". it seems that with two possible next states, the "wrong" model oscillates between two different modes for each state, so that's why it scumbles.

Parts of the data that supports my point:
- low graph with b=1 has the same behaviour for uniform
- it's somewhat also true with b=3 and not true for b=10, so it shows that the scumbleness becomes lower with b, which makes sense given what i said above.

Why it's not also the case with b=1 on policy:
- my guess is that because the examples with b=1 on-policy converge super quickly, we don't get to see the scumbleness.

### 9.1

in linear function approximation \hat v (s, \bold w) = \bold w^T  \bold x(s)

now, let's say that \bold x(s) is a vector of length |S| which encodes as a one-hot the id of a state (e.g. (1, 0, ..., 0) for the first state).

then when we're using tabular methods we're trying to find the correct weights w_1, ..., w_{|S|} which approximate v_{\pi}(s_1}, ..., v_{\pi}(s_|S|)

### 9.2

(n + 1)^k features for dimension k because for each product there are (n + 1) possibilities and k terms so (n + 1)^k possibilities. the "why" we're using all those combinations is because it's both the maximum and the minimum to get all the relationships between the variables

### 9.3

n = 2
c_{1,1}, c_{1, 2} = (0, 0)
c_{2,1}, c_{2, 2} = (1, 0)
c_{3,1}, c_{3, 2} = (0, 1)
c_{4,1}, c_{4, 2} = (1, 1)
c_{5,1}, c_{5, 2} = (2, 0)
c_{6,1}, c_{6, 2} = (0, 2)
c_{7,1}, c_{7, 2} = (1, 2)
c_{8,1}, c_{8, 2} = (2, 1)
c_{9,1}, c_{9, 2} = (2, 2)

### 9.4

tiles that are elongated along one dimension, such as stripe tilings from 9.12 (middle), or log stripes.

### 9.5

\alpha = \frac{1}{10 * 98} (\tau = 10 and 98 tiles, using 1/{\tau * n_tiles} from section 9.5.4)

### 9.6

using (9.7), U_t unbiased estimate and the definition of \hat v & \nabla \hat v for linear methods: \mathbb(E)[\bold w_{t+1}^T x(S_t) - v_{\pi}(S_t)] = \mathbb(E) { \bold w_t^T x(S_t) + \alpha [v_{\pi}(S_t) - \bold w_t^T x_(S_t)] x(S_t)^T x(S_t) - v_{\pi}(S_t}}

then using (9.19) with \tau = 1, \alpha = \frac{1}{\mathbb{E}[x^T x]}

so, on the right side, when we take the expacted value of x(S_t) ^ T x(S_t) and divide by alpha, we get 1.

the v_{\pi} and the \bold w_t^T x_(S_t) simplify and we get 0, which means the error was reduces (on average) to zero in one update.

### 9.7

\hat v(s, w) = 1 / (1 + e^{-w^T x(s)})

\nabla \hat v(s, w) = (e^{-w^T x(s)} / (1 + e^{-w^T x(s)})^2) x(s) = \hat v(s, w) (1 - \hat v(s)) x(s)

### 9.8

from deriving the cross-entropy loss i get the same update (i.e. w_{t+1} = w_t + (v_{\pi} - \hat v) \nabla \hat v) 

### 10.1

For MonteCarlo methods we would have U_t = G as target where G is the full return.

It's reasonable not to give pseudo-code as it wouldn't be so different from the Gradient Monte Carlo algorithm (or MonteCarlo Exploring Starts as it's a control problem).

For Mountain Car my intuition tells me that it would take many exploration steps to reach the top, and if it does reach the top it would be hard to know why exactly it reached it. In short, it would perform really poorly, and my first guess would be that it wouldn't even terminate on the first episode.

### 10.2

same as the sarsa one, but with an expected values over the possible actions a, i.e. \hat q(S_p, a, \bold w), where the expected value could be on, let's say, the epsilon greedy policy.

### 10.3

because we're on Mountain car with a constant reward per step of -1, so G = -n, and then with a large n we have a large return, so the weights vary more in the update.

### 10.4

just remove the part where you choose A_p in diff. semi gradient sarsa, and in your td error have a max_{a_p} term instead of the one of the s_p, a_p one

### 10.5

how to update the weights (which is essentially (10.12) with v instead of q)

### 10.6

average reward is 0.5.

for A: lim for \gamma close to 1 of {lim for h going to infinity of \sum_t (-\gamma)^t / 2} = (0.5 / (1 + \gamma)), so 1/4.

for B: same but the opposite sign, so -1/4.

### 10.7

so if we separate the r(\pi) part, we get for the expected valuesinside the paranthesis you have (0, 0, 1, ...) for A, (0, 1, 0, ...) for B and (1, 0, 0, ...) for C.

so the sum for C is a sum of \gamma^3t - sum \gamma^t * (1 / 3), so (1 / (1 - \gamma^3)) - (1 / 3) * (1 / (1 - \gamma)), which according to wolframalpha gives 1/3 to the limit.

for B you get \gamma times the same thing so same limit, and for A \gamma^2 so same limit of 1/3.

### 10.8

R_{t+1} - \bar R_t: ..., -1/3, -1/3, 2/3, -1/3, -1/3, 2/3, ...

\delta_t: -1/3 + \hat v(B) - \hat v(A), 2/3 + \hat v(C) - \hat v(B), -1/3 + \hat v(A) - \hat v(C)

for the first delta error (on the state A), if the value of A is too low then the difference of v(b) - v(a) will be positive (because they're supposed to converge to the same value 1/3), so the weights will correct (more) in the direction of the gradient, aka moving in the direction that will give higher estimate value for state a.

if we were using only the sequence of R_{t+1} - \bar R_t we would get something that doesn't change, so the second approach seems to converge in a more stable way whereas using only R_{t+1} - \bar R_t might not converge (and maybe do something periodic?)

### 10.9

just add a variable o initialized at 0 and change its value using the recurrence from ex2.7 at the end of the loop.

then use \beta / o as the parameter to multiply \delta (in the \bar R_t line)

### 11.1

G_{t:t+n} = same as the def. for semi-grad sarsa (n-step version) but replacing q with v. same for w_{t+n}

### 11.2

weights update: like (11.7)

G update:
- replace the \bar V by a mean of \hat v
- replace Q by \hat q
- in the continuing case: replace R_{t + 1} by R_{t + 1} - \bar R_{t + 1}

then, what's eft for a "semi-gradient version" is to convert the recursive definition in 7.17 to a closed form.

### 11.3

cf. plots/ex11.3 and ex_11_3 in figures.py

### 11.4

\bar RE(\bold w)
= \mathbb{E}[(G_t - \hat v(S_t, \bold w))^2]
= \sum_s \mu(s) \mathbb{E}[(G_t - \hat v(S_t, \bold w))^2 | S_t = s]
= \sum_s \mu(s) \mathbb{E}[(G_t - v_{\pi}(s) + v_{\pi}(s) - \hat v(S_t, \bold w))^2 | S_t = s]
= \sum_s \mu(s) \mathbb{E}[(G_t - v_{\pi}(s))^2 | S_t = s] + \sum_s \mu(s) \mathbb{E}[(\hat v(S_t, \bold w) - v_{\pi}(s))^2 | S_t = s] + 2 * \sum_s \mu(s) \mathbb{E}[(G_t - v_{\pi}(s)) * (\hat v(S_t, \bold w) - v_{\pi}(s))]
= \mathbb{E}[(G_t - v_{\pi}(s))^2] + \bar VE(\bol w) + \sum_s \mu(s) \mathbb{E}[(G_t - v_{\pi}(s)) | S_t = s] * \mathbb{E}[(\hat v(S_t, \bold w) - v_{\pi}(s)) | S_t = s] # because given S_t = s the right term doesn't contain any random variable and is thus independent of the let term of the product
= \mathbb{E}[(G_t - v_{\pi}(s))^2] + \bar VE(\bol w) # because \mathbb{E}[(G_t - v_{\pi}(s)) | S_t = s] = \mathbb{E}[G_t | S_t = s] - v_{\pi}(s)) = 0

### 12.1

G_t^{\lambda}\\
= (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t:t+n}\\
= (1 - \lambda) G_{t:t+1} + (1 - \lambda) \sum_{n=2}^{\infty} \lambda^{n-1} G_{t:t+n}\\
= (1 - \lambda) G_{t:t+1} + (1 - \lambda) \lambda \sum_{n=2}^{\infty} \lambda^{n-2} (R_{t+1} + \gamma G_{t+1:t+n})\\
= (1 - \lambda) G_{t:t+1} + (1 - \lambda) \lambda \sum_{p=1}^{\infty} \lambda^{p-1} (R_{t+1} + \gamma G_{t+1:t+p+1})\\
= (1 - \lambda) (R_{t + 1} + \gamma \hat v(S_{t + 1}, \bold w_t)) + (1 - \lambda) \lambda [\sum_{p=1}^{\infty} \lambda^{p-1} R_{t+1} + \sum_{p=1}^{\inf} \lambda^{p-1} \gamma G_{t+1:t+p+1}]\\
= R_{t + 1} + \gamma (1 - \lambda) \hat v(S_{t + 1}, \bold w_t)) + \gamma \lambda G_{t + 1}^{\lambda})\\

### 12.2

let's assume T > \tau_{\lambda}, \lambda in (0, 1), and let's abbreviate \tau_{\lambda} by \tau

the weight sum (divided by (1 - \lambda)) should be separated in two, so that \tau is the first p to satisfy:

\sum_{n=0}^p \lambda^n \ge \sum_{n=p+1}^{\infty} \lambda^n
i.e. \frac{1 - \lambda^{p+1}}{1 - \lambda} \le \frace{\lambda^{p+1}}{1 - \lambda}
i.e. \lambda^{p+1} \le 1/2
i.e. (p+1) (ln(\lambda)) \le -ln(2)
i.e. p \ge ln(2) / ln(\lambda)

so \tau = integer above ln(2) / ln(\lambda)

### 12.3

With a fixed \bold w we have:

G_t^{\lambda} - \hat v(S_t, \bold w)\\
= R_{t + 1} + \gamma (1 - \lambda) \hat v(S_{t + 1}, \bold w) + \gamma \lambda G_{t + 1}^{\lambda} - \hat v(S_t, \bold w)\\
= \delta_t + \gamma \lambda (G_{t + 1}^{\lambda} - \hat v(S_{t + 1}, \bold w))\\
= \delta_t + \gamma \lambda (\delta_{t+1} + \gamma \lambda (G_{t + 2}^{\lambda} - \hat v(S_{t + 2}, \bold w)))\\
= \sum_{k=0}^{T-t-1} (\gamma \lambda)^k \delta_{t+k}\\

because G_T = \hat v(S_T, \bold w) = 0 so G_{t + T - t - 1} - \hat v(S_{t + T - t - 1}, \bold w) is the last term

### 12.4

= (1 / \alpha) \sum_{t=0}^{T-1} (\bold w' - \bold w)  \text{ (for TD($\lambda)$)}\\
= \sum_{t=0}^{T-1} \delta_t z_t\\
= \delta_0 z_0 + \sum_{t=1}^{T-1} \delta_t (\gamma \lambda z_{t-1} + \nabla \hat v(S_t, \bold w))\\
= \delta_0 \nabla \hat v(S_0, \bold w) + \gamma \lambda \sum_{t=1}^{T-1} \delta_t z_{t-1} + \sum_{t=1}^{T-1} \delta_t \nabla \hat v(S_t, \bold w))\\
= (\gamma \lambda)^0 \sum_{t=0}^{T-0-1} \delta_{t+0} \nabla \hat v(S_t, \bold w) + \gamma \lambda \sum_{t=1}^{T-1} \delta_t z_{t-1}\\
= (\gamma \lambda)^0 T_0 + \gamma \lambda \sum_{t'=0}^{T-2} \delta_{t'+1} z_{t'}\\
= (\gamma \lambda)^0 T_0 + \gamma \lambda \sum_{t=0}^{T-2} \delta_{t+1} (\gamma \lambda z_{t-1} + \nabla \hat v(S_t, \bold w))\\
= \cdots\\
= \sum_{k=0}^{T-1} (\gamma \lambda)^k T_k\\
= \sum_{k=0}^{T-1} (\gamma \lambda)^k \sum_{t=0}^{T-k-1} \delta_{t+k} \nabla \hat v(S_t, \bold w)\\
= \sum_{t=0}^{T-1} \nabla \hat v(S_t, \bold w) [\sum_{k=0}^{T-t-1} (\gamma \lambda)^k \delta_{t+k}]\\
= \sum_{t=0}^{T-1} [G_t^{\lambda} - \hat v(S_t, \bold w)] \nabla \hat v(S_t, \bold w)\\
(1 / \alpha) \sum_{t=0}^{T-1} (\bold w' - \bold w) \text{ (for $\lambda$-offline return)}\\

### 12.5

using the same recurrence relationship as from 12.1,

G_{t:t+k}^{\lambda} - \hat v(S_t, \bold w_{t-1})
= R_{t + 1} + \gamma (1 - \lambda) \hat v(S_{t + 1}, \bold w_t)) + \gamma \lambda G_{t + 1:k}^{\lambda} - \hat v(S_t, \bold w_{t-1})
= \delta_t' + \gamma \lambda (G_{t + 1:k}^{\lambda} - \hat v(S_{t + 1}, \bold w_t))

now, the last term in the brackets is:

G_{t+k-1:t+k}^{\lambda} - \hat v(S_{t + k - 1}, \bold w_{t + k - 2})
=  G_{t+k-1:t+k} - \hat v(S_{t + k - 1}, \bold w_{t + k - 2}) (only the right term of 12.9 is non-zero)
= R_{t+k} + \gamma \hat v(S_{t + k}, \bold w_t) - \hat v(S_{t + k - 1}, \bold w_{t + k - 2})
= \delta_{t+k-1}

thus we get (12.10)

### 12.6

somehting like replacing the trace (cf. "accumulating traces" or "replaceing traces" line) with z_i \leftarrow (1 - \alpha \gamma \lambda z_T x) x_i should work

### 12.7


G_t^{\lambda_s}
= R_{t+1} + \gamma_{t+1} ((1 - \lambda_{t+1}) \hat v(S_{t+1}, \bold w_t) + \lambda_{t+1} G_{t+1}^{\lambda_s})\\
= R_{t+1} + \gamma_{t+1}(1 - \lambda_{t+1}) \hat v(S_{t+1}, \bold w_t) + \gamma_{t+1} \lambda_{t+1} R_{t+2} + \gamma_{t+1} \gamma_{t+2} \lambda_{t+1} ((1 - \lambda_{t+2}) \hat v(S_{t+1}, \bold w_{t+1}) + \lambda_{t+2} G_{t+2}^{\lambda_s})\\
= \sum_{k=1}^{\infty} (\Pi_{j=1}^{k-1} \gamma_{t+j} \lambda_{t+j}) (R_{t+k} + \gamma_{t+k} (1 - \lambda_{t+k}) \hat v(S_{t+k}, \bold w_{t+k-1}))

G_{t:h}^{\lambda_s}\\
= \sum_{k=1}^{h - t} (\Pi_{j=1}^{k-1} \gamma_{t+j} \lambda_{t+j}) (R_{t+k} + \gamma_{t+k} (1 - \lambda_{t+k}) \hat v(S_{t+k}, \bold w_{t+k-1}))

G_{t:h}^{\lambda_a}\\
= \sum_{k=1}^{h - t} (\Pi_{j=1}^{k-1} \gamma_{t+j} \lambda_{t+j}) (R_{t+k} + \gamma_{t+k} (1 - \lambda_{t+k}) \hat q(S_{t+k}, A_{t+k}, \bold w_{t+k-1}))

and for Expected Sarsa

G_{t:h}^{\lambda_a}\\
= \sum_{k=1}^{h - t} (\Pi_{j=1}^{k-1} \gamma_{t+j} \lambda_{t+j}) (R_{t+k} + \gamma_{t+k} (1 - \lambda_{t+k}) \bar V(S_{t+k}))

### 12.8

G_0^{\lambda_s}\\
= \rho_0 \left(R_1 + \gamma_1 \left(\left(1 - \lambda_1\right) V_1 + \lambda_1 G_1^{\lambda_s}\right) \right) + (1 - \rho_0) V_0\\
= V_0 + \rho_0\left(R_1 + \gamma_1 V_1 - V_0 + \gamma_1 \lambda_1 (G_1^{\lambda_s} - V_1)\right)\\
= V_0 + \rho_0\left(\delta_0^s + (\gamma_1 \lambda_1) \left(G_1^{\lambda_s} - V_1\right)\right)\\
= V_0 + \rho_0\left(\delta_0^s + (\gamma_1 \lambda_1) \left[\rho_1\left(\delta_1^s + (\gamma_2 \lambda_2) (G_2^{\lambda_s} - V_2)\right)\right]\right)\\
= V_0 + \rho_0 \sum_{k=0}^{\infty} \delta_{k}^s \Pi_{i=1}^k \gamma_i \lambda_i \rho_i

### 12.9

Based on ex12.8 and 12.7, it's equation (12.24) but with

(\delta_t^s)' = R_{t+1} + \gamma_{t+1} \hat v(S_{t+1}, \bold w_t) - \hat v(S_t, \bold w_{t-1})

### 12.10

G_0^{\lambda_a} - Q_0\\
= R_1 + \gamma_1 \left(\bar V_1 + \lambda_1 \rho_1 (G_1^{\lambda_a} - Q_1)\right) - Q_0\\
= \delta_0 + \lambda_1 \rho_1 \gamma_1 (G_1^{\lambda_a} - Q_1)

### 12.11

equation (12.27) but with

(\delta_t^s)' = R_{t+1} + \gamma_{t+1} \bar V_t(S_{t+1}) - \hat q(S_t, A_t, \bold w_{t-1})

### 12.12

wih the abstraction of (12.27), the steps are literally the same (as we don't use anything specific to \hat v)

just replace:
- \hat v by \hat q
- S_t by S_t, A_t

we get a result without multiplying by \rho_t (because it's not in (12.27)!)

### 12.13

for dutch trace, replace \gamma \lambda by \gamma_t \lambda_t, and multiply the whole update by rho_t

for replacing trace, we get rho_t if x_{i,t} = 1, \gamma_t \lambda_t \rho_t z_{i, t-1} otherwise

### 12.14

intuitively, we would need to alternate between using two approximate q functions, using updates similar to the ones from TB(\lambda)

it seems like the most straightforward way would be to alternate between using \nabla \hat q_1 and \nabla \hat q_2 when updating z_t (and the weights?)

### 13.1

from writing down the equations for v_{\pi} it turns out that d(v_{\pi}(s_1)) / dp_{right} is of the same sign as -(p^2 - 4p + 2) whose only roots in [0, 1] is 2 - sqrt(2) ~= 0.58. v_{\pi}(s_1) = - (2 / p) * ((2 - p) / (1 - p)) is about -11.66 for this p.

### 13.2

in box page 199, the right term should be multiplied by a \gamma

then, in the proof of the Policy Gradient theorem, we don't actually get \eta(s) when summing over the probabilities. we need to multiply those by \gamma^k to get the correct thing.

so that's why, actually we're multiplying by gamma^k all the terms that are far from the start state by k steps in the updates.

and for the steps that lead to (13.8), the additional steps is that we need to consider one episode, sum over all the steps, and then conclud on a length of episode that can vary.

### 13.3

\nabla \ln \pi (a|s, \theta)\\
= \nabla \left[\ln \frac{e^{h(s, a, \theta)}}{\sum_b e^{h(s, b, \theta)}} \right]\\
= \nabla h(s, a, \theta) - \nabla ln \left(\sum_b e^{h(s, b, \theta)}\right)\\
= \nabla \theta^T x(s, a) - \frac{\sum_b \nabla e^{\theta^T x(s, b)}}{\sum_u e^{h(s, u, \theta)}}\\
= x(s, a) - \sum_b \sum_k \frac{e^{\theta^T x(s, b)}}{\sum_u e^{h(s, u, \theta)}} x(s, b)_k e_k\\
= x(s, a) - \sum_b \pi(b|s, \theta) \sum_k x(s, b)_k e_k\\
= x(s, a) - \sum_b \pi(b|s, \theta) x(s, b)\\

### 13.4

note that \nabla \mu = x_{\mu} and \nabla \sigma = \sigma x_{\sigma}

\nabla ln \pi(a | s, \theta_{\mu})
= (- 1 / (2 * \sigma^2)) \nabla (a - \mu(s, \theta))^2
= (1 / \sigma^2) (a - \mu(s, \theta)) x_{\sigma}(s)

\nabla ln \pi(a | s, \theta_{\sigma})
= -\nabla ln \sigma - (a - \mu)^2 \nabla (2 * \sigma^2)^{-1}
= -(\sigma * x_{\sigma}) / \sigma + [(a - \mu)^2 * 2 * \nabla \sigma] / (2 \sigma^3)
= -x_{\sigma} - (a - \mu)^2 x_{\sigma} / \sigma^2
= ((a - \mu)^2 / \sigma^2 - 1) x_{\sigma}

### 13.5

(a)

Pt = Pr{A_t = 1}
= \pi(1 | S_t, \theta_t)
= e^{h_1} / (e^{h_1} + e^{h_0})
= 1 / (1 + e^{h_1-h_0})
= 1 / (1 + e^{-\theta_t^T x(S_t)})

(b)

\theta \leftarrow \alpha \gamma^t G \nabla ln \pi(A_t | S_t, \theta)

(c)

let's start (as recommended) by noting that:

f'(x)
= e^{-x} / (1 + e^{-x})^2
= (1 - f(x)) f(x)

if a = 1:

\nabla ln \pi(a | s, \theta)
= \nabla f(\theta^T x) / f(\theta^T x)
= (1 - f(\theta^T x)) \nabla \theta^T x
= (1 - f(\theta^T x)) x
= x / (1 + e^{\theta^T x})

if a = 0:

\nabla ln \pi(a | s, \theta)
= \nabla f(1 - \theta^T x) / f(1 - \theta^T x) because Pr(At=0) = 1 - Pr(At=1)
= f(\theta^T x) \nabla (1 - \theta^T x)
= - x / (1 + e^{theta^T x})
= opposite of the one for a = 1