-
Notifications
You must be signed in to change notification settings - Fork 32
/
Copy pathexercises.txt
987 lines (578 loc) · 47.6 KB
/
exercises.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
### 1.1
I think they would start playing randomly, then start exploiting some bad moves (when played random) from the opponent, becoming both better, until eventually both reaching an optimal policy
### 1.2
We could just say that symmetric positions are the same states? It would make the learning go faster (like 4 time faster is we take into account symmetries etc.)
But if the opponent didn't take advantage of symmetries, it might play differently in different symmetries, so we shouldn't put the same values for all the same positions.
### 1.3
It will learn to play worse because it might fall into local maxima, not exploring some parts of the tree.
### 1.4
When we do learn from exploratory moves, then the probability of winning should always be slightly underestimated (unless for the last moves). When we're not learning from them, then the proba might reach something like only 0 or 1 style integer proba, because it _knows_ at some point what's the optimal moves.
If we continue to make exploratory moves, then the set non-integer proba is better because actually represents what's happening in the game?
Maybe if we just want to maximize for win then the integer thing where we never update makes us be risk averse (in a minimax style).
### 1.5
Other improvements: (From 1.4, make the exploration rate go down.) Add some discount factor? Use some neural network? Have something the decreases the step size in a smart way? Actually learn something from exploratory moves but not too much? Initialize better the probabilities (instead of just 0.5 pb of winning, put something bigger if close to win, and less than 0.5 if close to loss). Or just the average of win vs. loss in the position (same until you reach initial board).
Better way to solve the tic-tac-toe pb: Do some meta-learning on a bunch of different players so it can adapt to new players quickly.
### 2.1
If \epsilon is 0.5 than probability of the greedy action is 0.5.
### 2.2
It definitely occured at the timesteps 4 and 5 (at timestep 4 because Q(2) < 0 = Q(3), and at timestep 5 because Q(2) > Q(3) = 0).
### 2.3
The \eps=0.01 will be better in the long run. It will select the optimal action 11=(0.99/0.9)x times more often. The EV is given by: EV = (1-\epsilon) * mean_maximum_action + \epsilon * EV(random action) = (1-\epsilon) * mean_maximum_action because the action values were sampled from a gaussian of mean 0.
Therefore, the solution with \epsilon = 0.01 will have (0.99 - 0.9) * mean_maximum_action = 0.09 * mean_maximum_action more reward per action.
### 2.4
Q_{n + 1} = Q_1 \prod_{i=1}^n (1-\alpha_i) + \sum_{i=1}^n \alpha_i \prod_{k=1}^{n-i} (1 - \alpha_{n-i-k+2})
### 2.5
Main result is that reward get up to 2+ with alpha=0.1 but if we do sample average it only goes up to 1.5. From plotting the distance to true q, we see that we sample average at some point we get further and further but with a constant alpha we keep on getting closer.
### 2.6
The optimistic greedy will underperform for the first moves or so, because it will try out the arms that have initial values of 5. However, there will be some kind of spike (on average after 5 moves) because it will definitely find the best value in the first 10 steps, on average after 5 moves. Then, after 10 moves, it tried out all the arms at least once, so will start to play a bit with the arms greedily, being a bit better than the one with the realistic algorithm because already explored all of them. So it will grow faster (in percentage of optimal action) until it outperforms the realistic eps-greedy.
### 2.7
The value in front of Q_1 is multiplied by 1-\beta_1 right at the beginning with \beta_1 = 1 so no bias.
More generally, \beta_n = \frac{\alpha}{1-(1-\alpha)^n}, which tends to \alpha when n->+inf (constant step size).
Replacing the \alpha_i by the value of \beta_i and using some cool product trick where everything simplifies, we get that:
the sum of weights is: \sum_{i=1}^n \frac{1-\beta}{1 - \beta^i} \beta^{n-i} \frac{1-\beta}{1 - \beta^{n-i+1}} where \beta = (1-\alpha).
It does sum to 1 (miraculously) for n=1 and n=2.
And for the exponentially weighted, the \beta^{n-i} makes it exponentially smaller for earlier rewards (i=1) than for the latest reward (n=i).
EDIT: after having coded the function (cf. chapter2/weights.py) it occurs that it doesn't really sum to 1... BUT the sum does converge to some mysterious number. So it's still exponentially weighted with the total weight being fixed at some point.
### 2.8
There is a spike at the 11th action because it finally can select the argmax (and not choose the latest random action).
### 2.9
\pi_t(a) = \sigma(H_t(a) - H_t(b))
### 2.10
First case: best expectation of success is 0.5 whatever your policy.
Second case: 0.55 by doing 2 on A and 1 on B.
### 2.11
After averaging over 10 bandits (cf. [/plots/ex2.11.png](/plots/ex2.11.png)), it seems that epsilon-greedy has the better performance, then greedy with optimistic initialization, then ucb and finally gradient bandit.
My explanation: ucb exploration decreases over time so cannot really adapt to 100 000 steps of random walk. Epsilon-greedy does better than optimistic greedy because of the exploration. I don't know why gradient bandit completely fails on random walks (compared to the others).
### 3.1
Task 1: Real Life
- A = {sleep, work}
- S = (awake, asleep)
- P(asleep, 1 | awake, sleep) = 0.9 # going to sleep
P(awake, -1 | awake, sleep) = 0.1 # insomnia
P(asleep, -10 | awake, work) = 0.9 # sleeping at work
P(awake, 2 | awake, work) = 0.1 # working is rewarding
P(asleep, 100 | asleep, sleep) = 0.99 # going into deeper sleep
P(awake, -1000 | askleep, sleep) = 0.01 # waking up from realizing you've tried to dream in a dream, in your bed, then starting to wonder if life is not just a dream
P(asleep, 10 | asleep, work) = 0.8 # dreaming of work in your dream, and staying in your dream
P(awake, -2 | asleep, work) = 0.2 # dreaming about work because late to work, so waking up unconsciously in a bad mood
Task 2: Play Chess against Random player
- A = {every valid chess move}
- S = {possible chess position. Check mate goes through terminal state.}
- P = {probability of random player going through next state, with reward of +1 if AI didn't block a check (or cannot), 0 if nothing happens and -1 if opponent captured one of your piece or won}
Task 3: Hardcore Navigation
- A = {decide if you should accelerate, brake, change angle to left or change angle to right}
- S = {image you see from the front camera} (in R^(16x16))
- P = {0 most of the time, -np.inf if accident (terminal state)}
--> Limits to the traditional MDP model:
1. States don't fully capture the actual state of the environment (i.e. you can see the same thing multiple times but be in a different position).
2. Continuous action space.
3. Infinite / continuous state space.
4. Infinite reward.
### 3.2
It's hard to find clear exceptions because all "goal directed behaviour" can be redefined as a MDP where "goal achieved" = terminal state with transition leading to this state having a reward of 1, and rest of transitions having a reward of 0.
However, in practice the MDP framework is not useful when the rewards are not (yet) well-defined or hard to define. For instance, as a clear exception, "understand a text" would be something hard to define for a ML model in February 2020. As for the walking robot problem, defining what "walking in a human way" or "behaving according to human values" is somehow arbitrary and hard to specify clearly.
### 3.3
The natural place to delimit the agent and the environment is what the agent can move in less than 0.01s (e.g. our muscles). The basis is "it has actual control over that thing". For a robot that limit might be different (can send signals much quicker, but cannot move as fast). It's somewhat a free choice where you draw the line, but I still feel that "what the agent can actually move in 0.1s" is kinda natural for human agents.
### 3.4
state s | action | state s' | reward r | p(s', r | s, a)
--------------------------------------------------------
high | search | high | r_{search} | \alpha
high | search | r_{search} | 1-\alpha
low |search | high | -3 | 1-\beta
low | search | low | r_{search} | \beta
high | wait | high | r_{wait} | 1
low | wait | low | r_{wait} | 1
### 3.5
In episodic tasks, S^{+} = S \bigup {terminal state}
In (3.3), considering the terminal state as an absorbing state, we just need to replace S by S^{+}.
If we want to make sth more complex, then we might consider terminal state as sth that goes into the first state after a reset? But Suttong & Barto say "very slightly" so I guess just need to change S to S^{+}.
### 3.6
The return would be exactly the same?
### 3.7
Must also include a penalty for every step, so that it learns to get out as quickly as possible.
### 3.8
G_5 = 0
G_4 = 2
G_3 = 3 + (1/2) * 2 = 4
G_2 = 6 + (1/2) * 4 = 8
G_1 = 2 + (1/2) * 8 = 6
G_0 = -1+ + (1/2)* 6 = 2
### 3.9
G_1 = 7 / (1-0.9) = 70
G_0 = 2 + 0.9 * 70 = 65
### 3.10
\sum_{k=0}{n} \gamma^k = (1 - \gamma^{n+1}) / (1-\gamma) (just multiply both sides by (1-\gamma). It's a sum of positive things inferior to 1/(1-\gamma) so it has a limit. Taking the limit, the sum is 1/(1-\gamma).
### 3.11
\mathbb{E}[R_{t+1}] = sum_{r \in R} r \sum{s'} \sum{a} p(s', r|s, a) \pi(a | s)
### 3.12
v_{\pi}(s) = \sum{a} \pi(a|s) q_{\pi}(s, a)
### 3.13
q_{\pi}(s, a) = \sum{s'}\sum{r} p(s', r| s, a) (r + \gamma.v_{\pi}(s'))
### 3.14
0.7 ~= (1/4) * (0.4 - 0.4 + 2.3 + 0.7) = 3/4
### 3.15
Intervals are important, not the sign. Using 3.8, adding a constant c adds v_c=c/(1-\gamma) to the value of all states.
### 3.16
In maze running, if the reward for each step is not negative then the agent might want to never finish the task.
### 3.17
q_{\pi}(s, a) = \sum_{r, s'} r + \gamma \sum_{a'} \pi(a'|s')q_{\pi}(s', a')
### 3.18
v_{\pi}(s) = \mathbb(E)_{a~\pi(\dot|s)}[q_{\pi}(s, a)]
v_{\pi}(s) = \sum_{a} \pi(a|s) q_{\pi}(s, a)
### 3.19
q_{\pi}(s, a) = \mathbb(E)[R_{t+1} + v_{\pi}(S_{t+1})]
= \sum_r \sum_{s'} p(s', r|s, a) (r + \gamma.v_{\pi}(s'))
### 3.20
optimal state-value function is: -2 everywhere apart the green (just use the driver to get to the green then putter), -1 if already on the green and 0 on the flag.
### 3.21
-4 if cannot reach the -2 zone of q*(., driver), -3 if can reach this zone, -2 if close enough to reach the green, -1 if on the green, 0 if on the flag.
### 3.22
\gamma=0: left
\gamma=0.5: both
\gamma=0.9: right
### 3.23
Let mh = max(q*(h,s), q*(h,w)) and ml=max(q*(l, s), q*(l, w), q*(l, re))
let a = \alpha and b = \beta and g = \gamma
q*(h, s) = rs + g (a.mh + (1-a)ml)
q*(h, w) = rw + g.mh
q*(l, s) = b(rs + g.ml) + (1-b)(-3 + g.mh)
q*(l, w) = rw + g.ml
q*(l, re) = 1 + g.mh
### 3.24
Essentially, you write the sum of rewards as 10 + 10 * 0.9^5 + ... = 10.\sum 0.9^4k = 10 / (1 - 0.9^5) = 24.419
### 3.25
v*(s) = max_a q*(s,a)
### 3.26
q*(s, a) = \sum_{r, s'} p(s', r | s, a) [r + \gamma.v*(s')]
### 3.27
\pi*(a|s) = 1 if a \in argmax q_{\pi}(s,a) else 0
### 3.28
\pi*(a|s) = 1 if a \in argmax \sum_{r, s'} p(s', r | s, a)(r + v_{\pi}(s')) else 0
### 3.29
v*(s) = max_a \sum_{s'} p(s'|s, a) [r(s, a) + v*(s)]
v_{\pi}(s) = \sum_a \pi(a|s) \sum_{s'} p(s'|s, a) [r(s, a) + v_{\pi}(s')]
q*(s, a) = \sum_{s'} p(s'|s, a) [r(s, a) + max_{a'}q*(s, a')]
q_{\pi}(s, a) = \sum_{s'} p(s'|s, a) [r(s, a) + \sum_{a'} \pi(a'|s') q_{\pi}(s', a')]
### 4.1
q_{\pi}(11, down) = -1
q_{\pi}(7, down) = -15
### 4.2
v_{\pi}(15) = -19 then -20
### 4.3
(4.3) q_{\pi}(s, a) = \mathbb(E)[R_{t+1} + \gamma.q_{\pi}(S_{t+1}, A_{t+1}) | S_t=s, A_t=a]
(4.4) q_{\pi}(s, a) = \sum_{s', r} p(s', r | s, a) [r + \gamma.\sum_{a'} \pi(a'|s').q_{\pi}(s', a')]
(4.5) q_{k+1}(s, a) = \sum_{s', r} p(s', r | s, a) [r + \gamma.\sum_{a'} \pi(a'|s').q_(s', a')]
### 4.4
my solution: keep a list of past policies and stop when looping on past policy (see chapter4/figures)
### 4.5
cf. code in chapter4/figures
### 4.6
intuitive needed changes:
- in 3), make the argmax action have value \pi(a|s) of (|min(A(s))|-\epsilon)/|min(A(s))|, \epsilon/|min(A(s))|.
- in 2), make the update be a stochastic sum with \pi(a|s)
- in 1), make all the values be above |min(A(s))|
### 4.7
cf. code in chapter4/figures
### 4.8
what i understand from the policy: it's trying to bet the amount necessary to get to 50, and then bet everything at 50. the intermediary steps are 25 and 75. so bets are centered around reaching either 0, 25, 50, 75 or 100.
the main strategy in betting all your money at 50 and not at 51 is that at 50 you NEED to play the "let's do this" plan. but at 51 you have this one extra cash that you can use to see if you can get higher?
### 4.9
results are not stable as theta goes to 1e-15 (smallest float?) but the value function seems to become sharper and sharper and the peaks follow a trend (linear from 0 to 0.5, then from 0.5 to 0).
for p\_{heads} = 0.25 the value function is really close to Figure 4.3. My interpretation is that what really matters is p\_{heads} < 0.5 in this context.
for p\_{heads} = 0.55, the value function grows really smoothly, and the optimal policy is to always bet one (kinda of a random walk?) and then not move when having 99 in stocks ?! because moving would lead to the value of the final state (which is zero?).
### 4.10
q(s, a) = \sum_{r, s_p} p(s_p, r|s, a) (r + \gamma max_{a_p} q(s_p, a_p))
### 5.1
the value function jumps off because it's highly likely to win if you stick with 20 (and you definitetly win or draw if you get 21), but if you hit with something like 19 you're very likely to lose.
it drops off for the last row on the left because if the dealer has an ace than it's hard to win.
the frontemost values are higher in the upper diagrams because a sum of 12 is alright when you have a usable ace, but pretty bad otherwise.
### 5.2
because we suppose we're getting cards from an infinite deck (i.e. with replacement), it doesn't seem to really matter if we use every-visit or first-visit MC.
### 5.3
root is the state-action pair, ending in a terminal state (which is not a state-action pair).
### 5.4
Q_{n+1}(s, a) = Q_{n}(s, a) + (1/n) (R_n - Q_{n}(s, a)
cf. implementatino in MonteCarlo ES of chapter5/mc.py
### 5.5
in this case the importance sampling ratio is 1 (only one action so b=\pi anyway), so the two alternatives (weighted importance sampling and ordinary importance sampling) are equivalent. G_t is 10, so for first visit V(s) = 10 and every-visit V(s) = 10/10 = 1.
### 5.6
same with Q(s,a), just change T(s) with T(s, a), i.e. visits to the state-action pair.
### 5.7
for the weighted importance-sampling method, the estimates are biased so the few first biases make the error be more and more wrong? because we're more likely to see highly biased things? but then if we average more we get better estimates?
### 5.8
intuitively, if we divided by the number of visits (as in every-visit), then we might in the worst case get 1/k if the legnth of the episode is k, which would give 0.2 * \sum_k (1.8^k / k) which still converges up to infinity (bc larger than an harmonic sum)?
### 5.9
code in chapter5/mc.py class MonteCarloFirstVisit
### 5.10
here's how the derivation goes:
1) split upper sum in two: WnGn and sum_k WkGk up to n-1
2) write the lower part as Cn
3) for the right part, it's actually Cn-1/Cn times Vn
4) write Cn-1 as Cn - Wn
5) put stuff together and you get the result
### 5.11
my intuition: because we stop when \pi(S_t) is not A_t, and ties are broken consistently, then \pi(A_t|S_t) is either 1 or 0. if it's 0 we already stopped and if it's 1 then 1/b(A_t|S_t) is correct.
### 5.12
cf. chapter5/racetrack.py and figures.py ex5.14 (somehow it's called ex5.14 instead of ex5.12)
### 5.13
left term is equal to sum_s sum_a b(a|s) \sum_{s',r} p(s', r | s, a) \sum_a' b(a' | s') ... \sum_a^(T-1) x (product \frac{\pi(a^(i)|s^(i))}{b(a^(i)|s^(i))}) x r
all of the below b can be simplified one by one with the b from taking the expected value. and then what's left at each step is the sum over pi(a^(i)|s^(i)) (equal to 1) and then the sum over the transitions (p(s^(i+1), r^(i) | s^(i), a^(i))) is also 1.
we're left with \sum_s \sum_a \pi(a|s) \sum_{s',r} p(s', r | s, a) r
or equivently, \sum_s sum_a b(a|s) \sum_{s', r} \frac{\pi(a|s)}{b(a|s)}
i.e. the right term.
### 5.14
the formula for V(s) is the same as for Q(s,a), just need to change T(s) as T(s,a) or time of visits to pair (s,a).
then it's the same formula with \sum (W_k * G_k) / \sum W_k, we just need to change G_k with G_{t:h} and the weights for k are (1-\gamma) \gamma^k \ro_{t:h-1} for k <= T - t - 2, and \gamma^{T-t-1} \ro_{T-1} for k = T-t-1.
actually, now the weights depend both on t and h so it's sum of W(s,h) G(s,h) divided by a sum of W(s,h).
for the details, see chapter5/mc.py OffPolicyMCControl then truncated_weighted_avg_est.
### 6.1
G_t - V_t[S_t] = sum_{t}^{T-1} \gamma^{k-t} (\delta_k + \gamma.(V_{k+1}[S_{k+1}] - V_k[S_{k+1}])
so the diff. G_t - V_t[S_t] - (sum of td errors) is inferior to (\gamma / (1 - \gamma)) * ||V_{k+1} - V_k||_{inf}
### 6.2
considering the driving pb., a situation where TD update might be more efficient than MC update: when i can't wait until the end of the episode to make my prediction (like, online learning for an autonomous vehicle on unknown environment). or when the MC trajectories are too difficult to sample, as they would necessit an intractable number of steps to reach the end of the episode (cf. race track example from end of chapter 5).
HINT CASE: here, with TD updates, we're updating using the correct value for "entering highway" directly, without having to simulate the entire episode. my intuition tells me that if we stay on the original scenario, then we would O(n) TD updates where only one MC update would work.
### 6.3
the first episode ended in the left absorbing state with value 0. only V[A] changed because for all the other states V[S_t] and V[S_{t+1}] were the same. here, V[Absorbing state] = 0 so the TD error was 0 + 1 * 0 - 1/2 so it diminished by -0.05 (step-size = 0.1).
### 6.4
theoretical answer:
- for MC: lower step size means slower convergence, so won't improve results. from the graph, results stop getting better at stepsize 0.3/0.4, and starts even to be counterproductive to use a higher alpha.
- for TD: idem for lowering the step size. as for using a higher step size, it makes the convergence quicker but the bias is higher (systemic error).
empirical answer (cf. plots/ex6.4.png: using higher step sizes did indeed lead to really high bias (and converging to wrong value with high variance), and using smaller step sizes is definitely too slow (doesn't converge in 100 episodes), but should give better results if we run for more episodes?
### 6.5
theoretical answer: my inituion is that this init value of 0.5 maximizes the kind of variance in the random walks, so that there's an infinity of variance somehow in the samples we get, and so the error gets higher throughout the episodes because we get longer walks? using smaller of bigger V_init will make it converge faster toa wrong value somehow?
empirical answer (cf. plots/ex6.5.png): among other initialization (V_init = 0, 0.25, 0.75 or 1) it only happened that clearly with 0.5.
### 6.6
first way (which i used): solve system of equations.
second way: dynamic programming.
i think the system of equations was used because, given how slow MC/TD methods are, it doesn't seem that you could get fast convergence with DP. and also we can compute exact values here, duh.
### 6.7
cf. implementation in chapter6/old_pol_td.py
### 6.8
it's exactly the same with u_t = Q(S_t, A_t), instead of V_t. at the end we get 0-0 because we're assuming Q(S_T, .) = 0 (which justifies why we need that in Sarsa).
### 6.9
diagonal moves help. not moving doesn't help and only slows down learning. indeed, the optimal solution (with diagonal moves) is the same number of moves as if you only had to go right without wind. EDIT: after printing the final trajectories, because the policy is epsilon greedy, it doesn't follow an optimal trajectory with diags/king moves.
for an empirical comparison, see plot/6/ex6.9 (code from chapter6/figures.py -> ex6_9). with diags learn a bit faster the optimal trajectory than diags + stay put, and both of them then have the same growth, which is much faster than just left, right, up and down moves.
### 6.10
stochastic winds with king moves takes 4-5x more timesteps to reach 1k episodes (tested on 10 different seeds for stochasticity (cf. plots/ex6.10 vs. plots/ex6.9)
### 6.11
q-learning is considered to be an off-policy control method because the target policy is the greedy policy (that's why the target estimate is r + \gamma * Q(s', \pi(s')). with \pi being the greedy policy), and is different of the behaviour policy (used to chose action a that lead to r, s'.
### 6.12
the formula for the weight updates is the same if they're both using the greedy policy to derive actions from Q values. however, action selection could be different as in sarsa, the action a' (next action) is chosed before the update of Q(s, a), whereas with qlearning the action following a (let's call it a') is chosen after the update of Q(s, a). this might change things if we stay in a same state s_0, and we would choose a_1 if we chose it before updating Q(s_0, a_0) but would chose a_2 if chosing after updating Q(s_0, a_0) (where a_1 != a_2 but a_1 or a_2 could be a_0).
### 6.13
cf. the formula implemented in chapter6/double_experted_sarsa.py. essentially, the "a_max according to Q_1" becomes "we use \pi_1 derived from Q_1 to weight the values of Q_2 accordingly in our update of Q_1(s, a)". cf. plot in plots/ex6.13.png to see the difference between double expected sarsa and expected sarsa for the environment with S_B and S_A. we see that double expected sarsa has less bias at the beginning, but then seems to take longer to converge (for the 300 first episodes).
### 6.14
in the car rental problem, we're in a state (n_1, n_2), and the afterstate is known and is n1-m, n2-m (which could come from many different places, same as in tic tac toe). so instead of going through all the different transitions (state-action pair) we focus on the the n^2 different states, which would speed up convergence.
### 7.1
we use that G_{t:t+k} = G_{t:t+k-1} + \gamma^{k-1}.\delta_{t+k-1} if k >= 2 and R_{t+1} + \gamma.V(S_{t+1}) if k == 1
G_{t:t+n} - V(S_t) = G_{t:t+n-1} + \gamma^{n-1} \delta_{t+n-1} - V(S_t)
= ...
= G_{t:t+1} + \sum_{k=1}{n-1} \gamma^k.\delta_{t+k} - V(S_t}
= R_{t+1] + \gamma.V(S_{t+1}) + ... - V(S_t)
= sum_{k=0}{n-1} \gamma^k.\delta_{t+k}
### 7.2
See chapter7/figures.py ex_7_2 for details about the experiment and chapter7/plots/ex7.2.png for the plot.
We compare the results from fig7.2 with what we get just replacing G by a sum of td errors. The averaged RMS error is higher for small values of n (1 and 2), about equal and better for large step sizes for n = 4, and then consistently lower than fig7.2 for larger values of n.
### 7.3
Theory:
- I think we used a larger number of states (19) so that the episodes would be on average much longer (than using just 5 states), so we would see why using n-step reward make sense.
- I expect a smaller walk to make the best value of n smaller, as a n too large (for the average length of the episode) can update states towards the wrong way (i.e. even the extreme right state can get updated by the value of reward from the extreme left side if n is large enough and there aren't that many states).
- I don't expect the change from -1 to 1 to make much a difference for the best n. It just makes it easier to initialize the V to 0 and have the value of terminal state be zero. but if we had 0 on the left and initialization to 0.5 (for all but the absorbing state) I think we would get about the same errors, with maybe a difference in "when we start with 0.5 then there's always a substraction" but when we initialize to 0 we substract 0 at the beginning.
Experiments:
(cf. chapter7/figures.py ex_7_3 for details about the experiment and chapter7/plots/ex7.3.png for the plot).
- indeed, with 5 states n became as small as... n=1.
- indeed, changing the left reward to 0 didn't change the best value of n=4, but the RMS errors was smaller overall. I think the left reward of 0 is just for symmetry and having V = 0 for the initialization.
### 7.4
let m = min(\tau + n, T)
the right term becomes: sum_{j=0}{m-1-t} \gamma^j R_{t+j+1} + \gamma^{m-t} Q_{m-1}(S_m, A_m)
if tau + n < T: m - t == n so we get G_{t:t+n}
if tau + n >= T, then Q_{m-1}(S_m, A_m) = Q_T(S_T, A_T) = 0 if we add the hypothesis that Q needs always to be 0 on terminal states.
### 7.5
essentially, we compute the result of the recurrent relationship, and replace horizon h with \tau + n (to get sth like for (7.2)), and we get:
G_{\tau:\tau+n} = (sum of \gamma^{i-\tau-1} R_i \ro_{\tau}...\ro_{i-1}) + \ro_{\tau}...\ro_{max_idx}.\gamma^{n-1}.V_{\tau+n-1}(S_{\tau+n}) + (1-\rau_{\tau}).V_{\tau+n-1}(S_{\tau})
see chapter7/off_pol_nstep_td.py for the implementation.
### 7.6
the expected value of the \ro is 1, independent of the estimate and the Qs, so you can replace them by one. then the expected value for the Q_{h-1} estimate is exactly the V_{h-1} with a bar on top, so it simplifies and we get the the same ev of return.
### 7.7
implemented in chapter7/off_pol_nstep_expected_sarsa.py
### 7.8
we use that u_t = G_{t:h} - V(S_t) = \ro_t (\delta_t + \gamma.u_{t+1}) by adding & substracting \ro_t.\gamma.V(S_{t+1})
then because G_{h:h} = V(S_h), the last term is 0 so we get:
G_{t:h} - V(S_t) = \sum_{k=t}{h-1} (\pi_{j=t}{k} \ro_j) \gamma^{k-t} \delta_k
### 7.9
we use that u_t = G_{t:h} - Q(S_t, A_t) = \delta_t + \gamma.\ro_{t+1}.u_{t+1}
then, from using that:
- G_{h:h} - Q(S_h, A_h) = 0 if h < T
- G_{T-1:h} - Q(S_{T-1}, A_{T-1}) = R_T + expV(S_T) - Q(S_{T-1}, A_{T-1}) = \delta_{T-1} (bc expV = 0 on terminal vals).
we get that G_{t:h} is:
- (\sum_{k=t}{h-1} (\pi_{j=t+1}{k} \ro_j) \gamma^{k-t} \delta_k) if h < T
- (same but replace h by T) if h >= T
### 7.10
doing it with not so random walk. cf. ex7.10 in chapter7/figures.py and chapter7/off_pol_nstep_td.py, and chapter7/plots/ex7.10.png
we get about the same results for the two methods... not really conclusive. on my plot it's slightly better to use that more complex method but with a different number of states it can be the other way around. maybe there's some environments where results are more conclusive or there's some error in my code.
### 7.11
G_{t:t+n} - Q(S_t, A_t)
= \delta_t - \gamma.expV(S_{t+1}) + \gamma.expV(S_{t+1}) - \gamma.\pi(A_{t+1}|S_{t+1})Q(S_{t+1}, A_{t+1}) + \gamma.\pi(A_{t+1}|S_{t+1}).G_{t+1:t+n}
= \delta_t + \gamma.\pi(A_{t+1}|S_{t+1}).(G_{t+1:t+n} - Q(S_{t+1}, A_{t+1}))
then, for the final proof, we need to distinguish how it ends, using mathematical induction:
case 1: n == 1
-> if t == T-1: then \delta_t = R_T + \gamma x 0 - Q(S_{T-1}, A_{T-1}) by def. of the expected V on terminal states, which is R_T, so G_{T-1:t+n} (by def).
-> if t < T-1: we get Q(S_t, A_t) + \delta_t = R_{t+1} + \gamma.expV(S_{t+1}) which is by def. G_{t:t+n) (defined in 7.15).
case 2: n >= 2, assuming n-1
G_{t:t+n}
= Q(S_t, A_t) + \delta_t + \gamma.\pi(Q_{t+1}, A_{t+1}).(G_{t+1:t+n} - Q(S_{t+1}, A_{t+1})) (by using recurrence from the main intuition)
then, we replace the G-Q right part by the formula, by applying it to G_{(t+1):(t+1)+(n-1)} (we assumed n-1)
the right part of the summation becomes: \gamma.\pi(Q_{t+1}, A_{t+1}).\sum_{k=t+1}{min(t+1+n-1, T-1)} \delta_k \product_{i=t+2}{k} \gamma.\pi(A_i|S_i)
then:
-the min is the same as what we want,
-after multiplying by gamma.\pi(A_{t+1}|S_{t+1}) the product starts at i=t+1
-the delta_t becomes the first term of the sum.
### 8.1
Theory: n-step shouldn't be doing much better, as it would reach the goal by random for the first episodes. if n = 3 or something, then it might get the right updates for the 3 moves to the goal. is n is larger (let's say 50) the path would be much more messy so the udpates would be wrong.
Practice: from running the n-step sarsa from chapter 7 with n=5 and n=50, it did update correctly next to the goal (next to walls) for n=5, but that's all. then, for n=50, it updated from start to goal almost always correctly, just a few tiles that were wrong (like 2 tiles out of 20). (cf. nstep_sarsa code and ex_8_1 code in figures.py and ex8.1.png in plots).
### 8.2
first phase: finding the hole in the blocking task requires exploration, so it makes sense that the dynaQ+ agent found the hole faster with the addditional exploration phase. also, from the note 1. (page 168) dynaQ+ started with a prior on never seen actions of "i'm going to stay at the same place and receive reward of 0", which is correct when hitting limits of the map + when hitting the map. so maybe after having found the correct path once, it kind of "assumed" that the rest would give a reward of 0 and not be a path?
second phase: well, it adapted and found the new path bc of exploration faster (does 6 episodes 2x faster than dynaQ after the change).
### 8.3
this one was observed when running simulations! actually, the dynaq+ agent needs to re-explore parts of the map from times to time (bc of the exploration factor), so it's suboptimal compared to the normal dynaq agent. this slight difference, over 3k steps, becomes noticeable (how much actually depends on the factor K of exploration, i used k=0.001).
### 8.4
Theory: it seems pretty important to change those Q-values. indeed, by design, if you change the Q-values somewhere, the max(Q) will change, so it will propagate to the other Q-values until it reaches the point where the agent is currently blocked (using planning). if you only change the estimated reward WHEN CHOOSING ACTIONS then it's only a heuristic for acting, not used in planning. so my guess it that: it's good at unblocking yourself when there are only a few steps to do. if it's a maze where you need to do multiple good choices (> 5) in a row to reach the goal then it won't work.
Experiments: from comparing dynaq+ with the original version on the blocking task (cf. plots/ex8.4.png) the weakness of this new version is that it takes more time to find the correct solution when the environment changes, but then its episodes lengths are shorter on average because it doesn't need to re-try every part of the map that he hasn't explored in a while.
### 8.5
Instead of selecting "THE" s_p, r from transition s, a, you just sample from a distribution on what you already encountered on s, a (e.g. if you got s_p,r = 1, 0 once and 2, 3 twice then the probability of the first is 1/3 and the second one is 2/3).
This might perform poorly because when everytime it went up it as going up at the beginning, so it will take a long amount of time until the "distribution" mentioned above changes to the correct model. (like we have 1000 wrong instances, to get something 99% correct after a change in environment we would need 99 000 new transition from the new environment).
For something that is good at both stochastic and changing environment, it would need to somehow have a distribution in the model (as mentioned) but this distribution should be rapidly updated for things it hasn't seen in a while (for instance, if it didn't see "up" cause a movement "up" in a while, then the probability of it happening in the model could decrease proportionally to 1 / (number of timesteps since last seen)).
### 8.6
This would weaken the case of sample updates over expected updates.
Let's say that there's one state-action pair that could lead to a reward of 10 ** 20 with probability 10 ** (-15) and 0 otherwise. No matter how many samples you do, you won't do more than 10 ** (15) samples for this state action pair (given current hardware limits), so you're really not likely to sample this really high (possible) reward which gives an expected value for this action pair of 1000 (not 0).
### 8.7
The graph shows the value of a start state under the greedy policy. For b = 1, there are only two actions that lead to two different (or the same) states, deterministically.
My guesses for why is scalloped in the early portions:
- for each regime, the model oscillates between two different actions for each state action pair, which corresponds to two states (b=1).
- it says "all the while assuming the model is correct". it seems that with two possible next states, the "wrong" model oscillates between two different modes for each state, so that's why it scumbles.
Parts of the data that supports my point:
- low graph with b=1 has the same behaviour for uniform
- it's somewhat also true with b=3 and not true for b=10, so it shows that the scumbleness becomes lower with b, which makes sense given what i said above.
Why it's not also the case with b=1 on policy:
- my guess is that because the examples with b=1 on-policy converge super quickly, we don't get to see the scumbleness.
### 9.1
in linear function approximation \hat v (s, \bold w) = \bold w^T \bold x(s)
now, let's say that \bold x(s) is a vector of length |S| which encodes as a one-hot the id of a state (e.g. (1, 0, ..., 0) for the first state).
then when we're using tabular methods we're trying to find the correct weights w_1, ..., w_{|S|} which approximate v_{\pi}(s_1}, ..., v_{\pi}(s_|S|)
### 9.2
(n + 1)^k features for dimension k because for each product there are (n + 1) possibilities and k terms so (n + 1)^k possibilities. the "why" we're using all those combinations is because it's both the maximum and the minimum to get all the relationships between the variables
### 9.3
n = 2
c_{1,1}, c_{1, 2} = (0, 0)
c_{2,1}, c_{2, 2} = (1, 0)
c_{3,1}, c_{3, 2} = (0, 1)
c_{4,1}, c_{4, 2} = (1, 1)
c_{5,1}, c_{5, 2} = (2, 0)
c_{6,1}, c_{6, 2} = (0, 2)
c_{7,1}, c_{7, 2} = (1, 2)
c_{8,1}, c_{8, 2} = (2, 1)
c_{9,1}, c_{9, 2} = (2, 2)
### 9.4
tiles that are elongated along one dimension, such as stripe tilings from 9.12 (middle), or log stripes.
### 9.5
\alpha = \frac{1}{10 * 98} (\tau = 10 and 98 tiles, using 1/{\tau * n_tiles} from section 9.5.4)
### 9.6
using (9.7), U_t unbiased estimate and the definition of \hat v & \nabla \hat v for linear methods: \mathbb(E)[\bold w_{t+1}^T x(S_t) - v_{\pi}(S_t)] = \mathbb(E) { \bold w_t^T x(S_t) + \alpha [v_{\pi}(S_t) - \bold w_t^T x_(S_t)] x(S_t)^T x(S_t) - v_{\pi}(S_t}}
then using (9.19) with \tau = 1, \alpha = \frac{1}{\mathbb{E}[x^T x]}
so, on the right side, when we take the expacted value of x(S_t) ^ T x(S_t) and divide by alpha, we get 1.
the v_{\pi} and the \bold w_t^T x_(S_t) simplify and we get 0, which means the error was reduces (on average) to zero in one update.
### 9.7
\hat v(s, w) = 1 / (1 + e^{-w^T x(s)})
\nabla \hat v(s, w) = (e^{-w^T x(s)} / (1 + e^{-w^T x(s)})^2) x(s) = \hat v(s, w) (1 - \hat v(s)) x(s)
### 9.8
from deriving the cross-entropy loss i get the same update (i.e. w_{t+1} = w_t + (v_{\pi} - \hat v) \nabla \hat v)
### 10.1
For MonteCarlo methods we would have U_t = G as target where G is the full return.
It's reasonable not to give pseudo-code as it wouldn't be so different from the Gradient Monte Carlo algorithm (or MonteCarlo Exploring Starts as it's a control problem).
For Mountain Car my intuition tells me that it would take many exploration steps to reach the top, and if it does reach the top it would be hard to know why exactly it reached it. In short, it would perform really poorly, and my first guess would be that it wouldn't even terminate on the first episode.
### 10.2
same as the sarsa one, but with an expected values over the possible actions a, i.e. \hat q(S_p, a, \bold w), where the expected value could be on, let's say, the epsilon greedy policy.
### 10.3
because we're on Mountain car with a constant reward per step of -1, so G = -n, and then with a large n we have a large return, so the weights vary more in the update.
### 10.4
just remove the part where you choose A_p in diff. semi gradient sarsa, and in your td error have a max_{a_p} term instead of the one of the s_p, a_p one
### 10.5
how to update the weights (which is essentially (10.12) with v instead of q)
### 10.6
average reward is 0.5.
for A: lim for \gamma close to 1 of {lim for h going to infinity of \sum_t (-\gamma)^t / 2} = (0.5 / (1 + \gamma)), so 1/4.
for B: same but the opposite sign, so -1/4.
### 10.7
so if we separate the r(\pi) part, we get for the expected valuesinside the paranthesis you have (0, 0, 1, ...) for A, (0, 1, 0, ...) for B and (1, 0, 0, ...) for C.
so the sum for C is a sum of \gamma^3t - sum \gamma^t * (1 / 3), so (1 / (1 - \gamma^3)) - (1 / 3) * (1 / (1 - \gamma)), which according to wolframalpha gives 1/3 to the limit.
for B you get \gamma times the same thing so same limit, and for A \gamma^2 so same limit of 1/3.
### 10.8
R_{t+1} - \bar R_t: ..., -1/3, -1/3, 2/3, -1/3, -1/3, 2/3, ...
\delta_t: -1/3 + \hat v(B) - \hat v(A), 2/3 + \hat v(C) - \hat v(B), -1/3 + \hat v(A) - \hat v(C)
for the first delta error (on the state A), if the value of A is too low then the difference of v(b) - v(a) will be positive (because they're supposed to converge to the same value 1/3), so the weights will correct (more) in the direction of the gradient, aka moving in the direction that will give higher estimate value for state a.
if we were using only the sequence of R_{t+1} - \bar R_t we would get something that doesn't change, so the second approach seems to converge in a more stable way whereas using only R_{t+1} - \bar R_t might not converge (and maybe do something periodic?)
### 10.9
just add a variable o initialized at 0 and change its value using the recurrence from ex2.7 at the end of the loop.
then use \beta / o as the parameter to multiply \delta (in the \bar R_t line)
### 11.1
G_{t:t+n} = same as the def. for semi-grad sarsa (n-step version) but replacing q with v. same for w_{t+n}
### 11.2
weights update: like (11.7)
G update:
- replace the \bar V by a mean of \hat v
- replace Q by \hat q
- in the continuing case: replace R_{t + 1} by R_{t + 1} - \bar R_{t + 1}
then, what's eft for a "semi-gradient version" is to convert the recursive definition in 7.17 to a closed form.
### 11.3
cf. plots/ex11.3 and ex_11_3 in figures.py
### 11.4
\bar RE(\bold w)
= \mathbb{E}[(G_t - \hat v(S_t, \bold w))^2]
= \sum_s \mu(s) \mathbb{E}[(G_t - \hat v(S_t, \bold w))^2 | S_t = s]
= \sum_s \mu(s) \mathbb{E}[(G_t - v_{\pi}(s) + v_{\pi}(s) - \hat v(S_t, \bold w))^2 | S_t = s]
= \sum_s \mu(s) \mathbb{E}[(G_t - v_{\pi}(s))^2 | S_t = s] + \sum_s \mu(s) \mathbb{E}[(\hat v(S_t, \bold w) - v_{\pi}(s))^2 | S_t = s] + 2 * \sum_s \mu(s) \mathbb{E}[(G_t - v_{\pi}(s)) * (\hat v(S_t, \bold w) - v_{\pi}(s))]
= \mathbb{E}[(G_t - v_{\pi}(s))^2] + \bar VE(\bol w) + \sum_s \mu(s) \mathbb{E}[(G_t - v_{\pi}(s)) | S_t = s] * \mathbb{E}[(\hat v(S_t, \bold w) - v_{\pi}(s)) | S_t = s] # because given S_t = s the right term doesn't contain any random variable and is thus independent of the let term of the product
= \mathbb{E}[(G_t - v_{\pi}(s))^2] + \bar VE(\bol w) # because \mathbb{E}[(G_t - v_{\pi}(s)) | S_t = s] = \mathbb{E}[G_t | S_t = s] - v_{\pi}(s)) = 0
### 12.1
G_t^{\lambda}\\
= (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t:t+n}\\
= (1 - \lambda) G_{t:t+1} + (1 - \lambda) \sum_{n=2}^{\infty} \lambda^{n-1} G_{t:t+n}\\
= (1 - \lambda) G_{t:t+1} + (1 - \lambda) \lambda \sum_{n=2}^{\infty} \lambda^{n-2} (R_{t+1} + \gamma G_{t+1:t+n})\\
= (1 - \lambda) G_{t:t+1} + (1 - \lambda) \lambda \sum_{p=1}^{\infty} \lambda^{p-1} (R_{t+1} + \gamma G_{t+1:t+p+1})\\
= (1 - \lambda) (R_{t + 1} + \gamma \hat v(S_{t + 1}, \bold w_t)) + (1 - \lambda) \lambda [\sum_{p=1}^{\infty} \lambda^{p-1} R_{t+1} + \sum_{p=1}^{\inf} \lambda^{p-1} \gamma G_{t+1:t+p+1}]\\
= R_{t + 1} + \gamma (1 - \lambda) \hat v(S_{t + 1}, \bold w_t)) + \gamma \lambda G_{t + 1}^{\lambda})\\
### 12.2
let's assume T > \tau_{\lambda}, \lambda in (0, 1), and let's abbreviate \tau_{\lambda} by \tau
the weight sum (divided by (1 - \lambda)) should be separated in two, so that \tau is the first p to satisfy:
\sum_{n=0}^p \lambda^n \ge \sum_{n=p+1}^{\infty} \lambda^n
i.e. \frac{1 - \lambda^{p+1}}{1 - \lambda} \le \frace{\lambda^{p+1}}{1 - \lambda}
i.e. \lambda^{p+1} \le 1/2
i.e. (p+1) (ln(\lambda)) \le -ln(2)
i.e. p \ge ln(2) / ln(\lambda)
so \tau = integer above ln(2) / ln(\lambda)
### 12.3
With a fixed \bold w we have:
G_t^{\lambda} - \hat v(S_t, \bold w)\\
= R_{t + 1} + \gamma (1 - \lambda) \hat v(S_{t + 1}, \bold w) + \gamma \lambda G_{t + 1}^{\lambda} - \hat v(S_t, \bold w)\\
= \delta_t + \gamma \lambda (G_{t + 1}^{\lambda} - \hat v(S_{t + 1}, \bold w))\\
= \delta_t + \gamma \lambda (\delta_{t+1} + \gamma \lambda (G_{t + 2}^{\lambda} - \hat v(S_{t + 2}, \bold w)))\\
= \sum_{k=0}^{T-t-1} (\gamma \lambda)^k \delta_{t+k}\\
because G_T = \hat v(S_T, \bold w) = 0 so G_{t + T - t - 1} - \hat v(S_{t + T - t - 1}, \bold w) is the last term
### 12.4
= (1 / \alpha) \sum_{t=0}^{T-1} (\bold w' - \bold w) \text{ (for TD($\lambda)$)}\\
= \sum_{t=0}^{T-1} \delta_t z_t\\
= \delta_0 z_0 + \sum_{t=1}^{T-1} \delta_t (\gamma \lambda z_{t-1} + \nabla \hat v(S_t, \bold w))\\
= \delta_0 \nabla \hat v(S_0, \bold w) + \gamma \lambda \sum_{t=1}^{T-1} \delta_t z_{t-1} + \sum_{t=1}^{T-1} \delta_t \nabla \hat v(S_t, \bold w))\\
= (\gamma \lambda)^0 \sum_{t=0}^{T-0-1} \delta_{t+0} \nabla \hat v(S_t, \bold w) + \gamma \lambda \sum_{t=1}^{T-1} \delta_t z_{t-1}\\
= (\gamma \lambda)^0 T_0 + \gamma \lambda \sum_{t'=0}^{T-2} \delta_{t'+1} z_{t'}\\
= (\gamma \lambda)^0 T_0 + \gamma \lambda \sum_{t=0}^{T-2} \delta_{t+1} (\gamma \lambda z_{t-1} + \nabla \hat v(S_t, \bold w))\\
= \cdots\\
= \sum_{k=0}^{T-1} (\gamma \lambda)^k T_k\\
= \sum_{k=0}^{T-1} (\gamma \lambda)^k \sum_{t=0}^{T-k-1} \delta_{t+k} \nabla \hat v(S_t, \bold w)\\
= \sum_{t=0}^{T-1} \nabla \hat v(S_t, \bold w) [\sum_{k=0}^{T-t-1} (\gamma \lambda)^k \delta_{t+k}]\\
= \sum_{t=0}^{T-1} [G_t^{\lambda} - \hat v(S_t, \bold w)] \nabla \hat v(S_t, \bold w)\\
(1 / \alpha) \sum_{t=0}^{T-1} (\bold w' - \bold w) \text{ (for $\lambda$-offline return)}\\
### 12.5
using the same recurrence relationship as from 12.1,
G_{t:t+k}^{\lambda} - \hat v(S_t, \bold w_{t-1})
= R_{t + 1} + \gamma (1 - \lambda) \hat v(S_{t + 1}, \bold w_t)) + \gamma \lambda G_{t + 1:k}^{\lambda} - \hat v(S_t, \bold w_{t-1})
= \delta_t' + \gamma \lambda (G_{t + 1:k}^{\lambda} - \hat v(S_{t + 1}, \bold w_t))
now, the last term in the brackets is:
G_{t+k-1:t+k}^{\lambda} - \hat v(S_{t + k - 1}, \bold w_{t + k - 2})
= G_{t+k-1:t+k} - \hat v(S_{t + k - 1}, \bold w_{t + k - 2}) (only the right term of 12.9 is non-zero)
= R_{t+k} + \gamma \hat v(S_{t + k}, \bold w_t) - \hat v(S_{t + k - 1}, \bold w_{t + k - 2})
= \delta_{t+k-1}
thus we get (12.10)
### 12.6
somehting like replacing the trace (cf. "accumulating traces" or "replaceing traces" line) with z_i \leftarrow (1 - \alpha \gamma \lambda z_T x) x_i should work
### 12.7
G_t^{\lambda_s}
= R_{t+1} + \gamma_{t+1} ((1 - \lambda_{t+1}) \hat v(S_{t+1}, \bold w_t) + \lambda_{t+1} G_{t+1}^{\lambda_s})\\
= R_{t+1} + \gamma_{t+1}(1 - \lambda_{t+1}) \hat v(S_{t+1}, \bold w_t) + \gamma_{t+1} \lambda_{t+1} R_{t+2} + \gamma_{t+1} \gamma_{t+2} \lambda_{t+1} ((1 - \lambda_{t+2}) \hat v(S_{t+1}, \bold w_{t+1}) + \lambda_{t+2} G_{t+2}^{\lambda_s})\\
= \sum_{k=1}^{\infty} (\Pi_{j=1}^{k-1} \gamma_{t+j} \lambda_{t+j}) (R_{t+k} + \gamma_{t+k} (1 - \lambda_{t+k}) \hat v(S_{t+k}, \bold w_{t+k-1}))
G_{t:h}^{\lambda_s}\\
= \sum_{k=1}^{h - t} (\Pi_{j=1}^{k-1} \gamma_{t+j} \lambda_{t+j}) (R_{t+k} + \gamma_{t+k} (1 - \lambda_{t+k}) \hat v(S_{t+k}, \bold w_{t+k-1}))
G_{t:h}^{\lambda_a}\\
= \sum_{k=1}^{h - t} (\Pi_{j=1}^{k-1} \gamma_{t+j} \lambda_{t+j}) (R_{t+k} + \gamma_{t+k} (1 - \lambda_{t+k}) \hat q(S_{t+k}, A_{t+k}, \bold w_{t+k-1}))
and for Expected Sarsa
G_{t:h}^{\lambda_a}\\
= \sum_{k=1}^{h - t} (\Pi_{j=1}^{k-1} \gamma_{t+j} \lambda_{t+j}) (R_{t+k} + \gamma_{t+k} (1 - \lambda_{t+k}) \bar V(S_{t+k}))
### 12.8
G_0^{\lambda_s}\\
= \rho_0 \left(R_1 + \gamma_1 \left(\left(1 - \lambda_1\right) V_1 + \lambda_1 G_1^{\lambda_s}\right) \right) + (1 - \rho_0) V_0\\
= V_0 + \rho_0\left(R_1 + \gamma_1 V_1 - V_0 + \gamma_1 \lambda_1 (G_1^{\lambda_s} - V_1)\right)\\
= V_0 + \rho_0\left(\delta_0^s + (\gamma_1 \lambda_1) \left(G_1^{\lambda_s} - V_1\right)\right)\\
= V_0 + \rho_0\left(\delta_0^s + (\gamma_1 \lambda_1) \left[\rho_1\left(\delta_1^s + (\gamma_2 \lambda_2) (G_2^{\lambda_s} - V_2)\right)\right]\right)\\
= V_0 + \rho_0 \sum_{k=0}^{\infty} \delta_{k}^s \Pi_{i=1}^k \gamma_i \lambda_i \rho_i
### 12.9
Based on ex12.8 and 12.7, it's equation (12.24) but with
(\delta_t^s)' = R_{t+1} + \gamma_{t+1} \hat v(S_{t+1}, \bold w_t) - \hat v(S_t, \bold w_{t-1})
### 12.10
G_0^{\lambda_a} - Q_0\\
= R_1 + \gamma_1 \left(\bar V_1 + \lambda_1 \rho_1 (G_1^{\lambda_a} - Q_1)\right) - Q_0\\
= \delta_0 + \lambda_1 \rho_1 \gamma_1 (G_1^{\lambda_a} - Q_1)
### 12.11
equation (12.27) but with
(\delta_t^s)' = R_{t+1} + \gamma_{t+1} \bar V_t(S_{t+1}) - \hat q(S_t, A_t, \bold w_{t-1})
### 12.12
wih the abstraction of (12.27), the steps are literally the same (as we don't use anything specific to \hat v)
just replace:
- \hat v by \hat q
- S_t by S_t, A_t
we get a result without multiplying by \rho_t (because it's not in (12.27)!)
### 12.13
for dutch trace, replace \gamma \lambda by \gamma_t \lambda_t, and multiply the whole update by rho_t
for replacing trace, we get rho_t if x_{i,t} = 1, \gamma_t \lambda_t \rho_t z_{i, t-1} otherwise
### 12.14
intuitively, we would need to alternate between using two approximate q functions, using updates similar to the ones from TB(\lambda)
it seems like the most straightforward way would be to alternate between using \nabla \hat q_1 and \nabla \hat q_2 when updating z_t (and the weights?)
### 13.1
from writing down the equations for v_{\pi} it turns out that d(v_{\pi}(s_1)) / dp_{right} is of the same sign as -(p^2 - 4p + 2) whose only roots in [0, 1] is 2 - sqrt(2) ~= 0.58. v_{\pi}(s_1) = - (2 / p) * ((2 - p) / (1 - p)) is about -11.66 for this p.
### 13.2
in box page 199, the right term should be multiplied by a \gamma
then, in the proof of the Policy Gradient theorem, we don't actually get \eta(s) when summing over the probabilities. we need to multiply those by \gamma^k to get the correct thing.
so that's why, actually we're multiplying by gamma^k all the terms that are far from the start state by k steps in the updates.
and for the steps that lead to (13.8), the additional steps is that we need to consider one episode, sum over all the steps, and then conclud on a length of episode that can vary.
### 13.3
\nabla \ln \pi (a|s, \theta)\\
= \nabla \left[\ln \frac{e^{h(s, a, \theta)}}{\sum_b e^{h(s, b, \theta)}} \right]\\
= \nabla h(s, a, \theta) - \nabla ln \left(\sum_b e^{h(s, b, \theta)}\right)\\
= \nabla \theta^T x(s, a) - \frac{\sum_b \nabla e^{\theta^T x(s, b)}}{\sum_u e^{h(s, u, \theta)}}\\
= x(s, a) - \sum_b \sum_k \frac{e^{\theta^T x(s, b)}}{\sum_u e^{h(s, u, \theta)}} x(s, b)_k e_k\\
= x(s, a) - \sum_b \pi(b|s, \theta) \sum_k x(s, b)_k e_k\\
= x(s, a) - \sum_b \pi(b|s, \theta) x(s, b)\\
### 13.4
note that \nabla \mu = x_{\mu} and \nabla \sigma = \sigma x_{\sigma}
\nabla ln \pi(a | s, \theta_{\mu})
= (- 1 / (2 * \sigma^2)) \nabla (a - \mu(s, \theta))^2
= (1 / \sigma^2) (a - \mu(s, \theta)) x_{\sigma}(s)
\nabla ln \pi(a | s, \theta_{\sigma})
= -\nabla ln \sigma - (a - \mu)^2 \nabla (2 * \sigma^2)^{-1}
= -(\sigma * x_{\sigma}) / \sigma + [(a - \mu)^2 * 2 * \nabla \sigma] / (2 \sigma^3)
= -x_{\sigma} - (a - \mu)^2 x_{\sigma} / \sigma^2
= ((a - \mu)^2 / \sigma^2 - 1) x_{\sigma}
### 13.5
(a)
Pt = Pr{A_t = 1}
= \pi(1 | S_t, \theta_t)
= e^{h_1} / (e^{h_1} + e^{h_0})
= 1 / (1 + e^{h_1-h_0})
= 1 / (1 + e^{-\theta_t^T x(S_t)})
(b)
\theta \leftarrow \alpha \gamma^t G \nabla ln \pi(A_t | S_t, \theta)
(c)
let's start (as recommended) by noting that:
f'(x)
= e^{-x} / (1 + e^{-x})^2
= (1 - f(x)) f(x)
if a = 1:
\nabla ln \pi(a | s, \theta)
= \nabla f(\theta^T x) / f(\theta^T x)
= (1 - f(\theta^T x)) \nabla \theta^T x
= (1 - f(\theta^T x)) x
= x / (1 + e^{\theta^T x})
if a = 0:
\nabla ln \pi(a | s, \theta)
= \nabla f(1 - \theta^T x) / f(1 - \theta^T x) because Pr(At=0) = 1 - Pr(At=1)
= f(\theta^T x) \nabla (1 - \theta^T x)
= - x / (1 + e^{theta^T x})
= opposite of the one for a = 1