-
Notifications
You must be signed in to change notification settings - Fork 0
Implementation analysis
Guzman GP edited this page Feb 22, 2019
·
5 revisions
These explanations will provide a better diagram’s pseudo code comprehension. Starting with the main ones:
- observe reward value: receives the grid’s symbol stepped on and gives its respective reward. Its boolean’s value determinates the possible ending of the current episode.
- extract possible actions: actions along the grid are defined here. In this case as spatial movements directions. If necessary they can be re-designed for other applica- tions.
- choose action: this applies the E-greedy policy strategy when selecting the action (exploration vs exploitation). It could be possible to be changed for further utilities, for example, to Boltzmann’s strategy.
- learn: Bellman’s equation process. This function coordinates most of the previously described ones and fills the q table attribute.
- infer path: deduction of the optimal policy from the q table with a given initial state and maximum range of steps. To continue with, the secondary functions used for sensibility analisys and visualization purposes:
- visualize inferred path: string logged representation of the states resulted as opti- mal policy.
- visualize max quality action: Seaborn and Matplotlib coloured representation of the maximum quality value in each state.
- q_value_ascii_action: translates the resulted quality values into Unicode characters. In the jupyter notebook practical example there can be found another plotted sensibility convergence function.