Implementation analysis

QLearning class:

These explanations will provide a better diagram’s pseudo code comprehension. Starting with the main ones:

observe reward value: receives the grid’s symbol stepped on and gives its respective reward. Its boolean’s value determinates the possible ending of the current episode.
extract possible actions: actions along the grid are defined here. In this case as spatial movements directions. If necessary they can be re-designed for other applica- tions.
choose action: this applies the E-greedy policy strategy when selecting the action (exploration vs exploitation). It could be possible to be changed for further utilities, for example, to Boltzmann’s strategy.
learn: Bellman’s equation process. This function coordinates most of the previously described ones and fills the q table attribute.
infer path: deduction of the optimal policy from the q table with a given initial state and maximum range of steps. To continue with, the secondary functions used for sensibility analisys and visualization purposes:
visualize inferred path: string logged representation of the states resulted as opti- mal policy.
visualize max quality action: Seaborn and Matplotlib coloured representation of the maximum quality value in each state.
q_value_ascii_action: translates the resulted quality values into Unicode characters. In the jupyter notebook practical example there can be found another plotted sensibility convergence function.