Hi, I’m Ravi Teja.
During my undergrad I have worked on couple of course projects in Reinforcement Learning. In one of the projects I had to train an agent using SMDP-Q learning to navigate it to a goal state in an environment consisting of four rooms and four hallways connecting them.
In implementation phase of the algorithm I initially faced difficulty in modelling the environment and to integrate the hallway states into the environment model. Upon trying different approaches I found that encoding each position in the four room grid world with an integer and memoization of hallway states is a good way to go with.
Later challenge I faced with was with modelling the ‘options’ for SMDP-Q learning algorithm. I had to decide on the initial states, policy and termination condition to model the options. To model the option I had the following alternatives :
– provide a deterministic policy for each option directing the agent to one of the hallway states (as terminal state) connecting the room in shortest path possible (approach 1)
– or to learn this policy (in approach 1) separately using Q learning with hallway states as terminal states (approach 2)
– or allow the agent to discover options (including the terminal state and initial states) itself using ‘options learning algorithm’ (approach 3).
I came across a research paper for approach 3, which describes the ‘options learning algorithm’ in detail. The option learning process is based on letting the agent explore the environment beforehand by creating random tasks (navigate from start state ‘S’ to goal state ‘G’) in the environment. In these random tasks the agent collects information about the frequency of occurance of different states. The algorithm is based on the intuition that if a particular state occurs frequently in these random tasks, it could be an important / bottleneck state. It is expected that the algorithm will discover the hallway states as bottleneck states. But the target state found by the algorithm is close to the hallway states but not exactly the hallway state ! It can be explained because of the fact that the states near the hallway are visited by both, the trajectories within a room and trajectories going from one room to another, hence making up for their high frequency. For this reason I decided not to go with this approach.
Regarding approach 2, I have noticed that the policy learnt by the agent converges to the optimal policy used in approach 1 when repeated for long enough runs. I went ahead and implemented approach 2 for the agent to learn internal policy for an option. Once the agent has learnt the internal policy for an option, SMDP-Q learning is implemented using these learnt options to navigate the agent to the goal state in four room grid world environment.
This project has increased my curiosity in ‘Hierarchical Reinforcement Learning’. There are a lot of other sophisticated option learning algorithms like intra-option learning and early termination which I would like to explore. My interest lies in working on ML problems that take inspiration from psychology and neuroscience to model primitive human behaviour.
Other than this I have also worked on implementing solution to ‘Cartpole Problem’ using DQN network with experience replay. I was also part of a contest in the Machine Learning course to predict the movie ratings given by a user based on his past behaviour and Movie feature vectors. This has led my interest in ‘Deep Reinforcement Learning’ and ‘Statistical Machine Learning’. I’m particularly inspired by the practical impact we could have with pursuit in Machine Learning Research.
I recently came across a book by ‘Douglas Hofstadter’ titled ‘Gödel, Escher, Bach’ in which he talks about how inanimate objects come together to form an animate object, through multiple short stories, illustrations and analysis. This has led to my interest in learning about ‘Swarm Intelligence’ and ‘Cohort Learning’ in the context of Machine Learning.
This research programme would help me in securing a seat in MS / PhD position in a University where I could continue my research activities to add on to our understanding of intelligence.
Applied on: September 28, 2021