Recent successes in artificial intelligence, for instance, the victory of AlphaGo and the agent that learned to play Atari games have put Deep Reinforcement Learning into the public speech. But what is it and how can we actually use it, for more than playing video games? (not that there’s anything wrong with having a bit of fun, of course).

The reinforcement learning problem goes as follows:

  • An agent receives a state from the environment.
  • The agent takes an action based on this state. The state is a domain-specific, loosely defined abstraction considered to have all the information the agent needs to execute this action.
  • The action is taken in order to maximize a total or long-term reward (not dependent only in the current state).
  • The environment responds with a reward, a new state and some metadata (for example, if the agent died).
  • The state is updated and the loop repeats.

This formalism is abstracted out often as a Markov Decision Process (MDP). The theory behind Markov Decision Processes is well established, and proving that a reinforcement learning algorithm converges (meaning that it successfully finds the right policy) reduces to prove the same thing on the underlying MDP. Now the main difference is that MDP’s assume that the agent has knowledge of the transition function, whereas in Reinforcement Learning we only observe what the new state is, but have no idea of the underlying mechanism generating such state. So algorithms that would work for an MDP, for instance, dynamic programming (planning today’s optimal action conditionally on the current state and the likely distribution of states tomorrow, and their value), would not make sense in Reinforcement Learning, because we are missing information.

So what we do in Reinforcement Learning is to calculate from experience the so-called Q-function. For a state s and action a, Q(s,a) denotes the “quality” (expected reward) of taking action a in state s. Depending on the nature of the problem we are facing, whereas episodic or non-episodic, different algorithms are used. Episodic here means that there is a certain final state, from which there is no further interaction:  say, a videogame in which either your character (the agent) dies, or a final state where it wins is reached. Other problems are non-episodic, for instance, learning to read is a non-episodic task.  This forces us to think of different ways of updating our knowledge of the Q-function: in the episodic case, we just wait until the end of the episode and take the value of the Q-function as the average reward of playing action a in state s, or we update it online in the case of non-episodic tasks. Here we use either the SARSA algorithm or Q-Learning. Once the Q-function is computed, the policy derived from it is simply choosing, at each state, the action a that maximizes Q(s,a).

Where does deep reinforcement learning come into play? When the state space is small, the methods above work reasonably well:  we just update a lookup table “somehow” (depending on the nature of the problem) and we are done. When the state space is large, this won’t do anymore, because we may not be able to store all this information, nor to efficiently access it. One way out of this problem is to parameterize the Q-function by the weights of a neural network W, such that we have now Q(s,a,W) and update them incrementally: the workflow goes as the reinforcement learning problem, but now the agent will “go to sleep” and use a random sample of the previously observed history to update the weights of the network through back propagation.

The “random sample” bit is crucial and one of the main tricks of the Nature paper where this research first appeared. This helps break correlations between data points, to avoid the situation where you are only updating you network weights from a limited set of experiences. This speeds up training considerably.

Applications and use cases

How can we use Deep Reinforcement Learning? One application is UI optimization: here, the UI designer is the agent; the state is the context in which the user of an application is (timestamp, location, user/ behavioral data); the action is the layout we present to the user and the reward is whether there is a conversion or a prescribed engagement goal has been reached.

Another example is classifier selection: this helps us assess the performance of a collection of recommendation systems in the wild, as opposed to “synthetic” metrics: our action is the recommendation system from which we issue a recommendation to a website visitor, and we define the reward in a similar way as above. After some time of experimentation, the best recommendation system is chosen, based directly on user feedback.

Finally, an interesting achievement is the use of Deep Reinforcement Learning within Google to reduce by 40% the cooling bill on their own data center. Although certainly there are other methods to achieve this (control engineers know this very well), it provides a practical use case of the technology.