When I first started learning reinforcement learning - as an intern conducting research at Brookhavens National Laboratories - there was one concept I just couldn’t grapple with: the state. Not only was it strange that we could just distill the environment in a vector (sometimes a scalar!), but I kept confusing state with its cousin, the observation. Sometimes I heard the concepts be used interchangably, other times the scientist would make certain to keep the two separate. While many good formal explanations of the differences exist, I hope the following presents a more beginner-friendly and thought-provoking illustration of perhaps the most crucial distinction in reinforcement learning.

Case Study #1: Online Tic-Tac-Toe

I won’t go into the spiel of rigorously defining what a Markov Decision Process (MDP) is, but I will leverage the spirit of a MDP as the basis for the following thought-experiment. Imagine you are trying to play a game of ultimate tic-tac-toe with your rival on the internet. You two mark your Xs and Os for a while, but all of a sudden you have to use the restroom! Of course, you deeply wish to best your rival in ultimate tic-tac-toe, but your body unfortunately has other plans. So you rush to the toilet, but along the way you spot your sibling and - preferring someone playing on your behalf than an automatic forfeit from being away from the keyboard - you command them to continue the game against your rival. While you’re on the toilet, your sibling sees the grid, thinks for a bit, and begins playing. You get out of the bathroom as soon as possbile, and kick your sibling off the keyboard. You see the new grid, sigh at their suboptimal plays, and continue the showdown.

state (as a MDP formalism) described informally

Now ask yourself: how was your sibling able to just play the game upon seeing the grid thus far, and how were you able to just play seeing the final result of your sibling’s plays? That’s because — and this is the main takeaway of a MDP — what really mattered was the grid’s configuration (a.k.a. its state) just before you made your next move. You don’t care how the board got to the way it looks now; the fact of the matter is that this is the configuration, and now you plan your next action.

observation (as a RL formalism) described informally

Simply put, an observation is what you perceive after you take some action in the environment you are in. In our tic-tac-toe example, you input your move into the tic-tac-toe program, it updates the grid, and after some time the grid is updated with your opponent’s response. You, as the player, could care less about how the program is orchestrating these inputs behind the scene. What matters to you is that the board displays your moves, and more importantly your rival’s moves. The idea of an observation may feel natural and intuitive, and that’s becaues it is. Observations are ubiquitous in our world: When it rains outside (as it is while I’m writing this), you have no idea what the molecular composition of the precipitative clouds are — all you perceive is the rain.

Edge case: observation \(=\) state (huh?)

Ok, I should address an elephant in the room. In the tic-tac-toe example, you might have noticed that your observation of the grid corresponds to the state of the grid exactly. The configuration of the grid is exactly the locations of Xs and Os and any unmarked spots, and that’s precisely what you see on the computer screen as well! Indeed, it is possible for your observation of the environment to be exactly its internal state, and this is what has deceived me for so long. But don’t let this deceive you as well — this is merely an exception, not the rule. In many structured, deterministic settings such as board games, this is the case. But we should not expect most environments to adhere to this rare equality, and in fact, the ones worth tackling shatter this equality.

Case Study #2: Glitched Tic-Tac-Toe

I hope the next thought experiment illuminates the distinction between state and observation, and more importanly how much more fun solving the environment becomes when state \(\neq\) observation. Imagine that while in the middle of competing against your rival in the same game of ultimate tic-tac-toe, your sibling messes with the display cables behind your monitor. As a result, the screen starts to glitch in such a way that some of the squares on the grid are covered by static and/or obscuring colors. You want to get your sibling away from the monitor, but the match is heated, and there’s such little time given for each player’s turn! Of course you don’t wish to forfeit, so now you just have to deal with what little you observe, and play to the best of your abilities.

Partial Observability (observation \(\neq\) state)

Life is a series of POMDPs - Dr. Mykel Kochenderfer

By mucking with the display cables, your sibling has thrown you into what’s known as a Partially Observable Markov Decision Process (POMDP). While the name may sound scary, the essence of a POMPDP is that now your observation cannot fully capture the state of the environment. The state remains untouched — the board’s internal configuration is the same, the letters are still mapped to the same spots in the code — but the glitches on the screen prevent you from knowing what the board completely looks like at any given moment. This is the partially observable part of the MDP. I hope it’s clear how the observation is no longer equal to the state, and I also invite you to think of how you may try to beat your rival in this glitched game of tic-tac-toe. Perhaps the stochasticity of the glitches compel you to jot down the marked locations while you can still see them (in reinforcement learning agents, this could be a replay buffer). In essence, so many more strategies (more formally, policies) to play this game can now exist because your observations don’t explain the state, and thus the complexity of the game greatly increases. Most importantly, the game becomes more interesting to play.

Working with observations: introducing agent states

Besides being a great tic-tac-toe player, you’re also a great hacker, and so you think about somehow forcing the tic-tac-toe program to significantly lengthen the duration of a turn. This would allow you to fully capture the grid configuration (assuming the glitches covering a particular square migrate to another spot)! But this would be unlikely (and considered cheating!), and in general, it is incredibly difficult to change with the environment’s state to accomodate for your observability. So, you have to work with what you’ve got, and you consider bringing on agent states. Informally, agent states are constructed by the agent to best encapsulate the environment based on what they’ve observed thus far. In the normal tic-tac-toe example you may construct your agent state to mimic that of the environment (agent states \(=\) state), but in general this is not likely. In some ways, agent states offer less than what knowing the environment state could offer you, but in many ways agent states could offer so much more. You may choose your agent state to include information such as the locations of not only the marked and open spots, but also the glitched spots. You can include the number of ratio of your Xs to your rival’s Os, which could be utilized in some clever strategy. For those who are interested in a mathematical description, you can think of agent states as the result of the following:

\[(X_1, X_2, X_3, \dots X_K) = f((O_1, O_2, \dots, O_N))\]

In other words, \(f\) is some agent-designed transformation on its \(N\) observations to generate a state \(X\) comprised of \(K\) components. Going along with the possible choices of states, some of the \(X\)s may correspond to grid locations of glitched squares, for instance. Much work in modern reinforcement learning goes into deriving such optimal \(f\) that the agent can leverage, where some methods figure out the components of the agent state for you (e.g. deep reinforcement learning)! Lastly, note that the choice of \(f\) is crucial for your performance. It is indeed possible to construct a poor \(f\) that yields agent states that actually sabotage your performance, where you’d have been better off working with just the observations!

observations and states in our life

We love to make order out of chaos. But sometimes there is no chaos in our daily life, and so the inherent structure of the environment is enough for us to live by. For instance, a typical 9-to-5 senior employee — as implied by the term “9-to-5” — does not have to do much thinking other than what is pertinent to their job or personal life. However, a budding teenager has their whole life ahead of them! The environments they inhabit are rapidly changing, their observations of the world may have no bearings on what is actually true. Their raw observations could easily deceive them, so what might they do? They start to build out a perspective — internal structures similar to agent states — that help them process their observations and help them act in a way that accomplishes their goals. This transformation from observations to worldview is what we called our mindset, and of course, mindsets can significantly differ. Your \(f\) may be different from my \(f\), and if we have vastly different goals in life, your agent states could serve you well but completely sabotage my life. What’s important here though, is that \(f\) can be changed, with some effort of course. So act in your environment, updating its state, gather some observations, and build out a wicked \(f\).