Conquering Reality - from a Reinforcement Learning perspective
TLDR - Good beliefs lead to a good life. We ask ourselves: “what can we learn about living life from the shortcomings of a class of reinforcement learning agents?” We propose that living is effectively holding some beliefs fixed (assumptions) and behaving under them. We can behave like reinforcement learning agents and update our assumptions off observations, but that doesn’t work well with our preconceptions. Instead, our internal updates should involve repeatedly asking what-ifs about our assumed limitations, playing the theories out, and adopting them if they bear fruit.
Disclaimer: I understand that I am writing this as someone who is relatively privileged. Many people in this world, including my family, are entangled in responsibilities and circumstances that unfortunately limit them from self-actualizing. Therefore, this post is written to those who can be in an environment allowing them opportunity to find themselves. This is the case for many college students, and as an undergraduate myself, this article is a product of my observations of life so far and my journey with reinforcement learning.
Reality is a strange thing. On one hand, we tend to think of it as this ebb-and-flow of causalities parametrized by an infinite number of variables. My decision to buy and eat a hot dog depends on many factors: the presence of hot dog vendors, the price set by them, the safety of their hot dogs, perhaps even the looks and greetings they give us! Each factor is again parametrized, perhaps by the economy, the quality of life in the neighborhood, the vendor’s personal life — we get the point. In short, we can think of reality as decided by others, and it may be too complex for us to ever have a grip of it.
But that is about the reality. Let’s talk about our individual realities, which is what we care about anyways (for better or worse). The relationship between the reality and an individual‘s reality can be roughly formalized as:
\[Reality = \bigcap_{p \in People}Reality_p\]To begin our discussion, let’s first consider the quintessential Thompson Sampling reinforcement learning agent, who operates — as is the case in many state-of-the-art systems — by sampling from a maintained posterior distribution of parameters. Upon this realization of parameters, the agent constructs an optimal policy and plays the game of interest under that policy. After each observation it receives while playing, the agent tweaks the distribution of parameters so that they better reflect the internal structure of the game, and by working towards perfect knowledge of the game, our agent hopefully converges to an optimal player.
Already we can draw many parallels between how we act in our individual realities, and of that realized by our counterparts in the world of reinforcement learning. In our own realities, we maintain a set of variables of life that we hold fixed — more informally known as our assumptions. These assumptions are crucial to living efficiently. For instance, a person commuting to work in NYC makes assumptions about the line status of the subway system, the fare, the behavior of other passengers on the train, and many more. It is only upon these assumptions that we can live our lives — in other words operate off some policy. The 1/2/3 to Chambers St is broken? Let’s take the A to World Trade Ctr instead. Assuming other passenger may not wear a mask, let me pack one.
So let’s suppose this is the case: every person maintains their unique set of assumptions of the world, they play the game of life under some policy parameterized by these assumptions, and then what? A Thompson Sampling agent would advice us to update our assumptions given the consequences, updates that would ideally disregard assumptions not worth making, and take on new ones that let us perform better in life. But here is the thing, how do we update? Thompson Sampling agents benefit from the power of Bayes Rule, so does the same hold true for us?
Whereas a reinforcement learning agent takes its parameter updates as truth, often we have that our assumptions are so strongly rooted that even the most compelling observations are no match for our stubbornness. Our internal bayesian updates become biased for our preconceptions, and against what we simply don’t wish to become reality. This could explain why many of us feel stings and become uncomfortable from learning about current events, seeking refuge in modern entertainment — television, video games, etc. Our typical problem-solving is often constrained to how we think a problem should be solved, with our policy comprised of methods taught by our parents, teachers, etc. If a different, more complete solution is presented to us, sometimes we absorb it to better understand the domain of that problem. Sometimes, however, we latch on to our ideology.
It’s safe to say we are a lot like reinforcement learning agents, after all they are inspired by us! But in many ways we are better than them, why? Because while they have to be obedient to their design, we don’t. Indeed, human achievement across the ages arise from the art of pivoting. For emperors, pivoting could mean restructuring the empire upon hearing bad news; for entrepreneurs pivoting could mean switching markets upon discovering a bad product-market fit for the current audience; for the majority of us, pivoting could just mean adopting a growth mindset and reconsidering the assumptions we constrain our growth with.
State-of-the-art information seeking Thompson Sampling agents operate under a particular realization of parameters (i.e. a theory) in an \(\epsilon\)-greedy fashion. What this means is that morst of the time the agent operates under some generated policy; however at some random instance, the agent samples a new set of beliefs from the continually updated posterior, effectively producing a new theory to work with. We don’t have to work with a theory for such long or be subject to such randomness. We can leverage our brain’s power as an answering machine to ask ourselves: what am I assuming should not be the case? Some assume work and fun should not be together in a sentence; some assume internships are needed to have a productive summer. Some assume pineapple and pizza do not go well together!
What am I assuming should not be the case?
Asking the “what-if” that would challenge these assumptions is only the first step. What comes next, akin to how an agent experiments with a theory, is to take the initiative and play out the what-if in the world. It could be that under this theory, you accumulate negative rewards, but it could also be that this theory strikes gold. For most theories the possibly harmful repercussions are often exaggerated, in part due to our stubbornness and tendency to remain in the comfort zone. If the theory does cause harm, pivot.
This idea sounds like we are “hacking reality,” and in some ways challenging our assumptions so frequently may be un-natural, but now I hope we feel somewhat free. We can take any shortcoming we have, and at any moment, try to live in a what-if of our choosing. One indicator of a good reinforcement learning agents is how well they gain epistemic certainty — knowing what they do not know — of their environment. There’s much active research in how to get agents to figure out which parameters to prioritize learning (recalling our subway commute example, the line status could bear the most information on which trains to take), and generally such algorithms embodying this information seeking behavior comes at a cost of compute time. We’re able to just leverage our wonderful mind and ask it the what-ifs that could help bring forth the realities seen in the shows and fantasies we indulge ourselves in.
So, what’s the takeaway? Reinforcement learning, among other qualities, can serve as a philosophy challenging the relationship we have with the assumptions we fix in our realities. We are effectively only as good of a player in the game of a life as the beliefs we have. To realize an enriching reality, we must frequently reconsider our beliefs against the wishes of our innate stubbornness. We begin by asking the what-ifs about our daily habits, then move towards our aspirations and goals. We should not have to settle with less; learn the parameters optimal to your life as fast as possible, and exploit.