Image: Jeff Bezos was a master at regret minimization

“Of all the words of mice and men, the saddest are, “It might have been.”

In computer science, the terms exploration and exploitation mean something else entirely. Exploration is gathering information, and exploitation is using the information you have to get a known good result. Many of life’s best moments is exploitation. A family gathering together on the holidays is exploitation. A music journalist would have to constantly listen to new music; this is exploration.

The tension between exploration and exploitation takes form in its most concrete form in a scenario called the “multi-armed bandit problem”. The odd name comes from the colloquial term for a casino slot machine, the “one-armed bandit”. Imagine that you walk into a casino, full of different slot machines each with their own odds of a payoff. Naturally, you want to maximise your winnings so it’s clear you need to test which machine is the most lucrative. There are two ways to do this: pulling arms on different machines to test them (exploring) and favouring the most lucrative machines (exploiting).

Imagine being faced with only two machines. One you played 15 times; 9 times it paid out, and 6 times it didn’t .The other you’ve played twice, and it paid out once and once it did not. Simply by dividing the total number of pulls by wins gives its ‘expected value’. First machine comes out ahead (9/15), with 60% expected values whereas the second machine is (1/2) 50%. Which one is the better machine? Because we only did two tries on the second machine, we don’t know how good the second machine might be.

Choosing a cafe/restaurant, or who to date, is really the same as deciding which arm to pull in life’s casino. Understanding the explore/exploit tradeoff provides insights into how our goals should change as we age and why the most rational course of action isn’t always choosing the best.

When we choose what to eat, who to spend time with, or what city to live in, regret looms large — presented with a set of good options, it is easy to torture ourselves with the consequences of making the wrong choice. These regrets are often about the things we failed to do, the options we never tried. Regret can also be motivating. Before starting Amazon, Bezos had a well-paid job at D.E Shaw & Co which he was forced to give up to start an online bookstore.

Bezos says:

“The framework I found, which made the decision incredibly easy, was what I called — which only a nerd would call — a “regret minimization framework.” So I wanted to project myself forward to age 80 and say, “Okay, now I’m looking back on my life. I want to have minimized the number of regrets I have.” I knew that when I was 80 I was not going to regret having tried this. I was not going to regret trying to participate in this thing called the Internet that I thought was going to be a really big deal. I knew that if I failed I wouldn’t regret that, but I knew the one thing I might regret is not ever having tried. I knew that that would haunt me every day, and so, when I thought about it that way it was an incredibly easy decision.”

We can’t live a life without regret, but we can live a life with minimal regret. In a multi-armed bandit, regret can be assigned a number: it’s the difference between the total payoff obtained by following a particular strategy and the total payoff that theoretically could have been obtained by just pulling the best arm every single time (had we known from the start which one it was).

There are several key points about regret proven by Herbert Robbins: First, assuming you’re not omniscient, your total amount of regret will probably never stop increasing, even if you pick the best possible strategy — because even the best strategy isn’t perfect every time. Second, regret will increase at a slower rate if you pick the best strategy than if you pick others; what’s more, with a good strategy regret’s rate of growth will go down over time, as you learn more about the problem and are able to make better choices. Third, and most specifically, the minimum possible regret — again assuming non-omniscience — is regret that increases at a logarithmic rate with every pull of the handle.

Logarithmically increasing regret means that we’ll make as many mistakes in our first ten pulls as in the following ninety, and as many in our first year as in the rest of the decade combined. Realistically, we can’t expect to never have regrets. But if we follow an regret-minimising algorithm, we can expect to have fewer new regrets than the year before.

So are any algorithms that minimise regret? The most popular are known as Upper Confidence Bound algorithms.

A confidence interval indicates uncertainty in an measurement and usually indicates the range of plausible values the quantity being measured could actually have. In a multi-armed bandit problem, an Upper Confidence Bound (UCB) algorithm says to pick the option for which the top of the confidence interval is the highest.

The Upper Confidence Bound algorithm assigns a single number to each arm of the multi-armed bandit. That number is the highest value that the arm could have, based on the information so far. An UCB algorithm doesn’t care which arm has performed best so far, but chooses the arm that could reasonably perform best in the future. The UCB is always greater than the expected value, but by less and less as we gain more experience with a particular option. Eg. An restaurant with one mediocre review still retains a potential for greatness absent in a restaurant with hundreds of such reviews.

UCB algorithms implement a principle dubbed “optimism in the face of uncertainty”. By focusing on the best an option could be, this gives a boost to possibilities we know less about. They inject a bit of exploration, leaping to new people and new things because any one of them could the the next big thing.

In the long run, optimism is the best prevention for regret.

Thanks to Christopher Lam, for his article to be posted here. His original article can be found on medium, here.