On Estimation & Confidence

One of the best interview questions I’ve ever had was also one of the strangest, and it has taken me many years to appreciate the question. The question was:

“Our company makes those little bottles of shampoo you see in hotels. How many bottles of shampoo should we produce next year?”

Now, if you’re anything like me you might be thinking “I have no idea, and I don’t care - this is a stupid question”, and that’s totally fair, but there are some important subtle things at play that we will get into. For now, take a few minutes and just think about the question and what your answer would be.

You might have some follow-ups:

  • How many hotels do we service?
  • How many bottles did we produce last year?
  • How many hotels even exist?
  • How many people even go to hotels?
  • How many bottles went unsold last year?
  • etc, etc

These are all well and good and you should be asking these kinds of questions, but at some point you need to just answer: “I think we should produce X bottles”.

Let’s dig into a possible specific reasoning. You might come up with a basic model:

(num bottles) = (num countries) * (num hotels/country) * (num rooms/hotel) * (num bottles/room) * (num days room in use/room) * ... etc

That’s what I did at least, and it worked okay-ish. I spoke with some of my more clever colleagues who also were in the interview pool, and they told me a different model they’d used which I found eye-opening:

(num bottles) = (num people in world) * (num vacations/person*year) * (num days/vacation) * (num bottles/day) * ... etc

That seemed better to me - but I couldn’t really explain why. As I dug into it I came up with the thought that my approach had more potential error in each of the rates - it felt harder to calculate each of these, and there were a lot of variables to account for and figure out. I felt like I had a better grasp of the variables in the second approach and thus it felt more trustworthy.

The two examples above illustrate two approaches to estimation: the first is a bottom-up form of estimation, and the second is a top-down form of estimation. In bottom-up estimation we are building our set from a small, tight number of things - including more and more as we go. In a top-down estimation we are shedding our set, filtering out more and more as we refine our estimation.

So you might be thinking: “oh, so top-down is more accurate” or you might think maybe I just didn’t do a good job representing a bottom-up estimate (totally fair). I want to argue that neither of those viewpoints is the correct way to estimate something.

I think the correct way to estimate something is to provide both an independent bottom-up and top-down estimate; then you can calculate the difference between your estimates. If your estimates are close - congratulations, you probably have a pretty good estimate. If they are far off (say an order of magnitude or more) then you’re clearly missing something. You need to add more details to at least one of your estimates. This difference gives you a band of confidence - an upper and lower bound of how sure you are. You effectively get a clamping or pigeonholing effect.

This is an extremely powerful tool when it comes to justifying decisions for which you have imperfect data too. Let’s say you’re not sure how many rooms hotels have. You might instead phrase the problem as a boundary value problem: “probably on the low of 30 rooms per hotel and on the high of 200 - who knows!”. That is a perfectly fine methodology, and it will turn your upper and lower bounds into upper and lower ranges. Maybe they even intersect which might make you very suspicious that the real answer lies in that intersection. You won’t always be right, but you’re being data-driven - your informing your decisions with science and mathematics and it’s this persistence and rigor that really counts.

You might be thinking: “aha, but what if I’m wrong or I completely miss some variable”. That is valid too. In fact, the best predictors in the world tend not to be simple - nor do they tend to be complex. Instead they are many and independent. What’s the best way to predict how many bottles of shampoo you’d need? Find 200 ways to predict it and aggregate their results. This approach of coming with a matrix of predictors gives you many knobs which you can tune and perfect over time. This technique comes up in many different places, but one of my favorites is machine learning where it’s called ensemble learning - the idea being that it can be a lot more powerful to run many different types of models, compose their results, and try to iron out their aggregate error than to try and create a single model that perfectly solves the problem. If you’re interested in learning more, I’ve found examples of this come up in proofs for algorithmic analysis and real analysis as well.

There isn’t a limit to how far you can take this concept, but I do feel like there are quick diminishing returns unless you can take advantage of the margins (like in machine learning where small percentages can really pay off). For the lay-person such as myself, simply providing both a top-down and bottom-up estimation to clamp your estimation winds up being good enough for justifying most gut decisions.

On Being Impactful

I’ve felt impactful at a few roles in my life. At Stripe it feels like I’m still getting there, but I at least see a hazy road to success and it does not feel like a hopeless endeavor.

I think impact is relatively easy to measure. Most things in business can be boiled down to numbers, and those numbers can be tied back to dollars and those dollars can be used to compare against other dollars. The point is that dollars can be the common denominator of basically all changes and actions. To me impact of any change or action can be modeled as:

(impact of change) = (duration of time change will be applicable) * (% chance of change/action being hit) * (diff in number of dollars)

A positive impact results in an increase in dollars, a negative impact is represented by a decrease in dollars.

For the sake of it, let’s list out some examples:

  • A change that is applicable for a long time, has a low chance of occurring, and has a low item-price difference → could be very impactful if we wait long enough
  • A change that is applicable for a short time, has a high chance of occurring, and has a low item-price difference → could be very impactful if the chances are high enough
  • A change that is applicable for a short time, has a low chance of occurring, and has a high item-price difference → could be very impactful if the price difference is high enough

In isolation, all of these have a questionable probability of being very impactful (note that “very impactful” implies that it is more impactful than other potential uses of time). However when we combine these then our confidence in high impact increases dramatically.

We have to admit however that it is difficult to know all of these things ahead of time. It may seem like we’ll stick with a new refactoring for a long time, but we might decide in 6 months that we’re going to rewrite it all anyways because of another problem. We may think that something has a high chance of occurring, but we fail to recognize that other changes greatly reduced to likelihood of our change occurring. We might think a new product is going to be high value and thus lead high sales prices, but it turns out users aren’t willing to pay for it and we have to lower the price. We deal with each of these with various estimations of course, and - even when our ideas fail to be impactful - this is the right approach.

What I want to focus on though is the opposite side of the coin: recognizing changes and actions that will have low impact (and subsequently not doing them). I feel that recognizing low impact work is even more valuable than recognizing high impact work as low impact work is a waste of time (assuming you have the ability to take on other work, of course - low impact work can be a great learning tool).

Incident-driven work is a good playground for exploring this idea, and I think that’s the case because incidents are often intense and blood-pressure-raising events. People are often unhappy that something happened the way it did and want to prevent it from happening again. I want to argue that this can lead to expensive low impact work and would like to stress how using a framework to direct all of our work (even work that is recent and stinging) is useful for ensuring we are reasoning rationally about what we choose to spend time on.

Let’s take a simple example:

  • Application XYZ had a 10 minute outage. On investigation it turns out that an operator had incorrectly configured the application, deployed it during off-hours, and did not immediately recognize the problem. Once the outage occurred, it took 15 seconds to recognize the issue. It took 2 minutes to deploy the rollback and reboot the application.

I’m sure you can already start rattling off ideas for changes we could be making here:

  • The application was incorrectly configured. How did this happen? Maybe we should write a script to validate configuration prior to deploys
  • The application was deployed during off-hours. Why? Maybe we should require all off-hours changes to be verified by two on-call engineers prior to being deployed
  • We didn’t immediately recognize the problem. Why did it take so long to recognize? Maybe we need to improve alerting to ensure we are quickly notified when things go wrong
  • Our end-to-end rollback took 2 minutes. Why did it take so long? Maybe we need to rethink our deploy and rollback strategy

The naive approach here is to do all of these, and I couldn’t fault anyone who felt they were all valuable solutions. I would agree that all of these could’ve prevented the incident or made the incident go smoother. I want to dig in further however and argue that I think two of these changes are pretty likely to be low impact.

  • The first change proposes writing a validation script. How often do we see misconfigurations happen? Are they really so common to justify both the development and support of this feature over many years? Will it be worth the additional overhead it will cost us to run this in CI or development overhead it will cost us in process? I would argue that there is not a strong case for this to be high impact, and thus I feel this is a low impact proposal

  • The second change involves adding additional process to off-hours changes. It’s likely that we’ll want to make off-hours changes in the future even though I’m not sure it’d be very high frequency. This process would likely last us many, many years and given it is a human process rather than a technical one it shouldn’t atrophy as technology changes. How often do people deploy off-hours? When they do, how often do things break and how badly? I’m a little concerned that there isn’t enough evidence that this definitely will be high impact. It’s probably a good idea so everyone can have dinner with their families instead of working, however as we’ll look at soon - this doesn’t really feel like it is the need that the incident is calling for. Not enough evidence this would be high impact - thus, low impact for me

  • We could improve our alerting so that we know more quickly when things go wrong. Assuming the operator was truly acting in the company’s best interest, one can probably sympathize with the operator. We should know when things break, and we should know fast. Things will break again and we will want to know quicker in those cases too. Our company is growing and adding new devs all the time, so it’s likely we’ll experience this many times in the future. Faster failure detection is something we’ll benefit from for a long time and would set an example for our other services. Moreover, the difference between ~8 minutes to recognize an issue and a few seconds could be massive considering our company’s volume. This seems like a very high impact piece of work

  • Our rollback took 2 minutes. It’s likely that we’ll need to rollback again in the future. We can look up our bad deploy rates to get an estimate for how much time per year this could save us and subsequently how many dollars that additional availability could save us. This work could also scale to our other systems too. This seems like it could be high impact - we’d like to do a bit more verification, but it is promising and the future wins are exciting

Incidents are the prime candidate for measuring impact, and I feel that incidents usually provide very clear measures you can use for justifying impact. Hopefully I helped demonstrate how you can triage proposals to avoid doing a lot of work that you never reap the rewards from. I don’t mean to talk away all remediations either - sometimes it is the right thing to spend time on the spot fix or the bespoke answer, but usually it is not. I haven’t even looked at the expense of solutions - that is also worth considering to make sure you weigh the cost-benefit.

There are other issues with time-pressured work that I haven’t covered yet but are worth mentioning. Work that is taken on hastily is less likely to have edge cases well-thought through. What happens if your changes don’t scale or you miss important cases that aren’t caught for years? All changes present a risk too. You might be fixing system A, but you skipped out testing how it integrates with system B - hopefully breaking system B didn’t create a bigger problem than your initial problem with system A.

The point is not to never make change - the point is that spending time on work that you could’ve preemptively predicted as low impact is a mistake, and it’s a huge mistake when it has not only low impact but has negative impact (whether that negative impact be accrued immediately or over time). We should put a bigger emphasis on justifying why we are working on something and we we believe it does not fall into those trappings. We should also challenge our ideas to not only estimate the ceiling of our potential impact, but also recognize the potential floor - maybe we’ll decide not to do something altogether, or maybe we’ll find a reason to put even more resources into it if we can’t find any holes.