≡ Menu

Niels Hoven

Watercooler moments
Because it’s so hard to measure, people tend not to think of word-of-mouth as a product feature. But it can be designed for and optimized, just like anything else. Television has been designing and engineering word of mouth virality for years. This essay is about how to do it in software.

Specifically, I want to talk about a tactic that was once prevalent in television that is now beginning to resurface in software: the watercooler moment.

Word of mouth virality is driven by watercooler moments – experiences that are so memorable that you can’t wait to talk to your friends about it at the watercooler the next day.

Famous watercooler moments

In 1980, CBS used the advertising catchphrase “Who shot J.R.?” to promote the TV series Dallas. Viewers had to wait 8 months to find out the answer. A session of Turkish Parliament was even suspended so that legislators could get home to see the answer revealed. It was the highest-rated TV episode in US history, with 83 million people tuned in to discover what happened.

who shot jr magazine cover

When Ellen DeGeneres came out as gay, there was rampant speculation about whether her character on her sitcom Ellen would come out as well. And she did, in an award-winning episode in April 1997 that generated enormous publicity and a nationwide conversation. The episode was the highest-rated episode of Ellen ever, with 42 million people tuned in to see the event.

ellen says "yep, I'm gay"

During the live broadcast of Super Bowl 38’s halftime show, Janet Jackson’s chest was exposed during a dance routine with Justin Timberlake. The moment, which became the most watched moment in TiVo history, resulted in 540,000 complaints to the FCC, “Janet Jackson” becoming the most searched phrase of 2004, and the phrase “wardrobe malfunction” entering the popular lexicon.

Janet Jackson wardrobe malfunction

That fact that moments can be planned or scripted doesn’t make the emotions they create any less genuine. Watercooler moments transcend the boundaries of their medium, sparking conversations in the real world to become communal experiences.

Designing watercooler moments

People are social animals. We have an instinctual desire to tell stories. Stories help us make sense of the world, share useful information, and reinforce bonds. They are the currency of human connection.

Watercooler moments turn a one-off event into a communal experience. People retell the story, share the story, interpret the story, discuss and argue its meaning. Interesting drama involving interesting participants provides endless fodder for discussions of motivations, ethics, and morality.

So creating a compelling story is the first step in creating a water cooler moment. But since you (a software developer) presumably have no script or characters to rely on, it means your app itself will have to create the story on its own.

Products that generate stories

Unexpected emotions create compelling stories. The more unexpected the event, and the more extreme the emotion, the more powerful the desire is to share it.

Any extreme emotion will get people talking. But while negative ones (outrage, anger, disgust, etc) are exploited to great success by the media, they’re generally not emotions you’d like your product to generate. So for now, let’s focus on tactics that generate unexpected moments of delight.

Example: Asana monster

Asana yeti monster
Emotions don’t necessarily have to be that extreme. Case in point: the little blue yeti that occasionally pops his head up after you move a card in Asana. An unexpected moment of delight can be enough to get people talking. Unconvinced? Just search for “asana narwhal” on Twitter.

Example: Hearthstone

This is the only example I’ve included from gaming, but it’s my favorite due to both the intensity of emotion and the intentionality behind its design. To intentionally engineer watercooler moments, Hearthstone’s designers created a number of cards (such as Millhouse Manastorm, shown below) with probabilistic effects that would, on rare occasions, completely change the course of the game in a spectacular way.
millhouse manastorm hearthstone card
Dramatically snatching victory from the jaws of a punishing defeat (or vice versa) is the sort of intensely emotional experience that you can’t help talking about, no matter which side of it you were on.

Example: Zappos customer service

Zappos uses exceptional customer service to create memorable moments for their customers. Sometimes these stories are so powerful that they even make the news, like overnighting a pair of shoes to a wedding for free because the original pair was routed to the wrong location.

Example: Tinder

Some apps are fortunate enough to generate watercooler moments naturally. Tinder grew to 50 million users in 2 years through word of mouth by allowing people to get laid (or at least matched) on-demand without fear of rejection.

Example: ClassDojo

ClassDojo (shameless plug: come work with me!) has also grown entirely via word of mouth. It surprises and delights teachers by solving problems they previously considered intractable: creating classroom community and growing parent involvement. Now ClassDojo is used in 90% of US schools, entirely through organic word of mouth growth.

Creating your own watercooler moments

To create watercooler moments, find opportunities to design experiences that are extremely unexpected (e.g. an albino giraffe), extremely delightful (e.g. flying first class), or both (e.g. a surprise party).

Obviously the best case scenario is that your core use case massively exceeds users’ expectations to the point where they can’t stop talking about it. (Think Napster in 1999.) Another great scenario is if your core use case is a series of unexpectedly delightful moments delivering a variable reward stream of dopamine hits directly to the brain. (Think Tinder, which is basically a slot machine that pays out sex.)

But for those of us not fortunate enough to be working on products whose core use cases tap directly into the brain’s pleasure centers, here are some tactics that might help:

Tactics for creating unexpectedness

  • Probability
  • User behaviors
  • Real world events

The simplest way to introduce unexpectedness into your product is adding some kind of probabilistic event. The celebration monsters in Asana, for example, don’t appear every time a card is moved. If they did, they would be expected and therefore boring and unworthy of comment.

Slack uses randomness to great effect with its randomized loading messages. It’s little touches like these lighthearted random messages that let Slack inject personality and delight into a corporate productivity tool.
Slack loading message
Another option is using user behaviors, particularly ones outside of core usage. Could something interesting happen if a user accidentally swipes instead of taps? Maybe some parts of the UI that don’t look interactable are actually responsive. Or maybe there’s some easter eggs for your users to discover.

Real world events are also good opportunities to deliver unexpected experiences. This is becoming common enough that it doesn’t have the impact that it used to, but it still gets users talking to see snow collecting on the UI during the holidays, or rainbow trails during Pride, or pumpkins on Halloween.

Tactics for creating delight

  • Next level visual polish
    • Animation
    • Particle effects
  • Characters
  • Personal messages from us, or for you
  • Celebrate a real user accomplishment

A classic way to create delight is though UX and visual polish. While a baseline level of polish and usability is expected in any app these days, taking your polish to a level above and beyond is a great opportunity to create delight.

Fabulous is one recent app that made me feel that sense of delight. Its clean yet whimsical UI was so enjoyable to use that even my non-designer friends couldn’t stop talking about it.

There are countless tactics beyond visual polish to “juice” up the delightfulness of an experience, but animations, particle effects, and cute characters are always safe bets.

Personalization is another great way to surprise and delight a user. In a world where we’re used to being on the receiving end of impersonal corporate emails, a message will stand out if it is clearly written to me personally, with empathy and understanding for my personal and unique situation. Alternatively, you could delight users by getting personal on the sender side, if you’re willing to open up, get personal about yourself, and send an authentic personal email from you and not just a faceless company.

Finally, recognizing your users’ accomplishments is a great way to delight them. If a user does something exciting in your app, help them celebrate! Maybe they just made their first post, maybe they returned to your app after a month away, maybe they discovered emojis for the first time.
Foursquare mayor popup

Realize that many actions that seem mundane to you still feel like big accomplishments to your users, so help them celebrate! Pop a congratulatory message, shower them with confetti, send them a certificate of accomplishment, or something else creative.

In summary

To grow word of mouth, delight users in surprising ways. Find opportunities to increase delight or increase surprise until people can’t stop talking about you.

Buckets with eggs

Here’s a familiar experience: You’re trying to improve retention so you run a series of experiments. You end up releasing the same control experience to several cohorts, with dramatically different results each time. Your sample size was large, your source of users hasn’t changed, and the tests were close enough together that there shouldn’t be any seasonality effects. What’s going on?

It turns out that there’s a nuance in retention calculations that trips a lot of people up. Let’s call it “Bad Bucketing”, and even some analytics companies are getting it wrong.

Wait, isn’t retention just a standard calculation?

While most metrics have a straightforward intuitive explanation, if you’ve ever rolled-your-own and done the actual calculations, you’ll quickly realize that calculating even basic metrics requires you to make numerous decisions.

(For example, for retention: Are we looking at all events, or only session-start events? Or for conversion: Are we calculating it as a percentage of our active users? Or only the ones who opened the app? Or only the ones that viewed the sales page?)

Frequently the right answer to these decisions is obvious. And sometimes the answer doesn’t really matter that much. But sometimes, the right answer is non-obvious and also REALLY matters. Calculating retention is one of those times.

Calculating retention

As a concept, retention is pretty intuitive. It answers the question of “Do people like my app enough to keep coming back to it?” Retention measures the percentage of users that come back to an app on a specified time scale: usually daily, weekly, or monthly. 1-day retention is frequently described as “What percentage of today’s users come back tomorrow?”, and 1-month retention as “What percentage of this month’s users come back next month?”

Conan, what is best in life? Retention

Retention is one of the most fundamental product metrics. It’s a proxy for product market fit, user lifetimes, and everything that is good. It is arguably the most critical metric to track for any product, but the most common and intuitive way of calculating retention has serious flaws, regardless of sample size.

The Bucket Blunder

The most intuitive way of calculating retention considers each day as a separate bucket. Count the number of new users in today’s bucket – that’s the cohort for today. Then calculate what percent return tomorrow. That percentage is your 1-day retention. This calculation is simple, intuitive… and wrong.

Treating all the users in your bucket the same glosses over the fact that users who show up earlier in the day have to stay engaged for a longer period of time in order to count as retained, as compared to users who show up late in the day.

regular retention

Basic retention: Both green and purple arrive on Day 0 and return on Day 1

A more reliable way to calculate retention is to consider each user’s install time individually. A user counts as retained for one day if they show up between 24 and 48 hours after their initial install. In other words, instead of asking “What percentage of today’s users come back tomorrow?”, ask “What percent of people who install today come back 24 hours later?”

Rolling retention windows: Green user has retained for 1 day after install. Purple user has not returned 1 day after initial install.

How serious is this problem, really?

I’ve seen retention measurements literally double because the UA bursts happened to hit just right. This results in false celebration now, followed by a wild goose chase when the next test inevitably comes back far lower.

Consider two users: Early Ellie installs at 12:01 am on June 1, and Late Larry installs at 11:59 pm on June 1. Ellie has to engage 24 hours after install to count as retained for 1 day. Larry only has to return 2 minutes later. As a result, installs later in the day will show much higher retention numbers.

The size of this effect further depends on how you’ve defined “being active”. Does a user count as being active on June 2nd if we see any activity from him? Or are we only looking at session start events? If taking any action inside our app qualifies Larry as an active user (a reasonable assumption), then if his first session lasts 2 minutes, from 11:59pm to 12:01am, then our system will say he’s been retained for 1 day.

How analytics companies are calculating retention

If you don’t feel up for the challenge of rolling your own analytics, one of the benefits of an off-the-shelf solution should be that you don’t have to worry about any of this. Unfortunately, that’s not the case, because all the top analytics providers calculate retention differently.

Consider how incredible it is that after at least a decade of retention being widely recognized as the single most important product metric, there’s still no standardized way to calculate it and each of the top analytics-as-a-service providers is just using their own judgment.

Mixpanel's retention calculation

Mixpanel: YES!

Mixpanel calculates retention correctly. Hooray Mixpanel!

Flurry's retention calculation

Flurry: Intuitive, but wrong

Flurry calls retention “return rate” and does it wrong but it’s still an improvement over their awful retention calculation.

Amplitude retention calculation

Amplitude: Rounded to the nearest hour? Fine, I can live with that

Amplitude changed their calculation recently and now calculates retention correctly for dates after August 18, 2015. They do round to the nearest hour (I’m not sure why, since we have these things called computers that are really good at dealing with clunky numbers) but that’s probably close enough.

Heap's retention calculation

Heap: I’m just confused by this

Heap is unclear. Their description of daily retention looks correct, but their description of weekly retention looks incorrect. I’ve emailed them for clarification. (EDIT 7/13/2018: Heap was very helpful and it sounds like they’re calculating retention correctly, using the same methodology as Mixpanel. Hooray Heap!)

How concerned should I personally be?

This is a particularly serious problem if you tend to burst user acquisition when running your experiments.

If the UA faucet gets turned on early in the morning for experiment 1, but late in the day for experiment 2, v1’s test will be full of Early Ellies, and v2’s test will be full of Late Larrys. The product changes won’t even matter; v2’s retention metrics will dominate v1’s.

Turning on UA at the same time of day for each test doesn’t solve the problem either, because the time required for ad networks to ramp up the volume on your campaign varies from day to day and week to week.

This happens on longer timescales, too. Does your August cohort have great monthly retention? Maybe that’s because all your August users installed during back-to-school in the last week of August, so they only had to stick around for a few days to count as retained in September.

Rolling retention and you

I’ve been referring to “What percent of people who install today come back 24 hours later?” as “rolling retention”, because of the rolling 24-hour buckets that are specific to each user. In rolling retention, any user counts as retained if she returns between 24 and 48 hours after her initial install, no matter what time of day she installed.

(Ideally, we would just call it “retention”, but until everyone starts calculating retention the same way, I guess we’re stuck qualifying the name somehow.)

“What % of people who install today return tomorrow?” is an intuitive question, but gives unreliable results. Instead ask, “What % of people who install today come back 24h later?” On the surface the questions are the same, but the latter gives much more trustworthy results.

If you start calculating retention this way, be aware that there will be some weirdness around the end of your retention curves.

You’ll now need to wait 48 hours to get your day1 retention, to give the Late Larrys a full 24h to return. And while you’re waiting for to see if Larry returns for his day1 retention, there could be an Ellie from the same cohort who’s already come back for day2.

It’s a pretty minor nuisance, though, and well worth it to have retention metrics that you can actually rely on. Have you run into something similar? If so, I’d love to hear about it.

“Everybody gets so much information all day long that they lose their common sense.” – Gertrude Stein

My first job as a product manager was in games. I worked at Playdom, Zynga’s primary competitor during the social gaming boom of 2009. The sophistication of our data analysis techniques and the platform supporting them played a large role in our eventual $700 million acquisition by Disney.

data-picard

For most companies at the time, “analytics” just meant counting pageviews. If you were really fancy, you could track the order in which users viewed certain pages and assemble a funnel chart to quantify dropoff. Gartner’s report on the state of web analytics in 2009 describes a range of key challenges, like “how to obtain a sustainable return on investment” and “how to choose a vendor”.

web analytics challenges 2009

“Why would we need an analyst to tell us what our hitcounter is saying?”

In contrast, social gaming powerhouses like Zynga and Playdom were custom building their own event-based analytics systems from the ground up. They tracked almost every action that players took in a game, allowing them to deeply understand their users’ needs and build features to fulfill them, rather than simply taking their best guesses.

For me, it was incredibly exciting to be on the cutting edge of analytics. For the first time, we could get real insights into players’ actions, aspirations, and motivations. Games are tremendously complex software products with huge design spaces, and even now it blows my mind that for most of the industry’s history, development decisions were made purely on gut instinct.

The power of these new data analysis techniques seemed limitless. Zynga went from zero to a billion-dollar valuation in under 3 years. And while gaming companies were the first to really showcase the potential of event-based metrics, they certainly weren’t the only ones. There was a digital gold rush as startups popped up left and right to bring the power of quantitative data insights to every industry imaginable.

Perhaps the most famous example of putting data-driven design on a pedestal is Marissa Mayer’s test of 41 shades of blue. (It’s an absurd test for many reasons, not the least of which is that with so many different variations, you’re basically guaranteed to discover a false positive outlier simply due to random noise.)

In this brave new world, metrics were king. Why would you need a designer? Everything could be tracked, measured, tested. A good PM was one who could “move the metrics”. MBAs and management consultants were hired by the boatload. One friend told me about the time he had to talk his CEO out of firing all the game designers in the company and replacing them with analysts.

A quick note about the game industry

As an aside, the game development industry has interesting market dynamics because of how many people dream of working in it. In some unglamorous industries (e.g. Alaskan crab fishing, logistics, B2B startups), demand for labor vastly exceeds supply. In games, it’s the opposite – many people stay in games out of passion, even when the money doesn’t justify it, leading to a market that is oversaturated and extremely competitive.

The evolutionary pressures of this absurdly competitive market mean that the pace of product innovation is extremely quick. The quality bar constantly increases, production costs go up, advertising prices rise, margins disappear, and mediocre products fail.

The gaming market’s competitiveness forces rapid innovation just to keep up, and when better tactics emerge, they are quickly adopted and rapidly bubble up to dominate the top of the market. As a result, the gaming market can be a bellwether of trends in the larger tech market, such as the power of the freemium model, microtransactions, sophisticated performance marketing, and strong product visions.

The competitive advantage of a strong product vision became undeniable in early 2012. At that time, Zynga had been around for about 5 years, with a peak market cap over $10 billion, and the company’s success had been repeated on a smaller scale by other strongly “data-driven” gaming companies on Facebook and on mobile.

However, an interesting trend was beginning to occur, with new games like Dragonvale and Hay Day dominating the mobile charts with innovative mechanics supported by a single, unified product vision.

Purely metric-driven iteration with no vision or direction could bring a product to a local maximum, which was good enough in the very early days of mass-market casual gaming. But as the market matured and competition intensified, a local maximum wasn’t good enough. Derivative products and products developed by only metric-driven iteration were vastly inferior to products driven by a strong creative vision from their inception, like Supercell’s Clash of Clans or Pocket Gems’ Episode. That vision was a necessary prerequisite to create a product strong enough to land at the top of the charts.

apple top grossing

Fortnite was announced in 2011 and launched in 2018. Gaming is a tough industry.

And being at the top of the charts is critical – revenue on the Top Grossing Charts follows a power law, with the handful of apps at the very top of the charts making more money than all the rest of the apps put together. As Zynga’s apps slipped down the charts, their inability to adapt to this new world became apparent and their stock price fell 80%.

Data-driven design had failed, as did intuition-driven design before it. The industry needed a more fundamental shift in perspective. Good teams now design for the long term, guided by intuition but informed by data.

Personally, I like to emphasize the difference between data-driven design (relying on data to make decisions because we have no user empathy) and data-informed design (use data to understand our users, then build features to delight them)

Data-driven design

When I say “data-driven design”, I’m referring to the mentality of “letting the data decide”. In this paradigm, PMs and designers surrender to the fallibility of their intuition, and thus they elect to remain agnostic, using A/B testing to continuously improve their products.

A number of companies I’ve talked to have bragged about about the fact that they’ve removed intuition from the decision making process. It’s comforting to be able to say “We don’t have to depend on intuition because the data tells us what to do!”

Of course, everyone knows that data is noisy, so companies use large test groups and increased rigor to mitigate those concerns. But the real problem isn’t tests giving the wrong answer, so much as it is the assumption that the infinite degrees of freedom of creating a compelling product can be distilled to a limited number of axes of measurement.

With the exception of straightforward changes like pricing, most design changes have complex effects on the overall user experience. As a result, treating metrics as end goals (rather than simply as indicators of good product direction) results in unintended consequences and a degraded user experience. Testing isn’t a magic bullet either. Sometimes this degradation occurs in an unexpected part of the user experience, and sometimes it occurs on a different timescale than the test.

Split tests typically gather data for a period of days or weeks. User lifetimes are typically months or years. If you’re only looking at the data you’ve gathered, it’s easy to unintentionally trade off difficult-to-measure metrics like long term product health in exchange for easy-to-measure short-term metrics like revenue.

Example: Aggressive paywalls

Zoosk is a dating app that built a huge userbase as a Facebook app during the heyday of data-driven design. They’re extremely aggressive with their monetization, with misleading buttons designed to constantly surprise the user with paywalls.

Oh boy, a message!

Gotcha! Paywall!

A company naively focusing on revenue will naturally iterate their way to this point, experimenting with increasingly early and aggressive paywalls and discovering that the spammier the app becomes, they more money they make.

However, while an aggressive approach can be very profitable in the short run, it quickly drives away non-payers and makes it difficult to engage new users. In the dating space, this results in a user experience that becomes worse every month for subscribers.

Sure enough, judging from AppAnnie/SensorTower estimates, Zoosk’s revenue has probably fallen about 50% since their 2014 high of $200 million.

Example: Searches per user

One of my favorite stories is from a friend who worked on improving the search feature at a major tech company. Their target metric was to increase the number of searches per user, and the most efficient way to do that was to make search results worse. My friend likes to think that his team resisted that temptation, but you can never be totally sure of these things.

Example: Brand tradeoffs

If you start a free trial with Netflix, you’ll get an email a few days before the end of the free trial reminding you that your credit card is about to be charged. I’m sure that Netflix has tested this, and I’m sure that they know that this reminder costs them money. However, they’ve presumably decided to keep the reminder email because of its non-quantifiable positive effect on the Netflix brand (or more precisely, to avoid the negative effect of people complaining about forgetting to cancel their free trial).

Netflix email

Short term revenue loss, long term brand gain

Notably, Netflix only reminds you before billing your card for the first time, and not for subsequent charges. At some point, a decision was made that “before the first charge but not before subsequent ones” was the correct place to draw the line on the completely unquantifiable tradeoff between short term revenue loss and long term brand benefits.

Example: Tutorial completion

A standard way to measure the quality of an onboarding experience is to measure what percent of users who start a tutorial actually finish it. However, since there will always be a natural drop off between sessions or over days, one obvious way to increase tutorial throughput is to build a tutorial that attempts to teach all the features in a single session.

Sure enough, tutorial throughput goes up, but now users are getting overwhelmed and confused by the pace of exposure to new menus and features. How to help them find their way? Maybe some arrows! Big, blinking arrows telling the user exactly which button to tap, directing them into submenus 7 levels deep and then back out.

You’ll be able to do this on your own next time, right?

Arrows everywhere can boost tutorial throughput, but all the users will be tapping through on autopilot, contradicting the point of having the tutorial in the first place! Excessive handholding of users increases tutorial completion (an easy to measure metric), but decreases learning and feelings of accomplishment (difficult to measure but very important metrics).

Example: Intentionally uninformative commununication

“You’ve been invited to a thing! I could tell you where and when it is in the body of this email, but I’d rather force you to visit my website to spam you with ads. Oh, look at how high our DAUs are! Thanks for using Evite!”

email from evite

If this email were helpful, Evite would have to find a different way to make money

Equally frustrating to users: Push notifications that purposely leave out information to force users to open the app. Users will flee to the first viable alternative that actually values the user experience.

Example: User experience

In a purely data-driven culture, justifying investment in user experience is a constant uphill battle.

Generally, fixing a minor UI issue or adding some extra juice to a button press won’t affect the metrics in any kind of a measurable way. User experience is one of those “death by 1000 cuts” things where the benefits don’t become visible until after a significant amount of work has already been put in.

As a result, it’s easy to constantly deprioritize improvements to the user experience under the argument of “why would we fix that issue if it’s not going to move the needle?”

To create great UX requires a leap of faith, a belief that the time you’re investing is worthwhile despite what the metrics say right now.

Hearthstone is a great example. Besides being a great game, it’s full of moments of polish and delight like the finding opponent animation and interactive backgrounds that are completely unnecessary from an minimum viable product perspective, but absolutely critical for creating a product that feels best-in-class.

Example: Sales popups

When I was at Playdom, we would show popups when an app was first opened. They’d do things like encourage users to send invites, or buy an item on sale, like this popup from Candy Crush does.

candy crush sale popup

Do you want revenue now or a userbase in the future?

I hate these. They degrade the user experience, frustrate the user, hurt the brand, and generally make interacting with your product a less delightful experience.

On the other hand, though, they WORK. And you can stack them: the more sales popups you push users through, the more money you make – right up until the point where all of your metrics fall off a cliff because your users are sick of the crappy experience and have moved on.

It always gave me a bit of schadenfreude to open a competitor’s game and see a sale popup for the first time, because the same pattern always repeated itself: As the weeks went by, more and more aggressive and intrusive popups would invade the user experience, right up until the game disappeared from the charts because all the users churned out.

Even retention isn’t foolproof

As a final note, while most of the examples above involve some variation on accidentally degrading retention, even optimizing for retention doesn’t prevent these mistakes from occurring if you’re optimizing for retention over the wrong timescale or for the wrong audience of users.

Typically, companies will look at metrics like 1-day, 7-day, 30-day retention because those numbers tend to correlate highly with user lifetimes. But focusing on cohort retention runs the risk of over-optimizing your product for the new users that you’re measuring, perhaps by over-simplifying your product, or neglecting the features loved by your elder users, or creating features that benefit new users at the expense of your existing audience.

Data-informed design

In contrast to “data-driven design”, which relies on data to drive decisions, in “data-informed design” data is used to understand your users and inform your intuition. This informed intuition then guides decisions, designs, and product direction. And your intuition improves over time as you run more tests, gather more data, and speak to more users.

When I’m making the case for the benefits of introducing intuition back into the decision-making process, there are two benefits that I keep coming back to: leaps of faith, and consistency.

Leaps of faith

Purely data-driven product improvement breaks down when a product needs to get worse in order to get better. (If you’re the sort of person who likes calculus metaphors, continuous improvement gets you to a local maximum, but not to a global maximum.) Major product shifts and innovations frequently require a leap of faith, committing to a product direction with the knowledge that initial metrics may be negative for an extended period of time until the new direction gets dialed in and begins to mature.

When Facebook introduced its newsfeed, hundreds of thousands of users revolted in protest, calling for boycotts and petitioning for removal of the feature. Now we can’t imagine Facebook without it.

Consistency

When products are built iteratively, with decisions made primarily through testing and iteration, there’s no guarantee of a consistent vision. Some teams take pride in the fact that their roadmaps only extend a week into the future. “Our tests will tell us what direction to go next!”

Data-informed design helps your product tell a consistent story. This is the power of a cohesive product vision.

It can be hard to explain exactly WHY a cohesive product vision translates to a better product, and also why it’s so hard to get there purely by data-driven iteration. Perhaps an extremely contrived example can help illustrate my point.

Let’s say you’re designing a new experience. You’re committed to good testing practices, and so over the next several months, you run tests on all 20 features you release. Each test is conclusive at the 5% significance level, and sure enough, users respond very positively to the overall experience that your tests have led you to.

Now, even with rigorous testing at a 5% significance level, 1 out of 20 tests will be wrong, and interestingly enough, 19 of the tests are consistent with the belief that your users are primarily young women, while 1 of them conclusively indicates that your users are middle-aged men.

Allowing your decision-making to be informed by data rather than dictated by it allows the team to say “Let’s just ignore the data from this particular test. Everything else we’ve learned makes us quite confident that we have a userbase of young women, and we believe our product will be better if all our features reflect that assumption.”

Obviously, if more tests come back indicating that your users are middle-aged men, your entire product vision will be thrown into question, but that’s ok. It’s preferable to ignore data in order to build a great product reflecting a unified vision that you’re 95% confident in, rather than creating a Frankenstein with 95% confidence on each individual feature.

The role of data in data-informed design

I believe that saying “just let the data decide” isn’t good product management, it’s an abdication of responsibility. As a PM or a designer, your job is to develop empathy for your users. Use data to understand them, their aspirations, and their motivations, and then take a position on what direction the product needs to move to best serve them.

Sometimes this means knowing your users better than they know themselves, as in the Facebook newsfeed example. More commonly, it means having enough faith in your product vision to recognize early false negatives for what they are, and being willing to grind through the trough of sorrow to realize your product’s potential.

Eric Reis gives an example of a registration flow that he worked on that performed poorly after launch. But based on earlier conversations with users, the team still believed in the design, and chose to continue working on it despite the data. Sure enough, it turned out that there was just one relatively minor design flaw, and once that was discovered, the new flow performed much better.

In this case, it was a relatively small feature with a relatively small flaw. But the same pattern holds on a larger scale as well – as visions become more innovative and ambitious, sometimes it requires commitment to a product vision over an extended period of time to see a product achieve its potential.

When to stop

I’m often asked, “If you know you’re just going to keep building no matter what the data says, then what’s the point in having data at all? How will we know when to kill the project?”

That’s a great question, since it’s often difficult to tell the difference between a false negative and a true negative. But there are two clear red flags to watch for: when a team loses faith in the project, and when a project stops improving. Ed Catmull cites the same criteria in Creativity, Inc. for knowing when one of Pixar’s movies is in trouble. Recognizing when a product is stuck is a challenge for any company committed to creativity and innovation, regardless of medium.

In data-informed design, learning is a continuous and parallel process. Rather than trying to design a rigorous enough test to validate/invalidate a direction at a particular moment in time, data is consistently gathered over time to measure a trajectory. If the team understands their users well, their work should show a general trend of improvement. If the product isn’t improving, or even if the product IS improving, but the metrics aren’t, then that’s a sign that a change is needed.

Some rules of thumb for data-informed design

It can be hard to know how to strike the right balance between data and intuition, but I do have a few rules of thumb:

Protect the user experience

Peter Drucker famously wrote: “What gets measured gets managed.” That’s true, but in my experience, “What gets measured gets manipulated, especially if you are being evaluated on that metric.” Example example examples

The challenge in product development is recognizing when we’re “teaching to the test”, regardless of whether it’s intentional or not. For anything that we’re measuring, I like to ask “is there a way I could move this metric in a positive way that would actually be really bad for our product long-term?” Then I ask, “is the feature I’m thinking about doing some flavor of that accidentally?”

A few examples of good intentions with potential for unintended consequences:

MetricTacticResult
Tutorial completionShorten the tutorial.
Users learn less
Conversion
Create misleading sales page. Buyers remorse
Revenue
Run frequent sales. Users trained to only buy at a discount

Have a “North Star” vision

I always advocate for having a “North Star” vision. This is a product vision months or years away that you believe your users will love, based off your current best understanding of them.

Since products take a lot of iterations to get good, early product development is full of false negatives on the way to that North Star. People love to talk about the idea of “failing fast” or “invalidating an idea early”, but a lot of times that just isn’t possible. The threshold for viability in a minimum viable product isn’t always obvious, and sometimes it does just take a little more polish or a few extra features to turn the corner.

The best way to get a more trustworthy signal is to just keep building and shipping. A North Star lets you maintain your momentum during the inevitable periods of uncertainty. Over time, small sample sizes accumulate, and noise averages out. Evidence about the product direction will build with time.

Treat metrics as indicators/hints, not goals

It’s important to remember that metrics are leading indicators, not end goals. Similar to how taking a test prep class to improve your SAT score doesn’t actually increase your odds of college success, features that overfocus on moving metrics may not actually improve the underlying product.

The most important question that data can answer is “does the team understand the users?” If so, features will resonate and metrics will improve over time. To validate/invalidate a product direction, look at the trajectory of the metrics, not the result of any individual test.

The right time to kill a project is when the trajectory of improvement flattens out at an unacceptably low level. Generally this means that a few features have shipped and flopped, which is an indicator that there’s some kind of critical gap in the team’s understanding of their users.

This also means that it can be difficult to get away from innovative product/feature ideas quickly. This can be an unpopular opinion in circles that are dogmatic about testing, but the fact of the matter is that I have never seen the “spray and pray” approach work well when it comes to product vision.

I’ve tried a wide range of nutrition hacks over the years (high carb, low carb, paleo, fasting, etc) but the only habit that has actually stuck with me is my spinach smoothies.

For over 10 years now, I’ve had a mostly-vegetable smoothie nearly every day. They’re delicious and nutritious and the only way I know to get 3 or 4 servings of fruits and veggies in about 5 minutes.

Most people know that they should be eating more fruits and vegetables (“5-a-day” was the hot campaign a while ago). Most people aren’t even close to that, and even those who try don’t realize that:

  • an entire bag of fresh salad from the supermarket is only about 1.5 servings of vegetables
  • you should really be eating closer to 9 servings a day

That’s a lot of veggies! I actually tried to do that at one point and after a few weeks of eating absurd quantities of salad, I just got tired of chewing. It was time to find a better way.

After a few false starts, my roommate and I discovered that orange juice was the secret to hiding the taste of spinach in a smoothie. Seriously, you can put an absurd amount of greens in a smoothie and not even taste them if you have an orange juice base. Hence, this recipe:

Super Spinach Smoothie

  • 2 servings frozen spinach/kale (1/3 bag)
  • Orange juice (try half OJ, half water if you want it less sweet)
  • 1 serving frozen berries or pineapple or banana or mango etc

Blend and drink.

Fills two glasses with a little left over. Whether this serves one or two is up to you.

A medium-rare steak is 135 degrees in the center. For thousands of years, the best way to accomplish this was to put the steak on a really hot grill and attempt to pull it off at just the right time. This is silly. Fortunately, technology has found a better way.

Take your steak, vacuum seal it in a plastic bag, and then lower it into a water bath whose temperature is carefully maintained at exactly 135 degrees. Let the steak come up to the temperature of the surrounding water, then pull it out, sear it with a blowtorch or a hot pan, and you’re ready to serve!

This method of cooking (known as “sous-vide“, French for “under-vacuum”) has several advantages.

1) No clean-up. Just open the bag, torch the steak, and you’re ready to serve!
2) No overcooking. Overcooking means accidentally bringing food above your target temperature. With sous-vide, the water bath maintains your food at the exact desire temperature, so overcooking is impossible.
3) No food safety concerns. Want a super-rare hamburger, but worried about e. coli? Pasteurization is a function of both temperature and time, so you can pasteurize your meat at a relatively low temperature by just holding it there for a couple hours.

Fish takes about 20 minutes and is perfectly cooked every single time. Chicken is moist and tender in a way I’ve never had it before. Sous-vide duck is amazing. After 2 hours, it’s deep red and juicy, unlike the dry grey stuff I had at Chinese restaurants growing up. Flank steak is one of the most flavorful cuts, but is usually one of the toughest. After 2 days in the sous vide, it’s as tender as filet mignon.

I have no desire to eat out anymore, because the food I make at home is faster and tastier. “Making dinner” now consists of taking a piece of meat (still in its original vacuum-sealed packaging from the supermarket) and dropping it into my sous-vide. For vegetables, I blend a spinach smoothie, or if I’m feeling fancy, I’ll put a tray of broccoli in the oven to roast. I can prepare an entire dinner in less than 60 seconds.

I’m writing this blog post because I’ve had a number of friends ask me how I put together my sous-vide setup. This is the email I’ve been forwarding them:

You can buy a countertop sous vide machine for $450. Alternatively, you can build your own for about $75.

I did neither, and bought a temperature controller that I can plug a rice cooker into. It’s cheaper and more flexible than a dedicated sous-vide machine. It has lower risk of electrocuting me than a DIY solution. And finally, if I ever want to sous vide something larger (say, an entire animal), I can just swap out my rice cooker for a larger heating element and I am good to go.

So without further ado, here’s my sous vide set-up (booze is optional, but recommended):

sous-vide-magic

Temperature Controller, $170
(This is the HD version, which is $10 extra, but get it in case you want to power a bigger heater later.)
Update: I’ve been informed that there are cheaper alternatives.

Perforated plate, $15
(You need something to keep the temperature probe away from the food. I use a small metal cheese grater, which works fine.)

Non-digital rice cooker, $30:
(This is big enough to do 2 flank steaks, a small roast, or a rack of ribs)

(optional, but fun) Cooking torch, $35:
(Get a butane refill from your local smoke shop, or just sear your meat in a pan after cooking it. Caveat – some people believe that butane can flavor the meat, and recommend just getting a blowtorch.)

I get my meat from Trader Joes already vacuum sealed. You might eventually want a vacuum sealer or a water bath that can handle larger items, but the above setup has been working great for me.

The definitive guide to sous-vide cooking times and temperatures can be found on Douglas Baldwin’s website. If I can’t find the info I need there, a quick google search usually turns up good suggestions. But to get you started, here are some times and temperatures that have been working well for me:

Food Temperature Time Notes
Duck breast 135 2 hours Crisp skin-side down in a pan before serving
Flank steak 131 2 days
Pork loin 137 2 hours
Soy ginger cod 132 20 mins Find it in Trader Joe’s frozen foods aisle. Thaw first.
Salmon 126 30 mins Add some slices of lemon if you vacuum seal it yourself
Pork shoulder 140 2 days
Pork chops 138 4 hours
Pork belly 155 2 days Leave it under the broiler afterwards to get super-crispy, and don’t forget plenty of salt!
2 comments