Stop letting the data decide

on June 27, 2018

“Everybody gets so much information all day long that they lose their common sense.” – Gertrude Stein

My first job as a product manager was in games. I worked at Playdom, Zynga’s primary competitor during the social gaming boom of 2009. The sophistication of our data analysis techniques and the platform supporting them played a large role in our eventual $700 million acquisition by Disney.

For most companies at the time, “analytics” just meant counting pageviews. If you were really fancy, you could track the order in which users viewed certain pages and assemble a funnel chart to quantify dropoff. Gartner’s report on the state of web analytics in 2009 describes a range of key challenges, like “how to obtain a sustainable return on investment” and “how to choose a vendor”.

“Why would we need an analyst to tell us what our hitcounter is saying?”

In contrast, social gaming powerhouses like Zynga and Playdom were custom building their own event-based analytics systems from the ground up. They tracked almost every action that players took in a game, allowing them to deeply understand their users’ needs and build features to fulfill them, rather than simply taking their best guesses.

For me, it was incredibly exciting to be on the cutting edge of analytics. For the first time, we could get real insights into players’ actions, aspirations, and motivations. Games are tremendously complex software products with huge design spaces, and even now it blows my mind that for most of the industry’s history, development decisions were made purely on gut instinct.

The power of these new data analysis techniques seemed limitless. Zynga went from zero to a billion-dollar valuation in under 3 years. And while gaming companies were the first to really showcase the potential of event-based metrics, they certainly weren’t the only ones. There was a digital gold rush as startups popped up left and right to bring the power of quantitative data insights to every industry imaginable.

Perhaps the most famous example of putting data-driven design on a pedestal is Marissa Mayer’s test of 41 shades of blue. (It’s an absurd test for many reasons, not the least of which is that with so many different variations, you’re basically guaranteed to discover a false positive outlier simply due to random noise.)

In this brave new world, metrics were king. Why would you need a designer? Everything could be tracked, measured, tested. A good PM was one who could “move the metrics”. MBAs and management consultants were hired by the boatload. One friend told me about the time he had to talk his CEO out of firing all the game designers in the company and replacing them with analysts.

A quick note about the game industry

As an aside, the game development industry has interesting market dynamics because of how many people dream of working in it. In some unglamorous industries (e.g. Alaskan crab fishing, logistics, B2B startups), demand for labor vastly exceeds supply. In games, it’s the opposite – many people stay in games out of passion, even when the money doesn’t justify it, leading to a market that is oversaturated and extremely competitive.

The evolutionary pressures of this absurdly competitive market mean that the pace of product innovation is extremely quick. The quality bar constantly increases, production costs go up, advertising prices rise, margins disappear, and mediocre products fail.

The gaming market’s competitiveness forces rapid innovation just to keep up, and when better tactics emerge, they are quickly adopted and rapidly bubble up to dominate the top of the market. As a result, the gaming market can be a bellwether of trends in the larger tech market, such as the power of the freemium model, microtransactions, sophisticated performance marketing, and strong product visions.

The competitive advantage of a strong product vision became undeniable in early 2012. At that time, Zynga had been around for about 5 years, with a peak market cap over $10 billion, and the company’s success had been repeated on a smaller scale by other strongly “data-driven” gaming companies on Facebook and on mobile.

However, an interesting trend was beginning to occur, with new games like Dragonvale and Hay Day dominating the mobile charts with innovative mechanics supported by a single, unified product vision.

Purely metric-driven iteration with no vision or direction could bring a product to a local maximum, which was good enough in the very early days of mass-market casual gaming. But as the market matured and competition intensified, a local maximum wasn’t good enough. Derivative products and products developed by only metric-driven iteration were vastly inferior to products driven by a strong creative vision from their inception, like Supercell’s Clash of Clans or Pocket Gems’ Episode. That vision was a necessary prerequisite to create a product strong enough to land at the top of the charts.

Fortnite was announced in 2011 and launched in 2018. Gaming is a tough industry.

And being at the top of the charts is critical – revenue on the Top Grossing Charts follows a power law, with the handful of apps at the very top of the charts making more money than all the rest of the apps put together. As Zynga’s apps slipped down the charts, their inability to adapt to this new world became apparent and their stock price fell 80%.

Data-driven design had failed, as did intuition-driven design before it. The industry needed a more fundamental shift in perspective. Good teams now design for the long term, guided by intuition but informed by data.

Personally, I like to emphasize the difference between data-driven design (relying on data to make decisions because we have no user empathy) and data-informed design (use data to understand our users, then build features to delight them)

Data-driven design

When I say “data-driven design”, I’m referring to the mentality of “letting the data decide”. In this paradigm, PMs and designers surrender to the fallibility of their intuition, and thus they elect to remain agnostic, using A/B testing to continuously improve their products.

A number of companies I’ve talked to have bragged about about the fact that they’ve removed intuition from the decision making process. It’s comforting to be able to say “We don’t have to depend on intuition because the data tells us what to do!”

Of course, everyone knows that data is noisy, so companies use large test groups and increased rigor to mitigate those concerns. But the real problem isn’t tests giving the wrong answer, so much as it is the assumption that the infinite degrees of freedom of creating a compelling product can be distilled to a limited number of axes of measurement.

With the exception of straightforward changes like pricing, most design changes have complex effects on the overall user experience. As a result, treating metrics as end goals (rather than simply as indicators of good product direction) results in unintended consequences and a degraded user experience. Testing isn’t a magic bullet either. Sometimes this degradation occurs in an unexpected part of the user experience, and sometimes it occurs on a different timescale than the test.

Split tests typically gather data for a period of days or weeks. User lifetimes are typically months or years. If you’re only looking at the data you’ve gathered, it’s easy to unintentionally trade off difficult-to-measure metrics like long term product health in exchange for easy-to-measure short-term metrics like revenue.

Example: Aggressive paywalls

Zoosk is a dating app that built a huge userbase as a Facebook app during the heyday of data-driven design. They’re extremely aggressive with their monetization, with misleading buttons designed to constantly surprise the user with paywalls.

Oh boy, a message!

Gotcha! Paywall!

A company naively focusing on revenue will naturally iterate their way to this point, experimenting with increasingly early and aggressive paywalls and discovering that the spammier the app becomes, the more money they make.

However, while an aggressive approach can be very profitable in the short run, it quickly drives away non-payers and makes it difficult to engage new users. In the dating space, this results in a user experience that becomes worse every month for subscribers.

Sure enough, judging from AppAnnie/SensorTower estimates, Zoosk’s revenue has probably fallen about 50% since their 2014 high of $200 million.

Example: Searches per user

One of my favorite stories is from a friend who worked on improving the search feature at a major tech company. Their target metric was to increase the number of searches per user, and the most efficient way to do that was to make search results worse. My friend likes to think that his team resisted that temptation, but you can never be totally sure of these things.

Example: Brand tradeoffs

If you start a free trial with Netflix, you’ll get an email a few days before the end of the free trial reminding you that your credit card is about to be charged. I’m sure that Netflix has tested this, and I’m sure that they know that this reminder costs them money. However, they’ve presumably decided to keep the reminder email because of its non-quantifiable positive effect on the Netflix brand (or more precisely, to avoid the negative effect of people complaining about forgetting to cancel their free trial).

Short term revenue loss, long term brand gain

Notably, Netflix only reminds you before billing your card for the first time, and not for subsequent charges. At some point, a decision was made that “before the first charge but not before subsequent ones” was the correct place to draw the line on the completely unquantifiable tradeoff between short term revenue loss and long term brand benefits.

Example: Tutorial completion

A standard way to measure the quality of an onboarding experience is to measure what percent of users who start a tutorial actually finish it. However, since there will always be a natural drop off between sessions or over days, one obvious way to increase tutorial throughput is to build a tutorial that attempts to teach all the features in a single session.

Sure enough, tutorial throughput goes up, but now users are getting overwhelmed and confused by the pace of exposure to new menus and features. How to help them find their way? Maybe some arrows! Big, blinking arrows telling the user exactly which button to tap, directing them into submenus 7 levels deep and then back out.

You’ll be able to do this on your own next time, right?

Arrows everywhere can boost tutorial throughput, but all the users will be tapping through on autopilot, contradicting the point of having the tutorial in the first place! Excessive handholding of users increases tutorial completion (an easy to measure metric), but decreases learning and feelings of accomplishment (difficult to measure but very important metrics).

Example: Intentionally uninformative commununication

“You’ve been invited to a thing! I could tell you where and when it is in the body of this email, but I’d rather force you to visit my website to spam you with ads. Oh, look at how high our DAUs are! Thanks for using Evite!”

If this email were helpful, Evite would have to find a different way to make money

Equally frustrating to users: Push notifications that purposely leave out information to force users to open the app. Users will flee to the first viable alternative that actually values the user experience.

Example: User experience

In a purely data-driven culture, justifying investment in user experience is a constant uphill battle.

Generally, fixing a minor UI issue or adding some extra juice to a button press won’t affect the metrics in any kind of a measurable way. User experience is one of those “death by 1000 cuts” things where the benefits don’t become visible until after a significant amount of work has already been put in.

As a result, it’s easy to constantly deprioritize improvements to the user experience under the argument of “why would we fix that issue if it’s not going to move the needle?”

To create great UX requires a leap of faith, a belief that the time you’re investing is worthwhile despite what the metrics say right now.

Hearthstone is a great example. Besides being a great game, it’s full of moments of polish and delight like the finding opponent animation and interactive backgrounds that are completely unnecessary from a minimum viable product perspective, but absolutely critical for creating a product that feels best-in-class.

Example: Sales popups

When I was at Playdom, we would show popups when an app was first opened. They’d do things like encourage users to send invites, or buy an item on sale, like this popup from Candy Crush does.

Do you want revenue now or a userbase in the future?

I hate these. They degrade the user experience, frustrate the user, hurt the brand, and generally make interacting with your product a less delightful experience.

On the other hand, though, they WORK. And you can stack them: the more sales popups you push users through, the more money you make – right up until the point where all of your metrics fall off a cliff because your users are sick of the crappy experience and have moved on.

It always gave me a bit of schadenfreude to open a competitor’s game and see a sale popup for the first time, because the same pattern always repeated itself: As the weeks went by, more and more aggressive and intrusive popups would invade the user experience, right up until the game disappeared from the charts because all the users churned out.

Even retention isn’t foolproof

As a final note, while most of the examples above involve some variation on accidentally degrading retention, even optimizing for retention doesn’t prevent these mistakes from occurring if you’re optimizing for retention over the wrong timescale or for the wrong audience of users.

Typically, companies will look at metrics like 1-day, 7-day, 30-day retention because those numbers tend to correlate highly with user lifetimes. But focusing on cohort retention runs the risk of over-optimizing your product for the new users that you’re measuring, perhaps by over-simplifying your product, or neglecting the features loved by your elder users, or creating features that benefit new users at the expense of your existing audience.

Data-informed design

In contrast to “data-driven design”, which relies on data to drive decisions, in “data-informed design” data is used to understand your users and inform your intuition. This informed intuition then guides decisions, designs, and product direction. And your intuition improves over time as you run more tests, gather more data, and speak to more users.

When I’m making the case for the benefits of introducing intuition back into the decision-making process, there are two benefits that I keep coming back to: leaps of faith, and consistency.

Leaps of faith

Purely data-driven product improvement breaks down when a product needs to get worse in order to get better. (If you’re the sort of person who likes calculus metaphors, continuous improvement gets you to a local maximum, but not to a global maximum.) Major product shifts and innovations frequently require a leap of faith, committing to a product direction with the knowledge that initial metrics may be negative for an extended period of time until the new direction gets dialed in and begins to mature.

When Facebook introduced its newsfeed, hundreds of thousands of users revolted in protest, calling for boycotts and petitioning for removal of the feature. Now we can’t imagine Facebook without it.

Consistency

When products are built iteratively, with decisions made primarily through testing and iteration, there’s no guarantee of a consistent vision. Some teams take pride in the fact that their roadmaps only extend a week into the future. “Our tests will tell us what direction to go next!”

Data-informed design helps your product tell a consistent story. This is the power of a cohesive product vision.

It can be hard to explain exactly WHY a cohesive product vision translates to a better product, and also why it’s so hard to get there purely by data-driven iteration. Perhaps an extremely contrived example can help illustrate my point.

Let’s say you’re designing a new experience. You’re committed to good testing practices, and so over the next several months, you run tests on all 20 features you release. Each test is conclusive at the 5% significance level, and sure enough, users respond very positively to the overall experience that your tests have led you to.

Now, even with rigorous testing at a 5% significance level, 1 out of 20 tests will be wrong, and interestingly enough, 19 of the tests are consistent with the belief that your users are primarily young women, while 1 of them conclusively indicates that your users are middle-aged men.

Allowing your decision-making to be informed by data rather than dictated by it allows the team to say “Let’s just ignore the data from this particular test. Everything else we’ve learned makes us quite confident that we have a userbase of young women, and we believe our product will be better if all our features reflect that assumption.”

Obviously, if more tests come back indicating that your users are middle-aged men, your entire product vision will be thrown into question, but that’s ok. It’s preferable to ignore data in order to build a great product reflecting a unified vision that you’re 95% confident in, rather than creating a Frankenstein with 95% confidence on each individual feature.

The role of data in data-informed design

I believe that saying “just let the data decide” isn’t good product management, it’s an abdication of responsibility. As a PM or a designer, your job is to develop empathy for your users. Use data to understand them, their aspirations, and their motivations, and then take a position on what direction the product needs to move to best serve them.

Sometimes this means knowing your users better than they know themselves, as in the Facebook newsfeed example. More commonly, it means having enough faith in your product vision to recognize early false negatives for what they are, and being willing to grind through the trough of sorrow to realize your product’s potential.

Eric Reis gives an example of a registration flow that he worked on that performed poorly after launch. But based on earlier conversations with users, the team still believed in the design, and chose to continue working on it despite the data. Sure enough, it turned out that there was just one relatively minor design flaw, and once that was discovered, the new flow performed much better.

In this case, it was a relatively small feature with a relatively small flaw. But the same pattern holds on a larger scale as well – as visions become more innovative and ambitious, sometimes it requires commitment to a product vision over an extended period of time to see a product achieve its potential.

When to stop

I’m often asked, “If you know you’re just going to keep building no matter what the data says, then what’s the point in having data at all? How will we know when to kill the project?”

That’s a great question, since it’s often difficult to tell the difference between a false negative and a true negative. But there are two clear red flags to watch for: when a team loses faith in the project, and when a project stops improving. Ed Catmull cites the same criteria in Creativity, Inc. for knowing when one of Pixar’s movies is in trouble. Recognizing when a product is stuck is a challenge for any company committed to creativity and innovation, regardless of medium.

In data-informed design, learning is a continuous and parallel process. Rather than trying to design a rigorous enough test to validate/invalidate a direction at a particular moment in time, data is consistently gathered over time to measure a trajectory. If the team understands their users well, their work should show a general trend of improvement. If the product isn’t improving, or even if the product IS improving, but the metrics aren’t, then that’s a sign that a change is needed.

Some rules of thumb for data-informed design

It can be hard to know how to strike the right balance between data and intuition, but I do have a few rules of thumb:

Protect the user experience

Peter Drucker famously wrote: “What gets measured gets managed.” That’s true, but in my experience, “What gets measured gets manipulated, especially if you are being evaluated on that metric.” Example example examples

The challenge in product development is recognizing when we’re “teaching to the test”, regardless of whether it’s intentional or not. For anything that we’re measuring, I like to ask “is there a way I could move this metric in a positive way that would actually be really bad for our product long-term?” Then I ask, “is the feature I’m thinking about doing some flavor of that accidentally?”

A few examples of good intentions with potential for unintended consequences:

Metric	Tactic	Result
Tutorial completion	Shorten the tutorial.	Users learn less
Conversion	Create misleading sales page.	Buyers remorse
Revenue	Run frequent sales.	Users trained to only buy at a discount

Have a “North Star” vision

I always advocate for having a “North Star” vision. This is a product vision months or years away that you believe your users will love, based off your current best understanding of them.

Since products take a lot of iterations to get good, early product development is full of false negatives on the way to that North Star. People love to talk about the idea of “failing fast” or “invalidating an idea early”, but a lot of times that just isn’t possible. The threshold for viability in a minimum viable product isn’t always obvious, and sometimes it does just take a little more polish or a few extra features to turn the corner.

The best way to get a more trustworthy signal is to just keep building and shipping. A North Star lets you maintain your momentum during the inevitable periods of uncertainty. Over time, small sample sizes accumulate, and noise averages out. Evidence about the product direction will build with time.

Treat metrics as indicators/hints, not goals

It’s important to remember that metrics are leading indicators, not end goals. Similar to how taking a test prep class to improve your SAT score doesn’t actually increase your odds of college success, features that overfocus on moving metrics may not actually improve the underlying product.

The most important question that data can answer is “does the team understand the users?” If so, features will resonate and metrics will improve over time. To validate/invalidate a product direction, look at the trajectory of the metrics, not the result of any individual test.

The right time to kill a project is when the trajectory of improvement flattens out at an unacceptably low level. Generally this means that a few features have shipped and flopped, which is an indicator that there’s some kind of critical gap in the team’s understanding of their users.

This also means that it can be difficult to get away from innovative product/feature ideas quickly. This can be an unpopular opinion in circles that are dogmatic about testing, but the fact of the matter is that I have never seen the “spray and pray” approach work well when it comes to product vision.