Thursday, May 21, 2009

A/B and Qualitative User Testing

Recently, I worked with a company devoted to A/B testing. For those of you who aren't familiar with the practice, A/B testing (sometimes called bucket testing or multivariate testing) is the practice of creating multiple versions of a screen or feature and showing each version to a different set of users in production in order to find out which version produces better metrics. These metrics may include things like "which version of a new feature makes the company more money" or "which landing screen positively affects conversion." Overall, the goal of A/B testing is to allow you to make better product decisions based on the things that are important to your business by using statistically significant data.

Qualitative user testing, on the other hand, involves showing a product or prototype to a small number of people while observing and interviewing them. It produces a different sort of information, but the goal is still to help you make better product decisions based on user feedback.

Now, a big part of my job involves talking to users about products in qualitative tests, so you might imagine that I would hate A/B testing. After all, wouldn't something like that put somebody like me out of a job? Absolutely not! I love A/B testing. It's a phenomenal tool for making decisions about products. It is not the only tool, however. In fact, qualitative user research combined with A/B testing creates the most powerful system for informing design that I have ever seen. If you're not doing it yet, you probably should be.

A/B Testing

What It Does Well

A/B testing on its own is fantastic for certain things. It can help you:
  • Get statistically significant data on whether a proposed new feature or change significantly increases metrics that matter - numbers like revenue, retention, and customer acquisition
  • Understand more about what your customers are actually doing on your site
  • Make decisions about which features to cut and which to improve
  • Validate design decisions
  • See which small changes have surprisingly large effects on metrics
  • Get user feedback without actually interacting with users

For example, imagine that you are creating a new check out flow for your website. There is a request from your marketing department to include an extra screen that asks users for some demographic information. However, you feel that every additional step in a check out process represents a chance for users to drop out, which prevents purchases. By creating two flows in production, one with the extra screen and one without, and showing each flow to only half of your users, you can gather real data on how many purchases are completed by members of each group. This allows you to understand the exact impact on sales and helps you decide whether gathering the demographic information is really worth the cost.

Even more appealing, you can get all this user feedback without ever talking to a single user. A/B testing is, by its nature, an engineering solution to a product design problem, which makes it very popular with small, engineering-driven startups. Once the various versions of the feature are released to users, almost anybody can look at the results and understand which option is doing better, so it can all be done without having to recruit or interview test participants.

Of course, A/B testing in production works best on things like web or mobile applications where you can not only show different interfaces to different customers, but where you can also easily switch all of your users to the winning interface without having to ship them a new box full of software or a new physical device. I wouldn't recommend trying it if you're designing, for example, a car.

What It Does Poorly

Now imagine that, instead of adding a single screen to an already existing check out flow, you are tasked with designing an entirely new check out flow that should maximize revenue and minimize the number of people who abandon their shopping carts. In creating the new flow, there are hundreds of design decisions you need to make, both small and large. How many screens should it have? How much up-selling and cross-selling should you do? At what point in the flow do you ask users for payment information? What should the screens look like? Should they have the standard header and footer, or should those be removed to minimize potential distractions for users when purchasing? And on and on and on...

These are all just a series of small decisions, so, in an ideal world, you'd be able to A/B test each one separately, right? Of course, in the real world, this could mean creating an A/B test with hundreds of different variations, each of which has to be shown to enough users to achieve statistical significance. Since you want to roll out your new check out process sometime before the next century, this may not be a particularly appealing option.

A Bad Solution

Another option would be to fully implement several very different directions for the check out screens and test them all against one another. For example, let's say you implemented four different check out processes with the following features to test against one another:
Option 1: Option 2: Option 3: Option 4:

Yellow Background
Three Screens
Marketing Questions
No Up-selling
No Cross-Selling
Header
No Footer
Help Link


    Blue Background
    Two Screens
    No Marketing Questions
    Up-selling
    No Cross-Selling
    Header
    Footer
    No Help


      Orange Background
      Four Screens
      Marketing Questions
      Up-selling
      Cross-Selling
      No Header
      Footer
      Live Chat Help

      White Background
      One Screen
      No Marketing Questions
      No Up-selling
      Cross-Selling
      No Header
      No Footer
      Live Chat Help

        This might work in companies that have lots of bored engineers sitting around waiting to implement and test several different versions of the same code, most of which will eventually be thrown away. Frankly, I haven't run across a lot of those companies. But even if you did decide to devote the resources to building four different check out flows, the big problem is that, if you get a clear winner, you really don't have very clear idea of WHY users preferred a particular version of the check out flow over the others. Sure, you can make educated guesses. Perhaps it was the particularly soothing shade of blue. Or maybe it was the fact that there weren't any marketing questions. Or maybe it was aggressive up-selling. Or maybe that version just had the fewest bugs.

        But the fact is, unless you figure out exactly which parts users actually liked and which they didn't like, it's impossible to know that you're really maximizing your revenue. It's also impossible to use those data to improve other parts of your site. After all, what if people HATE the soothing shade of blue, but they like everything else about the new check out process? Think of all the money you'll lose by not going with the yellow or orange or white. Think of all the time you'll waste by making everything else on your site that particular shade of blue, since you think that you've statistically proven that people love it!

        What Qualitative Testing Does Well

        Despite the many wonderful things about A/B testing, there are a few things that qualitative testing just does better.

        Find the Best of All Worlds

        Qualitative testing allows you to test wildly different versions of a feature against one another and understand what works best about each of them, thereby helping you develop a solution that has the best parts from all the different options. This is especially useful when designing complicated features that require many individual decisions, any one of which might have a significant impact on metrics. By observing users interacting with the different versions, you can begin to understand the pros and cons of each small piece of the design without having to run each one individually in its own A/B test.

        Find Out WHY Users Are Leaving

        While a good A/B test (or plain old analytics) can tell you which page a user is on when they abandon a check out flow, it can't tell you why they left. Did they get confused? Bored? Stuck? Distracted? Information like that helps you make better decisions about what exactly it is on the page that is causing people to leave, and watching people using your feature is the best way to to gather that information.

        Save Engineering Time and Iterate Faster

        Generally, qualitative tests are run with rich, interactive wireframes rather than fully designed and tested code. This means that, instead of having your engineers code and test four different versions of the flow, you can have a designer create four different HTML prototypes in a fraction of the time. HTML prototypes are significantly faster to produce since:
        • They don't have to run in multiple browsers, just the one you're testing
        • They don't have any backend code that needs to be done
        • They frequently don't have a polished visual design (unless that's part of what you're testing)
        And since making changes to a prototype doesn't require any engineering or QA time, you can iterate on the design much faster, allowing you to refine the design in hours or days rather than weeks or months.

        How Do They Work Together?

        Qualitative Testing Narrows Down What You Need to A/B Test

        Qualitative testing will let you eliminate the obviously confusing stuff, confirm the obviously good stuff, and narrow down the set of features you want to A/B test to a more manageable size. There will still be questions that are best answered by statistics, but there will be a lot fewer of them.

        Qualitative Testing Generates New Ideas for Features and Designs

        While A/B testing helps you eliminate features or designs that clearly aren't working, it can't give you new ideas. Users can. If every user you interview gets stuck in the same place, you've identified a new problem to solve. If users are unenthusiastic about a particular feature, you can explore what's missing with them and let them suggest ways to make the product more engaging.

        Talking to your users allows you to create a hypothesis that you can then validate with an A/B test. For example, maybe all of the users you interviewed about your check out flow got stuck selecting a shipment method. To address this, you might come up with ideas for a couple of new shipment flows that you can test in production once you've confirmed that they're less confusing with another quick qualitative test.

        A/B Testing Creates a Feedback Loop for Researchers

        A/B tests can also improve your qualitative testing process by providing statistical feedback to your researchers. I, as a researcher, am going to observe participants during tests in order to see what they like and dislike. I'm then going to make some educated guesses about how to improve the product based on my observations. When I get feedback about which recommendations are the most successful, it helps me learn more about what's important to users so I make better recommendations in the future.

        Any Final Words?

        Separately, both A/B testing and qualitative testing are great ways to learn more about your users and how they interact with your product. Combined, they are more than the sum of their parts. They form an incredibly powerful tool that can help you make good, user-centered product decisions more quickly and with more confidence than you have ever imagined.

        Like the post? Follow me on Twitter!


        This post originally appeared on the Sliced Bread Design blog.