Data Science CareerApr 4, 2018

ZSrandomv2

A/B testing: how to analyze a winning variant

Hi everyone! Had an A/B testing question that I’m trying to wrap my head around and could use some help. Let’s say you’re Airbnb and you add a new feature on your mobile app search results which puts a plus sign next to every listing returned in search that users can click for a quick preview of the listing (eg ratings, top 3 reviews etc). To book listing you still need to click on the search result and go through the full booking flow. The hypothesis is this new feature will drive more (bookings/impressions) = success metric. Let’s say the variant (which got this new feature) shows a significant positive lift in success metric. My questions: 1. Is this data sufficient to conclude that the new ‘plus sign’ feature drove incremental conversion? 2. Do you ever take the winning variant and deep-dive to figure out how many users in it actually even clicked on ‘plus sign’? Hope my question was clear. Thank you for taking the time to respond!

@Airbnb @Uber @Meta

Add a comment

Sort by :

Uber BlueApron Apr 4, 2018

What about novelty effect

ZS randomv2 OP Apr 4, 2018

Good point. I think we can only look at new users for comparison. Regardless, I think novelty effect wouldn’t drive additional conversion here. So maybe we ignore it for this test?

Airbnb ∞timehoriz Apr 5, 2018

More likely you’d keep a long-term holdout and watch for regression

New

jesus.cpp Apr 4, 2018

1. Yes. Assuming that you have no problems with other parts of experiment design. I think your dilemma is if you can take credit for conversion, even though your experiment is upstream. The answer is yes. 2. If you are seeing large positive uplift, you would want to dig deeper to see if you can have an even bigger impact by tweaking some aspects. But it is not necessary. Just good discipline as a follow-on experiment.

ZS randomv2 OP Apr 4, 2018

Thank you jesus.cpp! For #2, when you dig deeper into the winning variant should you check what % of people who converted actually interacted with the new feature at all? If yes, what is the right percentage to conclude success? For eg. 60% of converting users in variant played with the feature. Is that good or bad? Given the overall variant conversion was stat. sig. higher than control.

LinkedIn Ioijyytffh Apr 4, 2018

Let’s add a new variant: minus sign, which should bring a significant negative impact ;)

ZS randomv2 OP Apr 4, 2018

Lol

Microsoft Facts Apr 4, 2018

Is this your Airbnb take home interview question? 😛

ZS randomv2 OP Apr 4, 2018

Nope! cleared that a while back. This is just a question I never got a proper answer to. Hence reaching out to the collective intelligence ;)

Expedia Tr@vel Apr 4, 2018

Is p value < 0.05?

ZS randomv2 OP Apr 4, 2018

Yes.

Amazon Pokebowl Apr 4, 2018

There are a few things you’ll need to check: 1) Has your treatment group hit power? You don’t want false negatives. 2) is the Control group actually a proper control group? Is it the symbol that causes the conversion or the fact that there is an object there? 3) Do secondary metrics corroborate the success metric or do they diverge?

ZS randomv2 OP Apr 4, 2018

Interesting, what would be an example of a secondary metric corroborating success for a primary metric? In any experiment.

New

iamsomeone Apr 4, 2018

Check for statistical significance

Yelp LolRofl Apr 4, 2018

This

ZS randomv2 OP Apr 4, 2018

Yes, it’s stat. sig. but my question is slightly different. Once you get stat. sig, do you deep dive into your winning variation to see if the converters actually interacted with the new feature you launched? And if yes, what % of converters interacting with your feature is the right %?

Yahoo eyxitsjtj Apr 4, 2018

Did you check the A/A variant for no initial bias in the sample before switching to A/B? You’ll need a very large sample size since your success metric is an interaction effect as compared to measuring “View Listing Detail” (or whatever you guys call it) which is immediate effect. For causal inferences, I’d suggest doing the test with 2 variants: Control A: Nothing Variant B: Plus Sign (No preview) - Just track user interactions on it. Compare total bookings/impression against Control. Also Stratify bookings/impression on users that clicked vs that did not. Variant C: Plus Sign (With Preview) - Again stratify bookings/impressions on users that clicked vs that did not. B vs A will give you pure causal effect of just the plus sign on your success metric. C vs B stratified comparison will tell if preview leads to additional bookings. If C vs B shows that additional bookings come in due to preview then you know which version of plus sign you want to launch. Note: in 2 test variations your alpha should be sliced by half (see Bonferroni correction for simplicity) to lower the probability of false positives.

ZS randomv2 OP Apr 4, 2018

Thanks you!! This is very helpful. Just one follow up: let’s put #s for ease. Say conversions are: (all stat sig) A = 10% B = 12% C = 15% Can I say, C - A = 5% lift is due to the plus sign feature? Or do I need to dig further into C variant and check something else also to prove causality? In extreme case, what if the conversion lift was ALL purely random and no one really clicked my feature? Wouldn’t it be a bit safer to also check how many converters in C actually clicked ‘plus sign’ before converting? Apologies for long post. Wanted to share my thought process clearly.

Yahoo eyxitsjtj Apr 5, 2018

Yes I am saying that you should check the bookings/impression between users that clicked vs that did not click the plus sign (for variants B and C). Without Variant B, you can say that C - A lift is due to plus sign feature but you will not be able to prove if that happened by just looking at the plus sign (like having some kind of psychological priming effect) or also interacting with it. Eg. A = 10% B: Clickers = 6% B: Non-Clickers = 6% C: Clickers = 9% C: Non-Clickers = 6% Total B > A means lift due to looking at plus sign alone C: Clickers > B: Clickers AND C: Non-Clickers = B: Non-Clickers means Additional Lift is from people who interacted with the feature and consumed information before converting. If you see C: Clickers < B: Clickers (stat sig) even though total C > Total B then that means you can keep the plus sign but remove the module with additional information in it. Apart from this, you don’t need to do any other checks as long as you are confident that there was no initial bias in A/A variants.

Sort by :

Uber BlueApron Apr 4, 2018

What about novelty effect

ZS randomv2 OP Apr 4, 2018

Good point. I think we can only look at new users for comparison. Regardless, I think novelty effect wouldn’t drive additional conversion here. So maybe we ignore it for this test?

Airbnb ∞timehoriz Apr 5, 2018

More likely you’d keep a long-term holdout and watch for regression

New

jesus.cpp Apr 4, 2018

ZS randomv2 OP Apr 4, 2018

LinkedIn Ioijyytffh Apr 4, 2018

Let’s add a new variant: minus sign, which should bring a significant negative impact ;)

ZS randomv2 OP Apr 4, 2018

Lol

Microsoft Facts Apr 4, 2018

Is this your Airbnb take home interview question? 😛

ZS randomv2 OP Apr 4, 2018

Nope! cleared that a while back. This is just a question I never got a proper answer to. Hence reaching out to the collective intelligence ;)

Expedia Tr@vel Apr 4, 2018

Is p value < 0.05?

ZS randomv2 OP Apr 4, 2018

Yes.

Amazon Pokebowl Apr 4, 2018

ZS randomv2 OP Apr 4, 2018

Interesting, what would be an example of a secondary metric corroborating success for a primary metric? In any experiment.

New

iamsomeone Apr 4, 2018

Check for statistical significance

Yelp LolRofl Apr 4, 2018

This

ZS randomv2 OP Apr 4, 2018

Yahoo eyxitsjtj Apr 4, 2018

ZS randomv2 OP Apr 4, 2018

Yahoo eyxitsjtj Apr 5, 2018

Industries

Job Groups

General Topics

Sponsored

Most Read

A/B testing: how to analyze a winning variant

Most Read