Hi everyone! Had an A/B testing question that I’m trying to wrap my head around and could use some help. Let’s say you’re Airbnb and you add a new feature on your mobile app search results which puts a plus sign next to every listing returned in search that users can click for a quick preview of the listing (eg ratings, top 3 reviews etc). To book listing you still need to click on the search result and go through the full booking flow. The hypothesis is this new feature will drive more (bookings/impressions) = success metric. Let’s say the variant (which got this new feature) shows a significant positive lift in success metric. My questions: 1. Is this data sufficient to conclude that the new ‘plus sign’ feature drove incremental conversion? 2. Do you ever take the winning variant and deep-dive to figure out how many users in it actually even clicked on ‘plus sign’? Hope my question was clear. Thank you for taking the time to respond!
1. Yes. Assuming that you have no problems with other parts of experiment design. I think your dilemma is if you can take credit for conversion, even though your experiment is upstream. The answer is yes. 2. If you are seeing large positive uplift, you would want to dig deeper to see if you can have an even bigger impact by tweaking some aspects. But it is not necessary. Just good discipline as a follow-on experiment.
Thank you jesus.cpp! For #2, when you dig deeper into the winning variant should you check what % of people who converted actually interacted with the new feature at all? If yes, what is the right percentage to conclude success? For eg. 60% of converting users in variant played with the feature. Is that good or bad? Given the overall variant conversion was stat. sig. higher than control.
Let’s add a new variant: minus sign, which should bring a significant negative impact ;)
Lol
Is this your Airbnb take home interview question? 😛
Nope! cleared that a while back. This is just a question I never got a proper answer to. Hence reaching out to the collective intelligence ;)
There are a few things you’ll need to check: 1) Has your treatment group hit power? You don’t want false negatives. 2) is the Control group actually a proper control group? Is it the symbol that causes the conversion or the fact that there is an object there? 3) Do secondary metrics corroborate the success metric or do they diverge?
Interesting, what would be an example of a secondary metric corroborating success for a primary metric? In any experiment.
Check for statistical significance
This
Yes, it’s stat. sig. but my question is slightly different. Once you get stat. sig, do you deep dive into your winning variation to see if the converters actually interacted with the new feature you launched? And if yes, what % of converters interacting with your feature is the right %?
Did you check the A/A variant for no initial bias in the sample before switching to A/B? You’ll need a very large sample size since your success metric is an interaction effect as compared to measuring “View Listing Detail” (or whatever you guys call it) which is immediate effect. For causal inferences, I’d suggest doing the test with 2 variants: Control A: Nothing Variant B: Plus Sign (No preview) - Just track user interactions on it. Compare total bookings/impression against Control. Also Stratify bookings/impression on users that clicked vs that did not. Variant C: Plus Sign (With Preview) - Again stratify bookings/impressions on users that clicked vs that did not. B vs A will give you pure causal effect of just the plus sign on your success metric. C vs B stratified comparison will tell if preview leads to additional bookings. If C vs B shows that additional bookings come in due to preview then you know which version of plus sign you want to launch. Note: in 2 test variations your alpha should be sliced by half (see Bonferroni correction for simplicity) to lower the probability of false positives.
Thanks you!! This is very helpful. Just one follow up: let’s put #s for ease. Say conversions are: (all stat sig) A = 10% B = 12% C = 15% Can I say, C - A = 5% lift is due to the plus sign feature? Or do I need to dig further into C variant and check something else also to prove causality? In extreme case, what if the conversion lift was ALL purely random and no one really clicked my feature? Wouldn’t it be a bit safer to also check how many converters in C actually clicked ‘plus sign’ before converting? Apologies for long post. Wanted to share my thought process clearly.
Yes I am saying that you should check the bookings/impression between users that clicked vs that did not click the plus sign (for variants B and C). Without Variant B, you can say that C - A lift is due to plus sign feature but you will not be able to prove if that happened by just looking at the plus sign (like having some kind of psychological priming effect) or also interacting with it. Eg. A = 10% B: Clickers = 6% B: Non-Clickers = 6% C: Clickers = 9% C: Non-Clickers = 6% Total B > A means lift due to looking at plus sign alone C: Clickers > B: Clickers AND C: Non-Clickers = B: Non-Clickers means Additional Lift is from people who interacted with the feature and consumed information before converting. If you see C: Clickers < B: Clickers (stat sig) even though total C > Total B then that means you can keep the plus sign but remove the module with additional information in it. Apart from this, you don’t need to do any other checks as long as you are confident that there was no initial bias in A/A variants.
What about novelty effect
Good point. I think we can only look at new users for comparison. Regardless, I think novelty effect wouldn’t drive additional conversion here. So maybe we ignore it for this test?
More likely you’d keep a long-term holdout and watch for regression