When conducting an A/B test (Split test), usually people use a proportion test to check significance. I suspect that this test isn't always good enough and might change significantly the results of the test. Does anyone encounter such an issue, and if so which other method would you suggest?

I have found that the proportion test holds an assumption on the deviation of the data ( http://en.wikipedia.org/wiki/Binomial_distribution#Normal_approximation). In my cases, the p is very low, ~ 0.01 - 0.02.
Answer:

You are absolutely right! It is easy to prove that you are right by just putting the data on a chart and seeing the proportion of observations outside the 3 sigma limits. Never the less, when using the test for an homogeneous population (eg. same referral / traffic source or even keyword) its works much better. In other words, segment your data. Try using the new content experiments in Google analytics, It will allow you to see your test data for different segments. Good luck, Zvika

Was this solution helpful to you?

Other answers

Violations of that assumption are only likely to skew results if your sample sizes are small, so don't use small samples. Since your p value is very low, it stands to reason that you're not going to be able to make a compelling business case with tiny samples anyhow. With a p value in the range you mentioned, a sample size of 1000 is desirable. (There's are popular guidelines calling for sample size times p of at least 5-10.)

Meta Brown

I have done this type of testing with low p (2% and lower), and I have to tell you that the data does behave funny at the extreme of the p spectrum. My gut feeling is that the underlying assumptions about uniform distribution do not hold up. My approach was to look at a practical rather than theoretical solution, i.e. throw the dice enough to get me a feeling for what's normal and what's not (aka Monte-Carlo). So I just took my data and conducted virtual tests: split the historical subjects randomly into virtual groups of specific size (no difference in treatment), and compared how far the groups landed. I would encourage you to do the same, on the past data, if possible, or just set up an A/A test for learning. My conclusions were that the real standard deviation varied by the subject pre-qualification criteria. Groups with similar subjects gave me smaller standard deviation (even with the same p) than groups with more diverse subjects. My advice would be to calibrate your tests based on your underlying customer behavior and your required difference in performance using real "no difference test" data.

Tanya Zyabkina

Have you tried to use Heatmap with comparison & segment features to visually view the sifnificance? You can try http://www.miapex.com if you want.

Allen Zhao

Related Q & A:

Does anyone know why sometimes I pig out for a few days, then other days i can't eat anything at all?Best solution by Yahoo! Answers
Is it true that PayPal isn't always trustworthy?Best solution by answers.yahoo.com
How can I use a HDMI when my TV/Monitor doesn't have internal speakers?Best solution by Yahoo! Answers
Should I take the ACT test and the SAT II Subject test in June if I haven't prepared yet and I'm a Junior?Best solution by Yahoo! Answers
If you have a prescription for a controlled substance, what do you do during a drug test?Best solution by Yahoo! Answers

Just Added Q & A:

How many active mobile subscribers are there in China?Best solution by Quora
How to find the right vacation?Best solution by bookit.com
How To Make Your Own Primer?Best solution by thekrazycouponlady.com
How do you get the domain & range?Best solution by ChaCha
How do you open pop up blockers?Best solution by Yahoo! Answers

For every problem there is a solution! Proved by Solucija.

Got an issue and looking for advice?
Ask Solucija to search every corner of the Web for help.
Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.