Does anyone have PTCB test material?

When conducting an A/B test (Split test), usually people use a proportion test to check significance. I suspect that this test isn't always good enough and might change significantly the results of the test.  Does anyone encounter such an issue, and if so which other method would you suggest?

  • I have found that the proportion test holds an assumption on the deviation of the data ( http://en.wikipedia.org/wiki/Binomial_distribution#Normal_approximation). In my cases, the p is very low, ~ 0.01 - 0.02.

  • Answer:

    You are absolutely right! It is easy to prove that you are right by just putting the data on a chart and seeing the proportion of observations outside the 3 sigma limits. Never the less, when using the test for an homogeneous population (eg. same referral / traffic source or even keyword) its works much better. In other words, segment your data. Try using the new content experiments in Google analytics, It will allow you to see your test data for different segments. Good luck, Zvika

Zvika Jerbi at Quora Visit the source

Was this solution helpful to you?

Other answers

Violations of that assumption are only likely to skew results if your sample sizes are small, so don't use small samples. Since your p value is very low, it stands to reason that you're not going to be able to make a compelling business case with tiny samples anyhow. With a p value in the range you mentioned, a sample size of 1000 is desirable. (There's are popular guidelines calling for sample size times p of at least 5-10.)

Meta Brown

I have done this type of testing with low p (2% and lower), and I have to tell you that the data does behave funny at the extreme of the p spectrum. My gut feeling is that the underlying assumptions about uniform distribution do not hold up. My approach was to look at a practical rather than theoretical solution, i.e. throw the dice enough to get me a feeling for what's normal and what's not (aka Monte-Carlo). So I just took my data and conducted virtual tests: split the historical subjects randomly into virtual groups of specific size (no difference in treatment), and compared how far the groups landed. I would encourage you to do the same, on the past data, if possible, or just set up an A/A test for learning. My conclusions were that the real standard deviation varied by the subject pre-qualification criteria. Groups with similar subjects gave me smaller standard deviation (even with the same p) than groups with more diverse subjects. My advice would be to calibrate your tests based on your underlying customer behavior and your required difference in performance using real "no difference test" data.

Tanya Zyabkina

Have you tried to use Heatmap with comparison & segment features to visually view the sifnificance? You can try http://www.miapex.com if you want.

Allen Zhao

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.