Hypothesis testing - Dr. Mario's 🧠

- Tags:: #📝CuratedNotes , [[Experiment Design]] We use significance tests to make sure an observed effect between two groups is not random. ## [[Permutation test]] An interesting thing is that you may perform a significance test simply by **permutation**. That is, mixing the data of the two groups, splitting them randomly and compute whatever the difference between whatever statistic we are computing, and observe whether the original diference lies in the range of chance variation. **This does not assume normality and can even be done with different sample sizes** (p. 99 of [[Practical Statistics for Data Scientists]]) ![](assets/1640757329_157.png) Although resampling usually appears as done without replacement, note that it is not clear whether the resampling in a permutation test should be done with or without replacement (see p. 97 of [[Practical Statistics for Data Scientists]] and this SO question: [r - Resampling / simulation methods: monte carlo, bootstrapping, jackknifing, cross-validation, randomization tests, and permutation tests - Cross Validated (stackexchange.com)](https://stats.stackexchange.com/questions/104040/resampling-simulation-methods-monte-carlo-bootstrapping-jackknifing-cross?noredirect=1&lq=1), it seems it can also be done with replacement (as indicated in p.157 of [[📖 Introductory Statistics and Analytics. A Resampling Perspective]]). ## [[Bootstrap test]] Note that you may also do this with a [[Bootstrap test]], by calculating the confidence interval of the difference in means (resampling from each sample category **with replacement**): if the confidence interval contains 0, you are not rejecting the null hypothesis. This link between [[confidence intervals]] and hypothesis testing is explained in [[📖 The Art of Statistics]], p 271. ## P-value and alpha The **[[p-value]]** is the probability of observing an effect (the test statistic we are using to measure the difference between the two groups) as big as the one we have seen, under the null hypothesis. ![P-value in statistical significance testing.svg](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/P-value_in_statistical_significance_testing.svg/370px-P-value_in_statistical_significance_testing.svg.png) ## A/B testing So, according to the above, an hypothesis might be true, but chance may prevent you from being sure about that. That is, you may not be able to reject the null hypothesis but the effect could still be happening. We want to answer: **"Will a hypothesis test reveal a difference in two groups?"** So there are four moving parts: - Sample size - Effect size we want to detect. - Significance level (alpha) we want to use - Power It seems that in practice most tests are useless: ![[Pasted image 20230729205354.png|300]] ^b5b8ca ## A/B pricing testing and impact in survival rate (churn) If we change prices, it's very easy to think there will be an impact in survival rate, and, in turn, further complicates the impact in LTV: lower prices will probably lead to higher survival rate and higher prices lower survival rate. This could pose a problem in interpreting the results of a short pricing test if we use our existing survival curves. You can only be sure in the following scenarios: - A cheaper variant shows a higher estimated LTV. You can be certain this variant wins: LTV will very likely be even higher in reality because of higher survival rate. - A more expensive variant shows a lower estimated LTV. You can be certain this variant loses: LTV will very likely be lower in reality because lower survival rate. But we can't affirm anything in mixed scenarios because the variation of the survival rate matters there (cheaper variant with lower estimated LTV or more expensive variant with higher estimated LTV). ### What would we do if we wanted to know more on this? a) Make a long test where we literally test retention. b) Surrogate metrics (TBD) ## A/B/C testing ### How do we compute here the sample size? ### What is the analysis in this case? I guess: 1) Anova 2) Pairwise comparions. ## Dangers of A/B testing - Multiple comparisons. - Peeking (not fixed sample).