P-Hacking in Startups

Here's a summary of the key themes discussed in the Hacker News thread, along with supporting quotes:

The Trade-offs Between Rigorous A/B Testing and Startup Speed

A central theme revolves around the balance between rigorous A/B testing and the need for speed and agility in a startup environment. Many commenters argue that excessive focus on statistical significance can be detrimental, especially in the early stages. The primary tradeoff, according to many, is speed vs. certainty

Emphasis on Shipping and Iteration: Several users highlight the importance of continuous movement and shipping features, even if they're not perfectly optimized. "Startups need to be always moving. You need to keep turning the wheel to help keep everyone busy and keep them from fretting about your slow growth or high churn metrics" ("brian-armstrong"). This approach prioritizes learning and adapting over achieving statistical certainty with every change.
Opportunity Cost of Waiting for Statistical Significance: Commenters acknowledge the opportunity cost associated with waiting for statistically significant results. "But there’s an opportunity cost that needs to be factored in when waiting for a stronger signal," ("renjimen") suggesting that in some cases, the potential benefits of a change outweigh the risk of a false positive.
Balancing Priorities: "Even if you have to be honest with yourself about how much you care about being right, there’s still a place for balancing priorities. Two things can be true at once." ("epgui") This captures the nuanced perspective that statistical rigor isn't always the paramount concern, and decisions often involve weighing multiple factors.
Balancing Speed and Certainty: "One solution is to gradually move instances to you most likely solution. But continue a percentage of A/B/n testing as well. This allows for a balancing of speed vs. certainty" ("Nevermark")

The Appropriate Level of Rigor for Different Scenarios

Another prominent theme explores the idea that the level of rigor required for A/B testing should be proportionate to the potential impact and risk associated with the decision.

Context Matters: Numerous participants emphasize the need to tailor the testing approach to the specific context. "The sign up flow for your startup does not need the same rigor as medical research. You don’t need transportation engineering standards for your product packaging, either. They’re just totally different levels of risk." ("travisjungroth") This highlights the distinction between high-stakes decisions (e.g., medical treatments) and lower-stakes decisions (e.g., website design).
Questioning the Necessity of Strict Rigor: Some challenge the blanket application of rigorous statistical practices, particularly for non-critical decisions. "But does it, really? A lot of companies sell... well, let's say 'not important' stuff. Most companies don't cost peoples' lives when you get it wrong." ("Jemaclus").
Focus on Harm Reduction over Absolute Certainty: The goal isn't always to find the absolute best but to avoid making things worse. "Not 'only keep changes that are clearly good' but 'don't keep changes that are clearly bad.'" ("yorwba")

The Pitfalls of Over-Reliance on A/B Testing

Several posters cautioned against using A/B testing in situations where it's not appropriate or where it becomes a crutch.

A/B testing as a form of CYA: "Too often people use experimentation for CYA reasons so they can never be blamed for making a misstep" ("parpfish").
Product-Market Fit: "It's worth emphasizing though that if your startup hasn't achieved product market fit yet this kind of thing is a huge waste of time! Build features, see if people use them." ("simonw")
A/B testing on low conversion SaaS: "I’ve seen people get too excited to A/B test everything even when it’s not appropriate. For us, changing prices was a common A/B test when the relatively low number of conversions meant the tests took 3 months to run! I believe we’ve moved away from that, now." ("scott_w")
Don't waste your time squeezing signal: If you want to improve the product, improve the product. "Just (just lol) make better changes to your product so that the methods don’t matter. If p=0.00001, that’s going to be a better signal than p=0.05 with every correction in this article." ("travisjungroth")

The Importance of Understanding Statistical Concepts (and Avoiding P-Hacking)

A recurring theme is the fundamental need to understand statistical concepts like p-values, statistical significance, and the dangers of "p-hacking."

Understanding P-Hacking: The discussion addresses the risk of finding spurious correlations when analyzing data with too many degrees of freedom. "If you have many metrics that could possibly be construed as 'this was what we were trying to improve', that's many different possibilities for random variation to give you a false positive." ("PollardsRho")
Define your goals before experimenting: "If you're explicit at the start of an experiment that you're considering only a single metric a success, it turns any other results you get into "hmm, this is an interesting pattern that merits further exploration" and not "this is a significant result that confirms whatever I thought at the beginning." ("PollardsRho")
Importance of Best Practices: "Basically, are you a statistician? If not, sticking to the best practices in experimentation means your results are going to be meaningful." ("noodletheworld").
The arbitrary nature of p=0.05: "0.05 (or any Bayesian equivalent) is not a magic number. It’s really quite high for a default. Harder sciences (the ones not in replication crisis) use much stricter values by default." ("travisjungroth")
It's about the strength of the evidence: "Getting people used to talking about the strength of the evidence rather than statistical significance is a massive win most of the time." ("bigfudge").

Bayesian Approaches and Anytime-Valid Methods

Several commenters advocated for the use of Bayesian statistical methods and anytime-valid methods in A/B testing.

Go Bayesian: "For anyone spinning this stuff up, go Bayesian from the start. You’ll end up there, whether you realize it or not. (People will look at p-values in consideration of prior evidence)." ("travisjungroth")
Adopt Anytime-Valid Methods: "If you’re going to pick any fanciness from the start (besides Bayes) make it anytime valid methods. You’re certainly already going to be peaking (as you should) so have your data reflect that." ("travisjungroth")