A/B Test Design & Experiment Program Builder
DocumentTurns a change idea into a statistically valid, ready-to-run A/B test plan and growth-experiment program.
Live output preview
A plan is required to view this content
Choose a plan to access input format, sample outputs, and live previews.
View Plans →About the skill
What it does
Turns a change idea into a statistically defensible, ready-to-run A/B test design and, optionally, a continuous growth-experimentation program. It first writes a structured hypothesis — Because [observation/data], we believe [change] will cause [expected outcome] for [audience]. We'll know when [metric] — replacing "let's see what happens" with a falsifiable, measurable prediction. It enforces single-variable discipline (choosing the right type among A/B, A/B/n, MVT, Split URL) and defines a primary + secondary + guardrail metric triad. From baseline rate, traffic, and MDE it computes sample size and test duration (Evan Miller / Optimizely power-analysis logic; 95% confidence, 80% power by default). It designs variants, picks traffic allocation (50/50, conservative 90/10, ramping), and decides client-side vs server-side implementation. It generates a pre-launch QA checklist and warns against false-positive traps like peeking, early stopping, and segment cherry-picking.
When to use it
When you want to compare two versions and measure which performs better; when validating a landing/pricing/onboarding change before shipping; when you need answers to "should I test this, how long should I run it, when do I hit significance"; and when you want to move beyond one-off tests to a systematic practice with an ICE-prioritized backlog, experiment-velocity targets, and a winning-pattern playbook.
Method / frameworks
Structured Hypothesis Framework (Because/We believe/Will cause/We'll know) · primary/secondary/guardrail metric separation · MDE + power analysis for sample size and duration · traffic-allocation strategies · ICE scoring (Impact + Confidence + Ease, Sean Ellis) for prioritization · Experiment Velocity leading indicators (experiments/month, 20-30% win rate, backlog depth, cumulative lift) · an Experiment Playbook of proven patterns · weekly/bi-weekly/monthly/quarterly cadence. Statistical rigor via the 95% confidence (p<0.05) threshold and anti-peeking discipline.
How do I use this skill?
Upload the ab-testing.zip you downloaded as-is — no packaging needed, the format is already correct (folder at root).
- Open Settings → Customize → Skills
- Upload → select the
ab-testing.zipyou downloaded - Claude reads
SKILL.md; the name + description appear. Ready ✅
Scripts run in Anthropic's code-execution environment (sandbox) — not on your machine.