A/B Test Design & Experiment Program Builder

Document

Turns a change idea into a statistically valid, ready-to-run A/B test plan and growth-experiment program.

Live output preview

Input Format: Input FormatOutputWatch the Output: Watch the Output

A plan is required to view this content

Choose a plan to access input format, sample outputs, and live previews.

View Plans →

About the skill

What it does

Turns a change idea into a statistically defensible, ready-to-run A/B test design and, optionally, a continuous growth-experimentation program. It first writes a structured hypothesis — Because [observation/data], we believe [change] will cause [expected outcome] for [audience]. We'll know when [metric] — replacing "let's see what happens" with a falsifiable, measurable prediction. It enforces single-variable discipline (choosing the right type among A/B, A/B/n, MVT, Split URL) and defines a primary + secondary + guardrail metric triad. From baseline rate, traffic, and MDE it computes sample size and test duration (Evan Miller / Optimizely power-analysis logic; 95% confidence, 80% power by default). It designs variants, picks traffic allocation (50/50, conservative 90/10, ramping), and decides client-side vs server-side implementation. It generates a pre-launch QA checklist and warns against false-positive traps like peeking, early stopping, and segment cherry-picking.

When to use it

When you want to compare two versions and measure which performs better; when validating a landing/pricing/onboarding change before shipping; when you need answers to "should I test this, how long should I run it, when do I hit significance"; and when you want to move beyond one-off tests to a systematic practice with an ICE-prioritized backlog, experiment-velocity targets, and a winning-pattern playbook.

Method / frameworks

Structured Hypothesis Framework (Because/We believe/Will cause/We'll know) · primary/secondary/guardrail metric separation · MDE + power analysis for sample size and duration · traffic-allocation strategies · ICE scoring (Impact + Confidence + Ease, Sean Ellis) for prioritization · Experiment Velocity leading indicators (experiments/month, 20-30% win rate, backlog depth, cumulative lift) · an Experiment Playbook of proven patterns · weekly/bi-weekly/monthly/quarterly cadence. Statistical rigor via the 95% confidence (p<0.05) threshold and anti-peeking discipline.

How do I use this skill?

You don't "run" a skill — after installing it you just tell the agent your task (e.g. ask for the relevant job), and the skill kicks in by itself when its description matches.

Upload the ab-testing.zip you downloaded as-is — no packaging needed, the format is already correct (folder at root).

Open Settings → Customize → Skills
Upload → select the ab-testing.zip you downloaded
Claude reads SKILL.md; the name + description appear. Ready ✅

Scripts run in Anthropic's code-execution environment (sandbox) — not on your machine.