GTM Strategy

How do you run GTM experiments that produce trustworthy results?

Author

Will Cyniak

Date

June 30, 2026

Read time

min read

How do you run GTM experiments that produce trustworthy results?

TL;DR

Run GTM experiments that produce trustworthy results by doing three things before launch: write one hypothesis with one variable, size the sample so it can detect a realistic effect, and set the ship/iterate/kill rule in advance. Most "experiments" skip all three, then read random noise as a result and roll it out.

Most GTM "experiments" are just two emails and a hunch

A team sends version A to half the list, version B to the other half, sees B win by a few points, and declares B the winner. The list was 180 people. The "win" was four extra replies. That's not an experiment. That's noise wearing a lab coat.

The problem isn't the testing instinct, it's the missing math. With small samples and no pre-set rule, almost any difference looks meaningful, and you'll "learn" things that reverse the next time you run them. Booking.com runs around 25,000 experiments a year because most individual tests are inconclusive, and they know it. The discipline is in the setup, not the dashboard.

The mechanism: decide three things before you launch

A real experiment commits to its rules before it sees data. Three of them:

One hypothesis, one variable. "Leading with the integration story instead of the ROI story will lift reply rate" is testable. Changing the subject line, the CTA, and the sender at once is not — when it wins, you won't know which change did it. Move one thing.

A sample that can detect a realistic effect. Before launching, decide the smallest lift worth acting on and check whether your list is big enough to see it. Use a calculator like Evan Miller's. The standard settings are 95% confidence and 80% power, and at those settings, detecting a small lift on a low base rate takes far more contacts than most outbound tests ever have. If the math says you'd need 6,000 contacts per arm and you have 400, don't run it. You'll get an answer, it just won't be true.

The decision rule, written down. Decide now what each outcome means: what result ships, what result earns another round, what result kills the idea. Teams that skip this re-interpret the data until it agrees with what they already wanted to do.

Prioritize the backlog by lift you can actually measure

Score experiment ideas on expected impact, your confidence in the hypothesis, and the effort to run it — the standard ICE approach. Add one filter most teams miss: can you reach significance with the volume you have? A high-impact idea you can't measure cleanly belongs in a quarterly cohort test, not this week's A/B. Run the ones you can read first.

Result	What it means	Action
Clear win at 95%+	Effect is real and big enough to matter	Ship it and document why
Directional, 85–95%	Promising but not conclusive	Re-run with more volume
Flat	No detectable difference	Keep the simpler version; move on
Clear loss at 95%+	The change hurts	Kill it; record the learning
Underpowered	Sample too small to tell	Don't call it; pool or extend

What to do this week

Take the last GTM test your team called a "win." Find the sample size per variation and the size of the difference. Then ask: if we ran it again next week, would we bet money on the same result? If the honest answer is no, you don't have a finding, you have a coin flip. Size the next one before you send it.

Frequently asked questions

How big does a GTM A/B test need to be? Big enough to detect the smallest lift you'd act on, at 95% confidence and 80% power. For low base rates (like cold-email reply), that's often thousands per variation. Size it with a calculator before launch, not after.

What's the most common mistake in GTM experiments? Changing more than one variable at once, then not knowing which change drove the result. Move one thing per test.

Should we still test if our list is small? Not as a quick A/B. Use longer cohort tests or pooled results over time. A small-sample A/B gives you a confident-looking number that won't replicate.

How RevPack helps

We build the experimentation plumbing: an idea backlog scored by measurable lift, the tracking to read results cleanly inside your CRM, and decision rules wired to what your data can actually support. If your team tests constantly but nothing compounds, that's usually a measurement problem, not a creativity problem.

Book a call →

📚 References

Stefan Thomke — "Building a Culture of Experimentation," Harvard Business Review, March–April 2020. hbr.org
Evan Miller — "Sample Size Calculator (Evan's Awesome A/B Tools)." evanmiller.org

TL;DR:

Trustworthy GTM experiments are decided before launch: one hypothesis with one variable, a sample sized to detect a realistic effect at 95% confidence, and a ship/iterate/kill rule written down in advance. Skip those and you're reading noise.