Where they were when we started

The client was a B2B SaaS company in the mid-market workflow-software category. Annual revenue around $40M, two products, three buyer personas, an existing growth team of seven. They had been running CRO for 18 months when we were brought in. The program had produced 41 completed tests, 12 wins, 18 inconclusive, 11 losses.

Aggregate site-level conversion rate had moved from 2.1% to 2.3% across those 18 months. The team and leadership were not happy with this. The hypothesis was that they had hit the ceiling of what CRO could do for them and they were considering scaling back the program.

We saw something different. The program had not hit a ceiling. The program had been measuring the wrong thing and testing in the wrong order. Each individual test had been competently run; the collection of tests had not been sequenced as a program. The 0.2-point lift over 18 months was the natural result of unsequenced testing.

What we changed in week one

The first month was the diagnostic — what our firm always does on new engagements. We have written separately about why we do this at no cost. The first-month output was a 30-page diagnostic that surfaced three structural issues.

Issue one: the hypothesis tree was flat. The team was generating test ideas from a shared backlog of "things to try." There was no taxonomy of what each test was actually probing. As a result, the program had run six different tests on hero-section variants without any of them being designed to learn anything generalizable about hero sections. The wins were specific to the page; the losses were specific to the page; nothing compounded.

Issue two: test duration was political, not statistical. Tests were being called at two weeks because that was the cadence the team had committed to internally. Most of the two-week tests had not reached statistical power. The "wins" included tests that would not have survived a longer runtime.

Issue three: success was measured at the surface. Tests measured form-fill rate or click-through-rate on the page. Those metrics improved or did not. Nobody was measuring whether the lift translated to qualified pipeline downstream. On a B2B SaaS funnel with a 4-6 month sales cycle, the surface metrics were necessary but not sufficient.

The diagnostic recommended three changes corresponding to the three issues, plus an operating-rhythm change for the program itself.

The hypothesis tree

We rebuilt the test backlog as a structured hypothesis tree. The tree had three levels: strategic hypotheses (big claims about how the buyer makes decisions), tactical hypotheses (specific page-level claims), and experiments (the actual tests).

The strategic hypotheses for this client were three: (1) buyers were under-qualifying themselves because they could not tell which product fit their use case; (2) buyers in the larger enterprise segment were bouncing because the marketing site read as mid-market and did not signal enterprise credibility; (3) buyers from referral channels converted at materially higher rates and were being treated identically to cold-traffic buyers.

Hypothesis tree structure Strategic claims at the top, experiments at the leaves — outcomes flow upward STRATEGIC TACTICAL EXPERIMENTS Buyers cannot tell which product fits their use case Product page clarity improves qualification Comparison tool improves qualification Use-case content improves qualification test 01 test 02 test 03 test 04 test 05 test 06 test 07 test 08 Each experiment outcome updates the tactical hypothesis above it, which updates the strategic claim.
Fig 1. The hypothesis-tree structure that organized the program. Strategic hypotheses sit at the top; tactical hypotheses underneath; specific experiments at the leaves. Outcomes flow upward — each experiment's result updates the tactical hypothesis above it, which updates the strategic hypothesis above that.

Each strategic hypothesis had 4-6 tactical hypotheses beneath it. Each tactical hypothesis spawned 1-3 experiments. The result was a backlog of 60-some experiment slots, organized so that any individual experiment's outcome — win, loss, or null — would feed into the tactical and strategic hypotheses above it.

This is a method. It is not a methodology — there is no special software, no copyrighted framework. What it requires is the discipline to refuse to run tests that do not slot into the tree. The team had to give up the freedom to test "anything interesting" in exchange for the compounding benefit of tests that taught generalizable things.

The cadence and the measurement

On cadence, we made the program statistical, not political. Tests run until they reach pre-declared sample size at the pre-declared power. Tests that do not reach power within 6 weeks get pulled and the hypothesis gets re-examined — usually the traffic to the test surface is too low and we move the test to a higher-traffic surface.

This change cut the test volume by roughly 30 percent. The team initially read this as a loss. It was not. The remaining tests produced cleaner signal, and the signal compounded faster because each test was answering a real question instead of producing noise.

On measurement, we tied test outcomes to pipeline rather than to surface metrics. Every test that affected a top-of-funnel surface had to track two metrics: the surface metric (click, fill, scroll) and the downstream pipeline contribution measured 60 days after the test concluded. The downstream measurement was lagged and noisy, but it kept the team honest about which "wins" actually mattered.

Three tests in the first quarter showed surface lift with no pipeline lift. Those would have been counted as wins under the old measurement and would have been built on. Under the new measurement, the team killed the variants and moved on. This is the un-glamorous part of CRO that almost never makes it into case studies.

What the numbers looked like

Across the six months of the engagement, the program ran 24 tests. 11 produced statistically valid lifts at the surface. 7 of those 11 produced corresponding pipeline lifts at the 60-day measurement. 4 produced surface lifts without pipeline lifts and were rolled back.

The compounding result: qualified-pipeline conversion (from anonymous visit through to sales-qualified opportunity) moved from a baseline of 0.42% to 0.58%. That is a 38% improvement, sustained across the six months. In dollar terms, on the client's paid-traffic mix, that translated to roughly $2.1M of incremental annual pipeline.

The aggregate site-level conversion rate the team had been measuring before — the one that had moved 0.2 points in 18 months — moved 0.4 points in 6 months under the new program. But the team and leadership stopped looking at it. The metric that mattered was downstream, and that one moved enough that everyone in the conversation stopped arguing about whether CRO was hitting a ceiling.

The program had not hit a ceiling. The program had been measuring the wrong thing.

What carried forward

At the end of the engagement, we transitioned program ownership back to the client team. The three structural changes — hypothesis tree, statistical cadence, pipeline-tied measurement — were now native to how they ran CRO. The client has been running the program independently for 14 months as of this writing, and the compounding has continued; their last quarterly readout showed qualified-pipeline conversion at 0.71%.

The lesson we draw from this engagement is not specific to CRO. It applies to any digital optimization program. Programs do not hit ceilings. Programs run out of structure. When a sustained optimization program stalls, the right question is rarely "what new tactic should we try." The right question is "what is wrong with how we are sequencing the work."

Our team has run variants of this engagement pattern with several other clients across consumer subscription, B2B SaaS, and direct-to-consumer commerce. The specific tactics differ; the structural changes are roughly the same. If your team is running a CRO program that feels like it has plateaued, the diagnostic conversation is worth having.