(03) 8832 8005

Most Aussie Shopify founders test like teenagers gamble. A bold button colour change here. A new headline they read about in a Twitter thread there. A free shipping bar swap because a competitor just added one. Twelve months later they cannot tell you a single test that moved the number, because nothing was ever properly tracked, scored or stacked.

Here is the brutal benchmark. Optimizely studied 127,000 experiments across their platform and found that the average win rate is roughly 12%. That is the baseline for ad-hoc testing. Eighty-eight out of every hundred ideas you have today will either lose or be inconclusive. Without a system, you are paying for traffic, designer time and developer hours to learn almost nothing.

Mature testing programs flip those odds. Teams that follow a structured prioritisation framework lift their win rate to 22 to 30%, and an internal benchmark from CRO agency Conversion.com shows clients who score every hypothesis before they ship it see a 31% higher win rate than teams running tests by gut feel. That is the difference between a store that lifts conversion 25 to 40% over a year and one that lifts nothing at all.

This article gives you the system. It is called the PIE framework. We use it inside eCommerce Circle with members running $40k to $500k a month on Shopify, and it turns the noisy “what should we test next” debate into a single ranked list you can work through for the next twelve months.

Why Most Shopify A/B Tests Fail (the Hidden Tax of Random Testing)

The problem is not the tools. Shopify itself ships native A/B testing via Rollouts. Shoplift starts at $74 a month and plugs straight into the Theme Customiser. Intelligems runs price and shipping tests cleanly from $49. The barrier to running a test in 2026 is basically zero. Anyone can launch one this afternoon.

The problem is the queue feeding the tools. Every Aussie founder I work with has the same chaotic backlog: a Notion doc with 47 test ideas, a Slack thread of designer suggestions, three things their CRO agency proposed in 2024 that never shipped, and whatever their last podcast guest said worked for their store. There is no ranking. No scoring. No agreed standard for what gets tested first.

So tests get picked the wrong way. The flashy ones win the queue. The complex ones get parked. The tiny ones (a button colour, a font tweak) get shipped because they are easy, even though the lift is tiny and the page they live on gets ten thousand visits a month, not a hundred thousand. Twelve months in, the program has run 18 tests, three winners, and zero meaningful conversion lift on the store.

Here is the maths most founders never run. If your store does 50,000 sessions a month at a 2.81% conversion rate (the 2025 Shopify median) and an $85 AOV, you are at roughly $119k a month in revenue. A 25% lift in conversion adds $29,750 a month, which is $357,000 a year. Random testing leaves most of that on the table because the highest-impact tests never get prioritised correctly.

The fix is not more ideas. The fix is a way to rank the ones you already have.

The PIE Framework Explained (Potential, Importance, Ease)

PIE was developed by Canadian CRO agency WiderFunnel and it has held up for more than a decade because it forces three uncomfortable conversations in front of every test idea before a single line of code gets written.

You score each idea from 1 to 10 across three dimensions. Then you average the three scores. The highest scores go to the top of the test queue. The lowest get parked.

The output is a single PIE score, calculated as (P + I + E) / 3. Tests scoring 8.0 or higher go into the next quarter. Tests scoring 6.0 to 7.9 go on the watchlist. Tests below 6.0 get parked or killed. That is it. You now have a defensible, repeatable ranking that does not depend on whoever is loudest in the Monday meeting.

PIE framework scoring dashboard showing test ideas ranked by Potential, Importance and Ease scores
A live PIE scoring board built inside Notion or Airtable. The top three rows are the next quarter’s test queue.

Building Your Test Idea Inventory (the 5 Sources)

Before you can rank ideas, you need ideas. Most stores starve themselves at this step because they only ever ideate from one source (usually whatever the founder noticed on a competitor site last week). A proper test backlog draws from five distinct inputs, and you should be adding to it every fortnight.

Aim for a backlog of at least 40 to 60 ideas before you start scoring. A small inventory is the first sign you will run out of pipeline by month three.

Test idea inventory in Notion showing categorised hypotheses by funnel stage and data source
A test idea inventory grouped by funnel stage. Notice the source tag on every idea: that is what keeps the pipeline honest.

Scoring Tests with PIE: a Worked Example

Theory is cheap. Let us walk through three real test ideas from an Aussie skincare brand doing roughly $180k a month on Shopify and score them properly.

Without PIE, Test C ships first because it is easy. With PIE, Test B ships first because it has the highest expected return on effort. Test A goes second once the team has capacity. Test C either gets parked or runs as a quick optimisation while the bigger work is in flight. That single shift in sequencing is where the 31% win-rate uplift comes from.

A few scoring rules that keep the team honest. Score independently first (no one shows their numbers until everyone has scored). Then debate the gaps. If one person scored Potential at 9 and another at 4, that is the conversation worth having before you ship. Document the agreed score against the test card so you can audit later whether your team is consistently overestimating or underestimating any of the three dimensions.

Sample Size, Traffic Minimums and the 14-Day Rule

This is where most Shopify CRO programs quietly die. A store with 12,000 monthly sessions cannot run the same test calendar as a store with 200,000. Pretending otherwise produces false winners, false losers, and a backlog of tests that “felt like they should have worked” but the data was never trustworthy.

The rules of thumb that keep your testing honest:

This last point is the most important and the most ignored. Smaller stores should not be running ten tests a quarter. They should be running three or four bigger swings that move the dial enough to be detectable. Founders running $40k to $80k a month who try to mimic the testing cadence of a $5m brand burn through their backlog without learning anything because every test runs underpowered.

A/B test sample size calculator showing required visitors per variant for a Shopify store with 2.81 percent baseline conversion rate
Sample size calculation for a Shopify store at 2.81% baseline CR, targeting a 10% minimum detectable effect. Run this for every test before you ship.

Your 12-Test Annual Roadmap (the Template)

One test a month. That is the realistic cadence for most Aussie Shopify stores in the $80k to $500k a month range. Twelve tests a year, three of them winning at a 25% win rate (industry benchmark for a structured program), each delivering a 5 to 10% lift on the metric they target. That is your 25 to 40% annual conversion lift.

Here is the template we hand to members inside the eCommerce Circle workshop. Adapt the column headings to your tool of choice (Notion, Airtable, ClickUp, even a Google Sheet works fine).

A balanced annual roadmap typically looks like 3 PDP tests, 2 collection or category tests, 2 cart and upsell tests, 2 checkout tests, 1 homepage test, 1 email or popup test, and 1 ad landing page test. Skew the mix toward whichever funnel stage your data says is leaking hardest. Our 5-stage conversion funnel audit is the fastest way to figure out where the biggest leak sits today.

The CRO Test Post-Mortem (Why You Must Document Losses, Not Just Wins)

Every test gets a one-page post-mortem. Every single one. Winner, loser or flat. Stores that skip this step lose 80% of the value of their testing program because the institutional learning evaporates the moment the next test starts.

A good post-mortem captures six things:

Loser tests are more valuable than most founders realise. A test that disproves a beloved assumption (“our customers want more discount messaging”) saves you years of building in the wrong direction. The discipline of documenting it is what stops the same wrong hypothesis getting re-pitched by someone new on the team eighteen months later.

The Compounding Effect: How 30% Win Rate Becomes 25 to 40% Annual Conversion Lift

Here is the simple maths that should keep you committed to the roadmap when month three feels slow and month six feels like nothing is moving. Twelve tests a year. At a structured 25 to 30% win rate, you ship three to four winners. Each winner lifts the metric it targets by 5 to 15% (the realistic range for well-prioritised Shopify tests). Lifts compound across the funnel.

A 10% PDP conversion lift plus a 7% cart conversion lift plus a 5% checkout completion lift does not give you 22% total. It gives you (1.10 x 1.07 x 1.05) – 1 = 23.6% net conversion lift. Combine that with a single AOV winner (say a bundle test that lifts AOV 8%) and your revenue-per-session is up 33% year on year, all from twelve well-scored tests.

That is the lift that comes free with the same ad spend. For a store doing $180k a month, a 33% lift in revenue-per-session is roughly $60k a month more revenue without spending another dollar on Meta or Google. Over twelve months that is $720k of incremental revenue from a CRO program that costs you tool subscriptions plus your time.

Real Aussie examples back this up. Furniture brand Factory to Home added a FAQ popup to their PDPs and lifted completed orders 18.1% from a single test. Aje rebuilt their mobile experience after recognising that more than 75% of traffic was mobile but the experience was hurting conversion. Incu added advanced filtering to their multi-brand catalogue and lifted conversion 15% in the first week. None of these are dramatic redesigns. They are well-scored, well-implemented tests that shipped because they cleared the prioritisation bar.

For a deeper read on how the individual metric lifts stack into store-level economics, our breakdown of the Profit-Per-Visitor framework shows the unit of measurement that ties testing back to the bottom line. And if you want the channel-level equivalent (testing whether ad channels actually drive incremental revenue, not just claim credit), our geo-holdout incrementality testing guide is the companion read.

The PIE Test Backlog Quickstart Checklist

Print this. Stick it next to your screen. It is the one-pager that makes the system real.

You do not need a CRO agency to run this. You need discipline, a calendar, and the courage to kill the test ideas your team loves but the data says are 6.3 out of 10.

Inside eCommerce Circle, the test backlog is one of the core artefacts every member builds in their first 90 days. We work through it together, score live, and hold each other accountable to actually shipping the top-ranked test rather than the easiest one. If you want a second opinion on your current test queue, or you have not built one yet and are not sure where to start, let’s talk.

Paul Warren

Written by

Paul Warren

Helping Shopify brand owners scale smarter through the eCommerce Circle coaching community.

Leave a Reply

Your email address will not be published. Required fields are marked *

Thank You

Your application for the eCommerce Circle was successfully submitted.
We’ll get back to you through your provided details shortly.

Thank You

Your enrolment was successfully submitted, and we’ve added you to the waitlist for your preferred cohort.

Not a Circle Member Yet?
Only members can join cohorts!
Join here.