/Docs

Composite metrics

A composite metric is a weighted sum of multiple existing metric bindings rolled into a single decision signal. Use a composite when an experiment is multi-objective and the analyst needs one number to decide a winner instead of squinting at three or four primaries that may disagree. In the experiment builder this is surfaced as the Composite (weighted)measure — pick it on a binding and reference other bindings on the same experiment as its components. Statsig calls these "metric programs"; Eppo bundles equivalent behaviour under "guardrail decision rules." A vs B ships the same with explicit pairwise component-covariance handling and a Decomposition view that breaks the weighted lift back into per-component contributions.

What is it

In A vs B, the Composite (weighted) measure is picked inside the experiment builder's Metrics step — not at metric creation time. Its components are other metric bindings on the same experiment (the saved analysis configurations, referenced by binding ID), not raw project metrics. Each component is paired with a weight. Weights are normalised at save time so they sum to exactly 1.0 — you can type any non-negative numbers (60 and 40, 0.6 and 0.4, 3 and 2) and A vs B canonicalises them. The composite for each variation is the weighted sum of the per-component point estimates; the variance combines the per-component variances with pairwise covariance, so correlated components don't double-count their contributions.

When to use

Use a composite when the experiment must balance competing goals and deciding by-eye across multiple primaries would be subjective:

  • Engagement vs revenue — a UX change that increases session length but lowers revenue per visitor. Composite = 0.5 × session-length + 0.5 × revenue-per-visitor lets the team decide ex-ante what trade-off is acceptable.
  • Conversion vs retention— an onboarding flow that drives more signups but also more 7-day churn. Composite = 0.4 × signup-rate + 0.6 × 30-day-retention encodes the team's stated preference for stickier users.
  • Satisfaction vs activation — a tutorial step that improves NPS but slows activation. Composite = 0.7 × activation-rate + 0.3 × NPS makes activation the dominant driver while still rewarding satisfaction wins.
Effective-primary warning
If any single weight exceeds 0.85, the metric editor surfaces a warning that the composite is effectively a single-metric primary with a token contribution from the other components. A 0.95 × revenue + 0.05 × NPS composite isn't multi-objective — it's revenue with cosmetic NPS. The save isn't blocked (sometimes that's what the analyst wants), but the warning makes the trade-off deliberate.

Example

You're testing a new onboarding flow and the product team wants both higher activation and higher 30-day retention. They've agreed in advance that activation matters twice as much as retention.

  1. Create an Activation Custom Metric (tracking def, fires on first meaningful action) and attach a binding for it to your experiment with the Conversion rate measure.
  2. Create a 30-day retention Custom Metric (fires when a 30-day-active session is recorded) and attach a binding for it with the Conversion rate measure.
  3. In the experiment builder's Metrics step, attach a third binding and pick the Composite (weighted) measure. Add the two bindings above as components with weights 0.66 / 0.33 (auto-normalised to 0.667 / 0.333).
  4. Mark the composite binding as the primary metric for the experiment.

After the experiment has data:

text
1Composite (Onboarding success):
2 Control: 0.241 (weighted)
3 Variant: 0.268 (weighted)
4 Lift: +11.2% (95% CI: +3.4% to +19.5%)
5 Badge: Composite (weighted)
6
7Decomposition:
8 Activation (weight 0.667): +14.8% lift (contributes +9.9%)
9 30-day retention (weight 0.333): +3.9% lift (contributes +1.3%)

The Decomposition view makes it obvious that activation drove most of the win and that retention nudged in the same direction. If retention had been flat or negative, this view would surface that immediately instead of hiding it inside the composite.

How A vs B computes it

The point estimate is the weighted sum across components:

text
1composite = Σᵢ wᵢ · μᵢ

The variance combines per-component variances with pairwise covariance — correlated components don't double-count:

text
1Var(composite) = Σᵢ wᵢ² · Var(Xᵢ) + 2 · ΣᵢΣⱼ>ᵢ wᵢ · wⱼ · Cov(Xᵢ, Xⱼ)

Per-component covariance is computed once per result run from co-observed visitor records and reused across the three stats engines:

  • Frequentist — weighted t-test using the combined variance.
  • Bayesian — Normal posterior with weighted mean and combined variance.
  • Sequential — always-valid bound using the same combined variance (wider than the fixed-horizon CI by design).

When the co-observation count for a component pair falls below a threshold (default 30), the engine falls back to an upper-bound variance estimate (treating the pair as maximally correlated) and the results row surfaces "conservative CI" — better to widen the interval than to claim precision the data doesn't support.

CUPED and winsorization

Both CUPED and winsorization apply per component before the weighted sum. If you enable winsorization on the composite, each component is winsorized independently at the configured percentile, then the weighted sum is computed. CUPED runs after winsorization and before the variance combination. This ordering means a single noisy component doesn't poison the composite's variance, and CUPED's variance reduction is applied where it's most effective — component-by-component.

Sample experiment

A worked example: an onboarding redesign with ~22k visitors per arm. The composite metric "Onboarding success" (66.7% activation + 33.3% retention) moves from 0.241 to 0.268, an 11.2% lift with a 95% CI of +3.4% to +19.5%. The Decomposition shows activation lifted +14.8% (contributing +9.9% to the composite) and retention lifted +3.9% (contributing +1.3%). Both components moved in the same direction, and the team can see the composite isn't hiding a flat or negative component — it's a real win on both fronts.

FAQ

What if I don't know what weights to use?

Start with equal weights (e.g., 0.5 / 0.5) and iterate. Many teams use composite metrics specifically to force the up-front decision about what matters more — it's easier to debate weights before an experiment than to argue about whether a flat secondary disqualifies a winning primary after the fact.

Can a composite reference another composite?

No. Nested composites are blocked at save time. The math extends to nested cases but interpretation gets murky fast, and no industry tool supports them.

What happens if a component metric is paused mid-experiment?

The composite flags as "component unavailable" on the results page and you're prompted to amend the analysis plan (via the sealed-plan amendment flow) before results recompute. The composite is never silently re-weighted onto the remaining components — that would change the statistical interpretation of the test.

Does changing a weight require an amendment?

Yes, if the metric is referenced by a sealed analysis plan. Composite weight changes are recognised by the amendment-detection layer and flow through the existing sealed-plan amendment review with a written rationale.