Multi-Armed Bandits

Bandit rules let the platform automatically shift traffic toward better-performing variations as reward data accumulates, rather than waiting for a fixed experiment to reach significance. You configure the algorithm, attach a reward metric, and the SDK serves each user the variation most likely to maximise that metric.

What it is

A multi-armed bandit is an adaptive algorithm that balances two competing goals: exploration — trying variations that have not been measured enough to be confident about — and exploitation — serving the variation that currently looks best. Unlike an A/B test, which splits traffic equally for the entire experiment duration, a bandit updates its model continuously and re-weights allocations as data arrives.

A vs B supports three algorithms, all operating on the same BanditConfig shape attached to a flag rule:

Epsilon-greedy: serves the best known variation with probability 1 - explorationRate, and a random variation with probability explorationRate. Simple and predictable.
Thompson sampling:models each variation's reward as a Beta(α, β) distribution and samples from it at decision time. Naturally explores more when estimates are uncertain and converges quickly once a clear winner emerges.
UCB1: selects the variation with the highest upper confidence bound: mean + sqrt(2 ln N / n_i). Zero random exploration — every allocation is principled. Suited to deterministic environments.

When to use it

Prefer a bandit rule over a standard A/B test when:

You have several content variations (e.g., five hero images, three email subject lines) and do not care about statistical inference — you just want the best one served as quickly as possible.
The cost of serving a losing variation is high (e.g., real money, user churn) and you want to minimise regret rather than wait for a fixed experiment to conclude.
You are running a recommendation or personalisation use case where contextual attributes (user segment, page context) should influence which variation is optimal.

Stick with a standard A/B test when you need clean statistical significance, want to measure multiple metrics simultaneously, or are doing primary metric analysis for a business decision that requires rigorous inference.

How it works

The bandit configuration lives in the flag's rule, and the reward model is stored server-side and refreshed on each datafile update. The SDK reads the current model snapshot and selects an action (variation) at evaluation time:

1// BanditConfig shape (part of the flag rule in the datafile)
2interface BanditConfig {
3  algorithm: 'epsilon-greedy' | 'thompson-sampling' | 'ucb1'
4  explorationRate?: number     // only for epsilon-greedy (0–1)
5  rewardMetric: string         // event key tracked via client.track()
6  actions: Array<{
7    id: string
8    variationId: string
9    contextAttributes?: Record<string, number | string>
10  }>
11}
12
13// BanditModel snapshot — produced by offline training, read by SDK
14interface BanditModel {
15  version: string
16  algorithm: 'epsilon-greedy' | 'thompson-sampling' | 'ucb1'
17  perAction: Record<string, { mean: number; variance: number; samples: number }>
18}

The decision is logged as a BanditDecisionLogEntry which extends the standard DecisionLogEntry with the action ID, model version, the probability assigned to the action, and the optimality gap (epsilon at decision time for epsilon-greedy; null for UCB1).

Bandit assignments are sticky by default. Once a user is assigned a variation, they keep it until the model re-evaluates them into a new action bucket. See Sticky Bucketing for details on the storage backends.

Per-SDK usage

@avsbhq/browser

1import { AvsbClient } from '@avsbhq/browser'
2
3const client = new AvsbClient({ sdkKey: 'sdk_production_abc123' })
4await client.onReady()
5
6// Evaluation — same API as any flag
7const flag = client.getFlag('hero-image-bandit', 'control')
8// flag.source === 'bandit', flag.variationKey === 'variant-b' (example)
9
10// Track the reward metric the bandit is optimising on
11document.querySelector('.cta')?.addEventListener('click', () => {
12  client.track('hero_cta_click')
13})

@avsbhq/node

1import { AvsbServer } from '@avsbhq/node'
2
3const server = new AvsbServer({ sdkKey: process.env.AVSB_SDK_KEY })
4await server.onReady()
5
6// Server-side evaluation with context
7const flag = server.getFlag('pricing-plan-bandit', 'starter', userContext)
8
9// Track reward event (e.g., after a plan upgrade)
10server.track('plan_upgrade', { value: 49, context: userContext })

avsb-python

1from avsb import AvsbServer
2
3server = AvsbServer(sdk_key=os.environ["AVSB_SDK_KEY"])
4server.wait_for_ready()
5
6# Evaluate — identical API to non-bandit flags
7flag = server.get_flag("email-subject-bandit", "control", context)
8
9# Track reward
10server.track("email_open", context=context)

avsb-java

1import cloud.avsb.AvsbServer;
2import cloud.avsb.core.EvalContext;
3
4AvsbServer server = AvsbServer.builder()
5    .sdkKey(System.getenv("AVSB_SDK_KEY"))
6    .build();
7server.blockUntilReady(Duration.ofSeconds(5));
8
9Flag<String> flag = server.getFlag("email-subject-bandit", "control", ctx);
10
11// Track reward metric
12server.track("email_open", ctx);

Multi-Armed Bandits

What it is

When to use it

How it works

Per-SDK usage

Related concepts