Perplexity and bits per token

One-line definition

Perplexity is the exponentiated average per-token cross-entropy of a language model on a held-out corpus: $PPL = exp (\overset{ˉ}{L})$ where $\overset{ˉ}{L} = - \frac{1}{N} \sum_{i = 1}^{N} lo g p (x_{i} ∣ x_{< i})$ . Bits per token (BPT) is the same quantity in $lo g_{2}$ units: $BPT = \overset{ˉ}{L} / ln 2$ .

Why it matters

Perplexity is the dominant intrinsic metric used during pretraining: cheap to compute, smooth, monotonically related to next-token log-likelihood. Loss curves are usually reported as cross-entropy or perplexity. Scaling laws (Kaplan et al., 2020; Chinchilla, 2022) are derived in perplexity / loss space.

But: perplexity is a poor evaluation of model usefulness. Lower perplexity does not reliably mean better summarization, instruction following, code, or chat. For these you need task evals (HumanEval, MMLU, MT-Bench, etc.).

The formula

For a tokenized sequence $x_{1}, \dots, x_{N}$ and a model $p$ :

\overset{ˉ}{L} = - \frac{1}{N} i = 1 \sum N lo g p (x_{i} ∣ x_{< i})

$\overset{ˉ}{L}$ is the average cross-entropy in nats (natural log) per token. Then:

Perplexity: $PPL = exp (\overset{ˉ}{L})$
Bits per token: $BPT = \overset{ˉ}{L} / ln 2 = lo g_{2} (PPL)$
Bits per byte: BPT divided by mean bytes per token (depends on tokenizer)

Interpretation of perplexity: the model is “as confused as if it had to choose uniformly among $PPL$ words.” A perplexity of 20 means the model’s predictions are as good as a uniform distribution over 20 next-token candidates.

Tokenizer dependence

Perplexity is not comparable across tokenizers. Different tokenizers split the same text into different numbers of tokens, so the per-token cross-entropy differs even when the model perfectly captures the underlying language.

To compare across tokenizers, convert to bits per byte (BPB) or bits per character:

BPB = \frac{total bits to encode}{total bytes in corpus}

This is invariant to tokenization and is the proper cross-tokenizer comparison metric.

Typical numbers (for context)

On a clean English text corpus:

Model	Perplexity	BPB
Trigram (Brown corpus 1992)	~150	~1.5
LSTM (Mikolov 2010, PTB)	~80	.
GPT-2 small (Radford 2019, WebText)	~30	~1.0
GPT-3 (Brown 2020, books)	~10–20	~0.7
Frontier 2024–2026 LLM	~5–10	~0.5

The lower bound (compressed bits per byte of natural English) is estimated at ~0.6 bits/byte (Shannon).

Why perplexity ≠ usefulness

A model that has memorized its training data has perplexity 1 on it but doesn’t generalize. A model with low perplexity on web text may still:

Generate fluent nonsense in long-form generation.
Fail at instruction following (perplexity isn’t measured on that distribution).
Fail at tool use, reasoning, or code (a small perplexity gap can hide a huge capability gap).

Production model evaluation always requires task evals.

Common pitfalls

Comparing perplexities across tokenizers. Use BPB.
Reporting perplexity on the training set. Always use a held-out test corpus.
Reporting perplexity on a domain very different from your eval target. A model with low perplexity on Wikipedia may be terrible at code; pick a test corpus that matches the deployment distribution.
Treating a perplexity drop as a quality win. Sub-half-bit improvements are common during fine-tuning but don’t translate to user-facing quality.

Tokenization. Affects how perplexity is computed.
LLM evals: the hardest part. Why intrinsic metrics fail for production LLMs.