🧩 mosaic — tokenizer studio

Every LLM reads tokens, not characters. Train a real byte-pair-encoding tokenizer on the corpus below, then watch any text break into a mosaic of tokens — each tile is one token the model would see. The real Python library runs in your browser; no install, no keys.

GitHub ↗

booting Python runtime…

1 · Train

Vocabulary size · 512 Training corpus (edit to train on your own text)

2 · Tokenize

—

tokens

—

bytes

—

bytes / token

Text to tokenize

Each tile is one token; color is keyed to the token id, so repeated tokens share a color. Spaces show as · and newlines as ⏎. Bigger vocabularies learn longer merges, so common words collapse into single tokens and bytes/token rises — that's compression, and it's why your context window and bill depend on the tokenizer. Python fetched verbatim from src/mosaic.