🧩 mosaic — tokenizer studio

Every LLM reads tokens, not characters. Train a real byte-pair-encoding tokenizer on the corpus below, then watch any text break into a mosaic of tokens — each tile is one token the model would see. The real Python library runs in your browser; no install, no keys.

booting Python runtime…

1 · Train

2 · Tokenize

tokens
bytes
bytes / token

Each tile is one token; color is keyed to the token id, so repeated tokens share a color. Spaces show as · and newlines as . Bigger vocabularies learn longer merges, so common words collapse into single tokens and bytes/token rises — that's compression, and it's why your context window and bill depend on the tokenizer. Python fetched verbatim from src/mosaic.