designpattern.site

Tokenization & Embeddings

How LLMs convert text to numbers — BPE, WordPiece, SentencePiece tokenization, and dense vector embeddings.

Overview

Language models do not read words the way humans do. Under the hood, every model operates entirely on numbers — never on characters or words. Tokenization is the process that bridges raw text and the numeric world: it splits a string into discrete chunks called tokens, and assigns each chunk a unique integer ID. Embeddings then take those integer IDs and map them to dense vectors in a high-dimensional space where geometry encodes meaning.

Understanding this pipeline matters for practical reasons. Token counts determine how much text fits in a model's context window, affect API costs, and influence how well a model handles rare words or non-English languages. Embeddings power semantic search, clustering, classification, and retrieval-augmented generation. These are not internal implementation details to ignore — they shape every decision from prompt design to infrastructure cost.

What Is Tokenization?

Think of tokens as morphemes rather than words. A morpheme is the smallest meaningful unit in a language: "playing" contains the root "play" and the suffix "-ing". Tokenizers work similarly — they break text at boundaries that balance vocabulary size against coverage.

"Hello, world!" might become ["Hello", ",", " world", "!"] — four tokens. "unbelievably" might become ["un", "believ", "ably"] — three tokens. A single Chinese character is often a single token; an unusual technical term in English might span five or six tokens.

Loading diagram...

The tokenizer converts text to a sequence of integer IDs. The embedding lookup table — a matrix of shape vocab_size x d_model — maps each ID to a learned vector. Those vectors flow through the transformer. This entire process is learned end-to-end during training.

Tokenization Algorithms

Three algorithms dominate modern LLM tokenization. They differ in how they build the vocabulary and where they draw token boundaries.

BPE (Byte Pair Encoding)

BPE was originally a data compression algorithm. Applied to text, it starts with a vocabulary of individual characters (or bytes) and iteratively merges the most frequent adjacent pair. After enough merges, common words like "the" are a single token, while rare words are split into subword pieces.

GPT-2, GPT-3, GPT-4, and the Llama series (from version 2 onward) all use BPE. The "byte-level" variant used by GPT models operates on raw UTF-8 bytes rather than Unicode characters, which guarantees that any input — regardless of script — can be tokenized without an "unknown" token.

Step-by-step example on the corpus ["low", "lower", "newest", "widest"]:

  1. Start: l o w, l o w e r, n e w e s t, w i d e s t
  2. Most frequent pair: e s → merge to es: l o w, l o w e r, n e w es t, w i d es t
  3. Next most frequent: es t → merge to est: l o w, l o w e r, n e w est, w i d est
  4. Continue until vocabulary size is reached.

The final vocabulary contains individual characters plus the merged subword units discovered during training.

WordPiece

WordPiece, used by BERT and its derivatives (DistilBERT, RoBERTa, ALBERT), chooses merges differently: instead of the most frequent pair, it picks the pair that maximizes the likelihood of the training corpus under the current vocabulary. In practice the results look similar to BPE, but the selection criterion produces vocabularies that generalize slightly better on low-frequency words.

WordPiece uses a ## prefix to mark continuation subwords. This makes it easy to reconstruct word boundaries:

"playing"  →  ["play", "##ing"]
"unbelievably"  →  ["un", "##believ", "##ably"]
"ChatGPT"  →  ["Chat", "##GP", "##T"]

The ## is not arbitrary decoration — it signals "this piece attaches to the previous token without a space."

SentencePiece

SentencePiece treats the input as a raw stream of Unicode characters with no language-specific pre-tokenization (no splitting on spaces first). It can apply either BPE or a unigram language model algorithm on top of that stream. Because it is language-agnostic, it handles Japanese, Chinese, Arabic, and mixed-script input without special rules.

T5, mT5, LLaMA 1, Gemma, and Mistral use SentencePiece. The (U+2581, lower one eighth block) prefix marks the start of a new word: "Hello world"["▁Hello", "▁world"].

Comparison Table

AlgorithmUsed ByBoundary StrategyUnknown Tokens
BPE (byte-level)GPT-2/3/4, Llama 2+Byte-level, frequency mergesNever (byte fallback)
WordPieceBERT, DistilBERTLikelihood-maximizing mergesRare (uses [UNK])
SentencePiece + BPELLaMA 1, T5, GemmaRaw Unicode, no pre-tokenizationNever
SentencePiece + UnigrammT5, XLNetProbabilistic segmentationNever

Token Limits and Cost

Context Window = Token Budget

Every model has a maximum context length measured in tokens, not words or characters. GPT-4o supports 128,000 tokens; Gemini 1.5 Pro supports up to 2 million. Rough rules of thumb for English text: one word is roughly 1.3 tokens on average; one page of prose is roughly 500–750 tokens; one token is roughly 4 characters. Code, URLs, and non-Latin scripts often tokenize less efficiently. Always measure with the actual tokenizer for your model — never assume word counts translate directly.

Token counts also affect API billing. OpenAI charges per million input and output tokens. If a prompt includes a 10,000-word document repeated 100 times across batched requests, you pay for every token on every request. Caching strategies (OpenAI Prompt Caching, Anthropic's cache_control) reduce this cost but require understanding token boundaries to structure prompts correctly.

What Is an Embedding?

Once the tokenizer produces a sequence of integer IDs, each ID is passed through an embedding lookup table — a matrix E of shape [vocab_size, d_model]. Looking up token ID 42 retrieves row 42 of E: a vector of d_model floating-point numbers (commonly 768 for BERT-base, 1024 for BERT-large, up to 4096 for large GPT-style models).

This matrix is not hand-crafted. It is a learned parameter of the model, updated during training via backpropagation. The training signal pushes tokens that appear in similar contexts toward similar regions of the vector space. After training converges, the geometry of the embedding space encodes semantic relationships.

The classic demonstration: king - man + woman ≈ queen. The vector arithmetic works because "royalty" and "gender" are approximately separable dimensions in the space, an emergent property of training on large text corpora.

Semantic Similarity and Vector Space

Loading diagram...

Semantically related tokens cluster together because co-occurrence patterns in training data shape the space. "Dog" and "wolf" appear near each other; "dog" and "compiler" do not. This structure is not hard-coded — it emerges from billions of text examples.

Cosine similarity is the standard metric for comparing two embedding vectors a and b:

similarity = (a · b) / (||a|| × ||b||)

A score of 1.0 means the vectors point in the same direction (identical semantics in the model's representation). A score of 0.0 means orthogonal (unrelated). A score close to -1.0 means opposite directions (antonyms sometimes land here, though this is not guaranteed).

Code Examples

The examples below show two things: tokenizing text with tiktoken (the BPE tokenizer used by OpenAI models), and computing sentence embeddings using a pre-trained model.

# pip install tiktoken transformers torch sentence-transformers

# --- Part 1: Tokenization with tiktoken ---
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

text = "Tokenization splits text into tokens."
token_ids = enc.encode(text)
tokens = [enc.decode([tid]) for tid in token_ids]

print(f"Text:      {text}")
print(f"Token IDs: {token_ids}")
print(f"Tokens:    {tokens}")
print(f"Count:     {len(token_ids)}")
# Text:      Tokenization splits text into tokens.
# Token IDs: [12175, 2065, 25742, 1429, 7, 11241, 82, 13]  (approx)
# Tokens:    ['Token', 'ization', ' splits', ' text', ' into', ' tokens', '.']
# Count:     7

# --- Part 2: Token embeddings with transformers ---
from transformers import AutoTokenizer, AutoModel
import torch

model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

inputs = tokenizer("Hello world", return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

# outputs.last_hidden_state: shape [batch, seq_len, 768]
token_embeddings = outputs.last_hidden_state  # one vector per token
print(f"Token embeddings shape: {token_embeddings.shape}")
# torch.Size([1, 4, 768])  — [CLS] Hello world [SEP]

# --- Part 3: Sentence embedding via mean pooling ---
attention_mask = inputs["attention_mask"]
token_emb = outputs.last_hidden_state
mask_expanded = attention_mask.unsqueeze(-1).float()
sentence_embedding = (token_emb * mask_expanded).sum(1) / mask_expanded.sum(1)
print(f"Sentence embedding shape: {sentence_embedding.shape}")
# torch.Size([1, 768])

# --- Part 4: Cosine similarity ---
from sentence_transformers import SentenceTransformer, util

st_model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = [
    "A dog is running in the park.",
    "A puppy is playing outside.",
    "The compiler threw a syntax error.",
]
embeddings = st_model.encode(sentences, convert_to_tensor=True)

sim_01 = util.cos_sim(embeddings[0], embeddings[1]).item()
sim_02 = util.cos_sim(embeddings[0], embeddings[2]).item()
print(f"Dog / puppy similarity:   {sim_01:.3f}")  # ~0.85
print(f"Dog / compiler similarity: {sim_02:.3f}")  # ~0.08

Token Embeddings vs Sentence Embeddings

A token embedding is a vector for a single token. A sentence embedding is a single vector representing an entire input sequence. These serve different purposes and are obtained differently.

Token embeddings are used inside the model itself — the transformer attends over a sequence of them, refining each token's representation at every layer using context from all other tokens. The final-layer token embeddings carry context-sensitive meaning: the word "bank" will have different final-layer embeddings in "river bank" and "investment bank."

Sentence embeddings compress the entire sequence into one vector, typically by mean-pooling the token embeddings or by using the special [CLS] token's output (as in BERT-style models). They are used for tasks that require comparing two texts: semantic search, duplicate detection, clustering, and retrieval-augmented generation. Models like sentence-transformers/all-MiniLM-L6-v2 and OpenAI's text-embedding-3-small are specifically fine-tuned to produce high-quality sentence embeddings.

Common Misconception

The embedding layer at the start of a transformer is not the same as the contextual representations produced at later layers. The input embedding table maps token IDs to static vectors — "bank" always starts at the same point. By the final layer, "bank" has been transformed by attention and feed-forward operations into a context-sensitive representation that reflects its meaning in this specific sentence.

In Practice

Semantic search and RAG — Sentence embeddings are indexed in vector databases (Pinecone, Weaviate, pgvector, Chroma). At query time, the query is embedded with the same model, and approximate nearest-neighbor search returns the most semantically similar documents. The retrieved chunks are then injected into the LLM prompt.

Classification and clustering — Embedding a dataset of customer support tickets and running k-means clustering reveals natural topic groupings without labeled data. The same approach powers content recommendation and spam detection.

Cross-lingual retrieval — Multilingual embedding models (mBERT, LaBSE, OpenAI's multilingual embeddings) produce vectors where semantically equivalent sentences in different languages land near each other. A query in English can retrieve relevant documents in Japanese.

Token budget management — Production systems use tiktoken or the Hugging Face tokenizer for the target model to count tokens before every API call, trim context windows precisely, and measure prompt template overhead.

Key Papers and Resources

Word2Vec (2013) — Mikolov et al., "Efficient Estimation of Word Representations in Vector Space." Introduced the idea that word vectors trained on large corpora encode semantic relationships through vector arithmetic. The foundational paper for the field. arxiv.org/abs/1301.3781

GloVe (2014) — Pennington et al., "GloVe: Global Vectors for Word Representation." Trained on global co-occurrence statistics rather than local windows; competitive with Word2Vec on many benchmarks. nlp.stanford.edu/projects/glove

BERT (2018) — Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." Showed that contextual embeddings from a bidirectional transformer substantially outperform static word embeddings on downstream tasks. arxiv.org/abs/1810.04805

Sentence-BERT (2019) — Reimers and Gurevych, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." Introduced the siamese fine-tuning approach that makes BERT practical for semantic similarity at scale. arxiv.org/abs/1908.10084

OpenAI Embeddings — The text-embedding-3-small and text-embedding-3-large models are strong general-purpose sentence embedding models. Documentation at platform.openai.com/docs/guides/embeddings.

tiktoken — OpenAI's fast BPE tokenizer library for Python and Node.js: github.com/openai/tiktoken

Connections

Transformer Architecture — Embeddings are the input to the transformer. Positional encodings are added to the embedding vectors before the first attention layer. Understanding embeddings clarifies why transformers are permutation-invariant without positional encodings.

RAG and Retrieval — Sentence embeddings are the core data structure in every retrieval-augmented generation system. The quality of the embedding model directly determines recall: if the embedder does not place relevant documents near the query vector, no amount of LLM sophistication will recover them.

Fine-tuning — When fine-tuning a model, the embedding table is typically updated along with the rest of the weights. Techniques like LoRA can freeze the embedding layer to reduce memory usage, which works well when the vocabulary is already well-covered by pre-training.

On this page