LLM Visualizer: Interactive Learning Tool

How This Visualizer Works

This tool simulates the first few steps of a Transformer model, like those used in Large Language Models, to demonstrate how text is processed. Here's a breakdown of the pipeline:

1. Input & Tokenization

First, all the text from the prompt boxes is combined into a single string. This string is then broken down into a sequence of individual words, or tokens. This simulation uses a simple tokenizer that splits words by spaces and converts them to lowercase. In real models, tokenization is more complex and can handle sub-words.

2. Themed Embeddings

Words are just text; to work with them mathematically, we must convert them into vectors of numbers called embeddings. A good embedding captures the "meaning" of a word, such that similar words have similar vectors.

This simulation uses a simplified, rule-based system to create realistic embeddings. We have predefined themes (e.g., math_cs, geometry) with a base vector for each. When a token matches a theme, it gets an embedding that is a slightly "noisy" version of that theme's base vector. This ensures that tokens like math and computer have similar embeddings and will appear close together in the "Tokens & Embeddings" chart, forming a semantic cluster.

3. Causal Self-Attention

The Self-Attention mechanism is the core of the Transformer. It allows the model to weigh the importance of different tokens in the sequence when processing a specific token. The "Causal" part means a token can only pay attention to itself and the tokens that came before it.

Similarity Score: For each token (the "query"), we calculate its similarity to every previous token (the "keys"). This is done using a dot product between their embedding vectors.
Scaling: These scores are scaled down by dividing by the square root of the embedding dimension. This helps stabilize the learning process in real models.
Softmax: The scaled scores for each row are passed through a softmax function. Softmax converts the scores into a probability distribution, ensuring all weights in a row add up to 1. This gives us the final attention weights.

The heatmap visualizes these final weights. The color scale is non-linear (a scalePow) to make it easier to see the smaller, non-zero attention weights that would otherwise be invisible.