Unpacking the Mechanics of Transformer LLMs

“Deep in the human unconscious is a pervasive need for a logical universe that makes sense. But the real universe is always one step beyond logic.”
Frank Herbert (from Dune)

In every captivating narrative, the foundation lies in the details: the choice of words, the rhythm of sentences, and the subtext beneath the dialogue. In the digital realm, the transformer Large Language Model (LLM) mirrors this intricate process. Like the unseen drafts and revisions behind a published novel, the LLM operates behind the scenes, taking our words and crafting responses that can surprise, inform, or inspire. But what makes this AI “writer” tick? How does it understand our questions and curate its answers? Let’s delve into the mechanisms of this digital storyteller and unveil its inner workings.

Terminology

Attention Mechanism: The Art of Focus
Consider a director in a play, choosing where the spotlight should shine during a crucial scene. The attention mechanism in LLMs does just that. It decides which parts of your input (or query) should receive more focus, ensuring that the model grasps the essence of what you’re trying to convey.

Context: The Short-term Memory
LLMs, like storytellers, remember bits of the ongoing story to ensure continuity. They have a ‘context’ which is their short-term memory, keeping track of recent interactions to provide coherent replies.

Model & Parameters: The Brain and Neurons
The overall model is like the brain, and inside it are countless parameters, akin to neurons. They hold the learned patterns and relationships, allowing the model to generate meaningful responses.

Tokens: The Words and Phrases
In writing, every word and phrase crafts the narrative, setting the tone and pace of a story. In LLMs, these individual words or segments of words are termed as tokens. They form the fundamental units that the model reads, interprets, and responds to.

Weights: The Learned Experience
Every artist draws from their experiences, recalling past lessons to create something new. LLMs have ‘weights’ which are like memories from their training. These weights determine how the model responds based on patterns it has seen before.

Transformer LLM Architecture

Now that you know the key terms, let’s take a deeper dive into how transformer LLMs actually function:

During training of a transformer Large Language Model (LLM) like ChatGPT, data is broken down in pieces (parameters) and a model is created with the relationships between the pieces of data (weights). Imagine a 3D scatterplot where each piece of data is connected to others based on their relevance. Everything a pretrained model will know is in these weights, which brings us to the next important point: Context and persistence.

Without retraining, fine-tuning, or applying new instructions in a specific format, a model only retains the knowledge it gained during training and won’t retain any new information between requests. If you’ve seen Finding Nemo, it would be a little like having a conversation with the forgetful fish, Dory.

When you interact with a model (inference), the User Interface (UI) can compensate for a model’s inherent lack of short-term memory by maintaining a history of all previous user requests (prompts) and the AI’s subsequent replies. The UI can then resend the entire conversation history back with each new request. As each request becomes longer, the number of connections between data points grows.

As we continue to “chat” with the AI, the context begins to grow until the maximum buffer size (context) is reached, at which point the UI might begin removing the oldest or less relevant data. As a result, transformer AIs will eventually forget past information and can lose the thread of a conversation during longer sessions.

Keep in mind that a transformer AI’s responses rely on pattern matching, much like predictive text and auto-complete features, but at a much more sophisticated level. This means the model may seem to invent facts (hallucinate) when it lacks the necessary knowledge or connections.

The extent this hallucination occurs depends on the settings sent with your request (i.e., temperature, top_p, and top_k, which we’ll discuss in future posts). This is why we see incorrect or invented facts, invalid logic, and incorrect answers to math equations; the AI is not looking up data like a database or solving math problems using arithmetic, but instead looking at how your request aligns with textual data it has stored.

Putting It All Together

Now that we know the overall operation of transformer AI, let’s run through a roleplaying session as an example.

Start UI: There are several UIs available, such as the oobabooga text-generation-webui, each with slightly different interfaces tailored to research, writing, and chat.
Load Model: There are many different models to choose from, each with their own strengths and “personalities,” and most can be found on the LLM hub Hugging Face. Specifically, The Bloke has a vast library of models already prepared for use on modest computers (see Storytelling at Speed: Evaluating CPU and GPU Configurations for LLM AI).
Set Inference Parameters: Different values for temperature, top_p, and top_k help fine tune how the model will respond. How these parameters affect output will be the topic for an upcoming post.
Set Initial Context: Some UIs allow you to define an initial scenario and/or AI character. This will be sent to the AI and not shown to the user.
Set Greeting: Some UIs allow you to set the scene or have the AI character introduce themselves. Unlike the ‘Initial Context,’ this text is visible to the user before they input their prompt.
Prompt: The user enters their query, or prompt using a User Interface.
Tokenization: The prompt is first tokenized, which means it’s broken down into smaller pieces, often words or parts of words, that the model can understand. Each token is turned into a specific code, or vector, using a process called embeddings.
Passing Through the Model: These vectors are input into the model, which is structured with multiple transformer layers. Each layer refines the data, understanding patterns and nuances, so by the time they’ve passed through all layers, the model can generate coherent responses.
Attention Mechanism: As these vectors pass through each layer in the transformer, the attention mechanism comes into play. At each layer, the self-attention mechanism allows the model to weigh the importance of different tokens in the context of the current input sequence. In simple terms, it helps the model to “focus” on the most relevant parts of the input based on the query.
Generating a Response: After passing through all the layers of the transformer, the model produces a sequence of output vectors. These vectors are then converted back into human-readable text (detokenized), which becomes the response you see.

Key Points & Caveats

Transformer LLMs are static, meaning they only contain the data they were trained with, which means they represent a fixed point in time and can not learn any new data (without being retrained or fine-tuned).

Transformer LLMs are inherently stateless, meaning they do not remember any user input from one request to the next – that’s up to the UI to manage, and is limited to a fixed size (or context).

Transformer LLMs generate responses based on weights, which represent the relationships between parameters; they do not retrieve facts, perform arithmetic calculations, or comprehend logic. Because of this, answers may be incorrect, or hallucinations.

Conversely, when used for creative writing, story telling, and roleplay, responses may contain data about real people or stories and characters already written, so use caution when using responses in published works.

The quality of responses from a transformer LLM depends on the base model architecture, the quality and amount of data it was trained with, and the method of training. In addition, models should be evaluated to get an estimate of accuracy before being used for non-fiction purposes, and even then results should always be validated.

Conclusion

The transformer Large Language Model, at its core, is a digital storyteller, weaving narratives from its vast reservoir of “learned experiences.” For writers and creatives, grasping its mechanics illuminates the vast potential of this tool. Whether you seek inspiration or a fresh perspective, the LLM stands ready to enhance your creative process. As with any tool, its brilliance is in how we use it, turning the LLM into both muse and mentor on our artistic journeys.

TransformerTales