AI

Ok full disclosure, because you know I like to say when I’m using ‘AI’, this post is specifically mostly created by an LLM. Because it’s an interesting experiment and I don’t claim to be an ‘AI’ model creation and platform expert, so let’s see if using an LLM can help me! I was in a meeting along time ago at MS Victoria with some people (can’t imagine what types of people that would be) and we were talking about this tech, it was so long ago I can’t remember the details or anyones faces (useful that)… but the convo was really good, anyway, so what is an LLM and how does it work?

~/xservus/notes/how-llms-work.md

How large language models actually work

From a pile of text to a trained model, then from your prompt to the words on screen — without the hand-waving.

Most explanations either skip the maths entirely or drown you in it. This one splits the problem the honest way: first how the model gets built, then what physically happens when a prompt goes in and tokens come out.

01 — Building the model

0x00Data collection

You assemble an enormous text corpus — web crawl, books, code, reference material — often tens of trillions of tokens. The composition matters more than almost anything else; much of the real engineering is in filtering, deduplication, and balancing the mix (how much code versus prose, and so on).

0x01Tokenisation

Raw text isn’t fed in as characters or words. A tokeniser — usually byte-pair encoding or a variant — is trained on the corpus to build a fixed vocabulary, typically 50k–200k tokens, where common sequences become single tokens and rare ones split into sub-word pieces. cybersecurity might be one token; an obscure string fragments into several. This vocabulary is then frozen and becomes the model’s entire input/output alphabet.

0x02Pretraining

This is the expensive part — the bit that costs millions in compute. The objective is brutally simple: predict the next token given everything before it. Take a sequence, mask the future, ask the model to predict each next token, measure how wrong it was via cross-entropy loss, then adjust the model’s parameters by backpropagation and gradient descent. Repeat across the whole corpus, many times, on thousands of GPUs for weeks.

The model is a transformer — billions of parameters (weights) arranged in stacked layers. Nobody hand-codes any knowledge. The weights start random and the only pressure is “predict the next token better.” Grammar, facts, reasoning patterns, code syntax — all of it emerges as a side effect, because predicting the next token well requires implicitly modelling the structure that produced the text.

0x03Post-training

A raw pretrained model is just a next-token autocomplete engine — it’ll happily continue your prompt as if it were a web page rather than answer it. Post-training shapes it into something useful:

SFT — supervised fine-tuning on curated instruction → good response pairs, teaching the model the format of “be a helpful assistant that answers.”
RLHF / DPO — preference optimisation: show the model pairs of responses ranked by which is better, and tune it toward the preferred kind. This is where helpfulness, refusal behaviour, tone, and safety properties get baked in.

Pretraining builds the capability; post-training aims it.

02 — Prompt in, output out

Here is the inference pipeline end to end — what happens between you hitting enter and tokens streaming back.

LLM inference pipeline from prompt to output A prompt is tokenised, embedded into vectors, passed through transformer layers, converted to logits, then sampled to pick the next token, which is appended and fed back in autoregressively. Your prompt raw text + history Tokenise text → token IDs Embed IDs → vectors Transformer layers attention + FFN ×N Logits score every token Sample pick next token append token, feed back in — repeat until stop (autoregressive)

fig.1 — the inference loop

Walking through each stage

Tokenise. Your prompt — plus the whole prior conversation — is split into tokens using the frozen vocabulary from training. Each becomes an integer ID. This is also why models have context windows: there’s a hard ceiling on how many tokens can be in play at once.

Embed. Each token ID is looked up in an embedding table, becoming a vector of a few thousand numbers. Position information is added, so dog bites man differs from man bites dog. Your prompt is now a stack of vectors — one per token.

Transformer layers (the core). The stack passes through N identical layers — dozens to 100+ in large models. Each layer does two things:

  • Self-attention — every token’s vector is updated by pulling in information from other tokens it deems relevant. “It” looks back to find what it refers to; a closing bracket attends to its opener. This is what lets the model handle context instead of treating each word in isolation.
  • Feed-forward network — a per-token transformation doing the bulk of the computation and recall, applying patterns learned during pretraining.

Residual connections and normalisation keep it trainable. By the top layer, each token’s vector is a richly context-aware representation.

Logits. The final-position vector is projected back against the vocabulary, producing a raw score (logit) for every token — essentially “how likely is each possible next token.” A softmax turns those scores into a probability distribution.

Sample. One token is chosen from that distribution. This is where decoding settings bite: temperature flattens or sharpens the distribution; top-p and top-k restrict sampling to the most probable candidates. Temperature 0 just takes the argmax (deterministic); higher values add variety.

The loop. The chosen token is appended to the sequence and the whole thing runs again to predict the next token. This is autoregressive generation — output is produced one token at a time, each conditioned on everything before it, until a stop token or length limit. (In practice a KV-cache avoids recomputing earlier tokens every step, but the logic is as drawn above.)

03 — A closer look at attention

Attention is the part that’s hardest to picture. Here it is on its own: one token “looking at” every other token with varying intensity.

Self-attention weights from one query token The token “sat” attends to every other token with varying strength; line thickness shows attention weight, strongest toward “cat”. the cat the mat on sat line thickness = how strongly “sat” attends to each token

fig.2 — attention from a single query token

04 — The part most explanations skip

Everything above is one mechanism: next-token prediction. There’s no separate “reasoning module” or “knowledge lookup.” The model has no plan for the sentence it’s about to write — it computes a distribution over the next token, samples one, and repeats. Reasoning, instruction-following, “personality” — all of it is emergent behaviour from a system trained only to continue text, then nudged by post-training toward responses humans rate well.

Two consequences that matter operationally:

  • It’s a statistical model of plausible text, not a database of facts. It produces fluent, confident output whether or not the underlying claim is true, because fluency and truth are correlated in the training data but not identical. That’s the structural root of hallucination — it predicts what a plausible answer looks like, which usually but not always coincides with a correct one.
  • Behaviour at inference is shaped by three separate levers: the weights (fixed at training), the context you supply (prompt + history), and the decoding settings (temperature, top-p). The same weights can behave very differently depending on the latter two.
Xservus · security research & advisory mr-r3b00t

So in short:

“An LLM (large language model) is a computer program that is very good at predicting words.

People train it by showing it a huge amount of writing — books, websites, code, and so on. While it reads, it plays one simple game over and over: “guess the next word.” Each time it guesses wrong, it adjusts itself a tiny bit. After doing this billions of times, it gets very good at the game.

When you type a message to it, it doesn’t really “think” the way a person does. It just looks at your words and works out, one word at a time, what word is most likely to come next. Then it adds that word, looks again, and picks the next one. It keeps going until the answer is finished. That is how it writes whole sentences and paragraphs that sound natural.

Because of how it learned, an LLM can answer questions, write and fix text, explain ideas, summarise long things, and help with code. But it has one important weakness: it is guessing what sounds right, not checking what is true. So it can sound confident and still be wrong. That is why it’s worth double-checking anything important.

In one line: an LLM is a word-prediction machine that learned from huge amounts of text, and it writes by guessing the best next word again and again”

Back to human me

So, the AI we all talk about these days is a next word guessing machine….

From spellchecker to Skynet? I don’t think so. Humans and machines are symbiotic, and after all the one who has all the power, has the power cord in their hands! Back to the cyber world! I’ve just applied to Claude CVP…. let’s see how that goes!

(oh and different tools can change some of the above, some have routines that try to check facts etc. so as always, it depends!)