Guide

Large Language Models

How large language models work, how they are trained and fine-tuned, and what it takes to deploy them reliably in production applications.

What Are Large Language Models?

Large language models (LLMs) are neural networks trained on massive text corpora to predict the next token in a sequence. The transformer architecture , introduced in the 2017 "Attention Is All You Need" paper , is the foundation for every major LLM today: GPT-4, Claude, Gemini, Llama, Mistral, and their derivatives. The defining characteristic is scale: billions of parameters trained on trillions of tokens from the open web, books, code, and other text sources. At sufficient scale, emergent capabilities appear , reasoning, in-context learning, instruction following , that are not explicitly trained but arise from the pretraining objective alone.

The Transformer Architecture

The transformer processes input as a sequence of tokens, which are sub-word units produced by a tokenizer (typically BPE , byte-pair encoding). Each token is represented as a dense embedding vector. The attention mechanism allows every token to attend to every other token in the context window, computing a weighted sum of value vectors based on query-key similarity. Multi-head attention runs this process in parallel across multiple "heads", each learning to attend to different aspects of the input. Feed-forward layers process each position independently. These blocks are stacked (typically 32–96 layers for large models), and the final layer produces logit scores over the vocabulary for the next token prediction.

Pretraining and Scale

Pretraining is unsupervised learning at extreme scale: the model is trained to predict the next token across trillions of tokens. The loss is cross-entropy between the predicted distribution and the actual next token. Compute requirements scale with model parameters, dataset size, and training steps , governed by the Chinchilla scaling laws, which suggest optimal training matches model size to dataset size. A 70B parameter model trained on 2T tokens requires roughly 1–2 million A100 GPU-hours. Pretraining is the most compute-intensive phase but produces a general-purpose base model that can be adapted for many tasks. Training infrastructure for this scale requires distributed systems with careful communication optimization (ZeRO, pipeline parallelism, tensor parallelism).

Fine-Tuning and Alignment

Base LLMs predict text , they do not inherently follow instructions or behave helpfully. Alignment transforms a base model into a useful assistant through several stages. Supervised fine-tuning (SFT) trains on high-quality (prompt, response) pairs demonstrating desired behavior. Reinforcement learning from human feedback (RLHF) uses human preference comparisons to train a reward model, then uses that reward model with PPO or similar RL algorithms to fine-tune the LLM toward preferred outputs. Direct preference optimization (DPO) achieves similar alignment without a separate reward model. Constitutional AI and RLAIF use AI-generated feedback to reduce the human annotation burden. Each stage requires careful data curation , the quality of alignment data is more important than its volume.

Retrieval-Augmented Generation

LLMs have a fixed context window and a knowledge cutoff date. Retrieval-augmented generation (RAG) extends both by connecting the model to external knowledge stores at inference time. A query is embedded using a dense retrieval model, the most semantically similar documents are retrieved from a vector database (Pinecone, Weaviate, pgvector), and those documents are injected into the LLM's context alongside the user query. The model generates its response grounded in the retrieved content, reducing hallucination and enabling up-to-date information. RAG is the standard architecture for enterprise LLM applications because it does not require retraining to incorporate new knowledge and provides citation-traceable outputs.

Deploying LLMs in Production

LLM inference is compute-intensive: a 70B parameter model requires at least two A100 80GB GPUs for FP16 inference. Quantization (GPTQ, AWQ, GGUF) reduces memory requirements at a modest quality cost , 4-bit quantized models can run on consumer GPUs. KV cache management is critical for throughput: the key-value states from processed tokens are cached to avoid redundant computation on subsequent tokens. Batching strategies (continuous batching, PagedAttention as implemented in vLLM) dramatically increase throughput by serving multiple requests efficiently. For latency-critical applications, smaller distilled models or speculative decoding (using a small draft model to generate candidates verified by the main model) reduce time-to-first-token. Monitoring hallucination rates, toxicity, and latency regression requires purpose-built LLM observability tooling.

Building with LLMs?

We provide training data, fine-tuning, alignment, and production deployment for language model applications.

Talk to Us