Blog
AI Fundamentals — Quick Refresh
May 23, 2026
AI Fundamentals — Quick Refresh
Big Picture
Training Data → Training → Parameters → Inference → Output
| Phase | What happens |
|---|---|
| Training | Model learns from data; parameters are updated |
| Inference | Model uses fixed parameters to generate responses |
Tokens
The basic unit of text an LLM processes. Words are broken into chunks, not read letter-by-letter.
| Example | Approx. tokens |
|---|---|
| "Hello" | 1 |
| "I love pizza" | 3–4 |
| 75 words | ~100 |
Rule of thumb: 1 token ≈ 0.75 words
Total context = Input + Chat History + System Instructions + Output
Why they matter: cost, context usage, response length, and performance.
Context Window
The model's temporary working memory — everything it can "see" during one conversation.
Contains:
- System instructions
- Chat history
- User prompt
- Model output
Does NOT contain: model parameters (those live in weights, not the window)
When the window fills up: older messages may be dropped, summaries substituted, and details forgotten.
Think of it as a whiteboard with limited space — once it's full, you have to erase something to write more.
Parameters
The learned weights stored inside a model — its long-term knowledge baked in during training.
| Size | Example model |
|---|---|
| 4B | Small, fast, local-friendly |
| 8B | Balanced |
| 70B | High capability, expensive |
| Larger models | Smaller models | |
|---|---|---|
| Pros | Better reasoning, richer language | Less RAM, faster inference |
| Cons | More compute, slower | Lower capability ceiling |
Parameters = brain size. Context window = working memory. Tokens = words coming in.
Inference
Generating predictions from a trained model.
Prompt → Tokens → Context Window → Model → Next-token prediction → Output
Models generate text one token at a time, each token influenced by everything before it.
Quantization
Compressing a model to use less memory, at a small quality cost.
Example: Llama-3-8B-Q4
Llama-3→ model family8B→ 8 billion parametersQ4→ 4-bit precision (compressed)
Benefit: runs on consumer hardware
Tradeoff: slight quality degradation vs. full precision
Embeddings
Converting meaning into numerical vectors so computers can reason about similarity.
- "dog" and "puppy" → vectors that are close together
- "dog" and "spaceship" → vectors far apart
Used in: semantic search, recommendations, RAG, similarity matching
Attention
The mechanism that lets a model focus on the most relevant parts of its context when generating each token.
Example:
"John gave Sarah a book because she wanted it."
→ Attention links "she" back to "Sarah," not "John."
This is the core innovation behind Transformer-based models (the "T" in GPT, BERT, etc.).
Temperature
Controls how random (creative) the model's outputs are.
| Temperature | Behavior |
|---|---|
| Low (0–0.3) | Predictable, consistent, factual |
| Medium (0.5–0.7) | Balanced |
| High (0.9–1.5+) | Creative, varied — can become incoherent |
Use low temp for structured tasks (code, data); higher temp for brainstorming or creative writing.
Hallucinations
When the model outputs something plausible but factually wrong — stated with confidence.
Why it happens: models predict what text is likely, not what is true. They don't have a fact-checker.
Mitigation: RAG, grounding with sources, low temperature, explicit instructions to say "I don't know."
RAG (Retrieval-Augmented Generation)
Extending a model's knowledge beyond its training data by retrieving external content at query time.
Question → Retrieve relevant docs → Inject into context → Generate grounded answer
| Without RAG | With RAG | |
|---|---|---|
| Knowledge | Training data only | Current or private data |
| Hallucination risk | Higher | Lower (grounded) |
Fine-tuning
Additional training on a specific dataset to specialize a general model for a particular task or domain.
Examples: medical assistant, legal document reviewer, customer support agent
Tradeoff: more capable in-domain, potentially weaker out-of-domain. More expensive than prompting alone.
The Full Stack
Words
↓ tokenized into
Tokens
↓ loaded into
Context Window
↓ processed by
Attention ← focuses on what matters
↓
Model Parameters ← learned knowledge
↓
Inference ← next-token prediction, repeated
↓
Output
30-Second Cheat Sheet
| Concept | One-liner |
|---|---|
| Tokens | Chunks of text the model reads |
| Context window | Temporary memory for one conversation |
| Parameters | Learned knowledge baked into the model |
| Inference | Running the model to get an output |
| Attention | Focus mechanism — what relates to what |
| Embeddings | Meaning represented as vectors |
| Temperature | Creativity dial |
| RAG | Plug in external knowledge at query time |
| Fine-tuning | Specialize a model on new data |
| Hallucination | Confident but wrong output |
| Quantization | Compress model to save memory, slight quality loss |