Blog

AI Fundamentals — Quick Refresh

May 23, 2026

AI Fundamentals — Quick Refresh

Big Picture

Training Data → Training → Parameters → Inference → Output

Phase	What happens
Training	Model learns from data; parameters are updated
Inference	Model uses fixed parameters to generate responses

Tokens

The basic unit of text an LLM processes. Words are broken into chunks, not read letter-by-letter.

Example	Approx. tokens
"Hello"	1
"I love pizza"	3–4
75 words	~100

Rule of thumb: 1 token ≈ 0.75 words

Total context = Input + Chat History + System Instructions + Output

Why they matter: cost, context usage, response length, and performance.

Context Window

The model's temporary working memory — everything it can "see" during one conversation.

Contains:

System instructions
Chat history
User prompt
Model output

Does NOT contain: model parameters (those live in weights, not the window)

When the window fills up: older messages may be dropped, summaries substituted, and details forgotten.

Think of it as a whiteboard with limited space — once it's full, you have to erase something to write more.

Parameters

The learned weights stored inside a model — its long-term knowledge baked in during training.

Size	Example model
4B	Small, fast, local-friendly
8B	Balanced
70B	High capability, expensive

	Larger models	Smaller models
Pros	Better reasoning, richer language	Less RAM, faster inference
Cons	More compute, slower	Lower capability ceiling

Parameters = brain size. Context window = working memory. Tokens = words coming in.

Inference

Generating predictions from a trained model.

Prompt → Tokens → Context Window → Model → Next-token prediction → Output

Models generate text one token at a time, each token influenced by everything before it.

Quantization

Compressing a model to use less memory, at a small quality cost.

Example: Llama-3-8B-Q4

Llama-3 → model family
8B → 8 billion parameters
Q4 → 4-bit precision (compressed)

Benefit: runs on consumer hardware
Tradeoff: slight quality degradation vs. full precision

Embeddings

Converting meaning into numerical vectors so computers can reason about similarity.

"dog" and "puppy" → vectors that are close together
"dog" and "spaceship" → vectors far apart

Used in: semantic search, recommendations, RAG, similarity matching

Attention

The mechanism that lets a model focus on the most relevant parts of its context when generating each token.

Example:
"John gave Sarah a book because she wanted it."
→ Attention links "she" back to "Sarah," not "John."

This is the core innovation behind Transformer-based models (the "T" in GPT, BERT, etc.).

Temperature

Controls how random (creative) the model's outputs are.

Temperature	Behavior
Low (0–0.3)	Predictable, consistent, factual
Medium (0.5–0.7)	Balanced
High (0.9–1.5+)	Creative, varied — can become incoherent

Use low temp for structured tasks (code, data); higher temp for brainstorming or creative writing.

Hallucinations

When the model outputs something plausible but factually wrong — stated with confidence.

Why it happens: models predict what text is likely, not what is true. They don't have a fact-checker.

Mitigation: RAG, grounding with sources, low temperature, explicit instructions to say "I don't know."

RAG (Retrieval-Augmented Generation)

Extending a model's knowledge beyond its training data by retrieving external content at query time.

Question → Retrieve relevant docs → Inject into context → Generate grounded answer

	Without RAG	With RAG
Knowledge	Training data only	Current or private data
Hallucination risk	Higher	Lower (grounded)

Fine-tuning

Additional training on a specific dataset to specialize a general model for a particular task or domain.

Examples: medical assistant, legal document reviewer, customer support agent

Tradeoff: more capable in-domain, potentially weaker out-of-domain. More expensive than prompting alone.

The Full Stack

Words
  ↓ tokenized into
Tokens
  ↓ loaded into
Context Window
  ↓ processed by
Attention  ←  focuses on what matters
  ↓
Model Parameters  ←  learned knowledge
  ↓
Inference  ←  next-token prediction, repeated
  ↓
Output

30-Second Cheat Sheet

Concept	One-liner
Tokens	Chunks of text the model reads
Context window	Temporary memory for one conversation
Parameters	Learned knowledge baked into the model
Inference	Running the model to get an output
Attention	Focus mechanism — what relates to what
Embeddings	Meaning represented as vectors
Temperature	Creativity dial
RAG	Plug in external knowledge at query time
Fine-tuning	Specialize a model on new data
Hallucination	Confident but wrong output
Quantization	Compress model to save memory, slight quality loss