Blog Article
Back to Blog

Blog

AI Fundamentals — Quick Refresh

May 23, 2026

AI Fundamentals — Quick Refresh

Big Picture

Training Data → Training → Parameters → Inference → Output
PhaseWhat happens
TrainingModel learns from data; parameters are updated
InferenceModel uses fixed parameters to generate responses

Tokens

The basic unit of text an LLM processes. Words are broken into chunks, not read letter-by-letter.

ExampleApprox. tokens
"Hello"1
"I love pizza"3–4
75 words~100

Rule of thumb: 1 token ≈ 0.75 words

Total context = Input + Chat History + System Instructions + Output

Why they matter: cost, context usage, response length, and performance.


Context Window

The model's temporary working memory — everything it can "see" during one conversation.

Contains:

  • System instructions
  • Chat history
  • User prompt
  • Model output

Does NOT contain: model parameters (those live in weights, not the window)

When the window fills up: older messages may be dropped, summaries substituted, and details forgotten.

Think of it as a whiteboard with limited space — once it's full, you have to erase something to write more.


Parameters

The learned weights stored inside a model — its long-term knowledge baked in during training.

SizeExample model
4BSmall, fast, local-friendly
8BBalanced
70BHigh capability, expensive
Larger modelsSmaller models
ProsBetter reasoning, richer languageLess RAM, faster inference
ConsMore compute, slowerLower capability ceiling

Parameters = brain size. Context window = working memory. Tokens = words coming in.


Inference

Generating predictions from a trained model.

Prompt → Tokens → Context Window → Model → Next-token prediction → Output

Models generate text one token at a time, each token influenced by everything before it.


Quantization

Compressing a model to use less memory, at a small quality cost.

Example: Llama-3-8B-Q4

  • Llama-3 → model family
  • 8B → 8 billion parameters
  • Q4 → 4-bit precision (compressed)

Benefit: runs on consumer hardware
Tradeoff: slight quality degradation vs. full precision


Embeddings

Converting meaning into numerical vectors so computers can reason about similarity.

  • "dog" and "puppy" → vectors that are close together
  • "dog" and "spaceship" → vectors far apart

Used in: semantic search, recommendations, RAG, similarity matching


Attention

The mechanism that lets a model focus on the most relevant parts of its context when generating each token.

Example:
"John gave Sarah a book because she wanted it."
→ Attention links "she" back to "Sarah," not "John."

This is the core innovation behind Transformer-based models (the "T" in GPT, BERT, etc.).


Temperature

Controls how random (creative) the model's outputs are.

TemperatureBehavior
Low (0–0.3)Predictable, consistent, factual
Medium (0.5–0.7)Balanced
High (0.9–1.5+)Creative, varied — can become incoherent

Use low temp for structured tasks (code, data); higher temp for brainstorming or creative writing.


Hallucinations

When the model outputs something plausible but factually wrong — stated with confidence.

Why it happens: models predict what text is likely, not what is true. They don't have a fact-checker.

Mitigation: RAG, grounding with sources, low temperature, explicit instructions to say "I don't know."


RAG (Retrieval-Augmented Generation)

Extending a model's knowledge beyond its training data by retrieving external content at query time.

Question → Retrieve relevant docs → Inject into context → Generate grounded answer
Without RAGWith RAG
KnowledgeTraining data onlyCurrent or private data
Hallucination riskHigherLower (grounded)

Fine-tuning

Additional training on a specific dataset to specialize a general model for a particular task or domain.

Examples: medical assistant, legal document reviewer, customer support agent

Tradeoff: more capable in-domain, potentially weaker out-of-domain. More expensive than prompting alone.


The Full Stack

Words
  ↓ tokenized into
Tokens
  ↓ loaded into
Context Window
  ↓ processed by
Attention  ←  focuses on what matters
  ↓
Model Parameters  ←  learned knowledge
  ↓
Inference  ←  next-token prediction, repeated
  ↓
Output

30-Second Cheat Sheet

ConceptOne-liner
TokensChunks of text the model reads
Context windowTemporary memory for one conversation
ParametersLearned knowledge baked into the model
InferenceRunning the model to get an output
AttentionFocus mechanism — what relates to what
EmbeddingsMeaning represented as vectors
TemperatureCreativity dial
RAGPlug in external knowledge at query time
Fine-tuningSpecialize a model on new data
HallucinationConfident but wrong output
QuantizationCompress model to save memory, slight quality loss
BlogResume