Four Levels of AI - Peter Matthews

A summary of what each level is, and what each can do, before your next meeting

“Perhaps we can use AI?”

This post is a level set. None of what follows is new ground. But having a clear mental model of what “using AI” actually means — specifically, which of four distinct levels of investment and complexity you are discussing — will help you reason independently the next time the phrase appears in a meeting.

Here’s the short version: each level involves a different kind of data, a different kind of team, a different budget, and a different relationship with the underlying technology. Getting the level wrong is the most common and most expensive mistake companies make.

Level	Analogy	You Are…
1. Classic ML	Hiring a statistician	Teaching a model to find patterns in your spreadsheets
2. Building around a Foundation Model	Hiring a consultant	Asking a pre-trained model good questions with your data as context
3. Fine-Tuning a Foundation Model	Sending that consultant on a training course	Adjusting a pre-trained model so it speaks your language natively
4. Building a Language Model	Founding a university	Creating a new model from nothing, using your own data and architecture

Concepts You Need First

Before we walk through the levels, four concepts keep coming up. If you’re already comfortable with training vs. inference, parameters, foundation models, and structured vs. unstructured data, skip ahead to Level 1.

Training vs. Inference

Every AI model has two phases of life. Training is the learning phase: you feed data into a model and it adjusts its internal parameters (weights) to get better at a task. Inference is the using phase: the trained model receives new input and produces an output. The weights are frozen; no further learning happens.

The distinction matters because training is expensive — powerful hardware and many weeks or many months of compute, depending on the kind of AI (more on this below). Inference is cheap. Often milliseconds on modest hardware. When you use ChatGPT or Claude, you’re paying for inference. Someone else already paid for training, and they’re trying to make that money back via subscriptions.

Parameters and Scale

A parameter is a single adjustable number inside a model. A large language model like GPT-4 has hundreds of billions of parameters — reportedly around 1.76 trillion total across its mixture-of-experts architecture [1]. A small model might have one billion. The number matters because more parameters generally mean more capability — but also more cost to train and more hardware to run.

It turns out there are predictable power-law relationships between model size, training data, and performance [2]. Doubling the parameters doesn’t double the quality — you get diminishing returns. And certain abilities — multi-step reasoning, code generation — appear suddenly when a model crosses a size threshold, not gradually [3].

Foundation Models

A foundation model is a large model pre-trained on vast, general data (books, websites, code) at enormous expense. GPT-4, Claude, Gemini, Llama, and Mistral are all foundation models. They understand language, reason about problems, and generate text. They are general-purpose: they haven’t been trained specifically for your business.

The existence of foundation models is what makes Levels 2 and 3 possible. You don’t have to create intelligence from scratch. You can borrow it and adapt it.

Structured vs. Unstructured Data

Structured data lives in rows and columns: spreadsheets, databases, CSV exports. Each row is a record, each column is a feature. Customer age, purchase amount, delivery date. This is the territory of classic ML.

Unstructured data is everything else: documents, emails, images, conversations, PDFs, audio. This is the territory of foundation models and deep learning. The crucial question when evaluating an AI approach is: what kind of data does the problem involve?

Level 1: Classic ML

What It Is

Classic machine learning uses mathematical algorithms to find patterns in structured data. You give the model a set of features (inputs) and a target (what you want to predict), and it learns the relationship. A model might learn that customers who haven’t logged in for 30 days and whose last purchase was over 90 days ago are likely to churn.

The algorithms here, such as random forests, gradient boosting, logistic regression, support vector machines, are well-understood, fast, and interpretable. You can typically explain why the model made a specific prediction, which matters in regulated industries and in building stakeholder trust.

Key Concepts

Feature engineering: You decide which columns matter. “Days since last login” is more useful than raw login timestamps. This human judgment is the main work.
Training is cheap: Classic ML models train in minutes or hours on a standard laptop. No GPUs required for most problems.
Explainability is built in: Tools like SHAP can show exactly which features drove each prediction. A fraud detection model can say: “this transaction was flagged because the amount was 4x the average and it originated from a new device.”
Data requirements are modest: Hundreds to tens of thousands of labeled examples, not millions.
Tools are off the shelf: Scikit-learn is a popular and widely used ML library.

What It Costs

Primarily people-time. The compute is negligible — a few dollars of cloud resources. A typical project costs $5K–50K, almost entirely in data preparation and engineering hours [4]. Once deployed, models run on CPUs for fractions of a penny per prediction.

Business Problems That Fit

Predicting which customers will churn, and why
Forecasting demand, revenue, or inventory needs
Detecting fraud in transaction data
Scoring and prioritizing sales leads
Optimizing pricing, routing, or scheduling
Any problem where the answer is a number or a category, derived from data in your databases

When It Doesn’t Fit

Classic ML can’t read a document and summarize it. It can’t draft an email. It can’t have a conversation. If the problem involves understanding or generating language, you need a different level.

Level 2: Building Around a Foundation Model

What It Is

You take an existing foundation model — Claude, GPT-4, Gemini, or an open-source model like Llama — and use it as-is. You don’t change the model’s weights. Instead, you build software around it: carefully crafted prompts, connections to your data, guardrails for safety, and orchestration logic that chains steps together.

The model is the engine. You build the car around it.

Key Concepts

Prompt engineering: The way you phrase your instructions to the model determines the quality of the output. This is how you “program” the model. You can provide examples (few-shot prompting) to steer its behavior for specialized tasks.
RAG (Retrieval-Augmented Generation): The most important pattern at this level. Before the model answers a question, your system searches your own documents and feeds relevant chunks to the model as context. The model can now answer questions about your data without ever having been trained on it. Think of it as giving the consultant a briefing packet before each meeting.
Agents: Multi-step workflows where the model plans, uses tools (search, databases, calculators, APIs), and iterates until it completes a task. A research agent might search for information, read documents, draft a summary, check it against sources, and refine. (If you want more on the complexity ladder from prompts to workflows to agents, I wrote about that in Prompt, Workflow, Agent.)
No training cost: You pay only for inference — per-token API fees. There is no upfront training investment.

What It Costs

API pricing runs from under $1 to $15 per million tokens depending on the model [5][6]. A million tokens is roughly 750,000 words. For a small internal tool processing a few hundred queries per day, expect $50–500/month in API costs. For high-volume production systems, costs can reach thousands per month but scale linearly and predictably.

The bigger cost is engineering time to build the system around the model: days to weeks, not months.

Business Problems That Fit

Summarizing long documents, reports, or meeting transcripts
Customer support chatbots that answer questions from your knowledge base
Extracting structured data from unstructured PDFs, emails, or contracts
Drafting marketing copy, proposals, or internal communications
Automating research workflows: gathering, reading, synthesizing information
Any task involving language that a capable intern could do with access to your files

When It Doesn’t Fit

If the model consistently gets domain-specific outputs wrong despite good prompts — wrong terminology, misunderstood conventions, unreliable output formats — you may need to move to Level 3. And if sending your data to a third-party API isn’t acceptable for privacy or regulatory reasons, you either self-host an open model or move to a different level.

Level 3: Fine-Tuning a Foundation Model

What It Is

You take a pre-trained foundation model and continue training it on your own data. The model’s weights are updated so that your domain knowledge becomes part of its internal parameters. It retains its general abilities but gains specialized expertise — your terminology, your formats, your patterns.

The analogy: the foundation model already has a university degree. Fine-tuning is the professional certification course that makes it effective in your specific field.

Key Concepts

LoRA (Low-Rank Adaptation): This is the technique that makes fine-tuning affordable. Instead of retraining all of the model’s billions of parameters, LoRA freezes the original weights and trains only small additional matrices — typically less than 1% of the total parameters [7]. You can fine-tune a large model on a single GPU instead of a data center.
You need labeled data: Fine-tuning requires examples of the input/output behavior you want. Typically hundreds to thousands of examples: “given this input, produce this output.” The quality of your examples directly determines the quality of the result.
Catastrophic forgetting: A real risk. The model may lose some of its general capabilities as it learns your domain. LoRA mitigates this because the original weights stay frozen, but it doesn’t eliminate the risk entirely.
You own the model: The fine-tuned model runs on your infrastructure. No data leaves your systems during inference. This matters for regulated industries.

What It Costs

A fine-tuning run with LoRA typically costs $500–$3,000 in cloud GPU time [8][9]. The larger cost is preparing the training data — curating, cleaning, and formatting hundreds of high-quality examples. A full project including data preparation might run $5K–$15K. This is a fraction of training from scratch.

Business Problems That Fit

Making a model reliably produce outputs in a specific format or style that prompting alone cannot achieve
Building a model that natively speaks your industry jargon (medical, legal, financial, engineering terminology)
Classification tasks with enough labeled data: routing support tickets, categorizing documents, detecting compliance issues
Any task where you’ve tried Level 2, it works 80% of the time, but the remaining 20% matters enough to invest in customization

When It Doesn’t Fit

If prompting and RAG at Level 2 solve the problem, fine-tuning adds unnecessary complexity. If you don’t have high-quality labeled examples, the fine-tuning will reflect the quality of what you feed it — garbage in, garbage out still applies. And if the problem isn’t about language at all — it’s about predicting numbers from spreadsheets — you’re at the wrong level entirely. Go back to Level 1.

Level 4: Building a Language Model from Scratch

What It Is

You design a neural network architecture, assemble or create a large training dataset, and train the model from a random starting point. Every parameter is learned from your data. No foundation model is involved. This is the most expensive, slowest, and most technically demanding approach.

The companies that do this at scale are the ones that build the foundation models: OpenAI, Anthropic, Google, Meta. At a smaller scale, organizations build purpose-built models for specialized domains where no existing model has the right training data.

Key Concepts

The Transformer: The architecture underneath virtually all modern language models. Introduced in 2017 [10], it uses an attention mechanism that lets the model weigh the importance of every word relative to every other word in a passage. This is what allows models to understand context rather than just processing words in sequence.
Training is the bottleneck: Training a 1B-parameter model from scratch requires weeks of GPU time. A 100B-parameter model requires months on clusters of thousands of GPUs. The compute cost alone can range from $50K to well over $1M [11][12], before you factor in the team and the data.
Data is everything: The model can only learn what’s in the training data. Curating a high-quality, diverse, domain-appropriate dataset is often the hardest part of the project.
The result is entirely yours: Complete control over the model’s knowledge, behavior, and deployment. No dependency on any provider. But also: no safety net. Every problem is yours to solve.

What It Costs

A small model (1–10B parameters) costs $50K–$500K in compute, plus a team of 3–10 ML engineers over 3–12 months [11]. Larger models scale into the millions of dollars. The total project cost including people, data, and infrastructure typically runs $200K–$2M for a small model and tens of millions for anything approaching frontier scale. For reference, GPT-4’s training cost is estimated between $78M and $192M [12].

Business Problems That Fit

You operate in a domain where no existing model has adequate training data (a proprietary language, specialized notation, a novel data modality)
You need to run a model on edge hardware (phones, IoT devices, embedded systems) where size, latency, and offline operation are hard constraints
The model IS the product — your competitive differentiation comes from having a capability no one else can replicate
You have deep ML expertise in-house and a strategic reason to invest at this level

When It Doesn’t Fit

Almost always. The existence of high-quality foundation models means Level 4 is the right choice only when Levels 1–3 have been explicitly ruled out. If someone proposes building from scratch as a first step, the most useful question you can ask is: “What did we try at Levels 2 and 3, and why didn’t it work?”

Choosing the Right Level

So when the question arises — “perhaps we can use AI” — here are three questions to ask, in order.

Question 1: What kind of data is the problem about?

If the answer is structured data (databases, spreadsheets, transaction logs, sensor readings) → Level 1: Classic ML. Stop here. This is the most common answer and the most commonly overlooked. Large Language Models and Generative AI grab the headlines. But if you need to do good work with numbers, classic ML is always a good bet to start with. Detect Fraud? Classic ML. Improve Yield? Classic ML.

If the answer is unstructured data (text, documents, emails, images, conversations) → proceed to Question 2.

Question 2: Can an existing model do this with good instructions?

Test it. Spend two hours with Claude or GPT-4. Give it your data, your instructions, some examples. If it works: Level 2. Build wrappers and ship it. This is the fastest path to value and the right starting point for most language tasks. How to build the wrappers depends on the complexity of the task. All of the foundation models have APIs that allow one to interact with a model programmatically. Beyond that frameworks such as Langchain, Llamaindex, Langraph and CrewAI allow one to build sophisticated tooling around a foundation model.

If it works 80% but fails on domain-specific nuance → proceed to Question 3.

If no existing model understands your data at all → Level 4, but verify this carefully. It’s rare.

Question 3: Is the remaining gap worth closing with customization?

If you have labeled examples of the behavior you want and the cost of errors in the remaining 20% justifies the investment → Level 3: Fine-Tune.

If you don’t have labeled examples, go back to Level 2 and invest in better prompts, better retrieval, or better data before considering fine-tuning.

The governing principle: start from the simplest level that could work and escalate only with evidence. Most business AI problems are solved at Levels 1 or 2.

Summary

	Classic ML	Building Around	Fine-Tuning	Build from Scratch
Data type	Structured (tables)	Unstructured (text)	Unstructured (text)	Any
What changes	You train a small ML model	Nothing — you use the model as-is	A fraction of the model’s weights	All weights, from zero
Training cost	Negligible	None	$500–$3K compute	$50K–$1M+ compute
Time to prototype	Days	Hours to days	Weeks	Months
Team needed	Data scientist	Software developer	ML engineer	ML research team
Interpretability	High (SHAP, LIME)	Low (black box)	Low	Low
Runs on	CPUs	Provider’s cloud or self-hosted	Your GPUs	Your GPU cluster
Main risk	Data quality, concept drift	Provider dependency, prompt fragility	Data quality, catastrophic forgetting	Cost, timeline, team capability

The right answer to “perhaps we can use AI” is not yes or no. It’s: what kind of problem is it, and which level fits?

If you take one thing from this post into your next meeting, make it those three questions. They won’t make you a technical expert. But they’ll help you tell the difference between a problem that needs a spreadsheet model and one that needs a foundation model — and that decision point is where most of the money gets wasted.

References

Model Architecture & Parameters

[1] Exploding Topics. “Number of Parameters in GPT-4 (Latest Data).” August 2024. https://explodingtopics.com/blog/gpt-parameters — Reports GPT-4 uses a Mixture of Experts architecture with approximately 1.76 trillion total parameters (8 experts × 220B parameters each).

[2] Wolfe, Cameron. “Scaling Laws for LLMs: From GPT-3 to o3.” Deep Learning Focus, January 2025. https://cameronrwolfe.substack.com/p/llm-scaling-laws — Comprehensive review of power-law relationships between compute, data, and model performance.

[3] Wei, Jason et al. “Emergent Abilities of Large Language Models.” Transactions on Machine Learning Research, 2022. https://arxiv.org/abs/2206.07682 — Foundational paper documenting abilities that appear at specific scale thresholds.

Classic ML Costs

[4] Toxigon. “How Much Does It Cost to Develop a Machine Learning Model for Your Business.” January 2026. https://toxsl.com/blog/585/how-much-does-it-cost-to-develop-a-machine-learning-model-for-your-business — Industry survey of ML project costs ranging from $5K for simple models to $50K+ for complex implementations.

API Pricing

[5] OpenAI. “API Pricing.” https://openai.com/api/pricing/ — Official pricing documentation showing rates from $0.15/M tokens (GPT-4o Mini) to $60/M tokens (o1 models).

[6] Anthropic. “Claude API Pricing.” https://www.anthropic.com/pricing — Official pricing for Claude models ranging from $0.25/M tokens (Haiku) to $15/M tokens (Opus).

Fine-Tuning Costs

[7] Hu, Edward J. et al. “LoRA: Low-Rank Adaptation of Large Language Models.” arXiv, 2021. https://arxiv.org/abs/2106.09685 — Original paper introducing LoRA, demonstrating fine-tuning with <1% of parameters.

[8] 10xStudio. “How much does it cost to fine-tune LLaMA 3.1 with LoRA?” September 2024. https://10xstudio.ai/blog/how-much-does-it-cost-to-finetune-llama-with-lora — Detailed cost breakdown showing $50-$300 per training run for compute.

[9] Stratagem Systems. “LoRA Fine-Tuning Cost 2026.” January 2026. https://www.stratagem-systems.com/blog/lora-fine-tuning-cost-analysis-2026 — Current market rates for fine-tuning projects ($500-$5,000 typical).

Transformer Architecture

[10] Vaswani, Ashish et al. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30 (NeurIPS 2017). https://arxiv.org/abs/1706.03762 — The foundational paper introducing the Transformer architecture.

Training Costs

[11] Galileo AI. “What is the Cost of Training LLM Models?” February 2026. https://galileo.ai/blog/llm-model-training-cost — Comprehensive breakdown of training costs by model size, from $50K for small models to $100M+ for frontier models.

[12] CUDO Compute. “What is the cost of training large language models?” May 2025. https://www.cudocompute.com/blog/what-is-the-cost-of-training-large-language-models — Industry analysis of compute costs including GPT-4 training estimates ($78M-$192M).

Additional Sources

[13] NVIDIA Developer Blog. “OpenAI Presents GPT-3, a 175 Billion Parameters Language Model.” June 2020. https://developer.nvidia.com/blog/openai-presents-gpt-3-a-175-billion-parameters-language-model/ — Documentation of GPT-3’s architecture and parameter count.

[14] Wikipedia. “Neural scaling law.” https://en.wikipedia.org/wiki/Neural_scaling_law — Overview of scaling laws discovered by Kaplan et al. (2020) and Hoffmann et al. (2022).

[15] Stanford HAI. “Examining Emergent Abilities in Large Language Models.” January 2026. https://hai.stanford.edu/news/examining-emergent-abilities-large-language-models — Recent analysis of emergent capabilities and scale thresholds.

[16] MIT CSAIL. “How to build AI scaling laws for efficient LLM training and budget maximization.” September 2025. https://www.csail.mit.edu/news/how-build-ai-scaling-laws-efficient-llm-training-and-budget-maximization — Research on optimizing training budgets using scaling laws.

[17] Epoch AI. “Frontier language models have become much smaller.” December 2024. https://epoch.ai/gradient-updates/frontier-language-models-have-become-much-smaller — Analysis of parameter efficiency trends in modern LLMs.

[18] Exxact Corporation. “How LoRA Makes AI Fine-Tuning Faster, Cheaper, and More Practical.” November 2025. https://www.exxactcorp.com/blog/deep-learning/ai-fine-tuning-with-lora — Technical overview of LoRA efficiency gains.