The Math Behind AI: What You Actually Need to Know

February 24, 2026·AI @ VU Team

A lot of AI students treat math courses as hoops to jump through. I get it. When you signed up for an AI degree, you probably didn't picture yourself computing determinants by hand. But here's the thing: almost every "cool" AI technique you'll learn later is built directly on the math you're doing now. This post connects the dots.

Linear algebra is how ML talks about data

Machine learning operates on vectors and matrices. An image is a matrix of pixel values. A dataset is a matrix where each row is a data point and each column is a feature. A neural network layer is, at its core, a matrix multiplication:

y=Wx+b\mathbf{y} = W\mathbf{x} + \mathbf{b}

That's it. That's a neural network layer. WW is a weight matrix, x\mathbf{x} is your input, and b\mathbf{b} is a bias vector. When you learn matrix multiplication in Linear Algebra, you're learning the operation that every neural network runs thousands of times during a single forward pass.

Some other connections:

Eigenvalues and eigenvectors come up in PCA (Principal Component Analysis), which is one of the most common techniques for reducing the number of dimensions in a dataset. Given a covariance matrix CC, PCA finds vectors v\mathbf{v} where Cv=λvC\mathbf{v} = \lambda \mathbf{v}. Those vectors point in the directions of maximum variance in your data.

Cosine similarity, which measures how similar two vectors are, is used everywhere in NLP: cos(θ)=abab\cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \, \|\mathbf{b}\|}. It's how search engines compare documents and how word embeddings measure whether two words mean similar things.

SVD (Singular Value Decomposition) decomposes any matrix as A=UΣVTA = U\Sigma V^T. This shows up in recommendation systems (how Netflix suggests movies), image compression, and various NLP techniques.

Calculus is how models learn

Gradient descent is the algorithm that trains almost every modern ML model, and it's just calculus. The idea: you have a loss function that measures how wrong your model is, and you want to make it smaller. Calculus tells you which direction to go.

The chain rule from calculus is backpropagation. That's not an analogy; it's literally what backpropagation does. For a composition f(g(x))f(g(x)):

dfdx=dfdgdgdx\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}

In a neural network, each layer is a function composed with the next. To figure out how changing a weight deep in the network affects the final loss, you apply the chain rule backward through every layer. That's backpropagation.

The gradient descent update rule is:

wt+1=wtηL(wt)w_{t+1} = w_t - \eta \nabla L(w_t)

where η\eta is the learning rate and L\nabla L is the gradient of the loss with respect to the weights. You compute partial derivatives Lwj\frac{\partial L}{\partial w_j} for every weight wjw_j in the model, then nudge each weight in the direction that reduces the loss.

Here's a concrete example. Take f(x,y)=x2y+3y2f(x,y) = x^2 y + 3y^2. The gradient is:

f=(fx,  fy)=(2xy,  x2+6y)\nabla f = \left(\frac{\partial f}{\partial x},\; \frac{\partial f}{\partial y}\right) = \left(2xy,\; x^2 + 6y\right)

Now imagine xx and yy are weights in a model and ff is a loss function. That gradient tells you exactly how to adjust each weight to make the loss smaller. That's all gradient descent is doing, over and over, with millions of weights instead of two.

Probability is how you deal with the fact that data is noisy

You never have perfect data. There's always noise, missing values, measurement error. Probability and statistics give you the tools to work with that.

Bayes' rule is everywhere in ML:

P(AB)=P(BA)P(A)P(B)P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)}

Spam filters use it. Medical diagnosis systems use it. Any model that updates its beliefs based on new evidence is doing some version of Bayes' rule.

The normal distribution shows up constantly because of the Central Limit Theorem (more on that below). Its density function is:

f(x)=1σ2πe(xμ)22σ2f(x) = \frac{1}{\sigma\sqrt{2\pi}} \, e^{-\frac{(x - \mu)^2}{2\sigma^2}}

Maximum Likelihood Estimation (MLE) is how you find the best parameters θ\theta for a model. You pick the parameters that make your observed data most probable:

θ^=argmaxθi=1nP(xiθ)\hat{\theta} = \arg\max_\theta \prod_{i=1}^{n} P(x_i \mid \theta)

In practice, you minimize the negative log-likelihood instead (because products of tiny numbers are numerically unstable):

θ^=argminθ[i=1nlogP(xiθ)]\hat{\theta} = \arg\min_\theta \left[ -\sum_{i=1}^{n} \log P(x_i \mid \theta) \right]

Expected value and variance are basic but they come up in reinforcement learning (expected reward), risk assessment, and pretty much any situation where you need to summarize a distribution with a few numbers:

E[X]=xxP(X=x),Var(X)=E[X2](E[X])2\mathbb{E}[X] = \sum_x x \, P(X = x), \quad \text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2

Hypothesis testing is how you answer questions like "is model A actually better than model B, or did it just get lucky on this test set?" You need p-values and confidence intervals for this.

And the Central Limit Theorem is why averaging works. If you take nn i.i.d. samples with mean μ\mu and variance σ2\sigma^2, the sample mean converges to a normal distribution:

XˉndN ⁣(μ,  σ2n)\bar{X}_n \xrightarrow{d} \mathcal{N}\!\left(\mu,\; \frac{\sigma^2}{n}\right)

This is why larger datasets give you better estimates. It's also why so many things in nature look normally distributed.

How it all maps together

The math courses come in Year 1. The applications come in Year 2-3. It feels disconnected while you're doing it, but here's where each thing shows up:

Math conceptWhere it shows up in AI
Matrix multiplication Wx+bW\mathbf{x} + \mathbf{b}Neural network forward pass
Eigenvalues Cv=λvC\mathbf{v} = \lambda\mathbf{v}PCA, spectral clustering
Gradient L\nabla LTraining any model (gradient descent)
Chain rule dfdgdgdx\frac{df}{dg} \cdot \frac{dg}{dx}Backpropagation
Bayes' rule P(AB)P(A \mid B)Probabilistic models, classification
Distributions N(μ,σ2)\mathcal{N}(\mu, \sigma^2)Data modeling, generative models
Hypothesis testingModel evaluation, A/B testing

Some practical advice

3Blue1Brown's "Essence of Linear Algebra" and "Essence of Calculus" series on YouTube are excellent. They build geometric intuition for stuff that's usually taught purely algebraically. Watch them before or alongside your courses.

When you learn a new math concept, try implementing it in code. Write matrix multiplication in NumPy. Implement gradient descent from scratch for a simple function. The combination of math on paper and code that runs makes things click in a way that neither does alone.

Before Machine Learning starts in year 2, spend a weekend reviewing gradients, matrix operations, and Bayes' rule. Seriously. An afternoon of review will save you hours of confusion later.

mathmachine-learninglinear-algebracalculusprobability