Maximum Likelihood Estimation in a nutshell

3 min readJan 17, 2021

Suppose we have a random sample from a population and we know the population is normally distributed. Then our best estimate for the population mean and standard deviation is the sample mean and sample standard deviation. That's basic idea behind Maximum Likelihood Estimation.

Maximum Likelihood Estimation(MLE) helps estimate the parameters of a model. For instance if we want to create a Linear Regression model for a data set, MLE can estimate the coefficients.

Likelihood vs Probability

Lets describe this using the Linear Regression Model mentioned above.

L(coef | x, y) is the likelihood these coefficients will fit the Linear Model given this set of data.

P(y | coef, x) is the probability of getting this result given the Linear Regression Model given these coefficients and data

Steps for Maximum Likelihood Estimation

Remember we can multiply probabilities to get the combined probabilities of various x values and this gives us the Probability Density Function(PDF).

For a Normal Distribution the PDF is given by,

We will be using that a little later.

Step 1. Create Likelihood function

Step 2. Take the log of the Like Likelihood function

Step 3. Maximum likelihood is the argument with the maximum likelihood. Thus to find the MLE we need to take the partial derivative on both sides of the log likelihood function with respect to theta and set to zero and solve.

Maximizing the Likelihood for Regression with Gaussian Noise

In a Linear model we estimate y using coefficients theta and data x , where x and theta are vectors of size m. Epsilon is a random variable which captures the noise or error between the predicted value and the real value.

We can assume the noise is normally distributed with zero-mean.

Thus the conditional probability distribution of y given x using this Gaussian can be written as

This is similar to the normal distribution PDF above but instead of conditioning on the mean we are conditioning on x and theta. x and theta combine to form the mean for the gaussian of y.