Naive Bayes
Note: used with discrete data.
With the previous classifiers, there was an exponential growth in parameter size when increasing the number of attributes. Introducing the Naive Bayes Classifier, we assume that all features \vec{x} are conditionally independent given the class label y, reducing the number of parameters needed. We write p(\vec{x} \mid y). This means that the value of one feature does not influence or provide information about another feature, once we know the class.
📌 The name naive is given to highlight the (very big!) assumption of having independent features, which is often violated, but might nevertheless “work well.”
The Naive Bayes Classifier is actually a family of classifiers that use different distributions for p(\vec{x} \mid y) to model different problems.
- For binary features, a Bernoulli distribution is used.
- For continuous features, a Gaussian distribution is common.
- For categorical features, a categorical distribution is used.
In this case, for the class prior p(y)—which we need to know to compute p(\vec{x} \mid y)—we use a categorical distribution.
❓
- What is the Bayes estimator for (i) classification and 0/1 loss, (ii) regression with \ell_2 loss, (iii) regression with \ell_1 loss? – L1 is Mean Absolute Error (MAE), while L2 is Mean Squared Error (MSE).
- Suppose we know the data distribution p(\boldsymbol x \mid y) exactly. Can the Bayes optimal classifier still be wrong? Explain.
MLE for Naive Bayes
To estimate the parameters of our Naive Bayes classifier, we use MLE. Instead of maximizing the likelihood directly, we maximize the log-likelihood because it simplifies the computation by turning products into sums. The log-likelihood is split into two separate parts:
- Prior class distribution
- Class-conditional probabilities
By maximizing these two parts separately we obtain the MLE estimates.
💡 Add-1 (Laplace) Smoothing
To handle the zero-frequency problem (when a category or feature value is unseen in the training data), we can use add-1 smoothing.
Bayesian Naive Bayes
While MLE provides point estimates for the parameters, it can lead to problems like assigning zero probability to events not seen in the training data and thus needs add-1 smoothing. Another way to address this in the case of Bayesian Naive Bayes, is by incorporating prior beliefs and using Bayesian inference to estimate the parameters.
- Prior distributions: We use Dirichlet priors over the categorical distributions for both the class priors and the conditional probabilities.
- Posterior distributions: By combining the priors with the observed data, we obtain posterior distributions, which give us updated beliefs about the parameters.
This approach naturally smooths the parameter estimates, avoiding zero probabilities.
❓
- What is the Naive Bayes assumption? Why is it useful if this assumption can be made?
- Construct a simple example data distribution where using the Naive Bayes assumption is not appropriate.
- Is the Dirichlet-multinomial model relevant for a general Naive Bayes classifier? And for a Naive Bayes classifier with categorical features? Explain.
- Contrast categorical Naive Bayes using (i) MLE, (ii) MAP, (iii) fully Bayesian inference.
Gaussian Naive Bayes (GNB)
Note: used with continuous data.
Let’s consider the case in which the features are modeled with a continuous probability distribution rather than a discrete one.
The mean and standard deviation of the Gaussian are estimated using MLE.
❓
- Describe generally how Naive Bayes can be used with continuous features.
- Now describe Gaussian Naive Bayes, its assumptions, and its parameters.
- Is Gaussian Naive Bayes a linear classifier? – Yes, it separates binary data, also in multiple dimensions, but always linearly.
- For each assumption, give an example where the assumption does not hold.
- Is the feature distribution p(\boldsymbol x) of a Gaussian Naive Bayes model a Gaussian distribution? Explain your answer!
- What is the homoscedasticity assumption? Is it made by Gaussian Naive Bayes?
- Homoscedasticity refers to the assumption that the variance of the errors (or residuals) is constant across all levels of the independent variables. This assumption is not made by NB.
- Describe the Quadratic Discriminant Analysis (QDA) and Linear Discriminant Analysis (LDA) models.
- Linear Discriminant Analysis (LDA) assumes that each class is normally distributed with the same covariance matrix, resulting in linear decision boundaries between classes. It models p(X \mid Y) as Gaussian distributions and uses Bayes’ theorem to classify new data points.
- Quadratic Discriminant Analysis (QDA) relaxes the assumption of shared covariance matrices, allowing each class to have its own covariance, leading to quadratic decision boundaries that can better handle more complex data distributions.
- Gaussian Naive Bayes can be interpreted as a constrained variant of QDA. Why?
- GNB assumes uncorrelated features (diagonal covariance matrix), while QDA allows correlated features (full covariance matrix). This makes GNB a more constrained and computationally simpler model compared to QDA.
- Neither Gaussian Naive Bayes nor Linear Discriminant Analysis is more powerful than the respective other model. Why?
- Performance depends on how well the data aligns with their assumptions. GNB is better for sparse or independent-feature datasets, while LDA is more suitable for correlated-feature datasets.
- How can we “generate” data from a (i) Gaussian Naive Bayes, (ii) QDA, (iii) LDA model?
- By sampling from these distributions according to the model assumptions, you can generate synthetic data that adheres to each model’s constraints.
Logistic Regression (LR)
Logistic regression is a discriminative classifier that models binary outputs. It uses MAP estimation and empirical risk.
LR assumes that the output variable is binary and follows a Bernoulli distribution.
LR can also be generalized to non-binary problems, for example, by using softmax regression.
📌
The math: Bernoulli Distribution, Odds, and Logistic Functions
The Bernoulli distribution models a binary outcome y \in {0,, 1} with a success probability \theta \in [0, 1]. Its probability mass function is:
p(y \mid \theta) = \begin{cases} \theta & \text{if } y = 1, \newline 1 - \theta & \text{if } y = 0 \end{cases}The odds of success for \theta is defined in terms of the odds function, which measures the relative likelihood of success:
\text{odds}(\theta) = \frac{\theta}{1-\theta}The logit function (log of the odds) transforms probabilities from [0, 1] to the range (- \infty, \infty):
\text{logit}(\theta) = \log\left(\frac{\theta}{1-\theta}\right) = \log(\theta) - \log(1-\theta)The logistic function is the inverse of the logit function, used to map a value \eta \in (-\infty, \infty) back to a probability \theta \in [0, 1]
\theta = \sigma(\eta) = \frac{1}{1 + \exp(-\eta)}The logistic regression model uses a log-odds (logit) transformation to link a linear combination of input features to probabilities. The log-odds of success (y=1) can be expressed as:
\eta = w_0 + \sum_{j=1}^D w_j x_j = w_0 + \langle \vec{w}, \vec{x} \rangle\newline \begin{aligned} &\text{Where:}\newline &- ,w_0\text{ is the {\bf bias} term}\newline &- ,\vec{w}\text{ is the {\bf weight} vector}\newline &- ,\langle \vec{w}, \vec{x} \rangle\text{ is the {\bf dot product} of weights and features} \end{aligned}Probability of Success p(y=1) can be obtained using the logistic (sigmoid) function:
p(y = 1 \mid \vec{x}, \vec{w}) = \sigma(\eta) = \frac{1}{1 + \exp(-\eta)}Probability of Failure p(y = 0) is
p(y = 0 \mid \vec{x}, \vec{w}) = 1 - \sigma(\eta) = \sigma(-\eta)To make the equations cleaner
Combine the bias into the weight vector by defining an additional input feature x_0 = 1.
The probability of success simplifies to:
p(y = 1 \mid \vec{x}, \vec{w}) = \sigma(\langle \vec{w}, \vec{x} \rangle)The general case for any outcome y \in {0, 1} is:
p(y \mid \vec{x}, \vec{w}) = \text{Ber}(y \mid \sigma(\langle \vec{w}, \vec{x} \rangle))❓ Compare Gaussian Naive Bayes and logistic regression classifiers.
LR vs GNB
GNB is a generative model, meaning it models the joint probability p(x,y). LR is a discriminative model, focusing on the conditional probability p(y \mid x). Moreover, for every GNB classifier, there exists an equivalent LR classifier, but not the other way around.
LR is more flexible and often performs better when GNB’s assumptions are violated. On the other hand, GNB converges faster because it requires fewer parameters.
We should choose GNB when:
- The dataset is small.
- The features approximately follow Gaussian distributions within each class.
We should choose LR when:
- The dataset is larger and more complex.
- GNB’s assumptions are violated.
The perceptron model
The perceptron model predicts binary outcomes based on the inner product of the weight vector \vec{w} and the input vector \vec{x} we were using in logistic regression. Specifically, if such product is
- 0 if the dot product is negative
- 1 if it is positive.
💡 Decision boundary of a classifier The decision boundary of a classifier is the set of data points for which the classifier is unsure to which class the points belong to, as multiple classes achieve the highest possible probability score on such points. In the case of the perceptron, it happens when \langle \vec{w}, \vec{x} \rangle = 0 Geometrically, this boundary is a hyperplane that separates the data into two classes.
Alternatively, we can also express \langle \vec{w}, \vec{x} \rangle as
\langle \vec{w}, \vec{x} \rangle = |\vec{w}| \cdot |\vec{x}| \cdot \cos(\theta)where \theta is the angle between \vec{w} and \vec{x}.
This decision thus yields:
- |\theta| > 90^\circ: Negative instances
- |\theta| < 90^\circ: Positive instances
- |\theta| = 90^\circ: Decision boundary
Adding a bias term (offset) allows the decision boundary to shift, forming an affine hyperplane that does not pass through the origin. With or without the bias term, the perceptron is still a linear classifier because the decision boundary remains a hyperplane.
This perceptron model can perfectly separate datapoints in any dimension, but as soon as the data does not meet such criteria it will fails, even in the very simple case of predicting the output of a XOR operation.
💡What about categorical features? If we have a binary problem with categorical features, we can use the so-called one-hot encoding, which encodes categorical values into numerical ones by highlighting the feature that is considered using a 1, while setting all other features to 0. For example, if we have X \in {\texttt{red, green, blue}} and we want to encode the value \texttt{green}, we would create a vector x=(0, , 1, , 0)^{\top}.
Moreover, LR can handle slightly imbalanced classes, although the bias term may not be reliable in such scenarios, the feature weights remain useful.
❓
- Define odds, log odds, logit, and logistic function. How are they related?
- Describe the logistic regression model. What are its assumptions?
- How is the logistic regression model parameterized?
- What is a bias feature? Why is it used?
- What is a logit score? What does it tell you if it is (i) negative, (ii) zero, (iii) positive?
- How can we interpret weights in the logistic regression model?
- How does logistic regression handle categorical variables?
- Relate the perceptron to logistic regression. What’s similar? What’s different?
- What is a decision boundary of a classifier? When is a classifier linear? Is logistic regression linear? Why or why not?
- When is a dataset linearly separable? How does this relate to linear classifiers?
- What problems can arise with MLE for logistic regression? How can these problems be mitigated?
- In the absence of regularization, MLE may overfit the training data. Adding a penalty term can help. Different scales of data. To fix: normalization. Moreover, datasets with a large number of features MLE can converge slowly, so using adaptive learning rates can help.
SoftMax regression
SoftMax regression is a generalization of logistic regression for multi-class classification problems. It extends the regression approach to multiple classes by using a separate linear predictor for each class. The SoftMax function then converts these scores into a probability vector that predicts the likelihood of each class.
Each element in the resulting vector is a non-negative value, and the sum of these values equals 1.
It is important to remember that the scale of inputs doesn’t affect the SoftMax output.
This model is equivalent to LR in the special case of applying the SoftMax function with two outcomes.
SoftMax Regression extends logistic regression to handle multiple classes. Like LR regression, SoftMax regression estimates weights using maximum likelihood, which maximizes the probability of correct classifications across the training set.
❓
- State the definition of the softmax function, including its domain and its range.
- The softmax function is defined as:
\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}} ,,\forall,i \in {1, \dots, k} The domain is \R^k (a vector of real numbers), and the range is [0, , 1], ensuring that the outputs sum to 1 (probabilities).- Describe the softmax regression model. What are its assumptions?
- Extends logistic regression to multiclass classification by modeling the conditional probability of each class as a softmax function of a linear combination of features. It assumes that the classes are mutually exclusive and that the data is linearly separable in a higher-dimensional space.
- How does softmax regression relate to logistic regression?
- It handles more than 2 classes.
- How can we interpret the scores obtained by softmax regression?
- Probabilities, representing the model’s confidence for each class.
- Why are the weight vectors obtained by softmax regression redundant?
- The weight vectors in softmax regression are redundant because the predictions depend only on how the scores compare to each other, not their exact values. This issue is resolved by anchoring one weight vector, usually by setting it to zero.