Ridge-Lasso

Elastic Net Regularization

In Machine learning, Ridge and Lasso are two common regularization methods.

The equation of Ridge and Lasso are as follows:

Ridge: $L_{ridge} = \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} \beta_j^2$

Lasso: $L_{lasso} = \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} |\beta_j|$

We know that Ridge will penalize those dimensions with large $\beta$ , which results in a dense solution. Lasso will penalize those dimensions with small $\beta$ , which results in a sparse solution.

However, what if we combine Ridge and Lasso together? That is, we use the following equation:

\mathcal{L}=\|Y-X \beta\|_2^2+\lambda_2\|\beta\|^2+\lambda_1\|\beta\|_1

Then what will happen to the $\beta$ ? Will it be sparse or dense?

Actually, the combined regularization is called Elastic Net Regularization¹. It is a linear combination of L1 and L2 regularization, which has pros and cons as follow²:

Advantages:

Elastic Net combines the strengths of both Lasso and Ridge. It can select features among a large number of redundant features and also handle highly correlated features.
When there is multicollinearity among features, Elastic Net can select a group of correlated features to participate together in model building, enhancing model stability.

Disadvantages:

Parameter tuning, such as the regularization strength α and the L1/L2 weights ρ, needs to be carefully selected through methods like cross-validation; otherwise, it might lead to poor model performance.
Compared to Lasso alone, Elastic Net's solutions may not be sparse enough in extremely sparse problems, which could reduce model interpretability.

KL Divergence

When reviewing the Variational Autoencoder, there is a equation in the derivation process:

\mathrm{KL}\left(P_{\text {data }}(X) \mid p_\theta(X)\right)+\operatorname{KL}\left(q_\phi(h \mid X) \mid p_\theta(h \mid X)\right)=\operatorname{KL}\left(P_{\text {data }}(X) q_\phi(h \mid X) \mid p_\theta(h, X)\right)

We know that, for KL Divergence, we have:

$\mathrm{KL}(p(y \mid x) \mid q(y \mid x)) \stackrel{\text { def }}{=} \mathrm{E}_{p(x)} \mathrm{E}_{p(y \mid x)}\left[\log \frac{p(Y \mid X)}{q(Y \mid X)}\right]$ .
$\mathrm{KL}(p(x, y) \mid q(x, y))=\operatorname{KL}(p(x) \mid q(x))+\mathrm{KL}(p(y \mid x) \mid q(y \mid x))$

But why the KL Divergence in the derivation of VAE is like this? What is the intuitive explanation of this equation?

Reference

¹ https://en.wikipedia.org/wiki/Elastic_net_regularization

² https://blog.csdn.net/qq_51320133/article/details/137421397