How does L1-Regularization create Sparsity — Mathematical and Visual Intuition
I read at many sources (blogs/YouTube videos) that L1-Regularization creates Sparsity but none of them gave me the good Intuition of how until I saw this video. The video is behind paywall, so, I will give you the summary of the video in this blog.
Let us understand the intuition through steps
Step 1: Let us write down the generalized L1 and L2 mathematical formulation
Note 1: Here the loss can be anything and will be same for both L1 and L2 formulation.
Note 2: The left side of table is for L2-Regularization and right side is for L1-Regularization.
Step 2: Let us ignore loss as it is common for both and assume that the input is one-dimensional to make it simpler (The same works even if the input is n-dimensional).
Step 3: Let us find the gradient of the above formulation
Step 4: Update w₁ using gradient at the (j+1) iteration
lr is the learning rate.
Step 5: Intuition with and example. Let w₁ = 0.05 at jₜₕ iteration and lr is 0.01.
Step 6: Plugin the values into the equation
When compared to L2, L1 pushed the weight significantly more towards zero than L2. In few iterations, the weight of L1-Regularization will become zero while that of L2-Regularization will be around 0.045. This way the weight of L2-refularization may come very close to zero.