it’s always puzzled me what sparsity means when computation is nonlinear, i.e., decoding the observation from a sparse code using nonlinear computation, because the sparse code can very well be turned into a dense code along the nonlinear path from the original sparse code to the observation. this made me write a short note earlier, as in a few years back, and i thought i’d share my thoughts on sparsity here with you:
in my mind, there are three ways to define sparse coding.
- code sparsity: the code is sparse, i.e., $|z|_0 = O(1)$.
- computational sparsity: the computation is sparse, i.e., $x = \sum_{k=1}^K z_k w_k$, where $K = O(1)$ and $w_k \in \mathbb{R}^d$.
- noise robustness: the computation is robust to perturbation to the parameters: let $\tilde{w} = w + \epsilon$, where $\epsilon \sim \mathcal{N}(0, \sigma^2 1_{|w|})$. the MSE between $x$ and $\tilde{x} = |\sum_{k=1}^K z_k w_k – \sum_{k=1}^K z_k \tilde{w}_k|_2^2$ is $O(d \times \sigma^2)$ not $O(d’ \times d \times \sigma^2)$, because $k \ll d’$ is a constant w.r.t. $d’$.
these are equivalent if we constrain the decoder to be linear (i.e., $x = \sum_{i=1}^{d’} z_i w_i$,) but they are not with a nonlinear decoder. in particular, let us consider a neural net decoder with a single hidden layer such that $x = u \max(0, w z),$ where $u \in \mathbb{R}^{d \times d_h}$ and $w \in \mathbb{R}^{d_h \times d’}$. we can then think of how these different notions of sparsity manifest themselves and how we could encourage these different types of sparsity when training a neural net.
the amount of computation is then $O(d \times d_h + d_h \times d’)$ which reduces to $O(d \times d’)$ assuming $d_h = O(d’)$. even if we impose the code sparsity on $z$, the overall computation does not change ($O(d \times d_h + d_h \times k)$) and remain as $O(d \times d’)$. in other words, code sparsity does not imply computation sparsity, as was the case with linear sparse coding.
based on this observation, one can imagine imposing sparsity on all odd-numbered layers (counting the $z$ as the first layer) and the penultimate layer (one before $x$) in order to satisfy computational sparsity with a nonlinear decoder. in the example above, this implies that the sparsity should be imposed on both $z$ and $\max(0, wz)$.
this naive approach to computational sparsity implies noise robustness, as the number of parameters used in computation is restricted by construction. it does not mean however that there aren’t any other way to impose noise robustness. in particular, we can rewrite the whole problem of sparse coding as
$$\min_{z, w, u} \frac{1}{N} \sum_{n=1}^N |x^n – u \max(0, w z^n)|^2$$
subject to $$| \text{Jac}_{w,u} u \max(0, w z^n) |_F^2 < k d~\text{for all}~n=1,\ldots, N.$$
in other words, the influence of perturbing the parameters on the output must be bounded by a constant multiple of the output dimensionality.
of course it is not tractable to solve this problem exactly, but we can write a regularized proxy problem:
$$\min_{z, w, u} \frac{1}{N} \sum_{n=1}^N |x^n – u \max(0, w z^n)|^2 + \lambda | \text{Jac}_{w, u} u \max (0, wz^n) |_F^2,$$
where $\lambda$ is a regularization strength. in other words, we find the parameters, $w$ and $u$, that are robust to perturbation in terms of the output.
So, which sparsity are we referring to and do we desire when talking about sparsity in neural networks?