Theoretical bacground

Introduction

We are interested in solving the standard linear inverse problem \[\begin{equation}\label{eq:LM-problem} y = Ax + w, \end{equation}\]

where \(y\in {\mathbb{R}}^n\) and \(A\in{\mathbb{R}}^{n\times p}\) are known parameters of the model, \(w\in{\mathbb{R}}^n\) is a random noise vector and \(x\in{\mathbb{R}}^p\) is an unknown vector of paramteres we wish to estimate. We assume \(p\gg n\), i.e. the number of features is much greater than the size of the sample data and whence there might be many potential solutions to the problem\(~\eqref{eq:LM-problem}\).

To resolve this issue and prevent overfitting, we introduce a penalty function \(\phi\) which faforizes sparse solutions of\(~\eqref{eq:LM-problem}\), i.e. now we are looking among the minimizers of the following form \[\begin{equation}\label{eq:SLOPE} \widehat{x} = \operatorname*{argmin}_x \{\, \frac{1}{2}\Vert Ax - y \Vert_2^2 + \phi(x) \,\}. \end{equation}\]

The usual choices of \(\phi\) are scaled \(l^2\) penalty (Tikhonov regularization) and \(l^1\) penalty (LASSO). This note concerns sorted \(l^1\) penalized estimation (abbrev. SLOPE), introduced for the first time in (Bogdan et al. 2015), which assumes \(\phi\) to be the sorted \(l^1\) penalty, i.e. \[ \phi(x)= \phi_{\lambda}(x) = \sum_{i=1}^n \lambda_ix_i^{\downarrow}, \] where \(x_1^\downarrow \ge x_2^\downarrow \ge \ldots \ge x_n^\downarrow\) is the ordered permutation of the vector \({\left\vert x \right\vert}=({\left\vert x_1 \right\vert},{\left\vert x_2 \right\vert},\ldots,{\left\vert x_n \right\vert})\) and \(\lambda_1 \ge \lambda_2 \ge \ldots \lambda_p \ge 0\) are hyperparameters of the model. To lighten notation, we denote \(\Delta_p = \{\, x\in{\mathbb{R}}^p\colon x_1\ge x_2 \ge \ldots \ge x_p \ge 0 \,\}\), so that the above requirements read: \(x^\downarrow,\lambda\in\Delta_p\), where \(x^\downarrow = P{\left\vert x \right\vert}\) for some permutation \(P\) of the set \(\{1,2,\ldots,p\}\). Such a choice of regulizer is a generalization of the \(l^1\) regularization, as can be seen by taking \(\lambda_i\equiv\operatorname{const}\).

The fact that \(\phi_\lambda\) is non-separable makes the analysis of its teoretical properties much more onerous than in case of classical (separable) models, cf. (Bogdan et al. 2015, Bu et al. (2019)). Nonetheless, it turns out that SLOPE has two advantages over other regularization methods such as LASSO and knocoffs, namely:

  1. it achieves certain minimax estimation properties under particular random designs without requiring any knowledge of the sparsity degree of \(\widehat{x}\), cf. (Bellec, Lecué, and Tsybakov 2018);
  2. it controls the false discovery rate in the case of independent predictors, cf. (Bogdan et al. 2015).

We are interested in the algorithmic solutions to the problem \(\eqref{eq:SLOPE}\). Since the objective function in \(\eqref{eq:SLOPE}\) is convex but not smooth, one can not apply directly the classical gradient descent and has to turn to other methods.

A natural alternative solution is the plethora of proximal algorithms, e.g. ISTA (and its improvement – FISTA, cf. (A. Beck and Teboulle 2009)) or more classical alternating direction methods of multipliers (ADMM). The methods have been throughly studied in the literature, cf. (Beck A 2017) for a detailed treatment of the subject.

In this note we will focus on another approach, via the approximate message passing, considered for the first time in context of the LASSO problem in (Donoho, Maleki, and Montanari 2009) and subsequentially developed in e.g. (Bayati and Montanari 2011), and for the SLOPE regularization in (Bu et al. 2019) – see e.g. (Zdeborová and Krzakala 2016) for an accessible derivation of the method.

In the subsequent sections we will describre briefly some of these approaches.

Proximal methods

Denoting \(g(x)=\frac{1}{2}\Vert Ax - y \Vert_2^2\), the iterative shrinkage thresholding algorithm (ISTA) iteration for SLOPE with given \(\lambda\) can be written as:


ISTA-SLOPE:
Input: \(y\in{\mathbb{R}}^p\), \(\lambda\in\Delta_p\), \(A\in{\mathbb{R}}^{n\times p}\)
Initialize \(g(x)=\frac{1}{2}{\left\Vert Ax-y \right\Vert}^2\), \(x\in{\mathbb{R}}^p\), \(t>0\)
while (stopping condition) {
\(x \leftarrow {\operatorname{prox}}_{t\phi_\lambda}\big(x - t\nabla g(x)\big)\);
} return \(x\)


where \(t\) can be thought of as the learning rate and \(\operatorname{prox}\) denotes the proximal operator given by \[ {\operatorname{prox}}_{h}(y) := \operatorname*{argmin}_x \{\, h(x) + \frac{1}{2}\Vert x-y \Vert_2^2 \,\}. \]

A. Beck and Teboulle (2009) have introduced a faster version of ISTA, a.k.a. FISTA, which is based on the idea of Nesterov momentum. The general form of the algorithm is the following:


FISTA-SLOPE:
Input: \(y\in{\mathbb{R}}^p\), \(\lambda\in\Delta_p\), \(A\in{\mathbb{R}}^{n\times p}\)
Initialize \(g(x)=\frac{1}{2}{\left\Vert Ax-y \right\Vert}^2\), \(x=x_{old}\in{\mathbb{R}}^p\), \(r = 1\), \(t>0\)
while (stopping condition) {
\(u \leftarrow x_{old} + r(x-x_{old})\);
\(x_{old}\leftarrow x\);
\(x \leftarrow {\operatorname{prox}}_{t\phi_\lambda}\big(u - t\nabla g(u)\big)\);
update(\(r\));
} return \(x\)


Here \(r\) can be thought of as a acceleration term, which (if updated correctly through the update rule) can increase substantialy the speed of convergence of the algorithm. Note that keeping \(r\equiv 1\) restores ISTA.

One of the difficulties in dealing with SLOPE is that the regulizer \(\phi_\lambda\) is non-separable and thus its proximal operator cannot be applied element-wise. (Bogdan et al. 2015) have proposed an efficient algorithm that for any \(u\in{\mathbb{R}}^p\) computes \[ \widehat{x}= {\operatorname{prox}}_{\phi_{\lambda}}(u) = \operatorname*{argmin}_x \big\{ \frac{1}{2}{\left\Vert u-x \right\Vert}^2 + \sum_i \lambda_ix_i^\downarrow \big\} = \operatorname*{argmin}_x \big\{\sum_i\big[ \frac{1}{2}(x^\downarrow_i)^2 + x_i^\downarrow\lambda_i - x_iu_i\big]\big\} \] It is based on the following simple observations that follow immediatly from the above formulation:

  1. \({\operatorname{sign}}(\widehat{x}_i)={\operatorname{sign}}(u_i)\) for each \(i\) such that \(\widehat{x}_i\neq 0\);
  2. \(P\widehat{x}={\operatorname{prox}}_{\phi_\lambda}(Pu)\) for any permutation \(P\);
  3. If \(u\in \Delta_p\), then \(\widehat{x}\in\Delta_p\) (i.e., \(u=u^\downarrow \Rightarrow \widehat{x}=\widehat{x}^\downarrow\));

Therefore we can and do assume in the analysis below that \(u\in\Delta_p\). The optimization procedure now reads: \[ \widehat{x} = \operatorname*{argmin}_{x^\downarrow\in\Delta_p} \big\{\sum_i\big[ \frac{1}{2}(x^\downarrow_i)^2 - x_i^\downarrow(u-\lambda)_i \big]\big\}, \] whence

  1. \(\widehat{x}\) depends only on the vector \((\lambda - u)\);
  2. If \((u - \lambda)_+\in\Delta_p\) then \(\widehat{x} = (u - \lambda)_+\), where \(v_+ = (\max(v_i,0))_i\);
  3. If \((u-\lambda)_i \le (u-\lambda)_{i+1}\), then \(\widehat{x}_i \le \widehat{x}_{i+1}\), whence \(\widehat{x}_i = \widehat{x}_{i+1}\);
  4. Consequently, if \((u-\lambda)\) is nondecreasing along the indices \((i,i+1,\ldots,j)\), then \(\widehat{x}_i = \widehat{x}_{i+1}= \ldots = \widehat{x}_{i+1}\) and \[ \sum_{i\le r \le j} \widehat{x}_r^\downarrow(u-\lambda)_r = \sum_{i\le r \le j} \widehat{x}_r^\downarrow(\bar{u}-\bar{\lambda} ) \, \] where \(\bar{u},\bar{\lambda}\) are the means of \(u\) and \(\lambda\) along the indices \((i,i+1,\ldots,j)\), so replacing \(u_r,\lambda_r\) with \(\bar{u},\bar{\lambda}\) does not change \(\widehat{x}\).

The above observations justify the procedure below proposed by (Bogdan et al. 2015).


FastProxSL1:
Input: \(u\in{\mathbb{R}}^p\), \(\lambda\in\Delta_p\)
# Define the operator \(H_u(v) = P({\operatorname{sign}}(u_i)v_i)_i\) for some permutation \(P\), so that \(H_u(u) = u^\downarrow\in\Delta_p\)
\(u' \leftarrow H_u(u)\);
while \((u'-\lambda)_+\notin\Delta_p\) {
  identify nondecreasing and nonconstant segments \(i:j\) of \((u'-\lambda)\)
  replace \(u'_r,\lambda_r\) for \(r\in\{i,i+1,\ldots,j\}\) by their averages \(\bar{u'},\bar{\lambda}\)
} return \(H_u^{-1}(u'-\lambda)_+\);


Approximate message passing

The AMP approach is Bayes in nature. Namely, we assume that the true values of \(x\) are i.i.d. from some apriori distribution \(X=(X_1,\ldots,X_p)\) satisfying some intregrability choditions and that the noise vector \(w\) is elementwise i.i.d. with zero mean and variance \(\sigma_w^2<\infty\). The apriori distribution \(X\) and the parameter \(\sigma_w^2\) will play a crucial role in the construction of the algorithm as will be seen in a while. The base of the AMP algorithm for SLOPE is as follows


AMP-SLOPE:
Input: \(y\in{\mathbb{R}}^p\), \(\alpha\in\Delta_p\), \(A\in{\mathbb{R}}^{n\times p}\)
Initialize: \(g(x)=\frac{1}{2}{\left\Vert Ax-y \right\Vert}^2\), \(x=x_{old}\in{\mathbb{R}}^p\), \(v=\nabla g(x)\), \(t=t(X)\in{\mathbb{R}}_+\)
while (stopping condition) {
\(x_{old}\leftarrow x\);
\(x \leftarrow {\operatorname{prox}}_{\phi_{\tau\alpha}}\big(x_{old} - v\big)\);
\(v \leftarrow \nabla g(x) + \frac{v}{n} [\nabla{\operatorname{prox}}_{\phi_{\tau\alpha}}(x_{old} - v)]\);
update(\(\tau\));
} return \(x\)


Based on the observations 1.-7. above, it is easy to verify that for any vector \(u\in{\mathbb{R}}^p\) \[ \nabla{\operatorname{prox}}_{\phi_\lambda}(u) = \Vert {\operatorname{prox}}_{\phi_\lambda}(u) \Vert_0^\ast, \quad\text{where}\quad {\left\Vert u \right\Vert}_0^\ast:= \#\{ \text{unique non-zero magintudes in } {\left\vert u \right\vert} \}. \] E.g., \({\left\Vert (0,3,-3,3,1) \right\Vert}_0^\ast = 2\). Therefore, \(v\) update in the AMP-SLOPE algorithm can be read as \[ v \leftarrow \nabla g(x) + \frac{v}{n} {\left\Vert x \right\Vert}_0^\ast. \]

Moreover, the scaling factor \(\tau\) and its update rule in the AMP scheme are dictated by the so-called state evolution equation. Namely, let \[ F(\tau, \alpha) = \sigma_w^2 + \frac{1}{n} {\mathbb{E}}{\left\Vert {\operatorname{prox}}_{\phi_{\tau\alpha}}(X+\tau Z) - X \right\Vert}^2, \quad \tau\in{\mathbb{R}}_+,\,\alpha\in\Delta^p, \] where \(Z\) is the \(\mathcal{N}(0,I_p)\) random vector independent on the whole model. Then, setting \[\begin{equation}\label{eq:SE} \tau_0^2=\sigma_w^2 + \frac{p}{n}\mathbb{E} X_1^2, \quad \tau_{k+1}^2 = F(\tau_k,\alpha), \quad \tau_\ast = \lim_k \tau_k \end{equation}\] it can be verified that the stationary point of the AMP algorithm is also a minimizer to the SLOPE problem with \(\lambda=\hat{\alpha}\) being the solution to \[\begin{equation}\label{eq:lambda-alpha} \lambda = \alpha\tau_\ast\big( 1- \frac{1}{n}\mathbb{E}\Vert{\operatorname{prox}_{\phi_{\tau_\ast\alpha}}(X+\tau_\ast Z)}\Vert_0^\ast \big). \end{equation}\]

Bu et al. (2019) have shown under some technical assumptions the fixed point of the state evolution\(~\eqref{eq:SE}\) has unique fixed point (i.e., that \(\tau^\ast\) is well defined), and that the solution \(\hat{\alpha}\) to\(~\eqref{eq:lambda-alpha}\) is unique and well-defined. They have also proposed a numerical scheme for calculating \(\hat{\alpha}\) based on the bisection method.

Numerical experiments

source("R-codes/FastProxSL1.R")
source("R-codes/alpha-lambda.R")
source("R-codes/IterativeAlgs.R")

Parameters of the model:

set.seed(351759)

p=300
delta=0.5
eps=0.1
sigma2=0.2
n=p*delta
A=matrix(rnorm(n*p,mean=0,sd=1/sqrt(p*delta)), n, p) 
alpha=2*c(rep(1,p*delta),rep(0,p*(1-delta)))
stepSize=10
tol=0

prior= function() rnorm(p)*rbinom(p,1,eps)

x0=prior()
x=prior() 
y=as.numeric(A%*%x+sqrt(sigma2)*rnorm(n))
tau <- sqrt(sigma2+eps/delta)
tau_ast <- alpha_to_tau_ast(alpha, x, delta, tau=tau, sigma2 = sigma2, verbose= FALSE, ergodic = TRUE)
lambdahat <- alpha_to_lambda(alpha, x, delta, tau_ast = tau_ast, sigma2 = sigma2)

print(paste0("True loss: ",loss <- sum( (A %*% x - y)^2)/2 + MapToDelta(x)$w %*% MapToDelta(lambdahat)$w))

Fixed point of the state evolution equation:

t <- seq(from = 0, to = 4, length.out = 100)
Ft <- as.numeric(lapply(t, function(x) sqrt(F(x, alpha, delta, prior, sigma2, iter=10))))

Plot the SE:

df <- data.frame(t, t, Ft)
colnames(df) <- c("t","x=y","sqrt(F)")
df <- melt(df, id="t")
ggplot(data = df,
       aes(x=t, y=value, colour=variable))+
  geom_line(size=.5) +
  ggtitle("AMP precision")

Simulations

Compare AMP with different MC precision

AMP0 <-   new("AMP-SLOPE", 
             y=y, A=A, prior=prior, tau=tau, lambda=lambdahat, tol=tol, max_iter=25,
             alpha=alpha, sigma2=sigma2, F_iter=1)
AMP1 <-   new("AMP-SLOPE", 
             y=y, A=A, prior=prior, tau=tau, lambda=lambdahat, tol=tol, max_iter=25,
             alpha=alpha, sigma2=sigma2, F_iter=10)
AMP2 <-   new("AMP-SLOPE", 
             y=y, A=A, prior=prior, tau=tau, lambda=lambdahat, tol=tol, max_iter=25,
             alpha=alpha, sigma2=sigma2, F_iter=10^2)
AMP3 <-   new("AMP-SLOPE", 
             y=y, A=A, prior=prior, tau=tau, lambda=lambdahat, tol=tol, max_iter=25,
             alpha=alpha, sigma2=sigma2, F_iter=10^3)

AMP_res0 <- runAlg(AMP0)
AMP_res1 <- runAlg(AMP1)
AMP_res2 <- runAlg(AMP2)
AMP_res3 <- runAlg(AMP3)

Plot the results

t <- c(5:25)
df <- data.frame(t, 
                 AMP_res0@loss[t], 
                 AMP_res1@loss[t], 
                 AMP_res2@loss[t], 
                 AMP_res3@loss[t])
colnames(df) <- c("iteration","AMP0","AMP1","AMP2","AMP3")
df <- melt(df, id="iteration")
ggplot(data = df,
       aes(x=iteration, y=value, colour=variable))+
  geom_line(size=.5) +
  ggtitle("AMP precision")

Compare different methods

ISTA <- new("ISTA-SLOPE", y=y, A=A, x=x0, lambda=lambdahat, stepSize=stepSize,
            max_iter=100, tol=tol, init_params=FALSE, verbose=FALSE)
ISTA_res <- runAlg(ISTA)
# print(paste0("ISTA loss=", paste0(tail(ISTA_res@loss,1), collapse = "; "),
#              " after ",ISTA_res@iteration," iterations"))


FISTA <- new("FISTA-SLOPE", y=y, A=A, x=x0, lambda=lambdahat, stepSize=stepSize, 
             tol=tol, max_iter=100, verbose=FALSE)
FISTA_res <- runAlg(FISTA)
# print(paste0("FISTA loss=", paste0(tail(FISTA_res@loss,1), collapse = "; "), 
#              " after ",FISTA_res@iteration," iterations"))

PLot the results:

t <- c(2:100)
df <- data.frame(t, ISTA_res@loss[t], FISTA_res@loss[t], AMP_res3@loss[t])
colnames(df) <- c("iteration","ISTA","FISTA","AMP")
df <- melt(df, id="iteration")
ggplot(data = df,
       aes(x=iteration, y=value, colour=variable))+
  geom_line(size=.5) +
  scale_x_continuous(trans='log10') +
  ggtitle("Loss evolution")

Conclusions

  • AMP works only for the noisy models
  • AMP is very resource consuming due to the MC simulation part – it could be greatly improved if the integral in the state evolution could be computed analytically, which maybe could be obtained by choosing an appropriate prior.
  • AMP converges faster than ISTA/FISTA to some approximation of the optimum but the latter methods give more precise estimates (is more accurate due to the lack of randomness)

References

Bayati, M., and A. Montanari. 2011. “The Dynamics of Message Passing on Dense Graphs, with Applications to Compressed Sensing.” IEEE Trans. Inform. Theory 57 (2): 764–85. doi:10.1109/TIT.2010.2094817.

Beck, A., and M. Teboulle. 2009. “A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems.” SIAM J. Imaging Sci. 2 (1): 183–202. doi:10.1137/080716542.

Beck, A. 2017. First-Order Methods in Optimization. Vol. 25. MOS-Siam Series on Optimization. Society for Industrial; Applied Mathematics (SIAM), Philadelphia, PA; Mathematical Optimization Society, Philadelphia, PA. doi:10.1137/1.9781611974997.ch1.

Bellec, P. C., G. Lecué, and A. B. Tsybakov. 2018. “Slope Meets Lasso: Improved Oracle Bounds and Optimality.” Ann. Statist. 46 (6B): 3603–42. doi:10.1214/17-AOS1670.

Bogdan, M., E. van den Berg, C. Sabatti, W. Su, and E. J. Candès. 2015. “SLOPE—adaptive Variable Selection via Convex Optimization.” Ann. Appl. Stat. 9 (3): 1103–40. doi:10.1214/15-AOAS842.

Bu, Z., J. Klusowski, C. Rush, and W. Su. 2019. “Algorithmic Analysis and Statistical Estimation of Slope via Approximate Message Passing.” In Advances in Neural Information Processing Systems, 9366–76. http://par.nsf.gov/biblio/10163278.

Donoho, D., A. Maleki, and A. Montanari. 2009. “Message-Passing Algorithms for Compressed Sensing.” Proceedings of the National Academy of Sciences 106 (45). National Acad Sciences: 18914–9. doi:https://doi.org/10.1073/pnas.0909892106.

Zdeborová, L., and F. Krzakala. 2016. “Statistical Physics of Inference: Thresholds and Algorithms.” Advances in Physics 65 (5). Taylor & Francis: 453–552. doi:https://doi.org/10.1080/00018732.2016.1211393.