Abstract
The aim of this project is to investigate the the approximate message passing algorithm for SLOPE regularization problem based on (Bu et al. 2019) and compare it with classical convex optimization methods. Some numerical experiments regarding the cases that do not fit into the theoretical framework of (Bu et al. 2019) are also performed and analyzed.
where \(y\in {\mathbb{R}}^n\) and \(A\in{\mathbb{R}}^{n\times p}\) are known parameters of the model, \(w\in{\mathbb{R}}^n\) is a random noise vector and \(x\in{\mathbb{R}}^p\) is an unknown vector of paramteres we wish to estimate. We assume \(p\gg n\), i.e. the number of features is much greater than the size of the sample data and whence there might be many potential solutions to the problem\(~\eqref{eq:LM-problem}\).
To resolve this issue and prevent overfitting, we introduce a penalty function \(\phi\) which faforizes sparse solutions of\(~\eqref{eq:LM-problem}\), i.e. now we are looking among the minimizers of the following form \[\begin{equation}\label{eq:SLOPE} \widehat{x} = \operatorname*{argmin}_x \{\, \frac{1}{2}\Vert Ax - y \Vert_2^2 + \phi(x) \,\}. \end{equation}\]The usual choices of \(\phi\) are scaled \(l^2\) penalty (Tikhonov regularization) and \(l^1\) penalty (LASSO). This note concerns sorted \(l^1\) penalized estimation (abbrev. SLOPE), introduced for the first time in (Bogdan et al. 2015), which assumes \(\phi\) to be the sorted \(l^1\) penalty, i.e. \[ \phi(x)= \phi_{\lambda}(x) = \sum_{i=1}^n \lambda_ix_i^{\downarrow}, \] where \(x_1^\downarrow \ge x_2^\downarrow \ge \ldots \ge x_n^\downarrow\) is the ordered permutation of the vector \({\left\vert x \right\vert}=({\left\vert x_1 \right\vert},{\left\vert x_2 \right\vert},\ldots,{\left\vert x_n \right\vert})\) and \(\lambda_1 \ge \lambda_2 \ge \ldots \lambda_p \ge 0\) are hyperparameters of the model. To lighten notation, we denote \(\Delta_p = \{\, x\in{\mathbb{R}}^p\colon x_1\ge x_2 \ge \ldots \ge x_p \ge 0 \,\}\), so that the above requirements read: \(x^\downarrow,\lambda\in\Delta_p\), where \(x^\downarrow = P{\left\vert x \right\vert}\) for some permutation \(P\) of the set \(\{1,2,\ldots,p\}\). Such a choice of regulizer is a generalization of the \(l^1\) regularization, as can be seen by taking \(\lambda_i\equiv\operatorname{const}\).
The fact that \(\phi_\lambda\) is non-separable makes the analysis of its teoretical properties much more onerous than in case of classical (separable) models, cf. (Bogdan et al. 2015, Bu et al. (2019)). Nonetheless, it turns out that SLOPE has two advantages over other regularization methods such as LASSO and knocoffs, namely:
We are interested in the algorithmic solutions to the problem \(\eqref{eq:SLOPE}\). Since the objective function in \(\eqref{eq:SLOPE}\) is convex but not smooth, one can not apply directly the classical gradient descent and has to turn to other methods.
A natural alternative solution is the plethora of proximal algorithms, e.g. ISTA (and its improvement – FISTA, cf. (A. Beck and Teboulle 2009)) or more classical alternating direction methods of multipliers (ADMM). The methods have been throughly studied in the literature, cf. (Beck A 2017) for a detailed treatment of the subject.
In this note we will focus on another approach, via the approximate message passing, considered for the first time in context of the LASSO problem in (Donoho, Maleki, and Montanari 2009) and subsequentially developed in e.g. (Bayati and Montanari 2011), and for the SLOPE regularization in (Bu et al. 2019) – see e.g. (Zdeborová and Krzakala 2016) for an accessible derivation of the method.
In the subsequent sections we will describre briefly some of these approaches.
Denoting \(g(x)=\frac{1}{2}\Vert Ax - y \Vert_2^2\), the iterative shrinkage thresholding algorithm (ISTA) iteration for SLOPE with given \(\lambda\) can be written as:
ISTA-SLOPE:
Input: \(y\in{\mathbb{R}}^p\), \(\lambda\in\Delta_p\), \(A\in{\mathbb{R}}^{n\times p}\)
Initialize \(g(x)=\frac{1}{2}{\left\Vert Ax-y \right\Vert}^2\), \(x\in{\mathbb{R}}^p\), \(t>0\)
while (stopping condition) {
\(x \leftarrow {\operatorname{prox}}_{t\phi_\lambda}\big(x - t\nabla g(x)\big)\);
} return \(x\)
where \(t\) can be thought of as the learning rate and \(\operatorname{prox}\) denotes the proximal operator given by \[ {\operatorname{prox}}_{h}(y) := \operatorname*{argmin}_x \{\, h(x) + \frac{1}{2}\Vert x-y \Vert_2^2 \,\}. \]
A. Beck and Teboulle (2009) have introduced a faster version of ISTA, a.k.a. FISTA, which is based on the idea of Nesterov momentum. The general form of the algorithm is the following:
FISTA-SLOPE:
Input: \(y\in{\mathbb{R}}^p\), \(\lambda\in\Delta_p\), \(A\in{\mathbb{R}}^{n\times p}\)
Initialize \(g(x)=\frac{1}{2}{\left\Vert Ax-y \right\Vert}^2\), \(x=x_{old}\in{\mathbb{R}}^p\), \(r = 1\), \(t>0\)
while (stopping condition) {
\(u \leftarrow x_{old} + r(x-x_{old})\);
\(x_{old}\leftarrow x\);
\(x \leftarrow {\operatorname{prox}}_{t\phi_\lambda}\big(u - t\nabla g(u)\big)\);
update(\(r\));
} return \(x\)
Here \(r\) can be thought of as a acceleration term, which (if updated correctly through the update rule) can increase substantialy the speed of convergence of the algorithm. Note that keeping \(r\equiv 1\) restores ISTA.
One of the difficulties in dealing with SLOPE is that the regulizer \(\phi_\lambda\) is non-separable and thus its proximal operator cannot be applied element-wise. (Bogdan et al. 2015) have proposed an efficient algorithm that for any \(u\in{\mathbb{R}}^p\) computes \[ \widehat{x}= {\operatorname{prox}}_{\phi_{\lambda}}(u) = \operatorname*{argmin}_x \big\{ \frac{1}{2}{\left\Vert u-x \right\Vert}^2 + \sum_i \lambda_ix_i^\downarrow \big\} = \operatorname*{argmin}_x \big\{\sum_i\big[ \frac{1}{2}(x^\downarrow_i)^2 + x_i^\downarrow\lambda_i - x_iu_i\big]\big\} \] It is based on the following simple observations that follow immediatly from the above formulation:
Therefore we can and do assume in the analysis below that \(u\in\Delta_p\). The optimization procedure now reads: \[ \widehat{x} = \operatorname*{argmin}_{x^\downarrow\in\Delta_p} \big\{\sum_i\big[ \frac{1}{2}(x^\downarrow_i)^2 - x_i^\downarrow(u-\lambda)_i \big]\big\}, \] whence
The above observations justify the procedure below proposed by (Bogdan et al. 2015).
FastProxSL1:
Input: \(u\in{\mathbb{R}}^p\), \(\lambda\in\Delta_p\)
# Define the operator \(H_u(v) = P({\operatorname{sign}}(u_i)v_i)_i\) for some permutation \(P\), so that \(H_u(u) = u^\downarrow\in\Delta_p\)
\(u' \leftarrow H_u(u)\);
while \((u'-\lambda)_+\notin\Delta_p\) {
identify nondecreasing and nonconstant segments \(i:j\) of \((u'-\lambda)\)
replace \(u'_r,\lambda_r\) for \(r\in\{i,i+1,\ldots,j\}\) by their averages \(\bar{u'},\bar{\lambda}\)
} return \(H_u^{-1}(u'-\lambda)_+\);
The AMP approach is Bayes in nature. Namely, we assume that the true values of \(x\) are i.i.d. from some apriori distribution \(X=(X_1,\ldots,X_p)\) satisfying some intregrability choditions and that the noise vector \(w\) is elementwise i.i.d. with zero mean and variance \(\sigma_w^2<\infty\). The apriori distribution \(X\) and the parameter \(\sigma_w^2\) will play a crucial role in the construction of the algorithm as will be seen in a while. The base of the AMP algorithm for SLOPE is as follows
AMP-SLOPE:
Input: \(y\in{\mathbb{R}}^p\), \(\alpha\in\Delta_p\), \(A\in{\mathbb{R}}^{n\times p}\)
Initialize: \(g(x)=\frac{1}{2}{\left\Vert Ax-y \right\Vert}^2\), \(x=x_{old}\in{\mathbb{R}}^p\), \(v=\nabla g(x)\), \(t=t(X)\in{\mathbb{R}}_+\)
while (stopping condition) {
\(x_{old}\leftarrow x\);
\(x \leftarrow {\operatorname{prox}}_{\phi_{\tau\alpha}}\big(x_{old} - v\big)\);
\(v \leftarrow \nabla g(x) + \frac{v}{n} [\nabla{\operatorname{prox}}_{\phi_{\tau\alpha}}(x_{old} - v)]\);
update(\(\tau\));
} return \(x\)
Based on the observations 1.-7. above, it is easy to verify that for any vector \(u\in{\mathbb{R}}^p\) \[ \nabla{\operatorname{prox}}_{\phi_\lambda}(u) = \Vert {\operatorname{prox}}_{\phi_\lambda}(u) \Vert_0^\ast, \quad\text{where}\quad {\left\Vert u \right\Vert}_0^\ast:= \#\{ \text{unique non-zero magintudes in } {\left\vert u \right\vert} \}. \] E.g., \({\left\Vert (0,3,-3,3,1) \right\Vert}_0^\ast = 2\). Therefore, \(v\) update in the AMP-SLOPE algorithm can be read as \[ v \leftarrow \nabla g(x) + \frac{v}{n} {\left\Vert x \right\Vert}_0^\ast. \]
Moreover, the scaling factor \(\tau\) and its update rule in the AMP scheme are dictated by the so-called state evolution equation. Namely, let \[ F(\tau, \alpha) = \sigma_w^2 + \frac{1}{n} {\mathbb{E}}{\left\Vert {\operatorname{prox}}_{\phi_{\tau\alpha}}(X+\tau Z) - X \right\Vert}^2, \quad \tau\in{\mathbb{R}}_+,\,\alpha\in\Delta^p, \] where \(Z\) is the \(\mathcal{N}(0,I_p)\) random vector independent on the whole model. Then, setting \[\begin{equation}\label{eq:SE} \tau_0^2=\sigma_w^2 + \frac{p}{n}\mathbb{E} X_1^2, \quad \tau_{k+1}^2 = F(\tau_k,\alpha), \quad \tau_\ast = \lim_k \tau_k \end{equation}\] it can be verified that the stationary point of the AMP algorithm is also a minimizer to the SLOPE problem with \(\lambda=\hat{\alpha}\) being the solution to \[\begin{equation}\label{eq:lambda-alpha} \lambda = \alpha\tau_\ast\big( 1- \frac{1}{n}\mathbb{E}\Vert{\operatorname{prox}_{\phi_{\tau_\ast\alpha}}(X+\tau_\ast Z)}\Vert_0^\ast \big). \end{equation}\]Bu et al. (2019) have shown under some technical assumptions the fixed point of the state evolution\(~\eqref{eq:SE}\) has unique fixed point (i.e., that \(\tau^\ast\) is well defined), and that the solution \(\hat{\alpha}\) to\(~\eqref{eq:lambda-alpha}\) is unique and well-defined. They have also proposed a numerical scheme for calculating \(\hat{\alpha}\) based on the bisection method.
source("R-codes/FastProxSL1.R")
source("R-codes/alpha-lambda.R")
source("R-codes/IterativeAlgs.R")
Parameters of the model:
set.seed(351759)
p=300
delta=0.5
eps=0.1
sigma2=0.2
n=p*delta
A=matrix(rnorm(n*p,mean=0,sd=1/sqrt(p*delta)), n, p)
alpha=2*c(rep(1,p*delta),rep(0,p*(1-delta)))
stepSize=10
tol=0
prior= function() rnorm(p)*rbinom(p,1,eps)
x0=prior()
x=prior()
y=as.numeric(A%*%x+sqrt(sigma2)*rnorm(n))
tau <- sqrt(sigma2+eps/delta)
tau_ast <- alpha_to_tau_ast(alpha, x, delta, tau=tau, sigma2 = sigma2, verbose= FALSE, ergodic = TRUE)
lambdahat <- alpha_to_lambda(alpha, x, delta, tau_ast = tau_ast, sigma2 = sigma2)
print(paste0("True loss: ",loss <- sum( (A %*% x - y)^2)/2 + MapToDelta(x)$w %*% MapToDelta(lambdahat)$w))
Fixed point of the state evolution equation:
t <- seq(from = 0, to = 4, length.out = 100)
Ft <- as.numeric(lapply(t, function(x) sqrt(F(x, alpha, delta, prior, sigma2, iter=10))))
Plot the SE:
df <- data.frame(t, t, Ft)
colnames(df) <- c("t","x=y","sqrt(F)")
df <- melt(df, id="t")
ggplot(data = df,
aes(x=t, y=value, colour=variable))+
geom_line(size=.5) +
ggtitle("AMP precision")
AMP0 <- new("AMP-SLOPE",
y=y, A=A, prior=prior, tau=tau, lambda=lambdahat, tol=tol, max_iter=25,
alpha=alpha, sigma2=sigma2, F_iter=1)
AMP1 <- new("AMP-SLOPE",
y=y, A=A, prior=prior, tau=tau, lambda=lambdahat, tol=tol, max_iter=25,
alpha=alpha, sigma2=sigma2, F_iter=10)
AMP2 <- new("AMP-SLOPE",
y=y, A=A, prior=prior, tau=tau, lambda=lambdahat, tol=tol, max_iter=25,
alpha=alpha, sigma2=sigma2, F_iter=10^2)
AMP3 <- new("AMP-SLOPE",
y=y, A=A, prior=prior, tau=tau, lambda=lambdahat, tol=tol, max_iter=25,
alpha=alpha, sigma2=sigma2, F_iter=10^3)
AMP_res0 <- runAlg(AMP0)
AMP_res1 <- runAlg(AMP1)
AMP_res2 <- runAlg(AMP2)
AMP_res3 <- runAlg(AMP3)
Plot the results
t <- c(5:25)
df <- data.frame(t,
AMP_res0@loss[t],
AMP_res1@loss[t],
AMP_res2@loss[t],
AMP_res3@loss[t])
colnames(df) <- c("iteration","AMP0","AMP1","AMP2","AMP3")
df <- melt(df, id="iteration")
ggplot(data = df,
aes(x=iteration, y=value, colour=variable))+
geom_line(size=.5) +
ggtitle("AMP precision")
ISTA <- new("ISTA-SLOPE", y=y, A=A, x=x0, lambda=lambdahat, stepSize=stepSize,
max_iter=100, tol=tol, init_params=FALSE, verbose=FALSE)
ISTA_res <- runAlg(ISTA)
# print(paste0("ISTA loss=", paste0(tail(ISTA_res@loss,1), collapse = "; "),
# " after ",ISTA_res@iteration," iterations"))
FISTA <- new("FISTA-SLOPE", y=y, A=A, x=x0, lambda=lambdahat, stepSize=stepSize,
tol=tol, max_iter=100, verbose=FALSE)
FISTA_res <- runAlg(FISTA)
# print(paste0("FISTA loss=", paste0(tail(FISTA_res@loss,1), collapse = "; "),
# " after ",FISTA_res@iteration," iterations"))
PLot the results:
t <- c(2:100)
df <- data.frame(t, ISTA_res@loss[t], FISTA_res@loss[t], AMP_res3@loss[t])
colnames(df) <- c("iteration","ISTA","FISTA","AMP")
df <- melt(df, id="iteration")
ggplot(data = df,
aes(x=iteration, y=value, colour=variable))+
geom_line(size=.5) +
scale_x_continuous(trans='log10') +
ggtitle("Loss evolution")
Bayati, M., and A. Montanari. 2011. “The Dynamics of Message Passing on Dense Graphs, with Applications to Compressed Sensing.” IEEE Trans. Inform. Theory 57 (2): 764–85. doi:10.1109/TIT.2010.2094817.
Beck, A., and M. Teboulle. 2009. “A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems.” SIAM J. Imaging Sci. 2 (1): 183–202. doi:10.1137/080716542.
Beck, A. 2017. First-Order Methods in Optimization. Vol. 25. MOS-Siam Series on Optimization. Society for Industrial; Applied Mathematics (SIAM), Philadelphia, PA; Mathematical Optimization Society, Philadelphia, PA. doi:10.1137/1.9781611974997.ch1.
Bellec, P. C., G. Lecué, and A. B. Tsybakov. 2018. “Slope Meets Lasso: Improved Oracle Bounds and Optimality.” Ann. Statist. 46 (6B): 3603–42. doi:10.1214/17-AOS1670.
Bogdan, M., E. van den Berg, C. Sabatti, W. Su, and E. J. Candès. 2015. “SLOPE—adaptive Variable Selection via Convex Optimization.” Ann. Appl. Stat. 9 (3): 1103–40. doi:10.1214/15-AOAS842.
Bu, Z., J. Klusowski, C. Rush, and W. Su. 2019. “Algorithmic Analysis and Statistical Estimation of Slope via Approximate Message Passing.” In Advances in Neural Information Processing Systems, 9366–76. http://par.nsf.gov/biblio/10163278.
Donoho, D., A. Maleki, and A. Montanari. 2009. “Message-Passing Algorithms for Compressed Sensing.” Proceedings of the National Academy of Sciences 106 (45). National Acad Sciences: 18914–9. doi:https://doi.org/10.1073/pnas.0909892106.
Zdeborová, L., and F. Krzakala. 2016. “Statistical Physics of Inference: Thresholds and Algorithms.” Advances in Physics 65 (5). Taylor & Francis: 453–552. doi:https://doi.org/10.1080/00018732.2016.1211393.