The forward stagewise regression (FSR) algorithm roughly works as follows:

- Initilize by setting i = 0, coefficient vector
**b_i**=**0**, and the residual vector**rY_i**=**Y**-**X * b_i**. - Increase (or decrease*) the coefficient of the predictor variable most highly correlated with
**rY_i**by a small step, yielding the current coefficient vector**b_{i+1}**. - Eat potato chips (optional)
- Calculate
**rY_{i+1}**=**Y**-**X*****b_{i+1}**. - Set i = i + 1, repeat steps 2 - 4 until convergence (i.e., consecutive
**b_i**no longer change).

**rY_i**. The step size, that is, the size of the increase (or decrease) in step 2, can be provided by the user, or the optimal value can be determined by cross validation. Step 3 is optional, and should only be performed on a small subset of the iterations.

Although FSR can be performed with the R package lars, I wanted to program it myself, to check out the effect of different step sizes, and to see how the results compare to ordinary least squares regression (OLS). I created a function for performing FSR (called swReg), a function for plotting the coefficient paths (called plot.swReg), and applied it on the Boston housing dataset. Here are the code and the results:

**R functions for performing FSR, plotting coefficient paths and computing cross-validated prediction error**

swReg <- function(X, Y, stepsize = .1, threshold = .1) { sX <- scale(X) sY <- scale(Y) b <- vector(length = ncol(X)) rY <- sY sgn <- 0 lastsgn <- 0 index <- 0 lastindex <- 0 iteration <- 0 path <- list() while(abs(max(t(rY) %*% sX)) > threshold & (index != lastindex | sgn == lastsgn)) { iteration <- iteration + 1 lastindex <- index lastsgn <- sgn index <- which(abs(t(rY) %*% sX) == max(abs(t(rY) %*% sX))) sgn <- sign((t(rY) %*% sX)[index]) b[index] <- b[index] + sgn*stepsize rY <- sY - sX %*% b path[[iteration]] <- b } bUnstand <- (b/attr(sX, "scaled:scale"))*attr(sY, "scaled:scale") intercept <- attr(sY, "scaled:center") - bUnstand %*% attr(sX, "scaled:center") bUnstand <- c(intercept, bUnstand) names(bUnstand) <- c("(Intercept)", colnames(X)) return(list(unstandardized.coef = bUnstand, standardized.coef = b, iteration = iteration, stepsize = stepsize, threshold = threshold, coef.path = data.frame(matrix( unlist(path), ncol = ncol(X), nrow = iteration, byrow = TRUE, dimnames = list(1:iteration, colnames(X)))), data = list(X = X, Y = Y)) ) } plot.swReg <- function(object, legend = TRUE) { plot(object$coef.path[,1], type = "l", ylim = c(min(object$coef.path), max(object$coef.path)), ylab = "coefficient", xlab = "iteration", main = "Standardized coefficient paths", sub = paste("stepsize =", object$stepsize)) for (i in 2:ncol(X)) {lines(object$coef.path[,i], col = i)} if(legend) { legend("topleft", legend = colnames(object$data$X), cex = .5, lty = 1, col = 1:ncol(object$coef.path), y.intersp = .25, x.intersp = .25, bty = "n", seg.len = .5) } } predict.swReg <- function(object, newdata = object$data$X) { cbind(rep(1, times = nrow(newdata)), as.matrix(newdata)) %*% object$unstandardized.coef } swReg.xval <- function(object, k = 10, seed = 42) { set.seed(seed) ids <- peperr::resample.indices(n = nrow(object$data$X), sample.n = 10, method = "cv") models <- list() xvaldatasets <- list() for (i in 1:k){ xvaldatasets[[i]] <- list(train = list(X = object$data$X[ids$sample.index[[i]],], Y = object$data$Y[ids$sample.index[[i]]]), test = list(X = object$data$X[ids$not.in.sample[[i]],], Y = object$data$Y[ids$not.in.sample[[i]]])) models[[i]] <- swReg(xvaldatasets[[i]]$train$X, xvaldatasets[[i]]$train$Y, stepsize = object$stepsize, threshold = object$threshold) xvaldatasets[[i]]$test$xvalpredY <- predict.swReg(models[[i]], newdata = xvaldatasets[[i]]$test$X) } dataset <- cbind(object$data$X, fold = NA, Y = object$data$Y, Ycvpred = NA) coefficients <- list() for(i in 1:k){ dataset[,"fold"][ids$not.in.sample[[i]]] <- i dataset[,"Ycvpred"][ids$not.in.sample[[i]]] <- xvaldatasets[[i]]$test$xvalpredY coefficients[[i]] <- list(unstandardized.coef = models[[i]]$unstandardized.coef, standardized.coef = models[[i]]$standardized.coef) } return(list(dataset = dataset, cv.coefficients = coefficients)) }

**Example: Boston housing data**

## Data preparation: library(MASS) X <- as.matrix(Boston[,-14]) Y <- Boston$medv ## Run forward stagewise regression: tmp1 <- swReg(X, Y, st = .2, thr = .2) tmp2 <- swReg(X, Y, st = .1, thr = .1) tmp3 <- swReg(X, Y, st = .01, thr = .01) tmp4 <- swReg(X, Y, st = .001, thr = .001) ## Plot coefficient paths: par(mfrow=c(2,2)) plot.swReg(tmp1) plot.swReg(tmp2) plot.swReg(tmp3) plot.swReg(tmp4)

**Coefficient paths**

What we see is that the paths actually look very similar, but less iterations are required before convergence with larger step sizes. Also, larger steps sizes provide sparser solutions (i.e., less non-zero coefficients in the final solution). Note that the algorithm is applied on standardized X and Y values, so the step size could be interpreted as the minimal correlation between a predictor and the response variable that the user finds 'relevant'. The FSR function above provides both standardized and unstandardized regression coefficients.

**Comparison with OLS**

What we see is: the larger the step size, the sparser the final solution. With step size equal to .2, only 5 predictor variables are selected. With step size equal to .001, all variables are selected. Also, the smaller the step size, the more the final solution resembles the OLS solution. The absolute distances between the OLS and FSR solutions decrease with step size:

**References**

Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression.

*The Annals of Statistics*,*32*(2), 407-499. link to pdf
## No comments:

## Post a Comment