Wednesday, February 19, 2014

Restriction of range: effect on slope and correlation

In bivariate linear regression (y=a+bx), when there is restriction of range, the slope parameter will be hardly influenced, whereas the correlation coefficient will be underestimated. In least squares regression, the slope parameter b in y=a+bx is calculated as follows:

 The correlation coefficient r is calculated as follows:

Or, put differently: b = cov(x,y) / var(x), and r = cov(x,y) / (sd(x)*sd(y)).

Now, suppose we have a random draw of 1000 observations for two random variables: x and y. Both are normally distributed; x has a mean of 100 and a standard deviation of 25, and y is the sum of x and e, with e normally distributed with a mean of 0 and a standard deviation of 10. So, the estimated regression equation should be something like: y_hat = a + b*x = 0 + 1*x. The correlation coefficient between x and y should be something like var(y)/var(x) = 25^2 / (25^2 + 10^2) = 0.862

On the left, the 1000 observations for x and y are plotted; the estimated b=0.966 and r=0.816. The estimated regression equation is y=3.857 + 0.966x.

Further stats:
cov(x,y) = 215.925;
var(x) = 223.513;
var(y) = 313.601;
var(e) =105.264.

(Note, e and x happen to show some dependence in this sample, cov(x,e) = -7.587991.)

Now, if we would look only at the observations for which x>100, we get a subsample with restricted range, plotted below:

If we calculate the slope and correlation for this subsample, we get estimates b = 0.922 and r=0.611. The regression equation is y=8.9875 + 0.922x. So, the slope remained nearly the same, but the correlation got smaller.

Further stats:
cov(x,y) = 71.475;
var(x) = 77.515;
var(y) = 176.427;
var(e) = 110.991.

(Note, e and x happen to show some dependence in this subsample, cov(x,e) = -6.039.)

The cause of the decrease in cor(x,y) is that var(x) decreases, but var(e) remains the same. Therefore, the proportion of var(y) accounted for by x, has become smaller. But only in a relative sense has the amount of var(y,) accounted for by x, become smaller. In absolute sense, the amount of var(y), accounted for by x, has remained more or less the same; and so the expected increase in y, resulting from a unit increase in x, which is represented by the slope parameter b, remains more or less the same.

R code for generating this example

x <- rnorm(1000, 100, 15) # generate x
e <- rnorm(1000, 0, 10) # generate error

y <- x + e # generate y
text(70, 140,labels=paste("r =",round(cor(x,y),3)))
text(70, 135,labels=paste("b =",round(coef(lm(y~x))[2],3)))
var(x);var(y);var(e) # select subsample with restricted range and plot again:
x_restr <- x[x>100]
y_restr <- y[x>100]
text(110, 145,labels=paste("r =",round(cor(x_restr,y_restr),3)))
text(110, 142,labels=paste("b =",round(coef(lm(y_restr~x_restr))[2],3)))

No comments:

Post a Comment