Monday, June 12, 2017

A Two-Step Approach to Finding Hidden Structure in Your Data Through Overfitting

Whenever you specify your hypotheses before exploring your data, you may be likely to end up with null results, which may spoil your fun. Instead, you can have more fun by overfitting your data: By performing exploratory analyses to generate hypotheses and subsequently test those hypotheses by performing confirmatory analyses on the exact same dataset. In what follows, I will provide examples of this two-step approach to overfitting.

Step 1: Explore your dataset to generate hypotheses

For example, we may have a dataset of 100 observations on 10 variables that are actually uncorrelated. To illustrate, let’s generate such a dataset:
set.seed(1)
data <- matrix(rnorm(1000), nrow = 100, ncol = 10)
colnames(data) <- paste("x", 1:10, sep = "")
Let’s explore the data by inspecting the correlations:
library(corrplot)
corrplot(cor(data), method = "number", type = "upper")

We got at least one meaningful correlation there: x4 and x9 show a correlation \(>.2\). Now let’s perform some statistical testing to see if some are correlations significant:
library(Hmisc)
pvalues <- rcorr(data)$P
diag(pvalues) <- 1
corrplot(1-pvalues, type = "upper", method = "number", main = "1 minus p-value")

Two out of 45 correlation coefficients are significant. Surely there must be a publication in these data! Also, we could decide to test one- instead of two-sided, then we have 3 significant associations. But what would be much better than 3 significant correlations? Yes, indeed: a mediation effect!

Step 2: Perform confirmatory analyses on the same dataset

Example 1: Mediation Analysis

Obviously, we hypothesized before gathering our data that x10 is a causal variable, with a partially indirect effect on x9, via x4.
lm1 <- lm(x9 ~ x10, data = data.frame(data))
summary(lm1)
## 
## Call:
## lm(formula = x9 ~ x10, data = data.frame(data))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.46497 -0.74698  0.03043  0.64968  2.86824 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 0.009241   0.107194   0.086   0.9315  
## x10         0.205740   0.102135   2.014   0.0467 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.03976,    Adjusted R-squared:  0.02996 
## F-statistic: 4.058 on 1 and 98 DF,  p-value: 0.04671
Great! The causal and dependent variable are signficantly associated!
lm2 <- lm(x9 ~ x4 + x10, data = data.frame(data))
summary(lm2)
## 
## Call:
## lm(formula = x9 ~ x4 + x10, data = data.frame(data))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.36641 -0.62453  0.04468  0.62737  2.84705 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.001822   0.105719  -0.017   0.9863  
## x4           0.216306   0.107765   2.007   0.0475 *
## x10          0.181501   0.101315   1.791   0.0763 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.07805,    Adjusted R-squared:  0.05904 
## F-statistic: 4.106 on 2 and 97 DF,  p-value: 0.01942
Awesome! We have partial mediation: the effect of x10 is smaller when x4 is included in the equation and even becomes insignificant.

Example 2: Factor Analysis

Confirmatory factor analysis (CFA) offers a great way to generate a lot of outcomes at once: There will always be a parameter estimate, significance test or fit index that you can use to back up any interpretation you prefer. As we have no clue what we are looking for in our data, let’s explore the data first using PCA, which will provide us with the hypotheses we need to subsequently fit a CFA on the same data:
library(psych)
VSS.scree(data)

The scree plot indicates that five components have an eigenvalue $ > 1 $, providing justification for a 5-factor solution. However, such a solution is not very parsimonious for 10 items, so let’s use the elbow criterion and hypothesize that there are 2 factors in our data.
Of course, a better way to go about exploring the factorial structure of the data would be to perform parallel analysis, which would have shown that we would have obtained a very similar scree plot using simulated or resampled data:
fa.parallel(data)

## Parallel analysis suggests that the number of factors =  0  and the number of components =  0
But that spoils all the fun, so let’s ignore these results. Instead, we explore and interpret the 2-component solution:
principal(data, nfactors = 2, rotate = "varimax")
## Principal Components Analysis
## Call: principal(r = data, nfactors = 2, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##       RC1   RC2    h2   u2 com
## x1   0.07  0.48 0.235 0.76 1.0
## x2  -0.07  0.28 0.082 0.92 1.1
## x3  -0.08 -0.27 0.080 0.92 1.2
## x4   0.64 -0.05 0.408 0.59 1.0
## x5  -0.05  0.52 0.272 0.73 1.0
## x6   0.02 -0.49 0.244 0.76 1.0
## x7  -0.31  0.28 0.176 0.82 2.0
## x8   0.00  0.62 0.382 0.62 1.0
## x9   0.68  0.00 0.456 0.54 1.0
## x10  0.65  0.20 0.461 0.54 1.2
## 
##                        RC1  RC2
## SS loadings           1.40 1.40
## Proportion Var        0.14 0.14
## Cumulative Var        0.14 0.28
## Proportion Explained  0.50 0.50
## Cumulative Proportion 0.50 1.00
## 
## Mean item complexity =  1.2
## Test of the hypothesis that 2 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.13 
##  with the empirical chi square  153.01  with prob <  5.9e-20 
## 
## Fit based upon off diagonal values = -0.88
A clear structure emerges: x4, x9 and x10 form a factor; x1, x5, x6 and x8 form a factor; x7 loads on both factors. We may need to discard x2 and x3 because they have low correlations with the components.
This is exactly the structure we expected! So let’s use this structure to perform the CFA:
library(lavaan)
model <- '
  F1 =~ x4 + x9 + x10 + x7
  F2 =~ x1 + x5 + x6 + x8 + x7
'
fit <- cfa(model, data)
fitmeasures(fit, c("chisq", "df", "pvalue", "cfi", "srmr", "rmsea"))
##  chisq     df pvalue    cfi   srmr  rmsea 
## 13.777 18.000  0.744  1.000  0.053  0.000
Great! Our model has a non-significant chi-square value, CFI and RMSEA indicate perfect model fit and SRMR indicates a well-fitting model! Let’s look at the parameter estimates:
summary(fit, standardized = TRUE)
## lavaan (0.5-23.1097) converged normally after  61 iterations
## 
##   Number of observations                           100
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic               13.777
##   Degrees of freedom                                18
##   P-value (Chi-square)                           0.744
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   F1 =~                                                                 
##     x4                1.000                               0.427    0.433
##     x9                1.226    0.861    1.423    0.155    0.524    0.484
##     x10               0.898    0.589    1.523    0.128    0.384    0.366
##     x7               -0.418    0.450   -0.929    0.353   -0.179   -0.166
##   F2 =~                                                                 
##     x1                1.000                               0.294    0.329
##     x5                1.426    1.229    1.160    0.246    0.419    0.360
##     x6               -0.997    0.882   -1.130    0.258   -0.293   -0.305
##     x8                1.146    1.011    1.134    0.257    0.337    0.309
##     x7                0.742    0.805    0.922    0.356    0.218    0.203
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   F1 ~~                                                                 
##     F2                0.012    0.035    0.334    0.738    0.094    0.094
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .x4                0.791    0.176    4.483    0.000    0.791    0.812
##    .x9                0.898    0.240    3.746    0.000    0.898    0.766
##    .x10               0.954    0.176    5.425    0.000    0.954    0.866
##    .x7                1.084    0.174    6.223    0.000    1.084    0.938
##    .x1                0.712    0.133    5.362    0.000    0.712    0.892
##    .x5                1.179    0.237    4.969    0.000    1.179    0.870
##    .x6                0.838    0.149    5.634    0.000    0.838    0.907
##    .x8                1.078    0.193    5.594    0.000    1.078    0.905
##     F1                0.183    0.158    1.154    0.249    1.000    1.000
##     F2                0.086    0.101    0.860    0.390    1.000    1.000
Well, none of the factor loadings or factor (co)variances may be significant, but who expects significance with 100 observations? Well, the item’s residual variances were significant, but those are always signficant, because psychological test data is always very noisy (non-significant residual variances, that would be something to worry about!) The completely standardized loadings indicate that this is a very good model, because the indicators and corresponding factors all have correlations \(>.30\). Except for x7, but this item has cross loadings, so this is to be expected.

Conclusion

These examples show us that there may be more results hiding in your data than your pre-specified hypotheses think. You can always find meaningful results by overfitting: Just use the same data for exploratory and confirmatory analyses. Of course, you have to be careful not to include any of the exploratory parts of your analyses in your paper: Just present the results of the confirmatory analyses!