Introduction

It is considered bad statistical practice to dichotomise continuous outcomes, but some applications require predicted probabilities rather than predicted values. To obtain predicted values, we recommend to model the original continuous outcome with linear regression. To obtain predicted probabilities, we recommend not to model the artificial binary outcome with logistic regression, but to model the original continuous outcome and the artificial binary outcome with combined regression.

Installation

Install the current release from CRAN:

Or install the development version from GitHub:

#install.packages("devtools")
devtools::install_github("rauschenberger/cornet")

Then load and attach the package:

Example

We simulate data for nn samples and pp features, in a high-dimensional setting (pnp \gg n). The matrix 𝐗\boldsymbol{X} with nn rows and pp columns represents the features, and the vector 𝐲\boldsymbol{y} of length nn represents the continuous outcome.

set.seed(1)
n <- 100; p <- 500
X <- matrix(rnorm(n*p),nrow=n,ncol=p)
beta <- rbinom(n=p,size=1,prob=0.05)
y <- rnorm(n=n,mean=X%*%beta)

We use the function cornet for modelling the original continuous outcome and the artificial binary outcome. The argument cutoff splits the samples into two groups, those with an outcome less than or equal to the cutoff, and those with an outcome greater than the cutoff.

model <- cornet(y=y,cutoff=0,X=X)
model

The function coef returns the estimated coefficients. The first column is for the linear model (beta), and the second column is for the logistic model (gamma). The first row includes the estimated intercepts, and the other rows include the estimated slopes.

coef <- coef(model)

The function predict returns fitted values for training data, or predicted values for testing data. The argument newx specifies the feature matrix. The output is a matrix with one column for each model.

predict <- predict(model,newx=X)

The function cv.cornet measures the predictive performance of combined regression by nested cross-validation, in comparison with logistic regression.

cv.cornet(y=y,cutoff=0,X=X)

Here we observe that combined regression outperforms logistic regression (lower logistic deviance), and that logistic regression is only slightly better than the intercept-only model.

References

Armin Rauschenberger AR and Enrico Glaab EG (2024). “Predicting dichotomised outcomes from high-dimensional data in biomedicine”. Journal of Applied Statistics 51(9):1756-1771. doi: 10.1080/02664763.2023.2233057