Introduction

The most widely used methods for prediction include linear regressions, logistic regressions, k-Nearest Neighbors (k-NN)1, support vector machines (SVM)2, neural networks (NNs)3, extreme learning machines (ELM)4, deep learning (DL)5, random forests (RF)6,7 and generalized boosted models (GBM)8,9.

However, each method has its own drawbacks. For instance, linear regression and logistic regression handle linear and log-linear conditions, respectively, but may fail while dealing with nonlinear tasks. k-NNs are sensitive to the local structure of the data, with the best choice for k dependent on the properties of each datasets10. SVMs have uncalibrated class membership probabilities, large memory requirements (O(N2)) and difficult-to-interpret parameters2,11,12. NNs and DL are computationally expensive, with features learnt and tuned iteratively13,14. ELMs do not have sufficient features to handle complex works15. GBMs have high memory consumption and low evaluation speed16, as all base-learners must be evaluated in order to obtain predictions for the model. For RFs, decision trees are axis-parallel, which may lead to suboptimal trees; though oblique random forests provide one way to improve the performance of random forests17, ultimately they may fail on datasets with greater depth18.

We created Random Bits Forest (RBF), a classification and regression algorithm that integrates neural networks, boosting and random forests. We compared the performance of RBF with that of seven other methods, using 28 datasets from the UCI (University of California, Irvine) Machine Learning Repository. We then tested RBF on real psoriasis genome-wide association study (GWAS) data.

Methods

Summary

For clarity, features were standardized by subtracting the mean and dividing by standard deviation. The features were then transformed into random features/basis, by gradient boosting of the Random Bits base learner, a 3-layer sparse neural network with random weights and fed to a random forest classifier/regressor to obtain predictions (Fig. 1).

Figure 1
figure 1

The summarized process.

A 3-layer sparse neural network with random weights. Z represents threshold functions.

Random Bits

Our derived feature/basis/base learner is called Random Bits. It is a 3-layer sparse neural network with random weights. Two parameters were used to construct the neural network: twist1 (the number of features connected to each hidden node) and twist2 (the number of hidden nodes).

The features connected with hidden node are randomly assigned and interlayer weights are drawn from a standard normal distribution. The hidden nodes and the top node are the threshold units, with the threshold of each node determined by calculating the linear summation of its input for the ith sample zi and choosing a random zi among the sample as the threshold15.

Boosting Random Bits

In order to generate many Random Bits, we used a gradient boosting scheme with the following pseudocode:

For boostā€‰=ā€‰1 to B:

For stepā€‰=ā€‰1 to S:

Ā Ā Ā Ā Ā 1: residualā€‰=ā€‰Y; MaxVarā€‰=ā€‰0; BestBitā€‰=ā€‰NULL;

Ā Ā Ā Ā Ā 2: For candā€‰=ā€‰1 to C:

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 1: Draw a random bit, RB

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 2: Calculate the residual explained by RB: Var

Ā Ā Ā Ā Ā Ā Ā Ā Ā Ā 3: if (Varā€‰>ā€‰MaxVar) {MaxVarā€‰=ā€‰Var; BestBitā€‰=ā€‰RB;}

Ā Ā Ā Ā Ā 3: Set the random_bit_pool [(boostā€‰āˆ’ā€‰1)*ā€‰Sā€‰+ā€‰step]ā€‰=ā€‰BestBit

Ā Ā Ā Ā Ā 4: Mean[0]ā€‰=ā€‰E(residual|BestBitā€‰=ā€‰0), Mean[1]ā€‰=ā€‰E(residual|BestBitā€‰=ā€‰1)

Ā Ā Ā Ā Ā 5: residualā€‰=ā€‰residualā€‰āˆ’ā€‰Mean[BestBit];

The algorithm launched B independent boosting chains, each with S steps. Each boosting chain undergoes the standard gradient boosting procedure, starting with a residual of Y and updating every step. In each step, C Random Bits features (Cā€‰>ā€‰100) were generated and the bit with the largest pseudo residual was chosen. The Random Bits from each independent boosting chain were collected to form a large (~10,000) feature pool. The Random Bits were stored in a compressed format requiring 1 bit per Random Bits per sample.

Random Bits Forest

The produced Random Bits are eventually fed to Random Bits Forest. Random Bits Forest is a random forest classifier/regressor, but slightly modified for speed: each tree was grown with a bootstrapped sample and bootstrapped bits, the number of which can be tuned by users. The best bits among all the bootstrapped bits were chosen for each split. By making full use of the binary nature of Random Bits, through special coding and Streaming SIMD Extensions (SSE), acceleration was achieved, such that the modified random forest can afford ~10,000 binary features for large datasets (Nā€‰=ā€‰500,000).

Benchmarking

We benchmarked nine methods: linear regression (Linear), logistic regression (LR), k-Nearest Neighbors (kNN), neural networks (NN), support vector machines (SVM), extreme learning machines (ELM), random forests (RF), generalized boosted models (GBM) and Random Bits Forest (RBF). We used the RBF software available at http://sourceforge.net/projects/random-bits-forest/ and implemented the other eight methods using various R (v3.2.1) packages: stats, RWeka (v0.4-24), nnet (v7.3-8), kernlab (v0.9-19), randomForest (v4.6-10), elmNN (v1.0) and gbm (v2.1). We used ten-fold cross validation (accuracy, sensitivity, specificity and AUC) to evaluate each methodā€™s performance. For methods sensitive to parameter selection, we manually tuned the parameters to obtain the best performance. As we chose the best handpicked parameters for each method respectively, the performance of each method based on the best parameters was comparable with each other. The results of tuning the parameters of sensitive methods on the real psoriasis genome-wide association study (GWAS) dataset were provided as Supplemental Materials 1. Benchmarking was performed on a desktop PC equipped with an AMD FX-8320 CPU and 32GB of memory. SVM, on some large-sample datasets, failed to complete benchmarking within reasonable time (1 week), so those results were left as blank.

Benchmarked UCI Datasets Study

We benchmarked all datasets from the UCI Machine Learning Repository19 that fulfilled the following criteria including: (1) the dataset contains no missing values; (2) the dataset is in dense matrix form; (3) the dataset uses only binary classification; and (4) the dataset had clear instructions and specified the target variable.

We included 14 regression datasets (3D Road Network20, Bike Sharing21, Buzz in social media tomhardware,Buzz in social media twitter,Computer hardware22, Concrete compressive strength23,Forest fire24,Housing25,Istanbul stock exchange26,Parkinsons telemonitoring27,Physicochemical properties of protein tertiary structure, Wine quality28, Yacht hydrodynamics29,Year prediction MSD)30 and 14 classification datasets (Banknote authentication, Blood transfusion service center31,Breast cancer wisconsin diagnostic32,Climate model simulation crashes33,Connectionist bench34,EEG eye state, Fertility35,Habermans survival36,Hill valley with noise37,Indian liver patient38,Ionosphere39,MAGIC gamma telescope40,QSAR biodegradation41,Skin segmentation)42.

Applications on GWAS Dataset Study

We applied each method to a psoriasis genome-wide association (GWAS) genetic dataset43,44 to predict disease outcomes. We obtained the dataset, a part of the Collaborative Association Study of Psoriasis (CASP), from the Genetic Association Information Network (GAIN) database, a partnership of the Foundation for the National Institutes of Health. The data were available at http://dbgap.ncbi.nlm.nih.gov. through dbGaP accession number phs000019.v1.p1. All genotypes were filtered by checking for data quality44. We used 1590 subjects (915 cases, 675 controls) in the general research use (GRU) group and 1133 subjects (431 cases and 702 controls) in the autoimmune disease only (ADO) group. A dermatologist diagnosed all psoriasis cases. Each participantā€™s DNA was genotyped with the Perlegen 500K array. Both cases and controls agreed to sign the consent contract and controls (ā‰„18 years old) had no confounding factors relative to a known diagnosis of psoriasis.

We used both SNP ranking and multiple logistic regression methods, based upon allelic association p-values, for feature selection in training datasets and compared the different methods in both training and testing datasets. First, we trained the model based on the GRU dataset with different numbers of top associated SNPs and then chose the robust and popular method (LR) to select the best number of SNPs as predictors based on the maximum AUC of the independent ADO (testing) dataset (Fig. 2 and Supplemental Materials 2). We then selected the best number (best number of SNPsā€‰=ā€‰50) of top associated SNPs as input variables and evaluated their performance in both the GRU (training) dataset and independent ADO (testing) dataset for each learning algorithm (except LR). To know more information of these selected 50 top associated SNPs, the Pearsonā€™s R squared and Odds Ratio45 were also provided in Supplemental Materials 3.

Figure 2
figure 2

Maximum AUC of the independent ADO testing dataset with different numbers of markers.

To evaluate a classification methodā€™s performance on an imbalanced dataset, we used the area under the receiver operating characteristics (ROC) curve. The area under the curve (AUC) measures the global classification accuracy and is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance46. We used the AUC as a measure of classifier performance for both GRU (training) and ADO (testing) datasets (Table 3, Figs 3 and 4). The 95% confidence interval (CI) of the AUC47, sensitivity, specificity and accuracy of all methods were also calculated by choosing the optimal threshold value.

Table 3 Psoriasis prediction performance with all methods based on best number of SNP subsets.
Figure 3
figure 3

The ROC curve of six best benchmarked methods on the Psoriasis GWAS dataset of independent ADO group using selected best number of SNPs.

Figure 4
figure 4

The average of ten-foldā€™s cross-validation ROC curve of six best benchmarked methods on the Psoriasis GWAS dataset of GRU group using selected best number of SNPs.

Results

Results from UCI Datasets Study

Table 1 shows the regression root-mean-square error (RMSE) of all methods on 14 datasets. RBF was the top performing method in 13 and the second best performing method in 1. In the case (Housing) in which RBF was not the best method, the difference between RBF and the top performing method (RF) was within 2%. RF was the second best performing among the regression datasets. RBFā€™s performance exhibited the greatest improvement over that of the other methods with the 3D Road Network dataset, a shallow task in which the methods predicted the altitude at specific points on a 3D map. However, RBF outperformed RF by allowing non-axis-parallel splitting.

Table 1 Regression RMSE of all methods on 14 datasets.

Table 2 shows the classification error of each method among 14 datasets. RBF was the top performer in 8 datasets, the second best in 5 and the third best for 1. In the cases RBF was not the best method, the difference between RBF and the top performing method was within 2%. SVM was the second best method among classification datasets. RBFā€™s performance exhibited the greatest improvement over that of the other methods with the Hill valley with noise dataset, a deep task in which the methods classified the shape (ā€œhillā€ or ā€œvalleyā€) of a time series with 100 time points. Although all other methods, except neural networks, failed to well perform this task, RBF and its 3-layer random neural network features worked well on this dataset.

Table 2 Classification error of all methods on 14 datasets.

Furthermore, we also observed that the datasets in which RBF performed best were all big datasets (Nā€‰>ā€‰1000 with limited features, Table 1 and Table 2). This is due to the nature of trees, which inherently require larger samples than do regressions.

Results from GWAS dataset study

Figure 2 and Supplemental Materials 2 shows that the ideal number of biomarkers for prediction of psoriasis was 50 in the efficient LR classifier. When the number of biomarkers was less than 20, the AUC of independent ADO (test) dataset was unstable in LR classifier. On the other hand, as the number of biomarkers approached 50, performance improved and stabilized: the best AUC for LR was 0.7063, respectively. Performance did not significantly improve as the number of biomarkers increased over 50.

As seen in Table 3, all benchmarked methods were used to construct effective diagnosis models for psoriasis prediction based on optimal number of SNP subsets. No significant unbalances were found in the training and testing datasets, suggesting the credibility and stability of the prediction models. The average of AUC of 10-fold cross-validation48 in the training dataset and AUC of the independent testing dataset were used to evaluate the performance of all methods. The AUC of each method ranged from 0.6192āˆ’0.6739 in the training dataset and from 0.6563āˆ’0.7239 in the testing dataset. We found that RBF, GBM, SVM and RF were the four top performing methods in both the training dataset and the testing dataset. RBF was the top performer in both the training dataset (AUCā€‰=ā€‰0.6739, 95% CI: [0.5254, 0.8275], sensitivityā€‰=ā€‰0.6317, specificityā€‰=ā€‰0.6490, accuracyā€‰=ā€‰0.6390) and the testing dataset (AUCā€‰=ā€‰0.7239, 95% CI: [0.6930, 0.7548], sensitivityā€‰=ā€‰0.6543, specificityā€‰=ā€‰0.7151, accuracyā€‰=ā€‰0.6920). The ROC curves for each method are also shown in Fig. 3 and Fig. 4 for performance comparison visualization.

Furthermore, RBF appeared to be robust in sensitivity and specificity in both the training and testing datasets. Although the sensitivity and specificity of RBF were not the best for all datasets, its AUC still was the top performer in both GRU (training) and ADO (testing) datasets. This characteristic of RBF is also applicable in the unbalanced dataset, whose prediction performance may be easily influenced by the disease population ratio. In Table 3, we see that although KNN has the second accuracy (accuracyā€‰=ā€‰0.6884) in the testing dataset, its AUC performance (AUCā€‰=ā€‰0.7021) is poor because it pays more attention to specificity (specificityā€‰=ā€‰0.7279) than sensitivity (sensitivity = 0.6241).

Discussion

Random forests are among the top performing algorithms for machine learning, as they are accurate, fast, flexible and mature. Random forest6 is a substantial modification of bagging which builds a large number of de-correlated trees and then averages the trees. The main idea of random forests is to improve the variance reduction of bagging by reducing the correlation between trees without increasing the variance heavily49. And the target is achieved in the tree-growing process by randomly selecting the input variables. Thus, Random Bits Forest mainly focuses on the automated feature engineering of random forests. We also obtain good results if we feed random bits to a regularized linear regression, though, in big data cases, no better than we get from random forests. And the statistical inference50 of random forests equally applies to RBF.

RBF outperforms the random forest algorithm by breaking its two limitations: the limitation to axis-parallel splitting that may lead to suboptimal trees17 and the decision tree depth of two that could fail on dataset with greater depth18. To overcome the first limitation, we used random projections. Because of pre-generation of many (~10,000) random projections, the tree is allowed to grow with more freedom. To overcome the second limitation, we improved naĆÆve random projections with a 3-layer random neural network. We then defined a random neural network based on the original features and took its output as a derived feature/basis. Such additional depth may be crucial for specific datasets (UCI dataset: Hill valley with noise, shown in Table 2).

Compared to oblique random forests, RBF generated non-axis parallel features before random forest while oblique random forests generates oblique splits within the tree-growing process. One crucial improvement to our random projections was to use 3-layer random neural networks as random projection/basis, giving the random forest more depth. Additional layers did not improve accuracy on the benchmarked datasets, potentially because 3-layer neural networks are already universal approximations.

In order to make full use of our ~10,000 bits budget, we need a feature selection procedure rather than naĆÆve random projections. Feature selection was achieved by employing the gradient boosting framework. Instead of directly using the boosting predictions, we collected the boosted basis and fed them into the random forest. First, we found the random bit that best explained the residual and subtracted its effect from the residual to avoid highly correlated random bits. For the Hill valley with noise dataset, this method for feature selection reduced error from 11% to 2.5%, compared with naĆÆve random projections.

In the boosting procedure, we used multiple independent boost chains, originally just for ease of parallel computing. However, multiple chains also reduced the local optimum problem and led to better prediction. For small datasets, 256 boost chains were used.

Large sample (Nā€‰>ā€‰1000) are important for the success of RBF since trees are more flexible models than are linear models and as a result require a larger sample size. For smaller samples, regularization is useful, which was achieved by limiting the bootstrapped sample size. The consequence is that each tree was suboptimal and biased, but the trees are further decorrelated, thus reducing variance. Reducing feature bootstrap also helped to regularize the problem.

In summary, we firstly present Random Bits Forest (RBF), an original classification and regression algorithm that integrates the advantages of neural networks (for learning depth), boosting (for learning width) and random forests (for prediction accuracy). That is the reason why Random Bits Forest will perform better than other methods.

In conclusion, RBF is a novel robust method for machine learning, which is especially effective in datasets with large sample sizes (Nā€‰>ā€‰1000). Our work indicates that RBF performs better if fed with extracted/selected features by using appropriate feature selection methods.

Additional Information

How to cite this article: Wang, Y. et al. Random Bits Forest: a Strong Classifier/Regressor for Big Data. Sci. Rep. 6, 30086; doi: 10.1038/srep30086 (2016).