Random Bits Forest: a Strong Classifier/Regressor for Big Data

Wang, Yi; Li, Yi; Pu, Weilin; Wen, Kathryn; Shugart, Yin Yao; Xiong, Momiao; Jin, Li

doi:10.1038/srep30086

Download PDF

Article
Open access
Published: 22 July 2016

Random Bits Forest: a Strong Classifier/Regressor for Big Data

Yi Wang¹^na1,
Yi Li¹^na1,
Weilin Pu¹^na1,
Kathryn Wen²^na1,
Yin Yao Shugart²^na1,
Momiao Xiong³^na1 &
…
Li Jin¹^na1

Scientific Reports volume 6, Article number: 30086 (2016) Cite this article

7581 Accesses
21 Citations
10 Altmetric
Metrics details

Subjects

Abstract

Efficiency, memory consumption and robustness are common problems with many popular methods for data analysis. As a solution, we present Random Bits Forest (RBF), a classification and regression algorithm that integrates neural networks (for depth), boosting (for width) and random forests (for prediction accuracy). Through a gradient boosting scheme, it first generates and selects ~10,000 small, 3-layer random neural networks. These networks are then fed into a modified random forest algorithm to obtain predictions. Testing with datasets from the UCI (University of California, Irvine) Machine Learning Repository shows that RBF outperforms other popular methods in both accuracy and robustness, especially with large datasets (N > 1000). The algorithm also performed highly in testing with an independent data set, a real psoriasis genome-wide association study (GWAS).

Causal machine learning for predicting treatment outcomes

Article 19 April 2024

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

An open source knowledge graph ecosystem for the life sciences

Article Open access 11 April 2024

Introduction

The most widely used methods for prediction include linear regressions, logistic regressions, k-Nearest Neighbors (k-NN)¹, support vector machines (SVM)², neural networks (NNs)³, extreme learning machines (ELM)⁴, deep learning (DL)⁵, random forests (RF)^6,7 and generalized boosted models (GBM)^8,9.

However, each method has its own drawbacks. For instance, linear regression and logistic regression handle linear and log-linear conditions, respectively, but may fail while dealing with nonlinear tasks. k-NNs are sensitive to the local structure of the data, with the best choice for k dependent on the properties of each datasets¹⁰. SVMs have uncalibrated class membership probabilities, large memory requirements (O(N²)) and difficult-to-interpret parameters^2,11,12. NNs and DL are computationally expensive, with features learnt and tuned iteratively^13,14. ELMs do not have sufficient features to handle complex works¹⁵. GBMs have high memory consumption and low evaluation speed¹⁶, as all base-learners must be evaluated in order to obtain predictions for the model. For RFs, decision trees are axis-parallel, which may lead to suboptimal trees; though oblique random forests provide one way to improve the performance of random forests¹⁷, ultimately they may fail on datasets with greater depth¹⁸.

We created Random Bits Forest (RBF), a classification and regression algorithm that integrates neural networks, boosting and random forests. We compared the performance of RBF with that of seven other methods, using 28 datasets from the UCI (University of California, Irvine) Machine Learning Repository. We then tested RBF on real psoriasis genome-wide association study (GWAS) data.

Methods

Summary

For clarity, features were standardized by subtracting the mean and dividing by standard deviation. The features were then transformed into random features/basis, by gradient boosting of the Random Bits base learner, a 3-layer sparse neural network with random weights and fed to a random forest classifier/regressor to obtain predictions (Fig. 1).

Random Bits

Our derived feature/basis/base learner is called Random Bits. It is a 3-layer sparse neural network with random weights. Two parameters were used to construct the neural network: twist1 (the number of features connected to each hidden node) and twist2 (the number of hidden nodes).

The features connected with hidden node are randomly assigned and interlayer weights are drawn from a standard normal distribution. The hidden nodes and the top node are the threshold units, with the threshold of each node determined by calculating the linear summation of its input for the ith sample z_i and choosing a random z_i among the sample as the threshold¹⁵.

Boosting Random Bits

In order to generate many Random Bits, we used a gradient boosting scheme with the following pseudocode:

For boost = 1 to B:

For step = 1 to S:

1: residual = Y; MaxVar = 0; BestBit = NULL;

2: For cand = 1 to C:

1: Draw a random bit, RB

2: Calculate the residual explained by RB: Var

3: if (Var > MaxVar) {MaxVar = Var; BestBit = RB;}

3: Set the random_bit_pool [(boost − 1)* S + step] = BestBit

4: Mean[0] = E(residual|BestBit = 0), Mean[1] = E(residual|BestBit = 1)

5: residual = residual − Mean[BestBit];

The algorithm launched B independent boosting chains, each with S steps. Each boosting chain undergoes the standard gradient boosting procedure, starting with a residual of Y and updating every step. In each step, C Random Bits features (C > 100) were generated and the bit with the largest pseudo residual was chosen. The Random Bits from each independent boosting chain were collected to form a large (~10,000) feature pool. The Random Bits were stored in a compressed format requiring 1 bit per Random Bits per sample.

Random Bits Forest

The produced Random Bits are eventually fed to Random Bits Forest. Random Bits Forest is a random forest classifier/regressor, but slightly modified for speed: each tree was grown with a bootstrapped sample and bootstrapped bits, the number of which can be tuned by users. The best bits among all the bootstrapped bits were chosen for each split. By making full use of the binary nature of Random Bits, through special coding and Streaming SIMD Extensions (SSE), acceleration was achieved, such that the modified random forest can afford ~10,000 binary features for large datasets (N = 500,000).

Benchmarking

We benchmarked nine methods: linear regression (Linear), logistic regression (LR), k-Nearest Neighbors (kNN), neural networks (NN), support vector machines (SVM), extreme learning machines (ELM), random forests (RF), generalized boosted models (GBM) and Random Bits Forest (RBF). We used the RBF software available at http://sourceforge.net/projects/random-bits-forest/ and implemented the other eight methods using various R (v3.2.1) packages: stats, RWeka (v0.4-24), nnet (v7.3-8), kernlab (v0.9-19), randomForest (v4.6-10), elmNN (v1.0) and gbm (v2.1). We used ten-fold cross validation (accuracy, sensitivity, specificity and AUC) to evaluate each method’s performance. For methods sensitive to parameter selection, we manually tuned the parameters to obtain the best performance. As we chose the best handpicked parameters for each method respectively, the performance of each method based on the best parameters was comparable with each other. The results of tuning the parameters of sensitive methods on the real psoriasis genome-wide association study (GWAS) dataset were provided as Supplemental Materials 1. Benchmarking was performed on a desktop PC equipped with an AMD FX-8320 CPU and 32GB of memory. SVM, on some large-sample datasets, failed to complete benchmarking within reasonable time (1 week), so those results were left as blank.

Benchmarked UCI Datasets Study

We benchmarked all datasets from the UCI Machine Learning Repository¹⁹ that fulfilled the following criteria including: (1) the dataset contains no missing values; (2) the dataset is in dense matrix form; (3) the dataset uses only binary classification; and (4) the dataset had clear instructions and specified the target variable.

We included 14 regression datasets (3D Road Network²⁰, Bike Sharing²¹, Buzz in social media tomhardware,Buzz in social media twitter,Computer hardware²², Concrete compressive strength²³,Forest fire²⁴,Housing²⁵,Istanbul stock exchange²⁶,Parkinsons telemonitoring²⁷,Physicochemical properties of protein tertiary structure, Wine quality²⁸, Yacht hydrodynamics²⁹,Year prediction MSD)³⁰ and 14 classification datasets (Banknote authentication, Blood transfusion service center³¹,Breast cancer wisconsin diagnostic³²,Climate model simulation crashes³³,Connectionist bench³⁴,EEG eye state, Fertility³⁵,Habermans survival³⁶,Hill valley with noise³⁷,Indian liver patient³⁸,Ionosphere³⁹,MAGIC gamma telescope⁴⁰,QSAR biodegradation⁴¹,Skin segmentation)⁴².

Applications on GWAS Dataset Study

We applied each method to a psoriasis genome-wide association (GWAS) genetic dataset^43,44 to predict disease outcomes. We obtained the dataset, a part of the Collaborative Association Study of Psoriasis (CASP), from the Genetic Association Information Network (GAIN) database, a partnership of the Foundation for the National Institutes of Health. The data were available at http://dbgap.ncbi.nlm.nih.gov. through dbGaP accession number phs000019.v1.p1. All genotypes were filtered by checking for data quality⁴⁴. We used 1590 subjects (915 cases, 675 controls) in the general research use (GRU) group and 1133 subjects (431 cases and 702 controls) in the autoimmune disease only (ADO) group. A dermatologist diagnosed all psoriasis cases. Each participant’s DNA was genotyped with the Perlegen 500K array. Both cases and controls agreed to sign the consent contract and controls (≥18 years old) had no confounding factors relative to a known diagnosis of psoriasis.

We used both SNP ranking and multiple logistic regression methods, based upon allelic association p-values, for feature selection in training datasets and compared the different methods in both training and testing datasets. First, we trained the model based on the GRU dataset with different numbers of top associated SNPs and then chose the robust and popular method (LR) to select the best number of SNPs as predictors based on the maximum AUC of the independent ADO (testing) dataset (Fig. 2 and Supplemental Materials 2). We then selected the best number (best number of SNPs = 50) of top associated SNPs as input variables and evaluated their performance in both the GRU (training) dataset and independent ADO (testing) dataset for each learning algorithm (except LR). To know more information of these selected 50 top associated SNPs, the Pearson’s R squared and Odds Ratio⁴⁵ were also provided in Supplemental Materials 3.

To evaluate a classification method’s performance on an imbalanced dataset, we used the area under the receiver operating characteristics (ROC) curve. The area under the curve (AUC) measures the global classification accuracy and is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance⁴⁶. We used the AUC as a measure of classifier performance for both GRU (training) and ADO (testing) datasets (Table 3, Figs 3 and 4). The 95% confidence interval (CI) of the AUC⁴⁷, sensitivity, specificity and accuracy of all methods were also calculated by choosing the optimal threshold value.

Table 3 Psoriasis prediction performance with all methods based on best number of SNP subsets.

Full size table

Results

Results from UCI Datasets Study

Table 1 shows the regression root-mean-square error (RMSE) of all methods on 14 datasets. RBF was the top performing method in 13 and the second best performing method in 1. In the case (Housing) in which RBF was not the best method, the difference between RBF and the top performing method (RF) was within 2%. RF was the second best performing among the regression datasets. RBF’s performance exhibited the greatest improvement over that of the other methods with the 3D Road Network dataset, a shallow task in which the methods predicted the altitude at specific points on a 3D map. However, RBF outperformed RF by allowing non-axis-parallel splitting.

Table 1 Regression RMSE of all methods on 14 datasets.

Full size table

Table 2 shows the classification error of each method among 14 datasets. RBF was the top performer in 8 datasets, the second best in 5 and the third best for 1. In the cases RBF was not the best method, the difference between RBF and the top performing method was within 2%. SVM was the second best method among classification datasets. RBF’s performance exhibited the greatest improvement over that of the other methods with the Hill valley with noise dataset, a deep task in which the methods classified the shape (“hill” or “valley”) of a time series with 100 time points. Although all other methods, except neural networks, failed to well perform this task, RBF and its 3-layer random neural network features worked well on this dataset.

Table 2 Classification error of all methods on 14 datasets.

Full size table

Furthermore, we also observed that the datasets in which RBF performed best were all big datasets (N > 1000 with limited features, Table 1 and Table 2). This is due to the nature of trees, which inherently require larger samples than do regressions.

Results from GWAS dataset study

Figure 2 and Supplemental Materials 2 shows that the ideal number of biomarkers for prediction of psoriasis was 50 in the efficient LR classifier. When the number of biomarkers was less than 20, the AUC of independent ADO (test) dataset was unstable in LR classifier. On the other hand, as the number of biomarkers approached 50, performance improved and stabilized: the best AUC for LR was 0.7063, respectively. Performance did not significantly improve as the number of biomarkers increased over 50.

As seen in Table 3, all benchmarked methods were used to construct effective diagnosis models for psoriasis prediction based on optimal number of SNP subsets. No significant unbalances were found in the training and testing datasets, suggesting the credibility and stability of the prediction models. The average of AUC of 10-fold cross-validation⁴⁸ in the training dataset and AUC of the independent testing dataset were used to evaluate the performance of all methods. The AUC of each method ranged from 0.6192−0.6739 in the training dataset and from 0.6563−0.7239 in the testing dataset. We found that RBF, GBM, SVM and RF were the four top performing methods in both the training dataset and the testing dataset. RBF was the top performer in both the training dataset (AUC = 0.6739, 95% CI: [0.5254, 0.8275], sensitivity = 0.6317, specificity = 0.6490, accuracy = 0.6390) and the testing dataset (AUC = 0.7239, 95% CI: [0.6930, 0.7548], sensitivity = 0.6543, specificity = 0.7151, accuracy = 0.6920). The ROC curves for each method are also shown in Fig. 3 and Fig. 4 for performance comparison visualization.

Furthermore, RBF appeared to be robust in sensitivity and specificity in both the training and testing datasets. Although the sensitivity and specificity of RBF were not the best for all datasets, its AUC still was the top performer in both GRU (training) and ADO (testing) datasets. This characteristic of RBF is also applicable in the unbalanced dataset, whose prediction performance may be easily influenced by the disease population ratio. In Table 3, we see that although KNN has the second accuracy (accuracy = 0.6884) in the testing dataset, its AUC performance (AUC = 0.7021) is poor because it pays more attention to specificity (specificity = 0.7279) than sensitivity (sensitivity = 0.6241).

Discussion

Random forests are among the top performing algorithms for machine learning, as they are accurate, fast, flexible and mature. Random forest⁶ is a substantial modification of bagging which builds a large number of de-correlated trees and then averages the trees. The main idea of random forests is to improve the variance reduction of bagging by reducing the correlation between trees without increasing the variance heavily⁴⁹. And the target is achieved in the tree-growing process by randomly selecting the input variables. Thus, Random Bits Forest mainly focuses on the automated feature engineering of random forests. We also obtain good results if we feed random bits to a regularized linear regression, though, in big data cases, no better than we get from random forests. And the statistical inference⁵⁰ of random forests equally applies to RBF.

RBF outperforms the random forest algorithm by breaking its two limitations: the limitation to axis-parallel splitting that may lead to suboptimal trees¹⁷ and the decision tree depth of two that could fail on dataset with greater depth¹⁸. To overcome the first limitation, we used random projections. Because of pre-generation of many (~10,000) random projections, the tree is allowed to grow with more freedom. To overcome the second limitation, we improved naïve random projections with a 3-layer random neural network. We then defined a random neural network based on the original features and took its output as a derived feature/basis. Such additional depth may be crucial for specific datasets (UCI dataset: Hill valley with noise, shown in Table 2).

Compared to oblique random forests, RBF generated non-axis parallel features before random forest while oblique random forests generates oblique splits within the tree-growing process. One crucial improvement to our random projections was to use 3-layer random neural networks as random projection/basis, giving the random forest more depth. Additional layers did not improve accuracy on the benchmarked datasets, potentially because 3-layer neural networks are already universal approximations.

In order to make full use of our ~10,000 bits budget, we need a feature selection procedure rather than naïve random projections. Feature selection was achieved by employing the gradient boosting framework. Instead of directly using the boosting predictions, we collected the boosted basis and fed them into the random forest. First, we found the random bit that best explained the residual and subtracted its effect from the residual to avoid highly correlated random bits. For the Hill valley with noise dataset, this method for feature selection reduced error from 11% to 2.5%, compared with naïve random projections.

In the boosting procedure, we used multiple independent boost chains, originally just for ease of parallel computing. However, multiple chains also reduced the local optimum problem and led to better prediction. For small datasets, 256 boost chains were used.

Large sample (N > 1000) are important for the success of RBF since trees are more flexible models than are linear models and as a result require a larger sample size. For smaller samples, regularization is useful, which was achieved by limiting the bootstrapped sample size. The consequence is that each tree was suboptimal and biased, but the trees are further decorrelated, thus reducing variance. Reducing feature bootstrap also helped to regularize the problem.

In summary, we firstly present Random Bits Forest (RBF), an original classification and regression algorithm that integrates the advantages of neural networks (for learning depth), boosting (for learning width) and random forests (for prediction accuracy). That is the reason why Random Bits Forest will perform better than other methods.

In conclusion, RBF is a novel robust method for machine learning, which is especially effective in datasets with large sample sizes (N > 1000). Our work indicates that RBF performs better if fed with extracted/selected features by using appropriate feature selection methods.

Additional Information

How to cite this article: Wang, Y. et al. Random Bits Forest: a Strong Classifier/Regressor for Big Data. Sci. Rep. 6, 30086; doi: 10.1038/srep30086 (2016).

References

Altman, N. S. An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression. The American Statistician 46, 175, doi: 10.2307/2685209 (1992).
Article MathSciNet Google Scholar
Cortes, C. & Vapnik, V. Support-vector networks. Machine learning 20, 273–297 (1995).
MATH Google Scholar
Ripley, B. D. Pattern recognition and neural networks. (Cambridge university press, 1996).
Huang, G.-B., Zhu, Q.-Y. & Siew, C.-K. In Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference on. 985–990 (IEEE).
Bengio, Y. Learning deep architectures for AI. Foundations and trends^® in Machine Learning 2, 1–127 (2009).
Breiman, L. Random forests. Machine learning 45, 5–32 (2001).
Article Google Scholar
Liaw, A. & Wiener, M. Classification and regression by randomForest. R news 2, 18–22 (2002).
Google Scholar
Freund, Y. & Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55, 119–139 (1997).
Article MathSciNet Google Scholar
Friedman, J. H. Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189–1232 (2001).
Phyu, T. N. In Proceedings of the International MultiConference of Engineers and Computer Scientists. 18–20.
Burges, C. J. A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery 2, 121–167 (1998).
Article Google Scholar
Platt, J. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers 10, 61–74 (1999).
Google Scholar
Bengio, Y., Courville, A. & Vincent, P. Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence, IEEE Transactions on 35, 1798–1828 (2013).
Article Google Scholar
Tu, J. V. Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes. Journal of clinical epidemiology 49, 1225–1231 (1996).
Article CAS Google Scholar
Wang, Y., Li, Y., Xiong, M. & Jin, L. Random Bits Regression: a Strong General Predictor for Big Data. arXiv preprint arXiv:1501.02990 (2015).
Natekin, A. & Knoll, A. Gradient boosting machines, a tutorial. Front Neurorobot 7, 21, doi: 10.3389/fnbot.2013.00021 (2013).
Article PubMed PubMed Central Google Scholar
Menze, B. H., Kelm, B. M., Splitthoff, D. N., Koethe, U. & Hamprecht, F. A. In Machine Learning and Knowledge Discovery in Databases 453–469 (Springer, 2011).
Bengio, Y., Delalleau, O. & Simard, C. Decision trees do not generalize to new variations. Computational Intelligence 26, 449–467 (2010).
Article MathSciNet Google Scholar
Bache, K. & Lichman, M. UCI machine learning repository (2013).
Kaul, M., Yang, B. & Jensen, C. S. In Mobile Data Management (MDM), 2013 IEEE 14th International Conference on. 137–146 (IEEE).
Fanaee-T, H. & Gama, J. Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence, 1–15, doi: 10.1007/s13748-013-0040-3 (2013).
Kibler, D., Aha, D. W. & Albert, M. K. Instance‐based prediction of real‐valued attributes. Computational Intelligence 5, 51–57 (1989).
Article Google Scholar
Yeh, I.-C. Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete research 28, 1797–1808 (1998).
Article CAS Google Scholar
Cortez, P. & Morais, A. In Proc. EPIA (eds Neves, J., Santos, M. F. & Machado, J. ), 512–523 (2007).
Belsley, David A., Roy, E. K. & Welsch, E. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. (2005).
Akbilgic, O., Bozdogan, H. & Balaban, M. E. A novel Hybrid RBF Neural Networks model as a forecaster. Statistics and Computing 24, 365–375, doi: 10.1007/s11222-013-9375-7 (2013).
Article MathSciNet MATH Google Scholar
Tsanas, A., Little, M. A., McSharry, P. E. & Ramig, L. O. Accurate telemonitoring of Parkinson’s disease progression by noninvasive speech tests. IEEE transactions on bio-medical engineering 57, 884–893, doi: 10.1109/TBME.2009.2036000 (2010).
Article PubMed Google Scholar
Cortez, P., Cerdeira, A., Almeida, F., Matos, T. & Reis, J. Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems 47, 547–553 (2009).
Article Google Scholar
Gerritsma, J., Onnink, R. & Versluis, A. Geometry, resistance and stability of the delft systematic yacht hull series. (Delft University of Technology, 1981).
Bertin-Mahieux, T., Ellis, D. P., Whitman, B. & Lamere, P. In ISMIR 2011: Proceedings of the 12th International Society for Music Information Retrieval Conference, October 24–28, Miami, Florida. 591–596 (University of Miami) (2011).
Yeh, I. C., Yang, K.-J. & Ting, T.-M. Knowledge discovery on RFM model using Bernoulli sequence. Expert Systems with Applications 36, 5866–5871, doi: 10.1016/j.eswa.2008.07.018 (2009).
Article Google Scholar
Street, W. N., Wolberg, W. H. & Mangasaria, O. L. In International Symposium on Electronic Imaging: Science and Technology. 861–870.
Lucas, D. et al. Failure analysis of parameter-induced simulation crashes in climate models. Geoscientific Model Development 6, 1157–1171 (2013).
Article ADS Google Scholar
Gorman, R. P. & Sejnowski, T. J. Analysis of hidden units in a layered network trained to classify sonar targets. Neural networks 1, 75–89 (1988).
Article Google Scholar
Gil, D., Girela, J. L., De Juan, J., Gomez-Torres, M. J. & Johnsson, M. Predicting seminal quality with artificial intelligence methods. Expert Systems with Applications 39, 12564–12573 (2012).
Article Google Scholar
Haberman, S. J. In Proceedings of the 9th International Biometrics Conference. 104–122.
Hall, M. et al. The WEKA data mining software: an update. ACM SIGKDD explorations newsletter 11, 10–18 (2009).
Article Google Scholar
Ramana, B. V., Babu, M. S. P. & Venkateswarlu, N. A critical comparative study of liver patients from usa and india: An exploratory analysis. International Journal of Computer Science Issues 9, 506–516 (2012).
Google Scholar
Sigillito, V. G., Wing, S. P., Hutton, L. V. & Baker, K. B. Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Technical Digest 10, 262–266 (1989).
Google Scholar
Bock, R. et al. Methods for multidimensional event classification: a case study using images from a Cherenkov gamma-ray telescope. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 516, 511–528 (2004).
Article CAS ADS Google Scholar
Mansouri, K., Ringsted, T., Ballabio, D., Todeschini, R. & Consonni, V. Quantitative structure–activity relationship models for ready biodegradability of chemicals. Journal of chemical information and modeling 53, 867–878 (2013).
Article CAS Google Scholar
Mattern, W. D., Sommers, S. C. & Kassirer, J. P. Oliguric acute renal failure in malignant hypertension. Am J Med. 52, 187–197 (1972).
Article CAS Google Scholar
Fang, S., Fang, X. & Xiong, M. Psoriasis prediction from genome-wide SNP profiles. BMC Dermatol 11, 1, doi: 10.1186/1471-5945-11-1 (2011).
Article CAS PubMed PubMed Central Google Scholar
Nair, R. P. et al. Sequence and haplotype analysis supports HLA-C as the psoriasis susceptibility 1 gene. The American Journal of Human Genetics 78, 827–851 (2006).
Article CAS Google Scholar
Clarke, G. M. et al. Basic statistical analysis in genetic case-control studies. Nat Protoc 6, 121–133, doi: 10.1038/nprot.2010.182 (2011).
Article CAS PubMed PubMed Central Google Scholar
Fawcett, T. An introduction to ROC analysis. Pattern recognition letters 27, 861–874 (2006).
Article Google Scholar
DeLong, E. R., DeLong, D. M. & Clarke-Pearson, D. L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837–845 (1988).
Article CAS Google Scholar
Kohavi, R. In Ijcai. 1137–1145.
Trevor Hastie, R. T. Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Second Edition edn. (2009).
Boulesteix, A. L., Janitza, S., Kruppa, J. & König, I. R. Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2, 493–507 (2012).
Google Scholar

Download references

Acknowledgements

The computations involved in this study were supported by the Fudan University High-End Computing Center. The views expressed in this presentation do not necessarily represent the views of the NIMH, NIH, HHS or the United States Government.

Author information

Wang Yi and Li Yi contributed equally to this work.

Authors and Affiliations

Ministry of Education Key Laboratory of Contemporary Anthropology, Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Fudan University, Shanghai, 200433, China
Yi Wang, Yi Li, Weilin Pu & Li Jin
Division of Intramural Division Programs, Unit on Statistical Genomics, National Institute of Mental Health, National Institutes of Health, Bethesda, MD, USA
Kathryn Wen & Yin Yao Shugart
Human Genetics Center, School of Public Health, University of Texas Houston Health Sciences Center, Houston, Texas, USA
Momiao Xiong

Authors

Yi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yi Li
View author publications
You can also search for this author in PubMed Google Scholar
Weilin Pu
View author publications
You can also search for this author in PubMed Google Scholar
Kathryn Wen
View author publications
You can also search for this author in PubMed Google Scholar
Yin Yao Shugart
View author publications
You can also search for this author in PubMed Google Scholar
Momiao Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Li Jin
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.W., Y.L. and L.J. conceived the idea, proposed the RBF method and contributed to writing of the paper. Y.W., Y.L., Y.Y.S. and L.J. contributed the theoretical analysis. Y.W. also contributed to the development of RBF software using C++. Y.L. helped maintain RBF software and used R to generate tables and figures for all simulated and real datasets. W.P. and Y.L. used the R package ‘ggplot2’ to plot figures. MMX helped support the psoriasis GWAS dataset and revise the paper. Y.Y.S. and K.W. contributed to scientific discussion and manuscript writing. L.J. contributed to final revision of the paper.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Electronic supplementary material

Supplemental Materials 1

Supplemental Materials 2

Supplemental Materials 3

Rights and permissions

This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Reprints and permissions

About this article

Cite this article

Wang, Y., Li, Y., Pu, W. et al. Random Bits Forest: a Strong Classifier/Regressor for Big Data. Sci Rep 6, 30086 (2016). https://doi.org/10.1038/srep30086

Download citation

Received: 02 March 2016
Accepted: 28 June 2016
Published: 22 July 2016
DOI: https://doi.org/10.1038/srep30086

This article is cited by

Modeling dissolved oxygen concentration using machine learning techniques with dimensionality reduction approach
- Farid Hassanbaki Garabaghi
- Semra Benzer
- Recep Benzer
Environmental Monitoring and Assessment (2023)
Design Strategy for Art Copper Alloys’ Colors Through Machine Learning and Oxidation Treatment
- Shuang Zhou
- Qian Lei
- Zhou Li
JOM (2023)
Hybrid SFO and TLBO optimization for biodegradable classification
- Suvita Rani Sharma
- Birmohan Singh
- Manpreet Kaur
Soft Computing (2021)
A systematic review of the applications of artificial intelligence and machine learning in autoimmune diseases
- I. S. Stafford
- M. Kellermann
- S. Ennis
npj Digital Medicine (2020)
Nuclear Norm Clustering: a promising alternative method for clustering tasks
- Yi Wang
- Yi Li
- Li Jin
Scientific Reports (2018)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Methods

Summary

Random Bits

Boosting Random Bits

Random Bits Forest

Benchmarking

Benchmarked UCI Datasets Study

Applications on GWAS Dataset Study

Results

Results from UCI Datasets Study

Results from GWAS dataset study

Discussion

Additional Information

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Ethics declarations

Competing interests

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links