A parallelized version of random forests learning algorithm.
- Based on Weka's implementation of Breiman random forest construction.
- Support continuous features, which are repeatedly used during split.
- Support using Infogain / Gini impurity as split criteria.
- 2x speedup over Weka Random Forests (for high dimensional dataset).
- Scalable speedup by OpenMP and Open mpi parallelization.
-
In 'Classifier.h', change following variables:
NUM_TREES // Number of trees to construct RANDOM_FEATURE_SET_SIZE // Number of random features to be considered for finding the best split candidates
-
In 'TreeBuilder.h', change following variables:
MIN_NODE_SIZE // Minimum size of a node that can be considered as a leaf MIN_NODE_SIZE_TO_SPLIT // Minimum size of a node that can be further split
- Sentiment analysis of 50000 movie reviews from IMDb (25000 for training, 25000 for testing).
- Used top 10/50/200/1000 words with highest frequencies of occurrences, achieved the same accuracies.
- Test environment: Ubuntu Gnome 16.04, vlsci clusters (for distributed execution on clusters)
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011).
Learning Word Vectors for Sentiment Analysis.
The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).