How can I use a decision tree to classify an unbalanced data set?
-
My data set has two classes with at most 2000 members in one class and an unlimited number of members in the other class (usually hundreds of thousands). I have read that it is problematic to use a decision tree for such data: how can I adapt my model to use a decision tree (at least in part) in classification?
-
Answer:
Uniformly down sample data points from one class until its ratio to small class is around 3:1, train your decision tree classifier. If you want to get back the calibrated probability in the original distribution (for classification purpose, there is no need, since you only need a thresholding), you could apply bayesian to probability output from decision tree.
Tao Xu at Quora Visit the source
Other answers
I've come across this issue many times in the past. Not sure what the best or "correct" solution would be, but I've used some form of tree bagging[1] by constructing multiple balanced data sets from multiple random samples from the infinite set, trained multiple decision trees, and then ensembling the results. [1] http://en.wikipedia.org/wiki/Bootstrap_aggregating#Bagging_for_nearest_neighbour_classifiers Also this paper might be helpful: http://sci2s.ugr.es/keel/pdf/algorithm/articulo/2011-IEEE%20TSMC%20partC-%20GalarFdezBarrenecheaBustinceHerrera.pdf
Jason Zhang
If this is a supervised learning problem (you have a labeled set), why not downsample the second class until the two classes are of compareable ratios, and train the DT on that ? The classifier should still generalise for data of the original distribution.
Anonymous
There are several strategies for learning from unbalanced data.This paper on the issue should help you (https://www.semanticscholar.org/paper/An-insight-into-classification-with-imbalanced-L%C3%B3pez-Fern%C3%A1ndez/ca9ef070d2a424b344b814de1196520da2f34ad7/pdf), but Iâll try to summarize some strategies bellow: Undersampling: remove samples from the majority class (class with more samples) using an undersampling algorithm. Examples: One Sided Selection (OSS), Edited Nearest Neighbors (ENN), Tomek Links, Random Undersampling ⦠Oversampling: generate new samples from the minority class (class with few samples) using an oversampling algorithm. Examples: SMOTE, BorderlineSMOTE, SPIDER, Random Resampling ⦠Cost-Sensitive Learning: change the decision tree build algorithm so that the misclassifications of minority class samples have a higher cost than misclassifications of majority class samples. Ensemble Learning: instead of using a single decision tree, try to use several decision trees. Check out Bagging algorithm, Random Forests, Extra Trees Classifiers, Iterative-Classifier-Selection-Bagging (ICSBagging)⦠Combination: Combine undersampling, oversampling and ensemble eearning strategies. Most state of the art learning methods for learning from imbalanced data use a combination of different strategies. Choose the one that is best for you. In addition, there is this python package: https://github.com/fmfn/UnbalancedDataset for learning from unbalanced data in python. It will provide you an easy access to several strategies so you can evaluate which one is best for you. When evaluating, remember to use an adequate metric such as Area Under the ROC Curve (AUC).
Dayvid Victor
I also recommend the SMOTE package on R. Here is a small tutorial (text and a video) from Manuel Amunategui:http://amunategui.github.io/smote/index.htmlGood luck!
Louis-Marc Leblanc
search in google:SMOTE: Synthetic Minority Over-sampling Technique
Jonathan Zouari
Related Q & A:
- How can I use a button to retrieve a phone 'number' from contacts?Best solution by Stack Overflow
- How can I grow a mango tree?Best solution by Yahoo! Answers
- How Can I Use A Picture For My Avatar?Best solution by Yahoo! Answers
- How can I use a prepaid credit card?Best solution by ehow.com
- How can I use a VGA cable as an input?Best solution by tomsguide.com
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.