What is the best technique to impute the missing values of high cardinality categorical values (some of variables have over 10.000 unique values in 50.000 examples) using R?
-
I am a PhD student and trying to understand the data preprocessing. I am trying to preprocess a dataset from KDD Cup 2009, sponsored by Orange, fast scoring on large database to predict customer behavior. I am working with the small version (50,000 examples, 230 var.) I have already eliminated the variables that have missing data over %80. And i had less than 80 variables left. Before starting with the missing data when i observe the data I have seen that i have to deal with the high-cardinality categorical data. Some of categorical variables have over 13000 unique values in 50.000 examples. My question is how should i fill the missing values for categorical data, including the high cardinality ones ? And another question is after filling them with the right method, should i need to normalize and discreditize those categorical variables ? Thank you. I am working with R so if you have any suggestion about the R packages i would also appreciate that.
-
Answer:
Some quick and general responses here. No specifi answers because I never played this data now this way before. Normally I keep variables with less than 20% missing values and do subsequent analysis without input. Regarding to question 1. There are many ways to imput missing values. But the key question is do you really need to imput missing values. If yes. Then which imput method make sense to your variables. You can search key words R software imput categorical variable to start with. To second question. First ask yourself. Do they make sense. How to interpret the results?
Tian Dechao at Quora Visit the source
Other answers
A common way to impute categorical variables is to simply label them as "unknown". If you want to actually guess the correct value, there many methods (e.g. KNN) that can guess based on models. But I don't think they'll work so well when you have this many unique values. Also, if a large portion of the data are missing, then multiple imputation is usually a safer solution than single imputation. For your specific situation, 13000 unique values suggests this variable is probably not useful because it's pretty hard to detect an underlying pattern when there's only ~4 examples per unique value. Now if it turns out that 12990 of the values only have one example while each of the remaining 10 values have 3701 examples, then it could be useful. In this case, you'd simply recode the 12990 values as "other". My recommendation is to first take the variables that don't require extensively cleaning and use them to build a baseline model. Then start adding the more difficult variables to see if it generates better results than the baseline.
Li Yang
Related Q & A:
- What is the best hair dryer under $50?Best solution by allure.com
- What are some good paid jobs 50,000+?Best solution by Yahoo! Answers
- What is the best car I can get for less than 10,000 dollars?Best solution by carfax.com
- What is the best gadget for under £50?Best solution by cnet.com
- What are the chances of getting a $50,000 student loan?Best solution by Yahoo! Answers
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.