Are there articles that describe practical methods to select a machine learning algorithm?
-
I have a task related to text patterns. The selection of a supervised machine learning algorithm for a certain task is not trivial. There are many sw packages that offer different machine learning algorithms (i.e. scikit-learn). One approach is to try different algorithms(starting with linear models...) and to select the simplest algorithm that provides the best results (best accuraccy in the test set). I am looking for some articles that discuss this heuristic (practical) method to select a machine learning model for a task.
-
Answer:
Since your question is under "Classification" topic, I will assume this supervised learning task, however most of what I write bellow is applicable to other ML problems. While it might be tempting to focus on choosing some state-of-the art machine learning technique, which yields the highest accuracy on the data you have right now, several other factors are much more important for the "real world" machine learning system. Training Data. This is arguably the most important ingredient for any machine learning system. Make sure the training data is as clean as you can get. In practice, you are more likely to gain a meaningful boost in your classifier accuracy, by using better data than by using so-so data with theoretically better algorithm. Also, "there is no data like more data" - do you have all the data you could possibly get? Finally, try to "learn on the boundary" - give your model examples which lie on the boundaries of classes, instead of only using "easy" examples. Features. Think of what kind of features you can extract from your data. Some features will be harder to compute than others, hence you need to decide on which features your system can and can not use. Generally, I would advice to try to use as much meaningful features as possible. Experiment with different feature sets, but be aware of over fitting, which aren't likely to be an issue if you have enough data. Scalability and performance. Once you have a lot of data (but you can't have enough ;) ) with many features you are likely to face scalability issues. Hence, simpler model would be a better choice. Ease of debugging. For the production system it is extremely important to be able to answer the question "Why x was classified to class C?". This will help you to decide on whether the problem is training data, model, model parameters or simply a bug in the code. Something like linear regression is much easier to "debug" that multi-layer neural network. Once you've carefully thought through and experimented with 1 - 4 if you still still have a question on which model to choose for classification, here are few (less important advices): a) for cases where feature space is huge (>1000s of features) choose something simple like logistic regression; but be sure that you have way more data points than you have features, b) try SVMs, especially if you think you can over fit, c) for cases where feature space is small (<1000s) and the problem is hard, try boosted decision trees, d) if you think non-linear functions of your features can yield some information gain, then try non-linear models and kernel methods. To summarize: don't focus on the model too much, focus on the system as whole, keeping in mind points 1 - 4. You should be able to easily experiment with new data, features and models. Also, check Apache Mahout - http://mahout.apache.org/ , it has many algorithms implemented in map-reduce setting on top of Hadoop.
Oleksii Kuchaiev at Quora Visit the source
Other answers
The paper An Introduction to Variable and Feature Selection by Guyon et al. available at http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf discusses the process of obtaining a machine learning predictor from given data, and provides a list of 10 questions (at the end of the introductory section) to be asked through the process. Question 8 in the list discusses the choice of predictor. Though the focus of the questions is on feature and variable selection, the list provides a more complete picture of the learning process.
Abhinav Maurya
Oleksii hit on most of the practical considerations that are important. My two cents is that you should always start with logistic regression (with regularization) or Naïve Bayes.
Nitin Madnani
Related Q & A:
- How do I randomly select a string from an array in swift?Best solution by Stack Overflow
- How to programatically select a item in list using c#?Best solution by Software Quality Assurance & Testing
- Can anyone tell me how to select a University in Aus or NZ to study MA or MS or MBA in Communication?Best solution by Yahoo! Answers
- What are the methods to sell a Software online?Best solution by Quora
- Is java or visual basic a machine level, low level, high level or binary level programming language?Best solution by Quora
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.