How to select all articles and their similar articles from MySQL?

Are there articles that describe practical methods to select a machine learning algorithm?

  • I have a task related to text patterns. The selection of a supervised machine learning algorithm for a certain task is not trivial. There are many sw packages that offer different machine learning algorithms (i.e. scikit-learn). One approach is to try different algorithms(starting with linear models...) and to select the simplest algorithm  that provides the best results (best accuraccy in the test set). I am looking for some articles that discuss this heuristic (practical) method to select a machine learning model for a task.

  • Answer:

    Since  your question is under "Classification" topic, I will assume this  supervised learning task, however most of what I write bellow is  applicable to other ML problems. While  it might be tempting to focus on choosing some state-of-the art machine  learning technique, which yields the highest accuracy on the data you  have right now, several other factors are much more important for the  "real world" machine learning system. Training  Data. This is arguably the most important ingredient for any machine  learning system. Make sure the training data is as clean as you can  get.  In practice, you are more likely to gain a meaningful boost in  your classifier accuracy, by using better data than by using so-so data  with theoretically better algorithm. Also, "there is no data like more  data" - do you have all the data you could possibly get?  Finally, try  to "learn on the boundary"  - give your model examples which lie on the  boundaries of classes, instead of only using "easy" examples. Features. Think of what kind of features you can extract from your data. Some  features will be harder to compute than others, hence you need to decide  on which features your system can and can not use. Generally, I would  advice to try to use as much meaningful features as possible. Experiment  with different feature sets, but be aware of over fitting, which aren't  likely to be an issue if you have enough data. Scalability  and performance. Once you have a lot of data (but you can't have enough  ;) ) with many features you are likely to face scalability issues.  Hence, simpler model would be a better  choice. Ease  of debugging. For the production system it is extremely important to be  able to answer the question "Why x was classified to class C?". This  will help you to decide on whether the problem is training data, model,  model parameters or simply a bug in the code. Something like linear  regression is much easier to "debug" that multi-layer neural network. Once  you've carefully thought through and experimented with 1 - 4 if you  still still have a question on which model to choose for classification,  here are few (less important advices): a) for cases where feature space  is huge (>1000s of features) choose something simple like logistic  regression; but be sure that you have way more data points than you have  features, b) try SVMs, especially if you think you can over fit, c) for  cases where feature space is small (<1000s) and the problem is hard,  try boosted decision trees, d) if you think non-linear functions of  your features can yield some information gain, then try non-linear  models and kernel methods. To  summarize: don't focus on the model too much, focus on the system as  whole, keeping in mind points 1 - 4. You should be able to easily  experiment with new data, features and models. Also, check Apache Mahout - http://mahout.apache.org/  , it has many algorithms implemented in map-reduce setting on top of Hadoop.

Oleksii Kuchaiev at Quora Visit the source

Was this solution helpful to you?

Other answers

The paper An Introduction to Variable and Feature Selection by Guyon et al. available at http://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf discusses the process of obtaining a machine learning predictor from given data, and provides a list of 10 questions (at the end of the introductory section) to be asked through the process. Question 8 in the list discusses the choice of predictor. Though the focus of the questions is on feature and variable selection, the list provides a more complete picture of the learning process.

Abhinav Maurya

Oleksii hit on most of the practical considerations that are important. My two cents is that you should always start with logistic regression (with regularization) or Naïve Bayes.

Nitin Madnani

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.