What's the best machine for development?

How is industrial, product-oriented machine learning software development different from academic machine learning research or industrial research prototype development?

  • I'm interested in differences like "we use simpler algorithms," "we have to highly optimize for running time," "we more carefully architect the code to adhere to overall design guidelines," "We spend more time debugging," "We develop new algorithms much less often," "We take direction from management rather than exploring our own ideas," "we spend most of our time doing things other than machine learning development" etc. Basically, I want to know how academia or prototype-driven industrial research is different from a machine learning shop in a place like MS Bing, Google, Apple, Netflix, etc.

  • Answer:

    already provided a fantastic answer. Let me add some more major differences. In industrial ML development, usability (UX) is a very important factor. Not so much in academic research. This is the same as the general industrial software vs academic software comparison. Industrial ML development tends to focus on implementing the existing state-of-the-art algorithms and dealing with software design issues. Developers typically don't have time budget for and discouraged from investigating a new ML algorithm. Academic researchers, on the other hand, need to work on new ideas so that they can publish. Academic research is driven by competitions or challenges with fixed and typically clean datasets and with fixed tasks. Ultimately, research needs some metrics to be deemed successful. In reality, data almost always evolve and typically need cleaning. Even the data schema can change frequently. So, a lot of industrial ML development effort may not be on the core ML algorithms but on data sourcing, formatting, cleaning, features engineering, etc. Even the objectives may evolve as well. For example, with the same collected data, your objective can change from a simple exclusive classification to multilabel or hierarchical classification. ML research is (almost) all about accuracy. Most academic papers mention only accuracy and no computational cost. This is partly driven by the competitions culture mentioned earlier. Even if a paper mentions the impressive train/evaluate time, what matters more is whether it is the top performer on a popular scenario (e.g. MNIST or ImageNet). Note that this is not typical in other research areas. In Numerical Analysis (my former background), for example, accuracy and computational cost are always reported together. In industrial ML work, computational performance is absolutely critical. About (ML) extendability. Industrial ML development can be classified into two categories: generic ML software for people to use (e.g. RapidMiner, Mahout, etc.) and software to perform really well on specific tasks of high impact (e.g. Web Search, Image Search). The former needs to be extendable (i.e. being able to solve multiple tasks, with multiple data formats, from multiple data sources, etc.), much more extendable than academic software. There are actually a lot more differences, which I'd be glad to discuss if interested.

Kenneth Tran at Quora Visit the source

Was this solution helpful to you?

Other answers

The things that are different and harder in real-world applications are integration and scale. The things that are less of a problem are accuracy and generality. Academic research doesn't really address how data is stored, ingested, and output. This is actually usually the hard part. Where academia produces code it's typically in R, Matlab, or a rough draft in a systems language. These are not suitable for direct use in most any modern production environment. Scale is difficult too. Academic papers tend to focus on accuracy, rather than performance, and where they focus on performance, it's on throughput of the whole model-building phase. In the real-world you more often are concerned with latency -- how fast can I update with new data, and get an answer to one new question? What is less hard in the real world is accuracy. It's typically true that a "good" learning setup gets you mots of the value that the optimal one will. Netflix would have still been a fine company without all the research that went in to making their predictions 10% better, for example. So: you need not use the latest and greatest most complex algorithm. Fast, robust and extensible is better. If you are talking about developing one machine learning system for one product, rather than making a product out of machine learning, then I think generality is an easier problem in the real world. You don't have to make a system that solves a whole class of problems, and can make assumptions and inject domain knowledge specific to your problem. (That's not true if you are trying to make a machine learning product -- then you do still have a problem of trying to make one solution for many problems, which is hard.)

Sean Owen

One way to add to the excellent answers already submitted is to focus on the different motivations of the two environments. In academia, research focuses on advancing the science, and individuals are motivated by recognition of their personal contribution. In any but the most toxic industrial settings, development focuses on technology that satisfies a customer's requirements, and individuals are rewarded for the contribution to the team. Most customers don't want the bleeding edge of science; they want reliable and efficient satisfaction of their own business requirements. These differences may lead to some frustration as newly minted grads find themselves optimizing someone else's work rather than conducting their personal research.

Phil Parkman

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.