Is Java a better choice to implement machine learning algorithms than C++?
-
Hi, I know R, Python and C++. and most of my work is in machine learning domain. I use R--> to quickly understand the problem Python --> for quick prototyping C++ --> if needed But lately, I have been looking into "job" descriptions of other companies and seems like there is alot of buzz around java. I think its mainly because of Hadoop and stuff written mostly in Java. So, is it worthwhile to learn java? or C++ or java doesnt matter in terms of applying for those jobs as long as one understands the concepts?
-
Answer:
In applying for a job, knowing a language is never a bad thing. Knowing either Java or C++ you can spin up on the other pretty fast, but if you already know the local language it's better. Neither language is inherently better; the two languages are actually very similar. The choice has more to do with the preferences of the people who start the project: what libraries they know best, what development environment they prefer, etc. C++ has a speed advantage over Java much (but not all) of the time. Java is often better for big collaborative projects, since its late linking and garbage collection help decrease debugging cycles. In both cases, the languages have gradually improved to diminish the advantages of the other. So, if Hadoop is an important tool in the machine learning culture, you go learn Java. If you know C++ you really won't find it that hard.
Joshua Engel at Quora Visit the source
Other answers
Your question and your description are actually asking two different questions. Personally, I prefer writing ML algorithms in functional programming languages than object oriented languages. ML algorithms are just much more natural to write in functional programming languages. However, I quickly run into performance issues whenever I have tried doing it. That may be because of my inexperience in functional programming or inherent limitations of the languages I tried working with. Professionally, I have been writing ML algorithms for over 5 years now in the industry. Primarily, we have always used Java. With ML algorithms, being freed from memory management is a huge bonus. I can tell you this from second hand experience because one of my colleagues wrote the https://github.com/sudar/Yahoo_LDA code entirely in C++ and my discussions with him have assured me that being in Java is a safe haven. Also, on the plus side, it was easier to integrate our code written in Java with Hadoop*. Having said that, the kind of efficiency he was able to squeeze out of the code by using C++ (about which you can read up in this paper [1]) I doubt it would be easy to replicate using Java. So, there is a definite trade-off here. I believe even John Langford [2] has very similar opinion [3]. From perspective of Jobs, having experience in either C++ or Java is fine. Many of my colleagues had prior experience in C++ and they were quite comfortable transitioning to Java and working in Java (though they never liked Java :-) and that used to be common discussion point during lunch many a times). [1] http://www.vldb.org/pvldb/vldb2010/pvldb_vol3/R63.pdf [2] http://hunch.net/~jl/ [3] http://hunch.net/?p=230 * I am not saying it is particularly hard to integrate code written in other languages with Hadoop, one of my colleagues used to integrate code written in R with Hadoop, but doing these hacks had their own pain points which one probably don't want to go through especially when they are trying to run the code over large datasets and have deadlines to meet.
Arun Iyer
The answer to your question is that Java is certainly the norm for implementing ML algorithms these days. For every reasonably widely-used ML library that you find implemented in C++ I can probably point to three written in Java. As a point of clarification, I should mention that, in the community, the term "ML algorithm" refers specifically to the core mathematical relationship that governs handles how something is learned: for example, stochastic gradient descent is a learning algorithm which is used to find a special sort of decision boundary, and is especially used in the context of SVMs. What is not an algorithm is the part in your program where you do something other than straight-up ML: data processing, munging, and so on. If you're having trouble telling what is and is not ML algorithm, remember that the ML algorithm is virtually always handled by a library, and usually everything else is usually not part of your ML algorithm. This in mind, when I say that it is usually the case that ML algorithms are implemented in Java these days, what I mean, literally, is: ML libraries are usually implemented in Java. If you find yourself implementing actual ML algorithms, there are many advantages to doing so in Java: Both industry and the ML community are used to Java at this point, so getting people to adopt it (and maintain it) is less of an issue. Java is "marketable" -- it's stable and its quirks are relatively well-known. Java code is maintainable and readable, and it's usually very performant. Hardening a Java library and making it release ready is comparatively easy. ... and so on. There are disadvantages too, of course: Java isn't "sexy", for example. And, now that I've said all this, the other shoe must drop: you will almost certainly not be doing this most of the time you're writing ML-related code. You will rely on heavily optimized libraries that implement the algorithms you're interested in -- and they're pretty much all implemented. No, instead, what you will be doing a lot of is the scaffolding: processing data, turning knobs, and (most importantly, by far) finding the right features. And if you work in industry, you will probably have to harden this to work at scale. This last point in particular leads me to believe that what you meant to ask was something along the lines of, "if I work in ML, is it better to know C++ or Java?" The answer to that question is more complicated, because there's no unifying reason to use one language in all the cases where you might call an ML library. My advice breaks down into one major point: Emphasize technologies over languages. Have a very healthy and robust set of knowledge of a very diverse and useful set of technologies. That's not to say you shouldn't know languages. If you're looking for a starter list, this is what I recommend. Extremely strong knowledge of Java. Read Effective Java, and know it like the back of your hand. Extremely strong knowledge of lots of scripting languages (bash, awk, Python, and Perl in particular). Seriously, this is important. Extremely strong knowledge of distributed data processing technology -- pig, Hive, Hadoop, Caffeine, etc. Extremely strong knowledge of data protocols, e.g., RPC (Thrift, Protocol Buffers, etc.) Some knowledge of backend, e.g., MongoDB or MySQL
Anonymous
There are 2 use cases to consider if you are actually implement ML algos, and not just doing some data science: 1. Large scale, parallel solvers that can be run on a single, shared memory machine 2. Very large scale problems that actually require a cluster of machines For (1), the performance bottleneck is usually cache faults, so a high performance implementation requires both cache aware memory model and , frequently, access to BLAS libraries. Java can not support either. For this reason, most stand alone, high performance ML algos have been written in C (LibLinear, Vowpal Wabbit, GraphLab, etc) In contrast, for (2), the bottleneck is usually loading the data into memory itself (I/O bound), and, in some extreme cases, even being able to store a single feature vector on 1 node in memory. Here, the GraphLab guys have made great progress, but industry is mostly storing their data in Hadoop and it is necessary to either code in Java or use something like the c bindings to the HDFS libraries There is a case (3), where you are actually implementing ML in hardware, or on some special purpose architecture. If you are asking about learning C or Java, chances are this is not an issue
Charles H Martin
It sounds like you are curious about implementing ML in industry, which I don't know much about. For research and academia, I can say, with reasonable confidence, NO. For research, I believe C++, Matlab, and Python are more popular. Matlab is easy to use. Python is like Matlab, but free. C++ is for high-performance.
David Krueger
Related Q & A:
- What's better choice considering durability and safety: Suzuki alto or Chery QQ?Best solution by Yahoo! Answers
- Which sub is the better choice?Best solution by Stack Overflow
- What's a better choice IB or A/levels?Best solution by telegraph.co.uk
- What is a better college choice?Best solution by Yahoo! Answers
- Is C a good language to start out learning programming?Best solution by Yahoo! Answers
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.