Why can't R be used to write production grade code? Why is Python not used for prototyping also?
-
Most of the data scientists say that they use R for prototyping and Python/Java for production grade code. What happens when the libraries which were used while prototyping in R,are not available in Python? In such situations,If you use Rpy2 library to use R inside Python can you still make that code production grade?
-
Answer:
I would dispute the premise of the question. R can be and is used in production. The question really should focus on what type of production environment and task. Building a web server for a high-traffic website in pure R would be a mistake; building a fluidodynamic simulation in R, too. But you can deploy a production analysis web server in your organization for hundreds of users in Shiny, probably faster than with any web framework out there; and the number of corporate applications targeting less than 1000 users is surprisingly high, as well as the number of data sets smaller than 50GB (which R can handle just fine). R can also be used for large-scale batch data analysis tasks (especially in Linux). Another advantage of R is quite painless package management. On the other side, for large-scale real-time analytical tasks, I would use neither R nor Python, but Clojure (feel free to replace that with Scala, F# or your language of choice). I think the answer below (by A. Sharma) assumes a bit too much about the nature of the production environment; and repeates a few misconceptions about R. I will try to address some of these point by point, because they are so common. [Python] Relatively very fast I would point out that the Julia benchmarks are meaningless out of context as they implement the same algorithm across all language to showcase the versatility of Julia. But R native code code can perform these tasks much, much faster if one uses say, memoization instead of recursion. And besides that, the correct way to see R is not unlike Python: a very flexible language able to orchestrate tasks, which are usually executed using fast, optimized functions written in Fortran, C and C++, so that the performance differential is usually negligible. Beautifully written code It is easier to write ugly code in R. I would not mind having a little more syntactic sugar here and there, and fewer conversions/promotions,which are targeted toward interactive use. But R code can be amazingly elegant, and offers superior capabilities to program on the language. Check the code of any of the large committers to R packages, like H.Wickham, M.Dowle, or R.Francois. Better implementation of OOPs concepts No doubt here, but I am at peace with it; and I would recommend that my friends pythonistas stop writing classes ( ). Libraries to furnish almost anything which can be done in R This is quite patently false. Exercise: Take any of the CRAN task views pages (http://cran.r-project.org/web/views/), and try to map the libraries to Python. It is not even close, not by a long shot. And, in many of these packages there is a lot of domain-specific knowledge. They are written by the top experts in their fields. Scalable (Hadoop/Pig friendly) As mentioned above, "Scalability" is in the eye of the beholder; and language perceptions change very fast. I remember that only 10 years ago people sneered at Python being used in any production role, Django was dissed in public, etc. (there are still such naysayers, btw). For Hadoop, check http://cran.r-project.org/web/views/HighPerformanceComputing.html as well as the Revolution Analytics products. Better functionality to interact with other systems I think Python is top-class here, bar none, but you won't miss anything in R. It interacts with all known DBs, and has supported, tight-knight integration with a number of large vendors (Oracle, Teradata, Netezza, Tibco, Tableau..) Fast prototyping (You can use it as scripting language with Numpy/Pandas/... or use the OOPs concept to build a wrapper around your application to interact with other systems) This is odd. If anything, R's prototyping is *too* fast, because the language is expressive. You end up taking shortcuts and being too clever! And to answer the second part of the answer: Python is used for produciton and prototyping, all the time. So, I'll dispute that premise too. And, I would add, Python is still a beautiful language, with lots of development tools, a very large and friendly community, plenty of corporate support, and a complete set of libraries across many application domains. Need I say more?
Giuseppe Paleologo at Quora Visit the source
Other answers
I want to add something about the role of R in prototyping. A primary reason that R is an important tool for data science is the large number of analytical tools that have been implemented in R and nowhere else. Anyone who doubts this is simply not very knowledgeable about the number of analytical tools in existence. The details of your domain matter here, but as a rule, no other platform comes close to the number of algorithms available. Equally importantly, these libraries tend to play well together. Because R core is built around data structures like the data.frame and the time series, library writers have been able to build on data structures relevant to their domain, instead of implementing these from scratch. This both makes it easier to deploy an analytical library for R and also promotes compatibility among libraries. When you prototype in R, you can very quickly try out a large number of analytical tools. By the end of the prototype stage, you have identified the small number of algorithms that you actually want to put into production. Most likely, you will not productionalize these with R (although you might), but rather re-implement them from scratch in the platform of your choice. I believe that one of the things that distinguishes the data science workflow from other engineering workflow is the sheer complexity of options. In engineering, it makes little sense to have an entire platform just for 'trying out' different algorithms. But in analytics, there is a lot more algorithmic complexity at play. You can't simply guess which algorithms you need because you are experienced: you have to experiment. An entire platform for quick experimentation is exactly what you need to support analytical work---which, by the way, has always been the reason that R exists in the first place.
Yasmin Lucero
A few things: 1) OOP can be helpful, it's not always a waste, and a good object system would be helpful. In that respect, R's syntactic confusion between functions and methods isn't all that great (not to mention defining classes via function calls, and the S3/S4 duality), though one can get used to it. Python's use of classes and Java beans are examples of terrible programming practices, but OOP itself can be very powerful - look at good OOP-FP hybrids like Scala or OCaml. What's OOP is also debatable - Python doesn't have proper encapsulation, but Java does. Java doesn't have friend methods, but C++ does. Paul Graham had a nice article about how it's not easy to agree on what OOP really is - to me it's encapsulation, inheritance and polymorphism, but would multiple inheritance be included (Python) or prohibited (Java), are we counting inheritance-based or generic-based polymorphism? Most of all, does it matter to most data scientists and statisticians, or just to people who are also professional software engineers and who care more about such stuff? 2) If R is primarily functional as opposed to object-oriented, then why doesn't it support tail recursive optimization? If I'm mistaken about this lack of support, let me know. Would non-CS-savvy users care as they write their loops or simply vectorize? 3) In a multi-core world, having a single-threaded runtime and scaling via multi-processing is akin to writing web applications using CGI in 2014. The overhead of OS context-switching is very big (e.g. when using doMC et al. for multi-core processing) - then again, if your code is interpreted 100x slower than C++ or JITted Java/Scala, maybe that's a small penalty in the grand scheme of things. It may not be a problem for small datasets though, in those cases R's fast prototyping is way more important than performance. 4) R can still be fast "when written in C++" which is how many R libraries are written anyway. With 3k+ libraries, you may never have to write C++ yet benefit from the reasonable speed of existing libraries. The bytecode compiler helps quite a bit as well. 5) R's garbage collector sucks compared to any modern platform, but most especially relative to the JVM. So does R's memory management in general, with passing by copy as opposed to by reference. R doesn't sport Haskell's garbage collector that can clean up a gigabyte of garbage per second to do everything by copy. In languages such as Scala, you can do stuff by reference yet you can build immutable data structures if you want - it depends on the use case. Again, whether these things are important or not depends on the size of the data, on the frequency of object creation and destruction (e.g. is it a real-time system fed by Storm/Spark Streaming/Samza or just a batch process run once in a while). 6) Would you use R in production since it's GPL? Given the GPL loophole for SaaS applications, maybe - but R also has AGPL components these days, so that would still be a problem. For shipping sortware, GPL is a no-no. You may get commercial R from Revolution Analytics, but that can get expensive. 7) What's an adequate set of libraries is dependent on what you do. If it's NLP, Python's NLTK could be sufficient, same with scikit-learn for machine learning. Python, Java, Scala etc. are totally inadequate when it comes to statistics - that's where R really shines. Maybe if what you do is machine vision then you should go with C++ or the Python wrapper for OpenCV? If you're working on computational fluid dynamics, mechanical engineering, etc., then MATLAB or Octave would be better choices? If you do operations research all day, then you could use Python with a professional solver like Gurobi? It all depends on what your goals are. 8) What's production-quality depends on use cases. My data science applications deal with terabytes of data, so I use Scala and Java. However, for a small dataset, it's clearly the case that I would do analysis in R - it's pointless to be a CS or software engineering purist when model development productivity is at stake. But for huge data, it's hard to beat the JVM platform - you will pay the price though when it comes to the speed at which you develop models, since you may need to develop the algorithms from scratch, and only then apply them. Let's not be fanbois of any one platform, tool or language - each has its strengths and weaknesses. :)
Marek Kolodziej
I can't really speak to the R part as well as I don't use it often. That's because, in my opinion, Python can be used for prototyping and is actually really well suited for it. The libraries available like matplotlib, pandas, statsmodels, and scikitlearn plus Python's builtin libraries give you a good place to start from for most types of problems without having to reinvent the wheel. Then, when you've done your experimentation and have a proof of concept, all that code is already in a readable, robust, and widely-known programming language which you (or a pro software engineer) can optimize, turn into a module, write additional custom pieces for, and combine with other Python modules to make something actually useful. Want to make a web API that runs some lengthy analysis and returns data and plots later? Hook up your Python prototype to Flask/Django and a job queue like Celery or RQ. Need to parallelize it on some monster server with 16 cores? Look into joblib or multiprocessing. Etc, etc... Ultimately Python has a huge community both in Data Science and outside of it, and as a result code has already been written to solve many problems associated with building production systems of all different sorts- often times, it takes more than just statistics and basic plotting to make something truly useful. That said, if you're comfortable with R, that's probably what you should be prototyping in for now- the whole point of prototyping is to jump in and write something quick to see if its worth the effort of developing something robust. If you generate some really valuable analysis in R, there's probably a way to replicate that analysis in Python and if there isn't, you're in luck; it's a proper programming language where you (or someone else) can implement the necessary methods.
Brian Lange
Thanks for A2A, First of all, we need to understand what is production grade code. To me a code which satisfies the following criteria is a "production grade" code: it runs faster if not fastest it's scalable it's reusable it satisfies the project requirements able to interact with other systems it's stable it's maintainable minimal dependency with OS/libraries R is a language build by statisticians for statisticians. It has very limited core team who maintains its code and accepts new code which may boost its performance. Check out the benchmark of R with other languages like Python, C++... Normally, what people meant by R is GNU-R which is just one implementation of R. There are other implementations like http://www.pqr-project.org/, https://github.com/allr/fastr, also which are relatively faster. Kindly, go though this if you want to understand in detail http://adv-r.had.co.nz/Performance.html Data Scientists like R because of the kind of flexibility it provides when you want to work with data and build some Data Science model. It is a scripting language which is highly dynamic typed. So, you can just write code as if you were writing in your notebook. It is so intuitive with all the packages and visualization functionality. It is known fact in R community that if you want to implement something in Data Mining, first check R package list you might find some implementation. Demerits of R: Its is slow (and can't handle large data) Highly dynamic programming language Not many people know R in the team because a DS team is comprising of Data Engineer(SQL/Hive/Pig fellows), Data Scientist(R/Python/.... fellows), Product Managers(Excel, SQL)... Now, comes the question when you want your analysis to go live in production. Data Scientist either convert their code into languages like Python/Java/C++ by themselves or with the help of some other colleagues. I don't know Java and don't like it either(due to some personal issue) but if we talk about Python, the advantages it has are: Relatively very fast Beautifully written code Better implementation of OOPs concepts Libraries to furnish almost anything which can be done in R Scalable (Hadoop/Pig friendly) Better functionality to interact with other systems Fast prototyping (You can use it as scripting language with Numpy/Pandas/... or use the OOPs concept to build a wrapper around your application to interact with other systems) Answering to your other questions: What happens when the libraries which were used while prototyping in R,are not available in Python? -- Very rare situation, but you can use R then In such situations,If you use Rpy2 library to use R inside Python can you still make that code production grade? -- Ofcourse. Lets you system runs a DS model once a week on a small data. You can use R if the data fits in memory and you have patience to let it run. I'll share a personal experience with you. There was a code which was written in R and it use to take 5 days to complete in multicore environment. The same code when was translated to Python, reduced the time to 1.5 hrs. I can't provide you much details but yeah it was a good thing that we did.
Ankit Sharma
It shouldn't be 'can't' but 'don't'. Yes, R is not statically typed. But is JavaScript, which is quite popular in web development, statistically typed? Yes, R is not fast. But, not all applications need to be fast. Yes, R's OOP (S3, S4) is not OOP in strict sense. But how did programmers develop an application before OOP? Yes, R is not scalable yet. But not all data has to be distributed. However it is undeniable that the above make R harder to be adopted. Another reasons may be - most mainstream programmers (.NET, JVM) don't know R well - most statisticians (or R enthusiasts) don't know how to develop an application - its licence is likely to open their full source code
Jaehyeon Kim
I think the main overarching reason is that Python is a real programming language, while R is statistical scripting. And to maintain a production (i.e. large) system you need a real programming language. Nowadays still many software projects fail due to the overwhelming complexity of large code. Many programming design principles have been invented so that the "limited" human mind can deal with this challenge. Python does an excellent job supporting advanced programming, while R hardly tries. To sum up: * production code can become large and complex * programming large software is very, very hard -> only Python tried to support advanced programming concepts so that a large production systems can be maintained Python is lacking some statistical packages since the interest in Python for statistics/machine learning started considerably later, but it is catching up since a huge community has formed to write machine learning and visualization packages. Now you can decide which of the following two will happen faster to catch up with the other option: * Python developers use an excellent, maintainable language to write additional packages if needed * someone rewrites R from scratch to make it suitable for writing software projects Considerable Players push Python as the future of Data Science: http://continuum.io/press/continuum-receives-darpa-xdata-funding http://ipython.org/microsoft-donation-2013.html
Bernd Schmidt
Related Q & A:
- Why can't WD-40 or other petroleum based lubricants be used on airsoft guns?Best solution by Yahoo! Answers
- Can't delete name off messenger address list, error code 40402,need help pls?Best solution by Yahoo! Answers
- Why can't I sign in on messenger but can check my mail?Best solution by Yahoo! Answers
- Why can't people write on my facebook wall?Best solution by Yahoo! Answers
- Why can't my computer can't detect my external hard drive?Best solution by Yahoo! Answers
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.