What are the most useful data mining / analysis / science tools?
-
I saw a video recently of Hilary Mason, Chief Scientist at Bitly, talking about tools they use on her team. She mentioned that they use a lot of Python, i.e. NumPy, SciPy, Scikit-learn, as well as Hadoop HDFS, Redis, D3.js.. What tools do you find most useful? Does anyone use Gephi, Panda, the Julia Language, Tableau Public? Do you use R, Matlab, or Octave? What about Python, what other Python tools are there? Is MongoDB useful, or other forms of NoSQL? Please list the tools and what you use them for. Thank you.
-
Answer:
Still waiting on the "big data stack" to really flesh out - at Recorded Future we have a pretty mixed bag of tools. For data collection (a big part of our business) our pipeline is written primarily in Java/Scala/Python and data is stored with a number of backends (MongoDB and MySQL - and more recently Dynamo), with RabbitMQ controlling messaging. We index with Sphinx and are getting going with ElasticSearch/Lucene. For analysis, we're mostly an R shop, and use Hadoop (with java/python/R) as necessary to do large scale data filtering, aggregation and clustering. Day to day, I make heavy use of Python, R, and good old UNIX to get my data into shape and do modeling and exploratory data analysis. Spotfire is also a popular tool here for data visualization and analysis - though we've also baked interactive D3 visualizations into our product. I rarely go a day without producing a ggplot2 plot. Once we put our models into production, those generally run in Java or Python on realtime data. To Sean's point - I don't think that "data science" == "machine learning" - while ML techniques should be in a data scientist's toolbox, I think there are a lot of basic descriptive statistics, traditional statistical modeling, data manipulation, and data visualization techniques that should be in there, too.
Evan Sparks at Quora Visit the source
Other answers
We maintain a github repo with lots of resources and links to people about datascience. You can follow further updates from: https://bitly.com/awesomedatascience Awesome Data Science An open source Data Science repository to learn and apply towards solving real world problems. Table of contents https://github.com/okulbilisim/awesome-datascience#motivation https://github.com/okulbilisim/awesome-datascience#infographic https://github.com/okulbilisim/awesome-datascience#what-is-data-science https://github.com/okulbilisim/awesome-datascience#colleges https://github.com/okulbilisim/awesome-datascience#moocs https://github.com/okulbilisim/awesome-datascience#data-sets https://github.com/okulbilisim/awesome-datascience#bloggers https://github.com/okulbilisim/awesome-datascience#facebook-accounts https://github.com/okulbilisim/awesome-datascience#twitter-accounts https://github.com/okulbilisim/awesome-datascience#youtube-videos--channels https://github.com/okulbilisim/awesome-datascience#toolboxes---environment https://github.com/okulbilisim/awesome-datascience#journals-publications-and-magazines https://github.com/okulbilisim/awesome-datascience#presentations https://github.com/okulbilisim/awesome-datascience#other-awesome-lists Motivation This part is for dummies who are new to Data Science This is a shortcut path to start studying Data Science. Just follow the steps to answer the questions, "What is Data Science and what should I study to learn Data Science?" First of all, Data Science is one of the hottest topics on the Computer and Internet farmland nowadays. People have gathered data from applications and systems until today and now is the time to analyze them. The next steps are producing suggestions from the data and creating predictions about the future. you can find the biggest question for Data Science and hundreds of answers from experts. Our favorite data scientist is https://twitter.com/clarecorthell. She is an expert in data-related systems and a hacker, and has been working on a company as a data scientist. http://datasciencemasters.org/. This web site helps you to understand the exact way to study as a professional data scientist. Secondly, Our favorite programming language is Python nowadays for #DataScience. Python's -http://pandas.pydata.org/ library has full functionality for collecting and analyzing data. We use https://store.continuum.io/cshop/anaconda/ to play with data and to create applications. This is the https://github.com/okulbilisim/awesome-datascience/blob/master/DataScience-Life-Cycle.md to begin a Data Science project. InfographicPreviewDescriptionA visual guide to Becoming a Data Scientist in 8 Steps by https://www.datacamp.com/http://i.imgur.com/AfFMkHe.jpgMindmap on required skills (http://i.imgur.com/FxsL3b8.png)Swami Chandrasekaran made a http://nirvacana.com/thoughts/becoming-a-data-scientist/.by https://twitter.com/kzawadz via https://twitter.com/MktngDistillery/status/538671811991715840, http://www.marketingdistillery.com/2014/11/29/is-data-science-a-buzzword-modern-data-scientist-defined/What is Data Science? https://www.oreilly.com/ideas/what-is-data-science http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1 http://www.datascientists.net/what-is-data-science http://www.becomingadatascientist.com/2014/02/14/what-is-a-data-scientist/ http://en.wikipedia.org/wiki/Data_science http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/ https://ischool.syr.edu/media/documents/2012/3/DataScienceBook1_1.pdf. COLLEGES https://github.com/ryanswanstrom/awesome-datascience-colleges https://github.com/okulbilisim/awesome-datascience/blob/master/datascience.berkeley.edu https://dsi.virginia.edu/ http://datasciencedegree.wisconsin.edu/ MOOC's https://datasense.withgoogle.com/course https://www.coursera.org/course/datasci https://www.coursera.org/specialization/jhudatascience/1?utm_medium=listingPage https://www.coursera.org/specialization/datamining http://cs109.org/ http://schoolofdata.org/ http://www.openintro.org/ http://datascience.sg/categories/MOOC/ http://www.cs171.org/#!index.md https://www.coursera.org/course/procmin http://www.cs.ox.ac.uk/projects/DeepLearn/ https://www.youtube.com/playlist?list=PLE6Wd9FR--EfW8dtjAuPoTuPcqmOV53Fu http://www.cs.ox.ac.uk/activities/machinelearning/ http://www.cs.ubc.ca/~nando/540-2013/lectures.html https://github.com/DataScienceSpecialization/courses Data Sets http://academictorrents.com/ http://hadoopilluminated.com/hadoop_illuminated/Public_Bigdata_Sets.html http://catalog.data.gov/dataset - The home of the U.S. Government's open data http://www.census.gov/ https://www.freebase.com/ http://usgovxml.com/ http://enigma.io/ - Navigate the world of public data - Quickly search and analyze billions of public records published by governments, companies and organizations. http://datahub.io/ http://aws.amazon.com/datasets http://databib.org/ http://www.datacite.org/ https://www.quandl.com/ - Get the data you need in the form you want; instant download, API or direct to your app. http://figshare.com/ http://dev.maxmind.com/geoip/legacy/geolite/ http://hadoopilluminated.com/hadoop_illuminated/Public_Bigdata_Sets.html http://data.ohouston.org/ https://www.kaggle.com/wiki/DataSources http://www.1000genomes.org/data https://www.freebase.com/ http://www.google.com/publicdata/directory http://data.worldbank.org/ http://nyctaxi.herokuapp.com/ http://www.opendataphilly.org/ Connecting people with data for Philadelphia http://ahmetkurnaz.net/en/statistical-data-sources/ A blog post includes many data set databases http://grouplens.org/datasets/ Sample movie (with ratings), book and wiki datasets http://archive.ics.uci.edu/ml/ - contains data sets good for machine learning https://bitly.com/bundles/hmason/1 by https://bitly.com/u/hmason/bundles http://www.ncdc.noaa.gov/ http://www.climatedata.us/ (related: http://toolkit.climate.gov/) http://www.reddit.com/r/datasets http://maplight.org/data - provides a variety of data free of charge for uses that are freely available to the general public. Click on a data set below to learn more http://ghdx.healthdata.org/catalog - Institute for Health Metrics and Evaluation - a catalog of health and demographic datasets from around the world and including IHME results http://research.stlouisfed.org/fred2/ http://www.nyu.edu/projects/politicsdatalab/data_classic_sources.html https://github.com/datasciencemasters/data http://www.unicef.org/statistics/ http://data.unicef.org/ http://data.un.org/ http://sedac.ciesin.columbia.edu/ http://gdeltproject.org/ http://www.scb.se/en_/ http://www.datasciencecentral.com/profiles/blogs/great-github-list-of-public-data-sets http://data.stackexchange.com/ - an open source tool for running arbitrary queries against public data from the Stack Exchange network. Bloggers http://blog.wesmckinney.com/ - Wes McKinney Blog. http://miningthesocialweb.com/ - Mining The Social Web. http://www.gregreda.com/ - Greg Reda Personal Blog http://kldavenport.com/ - Kevin Davenport Personal Blog http://jvns.ca/ - Recurse Center alumna http://www.cse.unr.edu/~hkardes/ - Personal Web Page http://seanjtaylor.com/ - Personal Web Page http://drewconway.com/ - Personal Web Page http://www.hilarymason.com/ - Personal Web Page http://complexdiagrams.com/ - Personal Blog http://hairysun.com/ - Personal Blog http://www.becomingadatascientist.com/ Documenting my path from "SQL Data Analyst pursuing an Engineering Master's Degree" to "Data Scientist" http://allthingsds.wordpress.com/ - AllThings Data Sciene http://www.mdmgeek.com/ - Tech Blog on Master Data Management And Every Buzz Surrounding It http://datasciencemasters.org/ - The Open Source Data Science Masters http://cloudofdata.com/ Based in the UK and working globally, Cloud of Data's consultancy services help clients understand the implications of taking data and more to the Cloud. http://datasciencelondon.org/ Data Science London is a non-profit organization dedicated to the free, open, dissemination of data science. We are the largest data science community in Europe. We are more than 3,190 data scientists and data geeks in our community. http://datawrangling.com/ by Peter Skomoroch. MACHINE LEARNING, DATA MINING, AND MORE http://www.johnmyleswhite.com/ Personal Blog - Data Science Questions and Answers from experts http://openresearch.wordpress.com/ a PhD student at Berkeley http://blog.starbridgepartners.com/ MDS, Inc. Helps Build Careers in Data Science, Advanced Analytics, Big Data Architecture, and High Performance Software Engineering http://www.louisdorard.com/blog/ a technology guy with a penchant for the web and for data, big and small http://machinelearningmastery.com/ about helping professional programmers to confidently apply machine learning algorithms to address complex problems. http://www.danielforsyth.me/ - Personal Blog http://www.datascienceweekly.org/ - Weekly News Blog http://blog.revolutionanalytics.com/ - Data Science Blog http://www.r-bloggers.com/ - R Bloggers http://practicalquant.blogspot.com/ Big data http://www.mickaellegal.com/ a data enthusiast who gets hooked on solving intriguing problems and crafting beautiful stories and visualizations with data. Over the past 5 years, He haas applied statistics to solve problems in government, brain sciences, and most recently, retail. http://datascopeanalytics.com/ data-driven consulting and design http://yet-another-data-blog.blogspot.com.tr/ Yet Another Data Blog http://spenczar.com/ a data scientist at Twitch. I handle the whole data pipeline, from tracking to model-building to reporting. http://www.kdnuggets.com/ Data Mining, Analytics, Big Data, Data, Science not a blog a portal http://www.metabrown.com/blog/ - Personal Blog http://www.datascientists.net/ is building the data scientist culture. http://whatsthebigdata.com/ is some of, all of, or much more than the above and this blog explores its impact on information technology, the business world, government agencies, and our lives. http://www.micfarris.com/ Focusing on science, datascience, business, technology, and channeling inner geekness! http://magnus-notitia.blogspot.com.tr/ - Magnus Notitia http://newdatascientist.blogspot.com/ How a Social Scientist Jumps into the World of Big Data http://harvarddatascience.com/ - Thoughts on Statistical Computing and Visualization http://101.datascience.community/ - Learning To Be A Data Scientist http://www.chioka.in/kaggle-competition-solutions/ http://datascientistjourney.wordpress.com/category/data-science/ http://nyctaxi.herokuapp.com/ http://learninglover.com/blog/ http://getprismatic.com/story/1406683266166?utm_medium=email http://www.dataists.com/ http://www.data-mania.com/ http://data-magnum.com/ https://www.mapr.com/blog http://fastml.com/ http://www.p-value.info/ - Musings on data science, machine learning and stats. http://datascopeanalytics.com/what-we-think/ http://tarrysingh.com/ http://datascientistjourney.wordpress.com/category/data-science/ http://www.data-mania.com/index.php/easyblog http://filedrawer.wordpress.com/ - Chris Said's science blog http://www.emilio.ferrara.name/ http://datanews.tumblr.com/ http://www.reddit.com/r/textdatamining/ http://www.periscopic.com/#/news http://hilaryparker.com/ http://datastori.es/ http://datasciencelab.wordpress.com/ http://www.kennybastani.com/ http://blog.smola.org/ http://blog.data-miners.com/ http://blog.okcupid.com/ http://flowingdata.com/ - Visualization and Statistics http://www.calculatedriskblog.com/ http://www.applieddatalabs.com/ - content and news about data-driven business. https://beta.oreilly.com/learning http://blog.dominodatalab.com/ http://iamtrask.github.io/ - A Machine Learning Craftsmanship Blog Facebook Accounts https://www.facebook.com/data https://www.facebook.com/Bigdatascientist https://www.facebook.com/DataScience101 https://www.facebook.com/DataScienceDay/ https://www.facebook.com/nycdatascience https://www.facebook.com/pages/Data-science/431299473579193?ref=br_rs https://www.facebook.com/pages/Data-Science-London/226174337471513 https://www.facebook.com/DataScienceTechnologyCorporation?ref=br_rs https://www.facebook.com/groups/1394010454157077/?ref=br_rs https://www.facebook.com/centerdatasciences?ref=br_rs https://www.facebook.com/groups/bigdatahadoop/ https://www.facebook.com/groups/data.analytics/ https://www.facebook.com/groups/434352233255448/ https://www.facebook.com/groups/rhadoop/ https://www.facebook.com/groups/bigdatalearnings/ https://www.facebook.com/groups/bigdatastatistics/ https://www.facebook.com/groups/BigDataExpert/ https://www.facebook.com/groups/machinelearningforum/ https://www.facebook.com/groups/dataminingsocialnetworks/ Twitter Accounts https://twitter.com/BigDataCombine - Rapid-fire, live tryouts for data scientists seeking to monetize their models as trading strategies https://twitter.com/BigDataGal - Data Viz Wiz | Data Journalist | Growth Hacker | Author of Data Science for Dummies (2015) https://twitter.com/analyticbridge - Big Data, Data Science, Predictive Modeling, Business Analytics, Hadoop, Decision and Operations Research. https://twitter.com/greenbacker - Director of Data Science at @ExploreAltamira https://twitter.com/Chris_Said - Data scientist at Twitter https://twitter.com/clarecorthell - Dev, Design, Data Science @mattermark #hackerei https://twitter.com/DadiCharles - #datascientist @Ekimetrics. , #machinelearning #dataviz #DynamicCharts #Hadoop #R #Python #NLP #Bitcoin #dataenthousiast https://twitter.com/DataScienceCtrl - Data Science Central is the industry's single resource for Big Data practitioners. https://twitter.com/ds_ldn Data Science. Big Data. Data Hacks. Data Junkies. Data Startups. Open Data https://twitter.com/BecomingDataSci - Documenting my path from SQL Data Analyst pursuing an Engineering Master's Degree to Data Scientist https://twitter.com/TedOBrien93 - Mission is to help guide & advance careers in Data Science & Analytics https://twitter.com/datasciencetips - Tips and Tricks for Data Scientists around the world! #datascience #bigdata https://twitter.com/DataVisualizati - DataViz, Security, Military https://twitter.com/DataScienceX https://twitter.com/deeplearning4j - https://twitter.com/dpatil - White House Data Chief, VP @ RelateIQ. https://twitter.com/DominoDataLab https://twitter.com/drewconway - Data nerd, hacker, student of conflict. https://twitter.com/jabawack - #Networks, #MachineLearning and #DataScience. I work on #Social Media. Postdoc at @IndianaUniv https://twitter.com/erinbartolo - Running with #BigData--enjoying a love/hate relationship with its hype. @iSchoolSU #DataScience Program Mgr. https://twitter.com/gjreda Working @ GrubHub about data and pandas https://twitter.com/kdnuggets - KDnuggets President, Analytics/Big Data/Data Mining/Data Science expert, KDD & SIGKDD co-founder, was Chief Scientist at 2 startups, part-time philosopher. https://twitter.com/kdnuggets - KDnuggets President, Analytics/Big Data/Data Mining/Data Science expert, KDD & SIGKDD co-founder, was Chief Scientist at 2 startups, part-time philosopher. https://twitter.com/hakan_kardes - Data Scientist https://twitter.com/hmason - Data Scientist in Residence at @accel. https://twitter.com/hackingdata ReTweeting about data science https://twitter.com/johnmyleswhite Scientist at Facebook and Julia developer. Author of Machine Learning for Hackers and Bandit Algorithms for Website Optimization. Tweets reflect my views only. https://twitter.com/BDataScientist - Principal Data Scientist @ Microsoft Data Science Team https://twitter.com/b0rk - Hacker - Pandas - Data Analyze https://twitter.com/kncukier - The Economist's Data Editor and co-author of Big Data (http://big-data-book.com ). https://twitter.com/KevinLDavenport - Organizer of http://sddatascience.com https://twitter.com/krees - Interactive data visualization and tools. Data flaneur. https://twitter.com/KirkDBorne - DataScientist, PhD Astrophysicist, Top #BigData Influencer. https://twitter.com/lmrei - PhD Student. Programming, Mobile, Web. Artificial Intelligence, Intelligent Robotics Machine Learning, Data Mining, Natural Language Processing, Data Science. https://twitter.com/ML_toparticles - Live Content Curated by top 1K Machine Learning Experts https://twitter.com/Agent_Analytics - Data Analytics Recruitment Specialist at Salt (@SaltJobs) | Analytics - Insight - Big Data - Datascience https://twitter.com/__mharrison__ - Opinions of full-stack Python guy, author, instructor, currently playing Data Scientist. Occasional fathering, husbanding, ult|goalt-imate, organic gardening. https://twitter.com/ptwobrussell - Mining the Social Web. https://twitter.com/mertnuhoglu Data Scientist at BizQualify, Developer https://twitter.com/mrogati - Data @ Jawbone. Turned data into stories & products at LinkedIn. Text mining, applied machine learning, recommender systems. Ex-gamer, ex-machine coder; namer. https://twitter.com/noahi - Visualization & interaction designer. Practical cyclist. Author of vis books:http://www.oreillynet.com/pub/au/4419 https://twitter.com/PaulMiller - Cloud Computing/ Big Data/ Open Data Analyst & Consultant. Writer, Speaker & Moderator. Gigaom Research Analyst. https://twitter.com/peteskomoroch - Creating intelligent systems to automate tasks & improve decisions. Entrepreneur, ex Principal Data Scientist @LinkedIn. Machine Learning, ProductRei, Networks https://twitter.com/MDMGeek - Solution Architect @ IBM, Master Data Management, Data Quality & Data Governance Blogger. Data Science, Hadoop, Big Data & Cloud. https://twitter.com/q_datascience Quora's data science topic https://twitter.com/Rbloggers - Tweet blog posts from the R blogosphere, data science conferences and (!) open jobs for data scientists. https://twitter.com/randhindi https://twitter.com/randal_olson - Computer scientist researching artificial intelligence. Data tinkerer. Community leader for @DataIsBeautiful. #OpenScience advocate. https://twitter.com/EROLRecep - Data Science geek @ UALR https://twitter.com/ryanorban - Data scientist, genetic origamist, hardware aficionado https://twitter.com/seanjtaylor - Social Scientist. Hacker. Facebook Data Science Team. Keywords: Experiments, Causal Inference, Statistics, Machine Learning, Economics. https://twitter.com/silviakspiva - #DataScience at Cisco https://twitter.com/spenczar_n - Data nerd https://twitter.com/tozCSS - Enjoys ABM, SNA, DM, ML, NLP, HI, Python, Java. Top percentile kaggler/data scientist https://twitter.com/anskarl - Complex Event Processing, Big Data, Artificial Intelligence and Machine Learning. Passionate about programming and open-source. https://twitter.com/Terry_Timko - InfoGov; Bigdata; Data as a Service; Data Science; Open, Social & Business Data Convergence https://twitter.com/TextMining_r https://twitter.com/TonyBaer - IT analyst with Ovum covering Big Data & data management with some systems engineering thrown in. https://twitter.com/tonyojeda3 - Data Scientist | Author | Entrepreneur. Co-founder @DataCommunityDC. Founder @DistrictDataLab. #DataScience #BigData #DataDC https://twitter.com/vambati - Data Science @ PayPal. #NLP, #machinelearning; PhD, Carnegie Mellon alumni (Blog: http://allthingsds.wordpress.com ) https://twitter.com/wesmckinn - Pandas (Python Data Analysis library). https://twitter.com/WileyEd - Senior Manager - @Seagate Big Data Analytics | @McKinsey Alum | #BigData + #Analytics Evangelist | #Hadoop, #Cloud, #Digital, & #R Enthusiast https://twitter.com/datanews - The data news crew at @WNYC. Practicing data-driven journalism, making it visual and showing our work. @SkymindIO's open-source deep learning for the JVM. Integrates with Hadoop, Spark. Distributed GPU/CPUs | http://nd4j.org | http://www.skymind.io Youtube Videos & Channels https://www.youtube.com/watch?v=WXHM_i-fgGo Andrew Ng: Deep Learning, Self-
Huseyin Mert
I use Apache Mahout (http://mahout.apache.org) and Myrrix (http://myrrix.com), but then I would because I'm involved with these. Whether it is useful depends on what you're doing. Many things you've listed are languages or environments (R, Matlab, Octave). Those are more for prototyping ML algorithms on your "workbench" and not any part of a production system's operation. I suppose I put NumPy in the same bucket; it's more of a library, sure, but for speed I'm not sure I'd base a production ML system on something in Python. For prototyping, for non-critical stuff -- sure, if you're a Python person. Mongo has nothing to do with ML per se, nor does Redis, HDFS, NoSQL DBs. Hadoop doesn't either but if you're trying to do ML on Hadoop (or, Java in general) I think Mahout is kind of the main starting point for any such adventures right now. If that's you -- yes, recommend Mahout, and by extension Myrrix. Otherwise, I don't think they'd be useful.
Sean Owen
In terms of analysis, R and Python have already been mentioned as well as SAS and other statistical software. When it comes to data collection, though, Kimono (http://kimonolabs.com) can be an extremely useful tool. With Kimono, you can scrape almost any website and get your data in a structured format, ready for analysis. This makes it incredibly easy to pair with Python or R for further processing, and you can even do some simple but powerful analyzing with the new modify results feature (http://blog.kimonolabs.com/2015/01/26/write-javascript-functions-to-transform-your-api-results/) What makes Kimono awesome is how easy it is to use. You just grab data you want from almost any site by clicking, and soon youâve got your API. People use Kimono for all kinds of data analysis, from sentiment analysis (http://blog.kimonolabs.com/2014/12/17/guest-blog-sentiment-analysis-on-web-scraped-data-with-kimono-and-monkeylearn/) to graphing beer descriptors: (http://scottjanish.com/consensus-review-heady-topper/). Full disclosure, I do work for Kimono, but I have to say itâs an amazing tool.
Katie Lundsgaard
I'll try to round off some of the Python mentions. * Classic Python numerical/scientific combo - NumPy, SciPy, matplotlib, a bunch of the scikit stuff ( scikit.learn for machine learning stuff, scikit-image for image processing ) are great, no brainer for R&D (matlab-like conventions most of the time which can help esp. if you have people coming from academia that are more at home with that) * From there depends what you're doing: - as so far as datasets - Before hitting the DFS/NoSQL (e.g HDFS, Mongo etc.) highway, you can for example use HDF5 data format for pretty fast serialization / random access to fairly large datasets in a local / development environment. h5py is a nice library for working with this format esp. in conjunction with numpy, check out their quick start guide. not sure how well maintained this format is but for prototyping etc., you wont need to deal with operational hassle of setting up clusters / daemons etc. - if you're dealing with image processing / vision, OpenCV is obviously a good start, their latest releases (2.4.x) have the cv2 python library which works natively with numpy arrays as well. Lots of functionality inbuilt. - NLP and text mining - nltk is very actively developed and already has alot of great functionality and easy to use datasets. - one thing that wasn't mentioned has to do with working with graphs / graph algorithms - this is oviously a huge issue for most 'big data' applications where its alot about mining for relationships / correlations between different variables. NetworkX is a very complete Python package for most classic graph data structures and algorithms. Was also recently featured in a Pycon talk (google for 'pycon 2012 graph processing' ). My experience has been that its great for prototyping but not geared so much for bigger data sets (hundreds of thousands of nodes/edges at most, you might get abit farther with sparse graphs though). - distributed processing - If you do end up finding yourself with most code in Python and wanting to move towards Hadoop / Map Reduce , There's a couple of python frameworks that will let you write MapReduce jobs and use hadoop streaming to run these on a hadoop cluster. Have a look at Dumbo, and more recently Hadoopy (never tried hadoopy myself but sounds promising, dumbo is great and has good documentation, but takes some tinkering to try and get a handle on w.r.t performance, 'whats under the hood'.. ) - In general, Once performance becomes an issue: * have a look at Cython (lets you mix C strong typing inside python code and optimizes certain operations by compiling into low level Python C API code, can get huge speed gains in numerical context, looping etc.), * SWIG for wrapping small C/C++ modules in Python. * One more pretty amazing project with respect to numpy and performance that I recently found out about is numexpr - http://code.google.com/p/numexpr/ . Have just played with it following their examples but seems like a huge win already.. My experience has been that with these technologies you can certainly run production-grade code in a timely manner (whatever the meaning for that might be in your case - be it real-time processing of some sort or an analytics backend..)
Adam Ever-Hadani
There are different reasons why you would choose each of those packages. They are largely means to the same end, but the tools are used in different contexts. Become a master of one first -- don't obsess over the specific tool, rather consider how you're using it to most efficiently handle data and communicate your "data story" through graphs and models. SAS has been around for a long time (1960s I believe), and is still used in many business and enterprise applications. It runs on it's own programming language and is completely closed-source through the company Excel, although basic compared to the rest, is a nice one-size-fits-all tool for graphing and modeling spreadsheet data. There is a lot less overhead to Excel for accomplishing many things R and Python are becoming equally popular now -- however Python has many uses outside of just data analysis. The open source tools and datasets are driving a lot of this popularity. RStudio for R is a very robust tool for organizing your workspace and data, for which I haven't found a direct Python alternative. Python can potentially be translated into production code, whereas R is far too slow for that. Matlab is used extensively in academia and supports many high-quality mathematical and graphing functions. Although now Python and R are catching up, this used to be exclusive to Matlab for a long time.
Kaushik Kasi
If you'd like to use Python (NumPy, SciPy, etc.) on Hadoop, ping us at http://mortardata.com â our PaaS is built for exactly this sort of work. Other tools that people love include Pandas [1] and IPython [2]. [1] http://pandas.pydata.org/ [2]: http://ipython.org/
K. Young
I guess a lot of enterprise users still prefer to use SAS - it is easy to deploy, has a large existing knowledge base, and is a solid enterprise product. It is not very friendly when it comes to defining own functions/procedures, and here is where R and Python come in. These are what I use the most.
Aneesh Oberoi
Related Q & A:
- Where To Get Online Data Mining Work?Best solution by online.njit.edu
- How To Get Online Data Mining Work?Best solution by theatlantic.com
- What are the best universities for studying social science?Best solution by Quora
- What are the most available jobs in the science field in the US?Best solution by Yahoo! Answers
- What is the difference between principal component analysis and factor analysis?Best solution by Cross Validated
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.