What are some of the software engineering principles every data scientist should know? How do you learn them?
-
This question is particularly relevant to data scientists who are working in a software engineering-driven environment. Examples: version control, unit testing, the concept of continuous integration (CI) and regression testing (in the softwareânot statisticalâsense.) On the tool side, using issue tracking system like JIRA, learning jargon related to the development cycle (standup, sprint, epic, etc.) Historically, statisticians work more on an individual basis and the end product is usually in the form of an academic paper. Software engineers often work collaboratively and simultaneously, and the end product is usually in the form of code. How and where can data scientists learn from these software engineering principles?
-
Answer:
Here are some principles that I feel apply more to data scientists than software engineers. Data scientists should know how to construct tables properly to take advantage of indexes and speed up common queries. When writing queries to extract data from large databases, data scientists should understand how to minimize query runtimes (especially those involving complicated joins) so that the queries can scale with the size of the data. Data scientists should also practice good abstraction and factoring in their code, to prevent the proliferation of many lines of codes that do the same thing. This is very important for keeping code and queries reusable, which happens very often in analysis, data extraction, data munging, and dashboard building tasks. This should be balanced with keeping the code clean and readable. Perhaps most commonly, data scientists must be experts at cleaning, reading, formatting, and standardizing data I/O (aka data munging). You need to know exactly what kind of data you expect to get into your function and what kind of data you expect to get out. If the input is dirty, you must clean it! For more advanced applications, data scientists must practice proper parallel programming to take advantage of multiple machines and make some massive computations feasible.
William Chen at Quora Visit the source
Other answers
Maybe also a bit about algorithmic complexity, time/space trade-off's, an intuitive sense of 32-bit and 64-bit memory limitations, how map-reduce works, limitations of double precision and floating point limitations, etc.
Rob Weir
For data scientists, I would not suggest learning SE best practices for the purpose of learning them. You will learn when you need them. Version control should be used with code and document. And if possible, the code should support "click-and-go" (similar concept to nightly build in SE). That means all your results and figures can be reproduced by running a single main program/script. However, I seldom use unit testing during my data analysis because in an exploring process how can you know the result when you have done it! Only when I am about to finish the project, I add more documents/log and make a set of unit test. The unit test serves as a guard in case I will update the code in the future. For some projects, where data are already preprocessed by others, we just use csv and json files to store all data. But for a project that involves long-period data crawling, data cleaning, and modeling, we see the need to use a database so that all versions of data are well managed. In terms of the team work, my experience in doing several data mining contests (kddcup13, stackoverflow closed question, nokia mdc, etc.) is that splitting the work needs insight into the data mining task itself, rather than SE practices. The key thing is to decouple the big task into several independent ones so that everyone can work on his own and communicate with each other as little as possible. I am not saying that there is no exchange of ideas. On the contrary, there is a lot of discussions. However, too much exchange of code and data will cause the team to spend precious time in communication and coordination rather than data analysis. In summary, good practice cannot be learned by reading rules -- they can only be gained in a hard way, in the process of doing actual data mining tasks.
Yin Zhu
I think the list in the question details is pretty good. Most software developers learn this material on the job; in fact, there's a pretty good case to be made that that's the whole point of an entry level development position. I don't see any reason why it should be different for data scientists.
Justin Rising
This is an interesting question that I've thought about to a considerable degree. I think the answer is founded in another question: what kind of data scientist? There are "theorist" data scientists who do data analysis, data modeling and algorithm design. Then there are "experimentalist" data scientists who implement production systems and write code. Can a single data scientist do both? Sure, but that person is going to be a superman and should earn 2-3 salaries at once. How likely is that? Not very. As a theorist myself, I often work with in-house IT people who program in Python for example, and we implement production systems as a team. There are two different skill sets prevalent with the theorists and experimentalists. The former are adept with computer science, machine learning, mathematical statistics and probability theory. The latter are systems people capable of putting together hardware and cloud solutions, architecting Hadoop clusters, spinning up and configuring servers on EC2, and programming production scalable algorithms in Python. Would you really want a Ph.D. statistician programming your e-commerce website? And vice versa, would you really want your Linux sys admin developing your stocahstic gradient descent algorithm?
Daniel Gutierrez
Most data scientists can get away with being mediocre software engineers. There's such a small number of capable data scientists relative to demand that most companies can't afford to be too demanding. But I cannot stress enough how amazingly valuable it is to be a data scientist with top-notch engineering skills. If you're one of these rare souls, please contact me! Seriously, one thing that most data scientists struggle with is top down design. Most data scientists write one off scripts. These scripts are hard to maintain, hard to extend, hard to refactor, etc. Understanding how to write well abstracted, modular code is very valuable. And no, I'm not saying you need to follow some specific design pattern for every quick ad-hoc analysis.
Benjamin Paul Rollert
As it turns out, there are a number of skills that software developers often take for granted that new data scientists don't possess -- and might not even have heard of. I checked out what these skills might be. I'll walk through the most common responses below, but I'd say the unifying theme for all of them is that many new data scientists don't know how to effectively collaborate. Perhaps they used to be academicians and only worked alone or with a single other collaborator. Perhaps they're self-taught. Regardless of the reasons, writing code in an environment where many other people (and "other people" includes yourself at some later date) will be looking at, trying to understand, and using your code or the outcomes that your code produces. Writing modular, reusable code Chances are good that you'll be asked to submit your code for a code review at some point. This is normal and doesn't mean people are skeptical of your work. Many organizations require that code be reviewed by multiple people before it is merged into production code. Writing clean, reusable code will make this process easier for everyone and will lower the probability that you will be rewriting significant portions of your code following a code review. Documentation / Commenting Because other people are going to be reading your code, you need to learn how to write meaningful, informative comments as well as documentation for the code that you write. It is a very good practice (although one most people probably wouldn't follow) to write comments and documentation before you actually write the code. This is the coding equivalent of writing an outline before you write a paper.Unlike comments, documentation is a document written in English (or whatever language you speak), rather than in a programming language, that explains the purpose of the code you are writing, how it operates, example use cases, who to contact for support, and so on. This can be as simple as a README that is added to the directory where your code is, to a full-fledged manual that will be printed and given to users. Version control Many people use git as their version control system, although you may also encounter mercurial (abbreviated hg) and subversion (abbreviated svn) terminology and exact workflows will differ slightly, but the basic premise is usually the same. All of the code is stored in one or more repositories (repos), and within each repo you may have several branches -- different versions of the code. In each repo, the branch that everyone treats as the starting/reference point is called the master branch. GitHub is a service that hosts repos (both public and private) and provides a common set of tools for interacting with git. Testing There's a good possibility that if you have no formal computer science training, you don't even know what I mean when I say "testing." I'm talking about writing code that checks your code for bugs across a variety of situations. The most basic type of test that you can write is called a unit test.You write tests that describe the expected behavior of your code and that fail when that behavior is not produced. I'm working on another post about testing for data scientists, so I won't go into too much detail here, but Iâd just like to state here that it's a very important topic. Your code is going to be interacting with other codes and messy data. You need to know what will happen when it's run on data that isn't as pristine as the data you are working with now. Logging For instance, if your logs tell you that the code didn't run because the file containing the products wasn't found, you immediately know to try and figure out if the file was deleted, or if the network was down, and so on. If the code partially runs and fails on a specific product or customer, you can go inspect that particular piece of data, and fix your code so it won't happen again.Disk space is cheap, so log generously. It's a lot easier (and faster) to go through a big directory of logs than to try and reproduce an unusual error on a large codebase or dataset. Make logging work for you -- log more things than you think you'll need. Be smart at logging when functions are called, or when steps in your program are executed.
Rebecca Black
A lot of people ask me: how do I become a data scientist? I think the short answer is: as with any technical role, it isnât necessarily easy or quick, but if youâre smart, committed and willing to invest in learning and experimentation, then of course you can do it.JAVA DEVELOPERSIf youâre a Java developer, you are familiar with software engineering principles and thrive on crafting software systems that perform complex tasks. Data science is all about building âdata productsâ, essentially software systems that are based on data and algorithms.PYTHON DEVELOPERSIf youâre a Python developer, you are familiar with software development and scripting, and may have already used some Python libraries that are often used in data science such as NumPy and SciPy.Python has great support for data science applications, especially with libraries such as NumPy/Scipy, Pandas,Scikit-learn, IPython for exploratory analysis, and Matplotlib for visualizations.To deal with large datasets, learn more about Hadoop and its integration with Python via streaming.STATISTICIANS AND APPLIED SCIENTISTSIf youâre coming from a statistics or machine-learning background, it's likely youâve already been using tools like R,Matlab or SAS for years to perform regression analysis, clustering analysis, classification or similar machine learning tasks.R, Matlab and SAS are amazing tools for statistical analysis and visualization, with mature implementations for many machine learning algorithms.However, these tools are typically used for data exploration and model development, and are rarely used in isolation to build production-grade data products. In most cases, you need to mix-in various other software components in like Java or Python and integrate with data platforms like Hadoop, when building end-to-end data products.BUSINESS ANALYSTSIf your background is SQL, you have been using data for many years already and fully understand how to use data to gain business insights. Using Hive, which gives you access to large datasets on Hadoop with familiar SQL primitives, is likely to be an easy first step for you into the world of big data.Data science often entails developing data products that utilize machine learning and statistics at a level that SQL cannot describe well or implement efficiently. Therefore, the next important step towards data science is to understand these types of algorithms (such as recommendation engines, decision trees, NLP) at a deeper theoretical level, and become familiar with current implementations by tools such as Mahout, WEKA, or Pythonâs Scikit-learn.HADOOP DEVELOPERSIf youâre a Hadoop developer, you already know the complexities of large datasets and cluster computing. You are probably also familiar with Pig, Hive, and HBase and probably have experience in Java.A good first step is to gain deep understanding of machine learning and statistics, and how these algorithms can be implemented efficiently for large datasets. A good first place to look is Mahout which implements many of these algorithms over Hadoop.Another area to look into is âdata cleanupâ. Many algorithms assume a certain basic structure to the data before modeling begins. Unfortunately, in real life, data is quite âdirtyâ and making it ready for modeling tends to take a large bulk of the work in data science. Hadoop is often a tool of choice for large-scale data cleanup and pre-processing, prior to modeling.FINAL THOUGHTSReal-Life Application - The road to data science is not a walk in the park. You have to learn a lot of new disciplines, programming languages, and most importantly â gain real-world experience. This takes time, effort and a personal investment. But what you find at the end of the road is quite rewarding.Resources - There are many resources you might find useful: books, training, and presentations. You might find Cloudera useful for such training.Assistance - One great way to get started on real world problems and training for all the above skills is to observe what the whole world is talking about Data Science and Analytics; from the latest trends to career scopes. Check out this link to explore into the world of analytics. http://www.jigsawacademy.com/explore-analytics/
Ashok Rampal
8 Data Skills to Get You HiredThis is the core set of 8 data science competencies you should develop: Basic Tools: No matter what type of company youâre interviewing for, youâre likely going to be expected to know how to use the tools of the trade. This means a statistical programming language, like R or Python, and a database querying language like SQL. Basic Statistics: At least a basic understanding of statistics is vital as a data scientist. This will also be the case for machine learning, but one of the more important aspects of your statistics knowledge will be understanding when different techniques are a valid approach. Statistics is important to all types of companies, but especially data-driven companies where the product is not data-focused and product stakeholders will depend on your help to make decisions and design/ evaluate experiments. Machine Learning: If youâre at a large company with huge amounts of data, or working at a company where the product itself is especially data-driven, it may be the case that youâll want to be familiar with machine learning methods. This could be k-nearest neighbors, random forests, ensemble methods etc. â all of the machine learning buzzwords. Itâs true that a lot of these techniques can be implemented using R or Python libraries â because of this, itâs not necessarily a dealbreaker if youâre not the worldâs leading expert on how the algorithms work. It is more important to understand the broad strokes and really understand when it is appropriate to use different techniques. Multivariable Calculus and Linear Algebra: You may in fact be asked to derive some of the machine learning or statistics results you employ elsewhere in your interview. Even if youâre not, your interviewer may ask you some basic multivariable calculus or linear algebra questions, since they form the basis of a lot of these techniques. You may wonder why a data scientist would need to understand this stuff if there are a bunch of out of the box implementations in sklearn or R. The answer is that at a certain point, it can become worth it for a data science team to build out their own implementations in house. Understanding these concepts is most important at companies where the product is defined by the data and small improvements in predictive performance or algorithm optimization can lead to huge wins for the company. Data Munging: Often times, the data youâre analyzing is going to be messy and difficult to work with. Because of this, itâs really important to know how to deal with imperfections in data. Some examples of data imperfections include missing values, inconsistent string formatting (e.g., âNew Yorkâ versus ânew yorkâ versus ânyâ), and date formatting (â2014-01-01â vs. â01/01/2014â, unix time vs. timestamps, etc.). This will be most important at small companies where youâre an early data hire, or data-driven companies where the product is not data-related (particularly because the latter has often grown quickly with not much attention to data cleanliness), but this skill is important for everyone to have. Data Visualization & Communication: Visualizing and communicating data is incredibly important, especially at young companies who are making data-driven decisions for the first time or companies where data scientists are viewed as people who help others make data-driven decisions. When it comes to communicating, this means describing your findings or the way techniques work to audiences, both technical and non-technical. Visualization wise, it can be immensely helpful to be familiar with data visualization tools like ggplot and d3.js. It is important to not just be familiar with the tools necessary to visualize data, but also the principles behind visually encoding data and communicating information. Software Engineering: If youâre interviewing at a smaller company and are one of the first data science hires, it can be important to have a strong software engineering background. Youâll be responsible for handling a lot of data logging, and potentially the development of data-driven products. Thinking Like A Data Scientist: Companies want to see that youâre a (data-driven) problem solver. That is, at some point during your interview process, youâll probably be asked about some high level problem â for example, about a test the company may want to run or a data-driven product it may want to develop. Itâs important to think about what things are important, and what things arenât. How should you, as the data scientist, interact with the engineers and product managers? What methods should you use? When do approximations make sense? Engineers can always take it a notch higher and enroll themselves in good credible analytics institutes like Jigsaw Academy to make sure they are prepared with right skills when entering into the field of analytics.
Pratap Singh
Related Q & A:
- What is an injected defect in Software Engineering?Best solution by Software Quality Assurance & Testing
- Are there any professional software engineering certifications?Best solution by en.wikipedia.org
- Does anyone know how authentic are online data entry jobs?Best solution by Yahoo! Answers
- What are some good colleges for programming/software engineering?Best solution by colleges.usnews.rankingsandreviews.com
- What are the job opportunities for ME software engineering fresher?Best solution by Quora
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.