How broad/deep should you do unit-testing in building data analysis program?
-
As you know unit-testing is critical in making sure your algorithms are correct. What I also believe is that when your requirement or purpose is fast-changing, there's no sense to write unit testing. Also, for data analysis program, it would take a great deal of time to construct test data for unit-testing. Anyone who could shed some light on how unit-testing is done in some rapid data analysis?
-
Answer:
I don't think anyone uses traditional unit tests in rapid prototype development for data anaylsis. There are multiple reasons for this: Very often, you will be using or building on standard tools or libraries, which have been tested already, and have intuitive APIs. So the complexity of your code will be limited, there aren't substantial algorithmic mistakes you can make, Very often data analysis tools do not give a precise answer. For example, if they can be based on stochastic gradient descent or Markov chain Monte Carlo, so in theory, running them twice can give numerically different results. The technique roughly equivalent to testing in predictive data analysis is cross validation. This involves splitting your data into distinct training and validation/test sets. You do your analysis (fit the model) on the training set, and then you evaluate the predictive performance of the model on the held-out test set. This helps avoiding the common problem of overfitting, and also highlights most bugs or algorithmic mistakes that may exist in your code. Strictly speaking this process is more analogous to integration testing. Very often, when researchers develop new techniques, they test their algorithms with synthetic toy datasets where they know roughly what to expect. You can think of this process as a form of testing, although in very few cases do researchers apply the level of rigour that software engineers would.
Ferenc Huszár at Quora Visit the source
Other answers
In http://blog.mpacula.com/2011/02/17/unit-testing-statistical-software/, http://blog.mpacula.com/ suggests the following testing ideas: a. Count your outcomes. b. Keep probability distributions legal (non-negative and summing to one). c. See that probability marginals make sense. d. Compare your results to a simple baseline, to see whether they give any improvement over the baseline. e. Whenever possible, use a mathematically sound score instead of magic numbers.
Yuval Feinstein
Always make sure you have report summary and control totals that will always tie out when you make a change, so you can use as a sanity check before you go to production. You can set up tolerance values for maximum or minimum range values. If you make changes that break the rules, construct another rule! That is an agile way of doing this. After a while your control checks will get better and better.
Ralph Winters
In a data project I would test: Data formats and medias. When there is a script that reads data from a DB and writes it into a text file, I'll have a test with golden DB contents to parse and golden output file to compare against. APIs calls. When there is a way to invoke an API to produce the result, I would like this API call to be tested. If this API allow adding data, the test can be short and complete. If the API relies on some data being present / uploaded, I would not hesitate to write some code to have fake data available. Modeling. If your tool is building models and applying them, a good end-to-end test would be one that generates some train and test datasets according to some pattern and then uses your tool to train and further apply the model. End-to-end. A regression test that serves production-like traffic. I consider the following simple tools must-have: 1) Dump a fraction of user traffic into a file. 2) Replay "queries" from this file. Hope this helps. Thanks, Dima
Dima Korolev
Related Q & A:
- How to mock NSURLConnection.sendAsynchronousRequest in unit testing?Best solution by stackoverflow.com
- What is the best Unit testing framework for iPhone?Best solution by quora.com
- Does anyone have a good data recovery program?Best solution by ufsexplorer.com
- How Broad can "Global Partnership" be?Best solution by Yahoo! Answers
- How to find a career in Software Testing?Best solution by indeed.com
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.