What are the pros and cons of having complex SQL queries vs simple SQL queries but using other programming languages to handle the complexity?
-
Suppose I have some analysis to do on a set of data. Is it advisable to write simple SQL queries to extract the data and then rely on a programming language like Python to do the more complicated analysis? Or should I write a complex SQL query to handle everything? What are the pros and cons of each approach?
-
Answer:
My rule: if it can be done in SQL & isn't excessively painful or ugly, do it in SQL. This is for a couple main reasons: Relational databases tend to be great at optimizing relational queries. If you're doing a typical relational operation (e.g. SELECT, JOIN, AGGREGATE), it makes sense to do this in the database. I also tend to make ample use of WITH clauses to get to the appropriate result set. The code's closer to the data. It's perfectly clear what a query.sql file's doing when pointed to the appropriate database. Doing the operation in Python adds another level of abstraction, another I/O operation, and tends to make the code harder to understand (a JOIN in SQL looks much better than a JOIN in Python). To cover one common objection: "But I barely know SQL! I know Python in and out." If you're ever touching relational data, stop. Go learn SQL. Fully grok relational operations. You'll save yourself a ton of time and pain in the future - SQL is meant for handling relational data well. Of course, there are many reasons why using Python would be the right call. Complex machine learning models shouldn't be implemented at the database level (given current typical technologies). If you're writing many custom functions or stored procedures, there's a good chance you should be loading the data into Python. If the operation is a 200 line behemoth of a SQL query and a 5 line Python function, it should be in Python. If you need to heavily unit test a certain operation, that's generally much easier to do in Python.
Ben Hamner at Quora Visit the source
Other answers
I've rarely had to make a query that wouldn't be 10x uglier in an imperative language. And, as the other good answers say, you usually want to do the work close to the data (especially if it involves aggregation).
Toby Thain
Here are some thoughts in no particular order: 1. If you can do it in SQL without too much pain, you should. SQL is very good at what it does. 2. 'Too much pain' relates to how many things I will have to look up in order to draft the SQL, and how hard it will be to verify that the SQL did what I expected, and how hard it will be to debug. 3. If an analysis is a one-off, I prioritize easier to write code over performant code, and vice versa. If analysis is likely to be put into production, then I lean more on SQL both because it tends to be more performant and because it is readable to a wider range of technical people. 4. If you are writing very complex SQL to perform essential tasks, then there is probably something inadequate in your data architecture. A well-designed and maintained data architecture is essential to successfully putting analytics into production. In some cases, this just means you need more views. 5. Here is a list of tasks that I often do on both sides of this divide: converting date time strings, rescaling numbers (changing units), and calculating statistics derived from other columns. On the other hand, I almost always perform joins on the SQL side of the divide. And, if a calculation requires iteration across rows, e.g. to calculate an exponential smoothing, then I do it in a higher level language.
Yasmin Lucero
Pros The query is much more succinct in SQL than in the caller language - you'll find your program simplified Stored procedures (if available) can abstract away even more complexity than a single query You work much closer to the data - i.e. less network traffic bandwidth for "back and forth chatter" Potentially more performant (beyond less network) as SQL query optimizer probably can do a better job Cons Portability will suffer as not all databases handle the same SQL variant Some problems are simply not SQL's strong suit (for example - graph traversal) and will be better implemented upstream
Yinso Chen
try to find a work around in sql query by avoiding cursors and any iterations.Too many iterations in query would add load on the server as server side operations slow down the application. You need to find a perfect balance on usage between query logic and prog lang. logic
Vinay Bysani
If it's an operation that visits a lot of data, the more you can do on the db engine, the better (especially if it's something done frequently). Also, SQL is very good at "in-line" operations, so you can have an extremely complex computation in a WHERE or select-list and SQL won't be terribly slow. What would be slow is table-scans on a big table or badly optimized joins, as is always the case. In particular, you want to avoid "in-code joins" if possible. It is nearly impossible for your app-side code to execute an efficient join that doesn't turn into what DB people would call a nested-loop join, which is already the worst type of join a db can execute and may be dozens of times slower in app-side code versus in the db engine, due to the extra network traffic involved. That said, there are plenty of times when you have to do operations in app-side code because SQL isn't good at it or stored procedures are too slow relative to fast app code. In particular, stored procedures in many db engines are interpreted, while compiled code will beat the pants off them in many operations, even with network overhead. (I recently converted an icky MySQL stored procedure to a bit of compiled app code + setup query that reduced the time spent for this operation from many hours to a few minutes.) If you have to do this sort of thing, try to come up with a query that fetches all the data you need out to your app, including any joins, so your app-side code can just iterate through the select-list and execute its computation. Best results would be if your app-side code operates on a simple result-set and doesn't need to visit the db for anything else. If you need to use a lookup table for something, it's probably best to either join to what you need in the initial query or simply fetch the lookup table contents into your app and store it in a local data structure versus trying to visit the db when you need something out of it while processing your result set (this operation is what I call an app-side join). As for writing results back to the db (assuming your app wants to do this), if performance is an issue, you may want to batch up results and send them in large, multi-row INSERTs versus sending a single INSERT with one row at a time.
Greg Kemnitz
An important reason: Many managers and other analysts are familiar with SQL and can HELP you develop the necessary queries. They often change. Doing it in SQL is also "self documenting", portable, easier to understand than whatever language you are programming in. I do this all the time, and it ends up being more collaborative, and often faster, than if I just did it myself.
Ralph Winters
My two cents: it largely depends on the analysis you want to do with the data and the size / nature of the data you've got. For one thing, there are different kinds of SQL, with difference performance concerns / capabilities. In some cases, if your data set is very large and running a SQL query for complex computation is very slow, Python will be a better choice for obvious reasons. Generally, SQL is good for binding data logic in routines. For example, if there's some data cleansing / basic data manipulation that you want to get done frequently or on a regular basis, you might want to set up an automated routine task by calling a stored procedure in SQL. On the other hand, there are some tasks that are intrinsically hard (or, at least non-trivial) for SQL to handle, such as random number generation. Therefore, it's easier to use Python to generate a random sample. That said, if you are only familiar with one of SQL and Python, you might want to discuss your plan of actions with a domain expert of either (or both) languages based on the data set and the task you have. I believe that'll land you to a nice solution faster than asking a vague question here. :)
Ji Li
I saw a coworker, a senior Java programmer (with a PhD), struggling to get some code written. He was behind on his deadline. I asked him what he was trying to do and if I could help.He said he needed data from two tables. He needed all data from the first table and only matching data from the second table. Where there was no matching data, he still wanted the data from the first table, but with nulls for the objects where there was no match in the second table.âWhatâs the problem? Thatâs an outer join in SQL,â I replied.He said he didnât know how to use an outer join, so he had spent many hours trying to code the task from scratch using Java. He was fetching all the data from both tables, matching it up, and discarding data from the second dataset that didnât match. He was trying to make the matching more efficient by implementing some sophisticated sorting algorithm and a B-tree data structure.âNo, really, the database can do that easily, itâll run faster, require less code, and youâll only fetch the results of the match. Iâll show you how,â I offered.His face grew dark, and he literally said to me, âI forbid you from showing me an outer join!âHe had invested so much in implementing his data structure that he couldnât bear to admit he had wasted his time. He would rather finish his pages of Java code, and never know how to do the same task in one line of SQL.Donât be that guy. Use the right tool for the job.
Bill Karwin
It all depends on the task features under analytics. Usually there are two kinds of tasks: One has the explicit computing target for statistical analysis explicitly calculate target; the other doesn't have specific target but for exploratory discovery and analysis. For the former one, general structured data computing can be directly done with complex SQL queries; If mainly involving the string analysis, it should be done with simple SQL+perl/Python; Regarding the predictive analysis, you should use R or SPSS; For large data calculation, you should use Hadoop or esProc; If the analysis of the target is too complex, and needs to be decomposed simple, R and esProc is more suitable. In term of exploratory analysis, simple SQL fetch + analysis language is obviously the best choice. As such analysis needs to guess, make judgments according to current data, and then decide to take what kind of algorithm according to the characteristics of data.SQL is difficult to observe the results of each step, it is very difficult to achieve the distributed computing, not suitable for exploratory analysis. In contrast, esProc and R is more suitable for this algorithm. There are many cased to show the difference and association among R, SQL, esProc, Perl, Python, Hadoop, etc. at the blog http://datakeyword.blogspot.com/ Which could help you choose the right tools.
Davina May
Related Q & A:
- What are the pros and cons of donating blood?Best solution by Quora
- What are the pros and cons of xenotransplantation?Best solution by Yahoo! Answers
- What are the pros and cons of pre-paid debit cards?Best solution by Yahoo! Answers
- What are the Pros and Cons of being against Mercury Pollution?Best solution by Yahoo! Answers
- What are the pros and cons of buying a house using an FHA loan?Best solution by Personal Finance and Money
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.