What are some good "beginner level" data modeling/analytics approaches to kick start a data science/analytics team?

Be the "data guy" my current boss needs (and future bosses want)?

  • I'm in academia. Two related questions: 1) Can you recommend readings, forums, programs, and tools under the broad heading, "data analysis?" 2) And, what's the next job I should be looking for? Apologies in advance for identity-obfuscating vagueness from the Sock. I'm in academia. My new boss's boss came in as a data-driven administrator. He's an outlier in the discipline - he's interested in data for field that doesn't do much quantitative analysis. He has a PhD in a math and statistics-heavy discipline and is leading a unit bereft of anything resembling quantification or visualization savviness (either from the faculty, staff, or administration). Which, I think, is precisely why he was hired. Happily, I've found some success under the new regime. My niche role rewards technical knowledge and data collection (but doesn't necessarily require it). I was among the first to walk into his office with well-formatted excel charts drawn from a robust access database I developed. But...I want to do more. My education is mostly non-technical (after a false start as an engineering student that had 18-year-old me sitting in front of a Sun terminal plodding through Fortran 77, and sleeping through differential equations), but I've come to my current role because I have a high tolerance for dealing with information overload and searching for answers. Here are some things I'm taking stabs at: - Web scraping: I've figured out the browser extensions and X-path syntax that makes it easier to pull data off the web. I've learned enough Python to get BeautifulSoup (and to a lesser extent, Srapy) running some Hello World scripts. - Visualization: Boss is sending me to a Tufte course and I've skimmed through a couple of his books. - I get a kick out of r/dataisbeautiful - R and Stata? I know enough to drop the names, but haven't explored them much. -PDF text parsing - I get a lot of data as text only PDFs. This is a terrible state of affairs (that's not going to change any time soon). Beyond the technical challenges, I'm suffering some "grass is greener" envy. For many of the tasks and projects I'm contemplating, there are analysis firms and packages that could almost certainly do it. But -- as they've explained -- neither we nor our academic niche have the budget to a make a business case for them to take any interest. Which makes me think that I ought to develop these skills to employ for the bigger fish they're working with. The most helpful career advice would come if I were willing to explain my niche, but broadly: what are the jobs related to the issues and skills I've mentioned? Thanks!

  • Answer:

    If you've already learned enough Python to get BeautifulSoup working, I think you should look at NumPy, the plotting capabilities of Matplotlib, and if you need scientific data analysis, the analysis capabilities of SciPy. NumPy is a professional-grade package that now provides the foundation for a wide range of scientific data analysis - it is fantastic. I don't have as glowing a report for Matplotlib, but let's say it is regularly improving, and things have gotten a lot more stable since it went 1.0. Working through the NumPy and Matplotlib tutorials online should get you a solid footing with numerical data analysis and visualization. (And if you need to install the whole thing, look at the academic licensing for the Enthought Python distribution - or Canopy, as it is now styled.)

Admiral Sock at Ask.Metafilter.Com Visit the source

Was this solution helpful to you?

Other answers

Coursera is offering a paid https://www.coursera.org/specialization/jhudatascience/1?utm_medium=listingPage that I've been window-shopping for a while. I have no idea how good it is or whether it's worth the time/money investment, but it's there. Codeschool has a free https://www.codeschool.com/courses/try-r. School of Data has a ton of free http://schoolofdata.org/courses/.

ourobouros

In my mind, there's no reason to pay for JHU's Data Science specialization on Coursera unless your work is footing the bill and you're not burning through a finite, small pile of money. You can also take all of the courses for free except the capstone, and I haven't heard good things about the capstone. The learning curve for it if you haven't done any substantive coding is harsh, and the level of explanation and forum discussion is highly variable between the courses. That said, I'm in an analogous situation and learning bits of R and Python all the same, using those and other resources. Try the CodeAcademy or DataCamp intros to get you started and see if anything clicks for you. On the statistics front, if you need an intro (or a refresher) I can highly recommend https://www.coursera.org/course/statistics by Dr. Mine Çetinkaya-Rundel at Duke. It starts again in early March, and as an added bonus, the labs are all conducted in R (using DataCamp) and very well put together.

deludingmyself

Hard to answer without really knowing what type of data you're working with, or what you're trying to do with it. R is free and with the R-CMDR plugin you can run it from a menu-driven interface rather than direct coding. Sounds like you could use a basic statistics class. Most come teaching a statistical analysis software. R is open source, STATA is user-friendly, and SAS is quite common though more for experts than dabblers. But don't forget: the first part of making decisions with data is having data. If you're talking about making work decisions, then the big job is collecting it. If you track information, often a simple analysis ("how much time and money are we spending on this task? how much benefit are we getting from it? is it worth it?") is all you need to make an informed decision.

entropone

Various departments at my university do "statistical computing" classes that will cover various computing packages, ranging from entire semesters covering R to a couple of days to learn Stata (which is much more user friendly). You might check in math/stats, but also in various social science departments. In my experience, most TAs would let outside people sit in unofficially/off the books as long as they didn't do annoying things like try to dominate the Q&A or take up a ton of time in office hours. As far as which to learn, I would start with Stata simply because you can probably learn it an a few days on your own, using online user manuals and/or a couple of books checked out of the library. I don't have much of a computing background and I taught it to myself. I think R would be tougher to tackle on your own unless you are already pretty programming savvy, so for that I would look for a workshop or class.

rainbowbrite

Are you in a city? Check out a Tableau demo. You probably won't use this stuff in academia, but at least you could get a taste of how people make this easy for themselves.

oceanjesse

For PDF text parsing: if you have access to a unix server I use http://linux.die.net/man/1/pdftotext to get data that should be a bit more easy to automate the processing of. Outside of unix, you might find some desktop apps simply by searching for pdftotext on Google, Yahoo, etc. Seems like you're otherwise well on your way to building a tool box. Just be ready to itemize and consequentially be able to maximize the face value of your resume if/when the time comes. I don't know if these departments have died out or are still alive in companies, but you'll be seeking out Data Insights, Warehouse groups to get employed in. Companies that do analytics, targeted advertising, other related data intensive work. Good luck! Sounds like you've been employed in a rather important part of science, not the most interesting or fulfilling, sure, but important.

JoeXIII007

The job title you're looking for is 'Data Scientist'. It's a great field to get into right now if this is a subject you're interested in - demand for competent data scientists greatly exceeds supply. I'll add another recommendation for the free data science specialization track on coursera. It'll give you a good, structured overview of the field, and after completing it you can decide which topics are of most interest to you. Regarding software, R and Python are both widely used (as is SAS, but the other two are open source).

btkuhn

Instead of going to academic departments, does your university have a data and information services division of its library/ies? They often offer workshops and access to public data sets.

Madamina

I strongly recommend https://www.udacity.com/courses#!/data-science, they have quite a lot of data science stuff and all free to view.

greytape

Related Q & A:

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.