How to merge datasets in Stata conditionally?

Recommended Stata alternatives?

  • What is the best free substitute for Stata? I use Stata at university, but only have access to it during term-time. During vacation, I'd like a free program that can open large Stata (or SPSS or SAS) datasets, manipulate the data (pooling datasets, generating variables etc), and run regressions. I don't expect to use any particularly complex statistical methods. Ease of use is a priority: I have only a few weeks and a fair amount of work to do, so although I could probably get by with a command-line, a GUI solution would be great. Graphs would be nice, but are not essential. Windows support would be ideal, but Linux-only is acceptable. I can use Google, so I'm not just looking for a list of open source statistical software; rather, I'd like recommendations of good Stata subsitutes. Thanks!

  • Answer:

    Here's an example of how to subset data with the mtcars dataset that comes with R: > mtcars[1:5,]                    mpg cyl disp  hp drat    wt  qsec vs am gear carb Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 > mtcars.mpg = subset(mtcars, select=c(mpg, cyl, hp)) > mtcars.mpg[1:5,]                    mpg cyl  hp Mazda RX4         21.0   6 110 Mazda RX4 Wag     21.0   6 110 Datsun 710        22.8   4  93 Hornet 4 Drive    21.4   6 110 Hornet Sportabout 18.7   8 175 > ?lm > summary(lm(mpg ~ hp, mtcars.mpg)) Call: lm(formula = mpg ~ hp, data = mtcars.mpg) Residuals:     Min      1Q  Median      3Q     Max  -5.7121 -2.1122 -0.8854  1.5819  8.2360  Coefficients:             Estimate Std. Error t value Pr(>|t|)     (Intercept) 30.09886    1.63392  18.421  < 2e-16 *** hp          -0.06823    0.01012  -6.742 1.79e-07 *** --- Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1  Residual standard error: 3.863 on 30 degrees of freedom Multiple R-Squared: 0.6024, Adjusted R-squared: 0.5892  F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07  > library(lattice) > xyplot(mpg ~ hp, mtcars.mpg, type=c("p", "r")) Listen to ROU_Xenophobe.

matthewr at Ask.Metafilter.Com Visit the source

Was this solution helpful to you?

Other answers

I suppose my question is: in the short term, what does R do for me that makes up for the extra learning effort? Nothing. Well, nothing + epsilon -- R will do some things that Stata won't, and R is good for simulation, if you care about that. But if you can't afford Stata, then the choice isn't between Stata and R, it's between R and other free stuff. I would be utterly astonished if you found a freeware workalike or work-similar to stata. No doubt there is other free stuff that will manipulate data and run regressions, but my sense is that it's generally either crippled in some way, user-hostile, or doesn't have a useful community around it. R isn't crippled -- if you can do it, you can do it with R -- and the community built up around it is very useful and keeps expanding its capabilities. If you have to learn anything about any other software other than Stata, it should be R, because what you learn won't be useless to you in a year.

ROU_Xenophobe

And yeah, getting a csv out of stata is as easy as "outsheet using filename, comma". I just have no idea how to do that in SPSS or SAS, because I've developed a real loathing for them. SPSS I just never liked. And SAS pissed me off real bad in like 1995 when I found that while every version of SAS uses the same godawful syntax from 1787 or whenever it first came out, you couldn't transparently move a SAS-for-unix dataset to SAS-for-PCs, even with the same version number. Fuck you, SAS.

ROU_Xenophobe

R will http://cran.r-project.org/doc/manuals/R-data.html#EpiInfo-Minitab-SAS-S_002dPLUS-SPSS-Stata-Systat. I can't say that it is especially easy to use at first, but there are numerous GUIs for it, Linux and Windows.

grouse

http://www.r-project.org/ can read Stata datasets, and there's a windows version. Do you need to run Stata scripts too?

mkb

No, I don't think I need Stata scripts. Some kind of logging would be nice though. One thing I should have emphasised is that the datasets I'm using are very large (650Mb or more). I need to be able to open these files and do things like generating new variables based on existing data, but I'm only going to be running regressions on about a dozen variables (out of well over a thousand), so I'll probably want to extract the desired variables into a new, smaller dataset.

matthewr

I'm not sure what kind of logging Stata does, but R automatically saves all the commands you run to a history file. This is invaluable to me when I come back several months later and can't figure out just how I created that particular plot.

grouse

Not free, but if your university is a participant in stata's gradplan a permanent license for intercooled ("normal") Stata 10 is only $155. And then you have Stata 10.x forever and can throw it on your laptop, etc. If it's got to be free, R. R is not user friendly. At best, it vaguely tolerates users with a grumpy sniff. In your shoes, I would load all of the datasets I think I'm going to use and dump them to csv from their home software rather than screw around with beating R into importing proprietary datasets. Learning the other parts of R will be enough of a pain in the ass, and it should be easy to make Stata/SPSS/SAS vomit a csv.

ROU_Xenophobe

The ?lm above gets the help for lm. If you want an R book I highly recommend An Introduction to S and S-Plus by Phil Spector, which works just fine for R. Unfortunately the online introductions I have seen all suck, as do several of the books with R in their name.

grouse

ROU_Xenophobe: Unfortunately, it has to be free. No GradPlan, so the cheapest Stata is £320. I'll definitely follow the CSV suggestion, thanks. Many thanks for the R code, grouse. As you and ROU say, R is not particularly easy to use. I can see that if the choice was between Stata and R, perhaps the effort of learning R would pay off in the long run. But I've already chosen Stata, and I'm looking for a temporary substitute for it rather than a fullblown long-term replacement. With that in mind, I suppose my question is: in the short term, what does R do for me that makes up for the extra learning effort?

matthewr

Related Q & A:

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.