How to plot a histogram?

How do I plot a histogram of data which is very large in GNUPlot / Octave or any other scientific tool?

  • I have a dataset of float values that I want to analyze. It has around 92 million data points, 24 million of them unique and range from 1e-13 to 0.1. To get an idea of how they are distributed, I sorted them and ran uniq -c to get the unique values preceded by their counts. It would be nice if I could get a histogram of these values where the application decides what the appropriate binning is. As a preliminary attempt, I simply tried plotting the values on the X axis with their counts on the Y axis but GNUPlot fails with the error "out of memory for expanding curve points". Is there a way I could see a histogram?

  • Answer:

    I don't know about gnuplot, but I looked up your problem and found this page: http://stats.stackexchange.com/questions/7348/more-efficient-plot-functions-in-r-when-millions-of-points-are-present From which there is this: Look at the http://cran.r-project.org/package=hexbin package which implements paper/method by Dan Carr. The http://cran.r-project.org/web/packages/hexbin/vignettes/hexagon_binning.pdf has more details which I quote below: 1 Overview Hexagon binning is a form of bivariate histogram useful for visualizing the struc- ture in datasets with large n. The underlying concept of hexagon binning is extremely simple; the xy plane over the set (range(x), range(y)) is tessellated by a regular grid of hexagons. the number of points falling in each hexagon are counted and stored in a data structure the hexagons with count > 0 are plotted using a color ramp or varying the radius of the hexagon in proportion to the counts. The underlying algorithm is extremely fast and effective for displaying the structure of datasets with n≥10^6 If the size of the grid and the cuts in the color ramp are chosen in a clever fashion than the structure inherent in the data should emerge in the binned plots. The same caveats apply to hexagon binning as apply to histograms and care should be exercised in choosing the binning parameters. http://stats.stackexchange.com/a/7356http://stats.stackexchange.com/posts/7356/edit answered Feb 18 '11 at 2:02 http://stats.stackexchange.com/users/334/dirk-eddelbuettel http://stats.stackexchange.com/users/334/dirk-eddelbuettel Check out the link. There are other solutions there, as well. Let me know if any are useful.

Steven Dillard at Quora Visit the source

Was this solution helpful to you?

Other answers

Take a random subsample of your data. You don't need all 92 million points to get an idea of the distribution. If you look at a histogram with 10k points versus 100k points, I doubt you'll notice much difference. Even less so for 100k versus 92m. As far as automatically figuring out bin size... I'm less familiar with Octave, but R has a few built in algorithms you can try out. Also, a kernel density estimate (basically a smoothed histogram) will give you an even better picture of your distribution. # make some fake data data <- rnorm(100000, 0, 1) # plot a histogram using two different algorithms for determining bin size hist(data, breaks="Scott") # minimizes error for normal distributions hist(data, breaks="FD") # based on inter-quartile ranges # plot a kernel density estimate plot(density(data))

Sam Thomson

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.