Can a mathematically sound prediction interval have a negative lower bound?

Can a mathematically sound prediction interval have a negative lower bound?

  • I have used R to form a 95% prediction interval for the number of endemic species on an island. My lower bound is negative – is that mathematically sound? In the linear model used in the prediction interval, the data used are: Area Surface area of island, hectares DiscSC Distance from Santa Cruz, kilometres Elevation Elevation of higher point in metres and it is coded as such: selected.model <- lm(ES ~ Area + Elevation + DistSC + I(Elevation^2) + (Elevation:DistSC) + (A‌​rea:Elevation)) and stepwise regression was performed to find this "best" model I'm not exactly sure how a prediction interval works. I just want to make sure it is OK. Obviously a negative number of species is incorrect, but I know it takes into account the uncertainty of the mean as well as data scatter.

  • Answer:

    Mathematics are reality-agnostic. So your negative lower prediction band can certainly be mathematically sound. I would argue, however, that this is a good indication that you are using the wrong mathematics, e.g., Ordinary Least Squares (which assumes a normal distribution of errors) with count data (where a normal distribution makes no sense). I would suggest using Poisson regression or some similar method that is more suitable for count data.

user42835 at Cross Validated Visit the source

Was this solution helpful to you?

Other answers

It suggests to me that you haven't used any analytic approach with an appropriate transformation of the outcome. With count data, for instance, popular linear models (Poisson Regression or Negative Binomial Regression in particular) model the log of the process as a linear function of predictors. Then, any predicted values resulting from such a model would have to be exponentiated and, thus, positive. Similarly, when you use the predict.glm function with se.fit set to TRUE for these models, you calculate symmetric prediction intervals for counts on the log scale. Re-exponentiating those values ensures that you have intervals which do not include 0. You'll notice that the exponentiated predictions are the same as you would get from setting type='response' in the predict function. However, asking for both type='response', se.fit=TRUE will confuse R since the link transformation of the GLM means you'll have non-symmetric intervals (SE of FIT is calculated on the transformed outcome scale). There are additive count models, just like there are additive risk models for binary endpoints, but I think the results can be difficult to interpret and they behave untenably for predictions near to the boundaries values of the support (0 for count data). As such, I'd be dubious about not only your negative predictions but all other predictions from your model.

AdamO

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.