What statistical analysis method can I use to find the relationship between a disease (single variable) and environmental factors (multiple variables)?

Question

Boris Babenko · Answer

[Disclaimer: I'm a "punk statistician", not a real one :-P]

You're saying that there may be an arbitrarily complex relationship between your variables and the target, and you want to model this relationship... that's generally a really difficult problem.  Typically the way one goes about this is to assume the relationship follows some parametric model: e.g., linear, polynomial, exponential.  Then you can try to fit the model to the data and see how good the fit is.  In the simplest case you'd test for a linear relationship and look at correlation or R2 score, as you mentioned.  As you pick more and more complex models, you need to start worrying about the bias variance dilemma [i.e., your model might get a perfect fit, but it'll be fitting noise in the training data; see Bias–variance dilemma [ http://en.wikipedia.org/wiki/Bias%E2%80%93variance_dilemma ]].

The obvious question is: how do I pick a model to fit?  No easy answer here, unfortunately.  One way to do this is to just look at the data and get a feel for what the possible relationships may look like.  Then pick the SIMPLEST model that you feel may capture the relationship (see again bias-variance).  The ultimate test is to see how well your model can predict the target for new (or held out) data points.

Another option is to fit a non-parametric model.  These models are great if your ultimate goal is to make predictions for novel data points (e.g., what will be the disease incidence if temperature rises to 45, and elevation is X?).  You could try random forest regression [see 3.2.3.3.2. sklearn.ensemble.RandomForestRegressor [ http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor ]], or radial basis functions [see Radial basis function [ http://en.wikipedia.org/wiki/Radial_basis_function ]].  The main disadvantage to these types of models is that while they may be able to make good predictions, they are not as interpretable.  When relationships are really complex though, this is your best bet.

Rebecca Warner · Answer

If the one independent variable is categorical (information about group membership) and all of the outcome variables are quantitative, roughly normally distributed, and associations among them are linear, then Multivariate Analysis of Variance (MANOVA) could be used. (In SPSS this is obtained using GLM).

If all of the variables are quantitative, you could set up a regression using the scores for your “causal” independent variable as values of Y, and the scores for your outcome variables as values of X1, X2 etc. The regression would show you whether there is some weighted linear combination of scores on the quantitative X variables that is strongly related to the score on the single Y variable.

Y’ = b0 + b1X1 + b2X2 + …. bkXk

It’s conventional to think of the X’s as “causes” of Y in this regression situation, but the analysis is just a way of finding out whether scores on Y are related to some combination of scores on the X’s. Regression does not “know” anything about your assumptions regarding possible cause/effect.

Unless your research design was experimental, neither analysis tells you anything about cause and “effect”. Each analysis only tells you if the scores for your set of outcomes are statistically significantly related to the score on your single predictor variable.

Anonymous · Answer

I'll be unkind now. There is high chance you will not get any more answers because you don't respect the people your asking help from.

You can commit a bit more time in stating your question properly, on double-checking your spelling and your grammar.

What is your hypothesis? What is your continuous variable?

What I understand from your brief description is that you want to compare if the guys choosing one color have a higher average in their continuous variable than the ones who are not choosing the color (same applies for color combinations).

Let's say that you have color blue, and the continuous variable is height. What is your hypothesis? Do you want for example to show that people who choose color blue (e.g. N1=15) are taller than those not choosing color blue (e.g N2=42)?

Then calculate the average for each group (the blues and the non blues) and measure their difference. You can use permutation testing to check if your result is statistically significant, good is that permutation testing has (next to zero) assumptions. This can be done by randomly permuting the members of each group keeping the group sizes same. So out of the 15+42=57 people (your observations), assign them randomly into groups of 15 and 42 members and calculate again their mean difference. Repeat this process for a lot of times depending on the significance threshold you want. For example repeat the process for 10000 times and then sort the results (the mean differences). Then if the real difference you calculated is higher than let's say 9900 of the permutations your result is significant at a significance threshold of p%3C0.01.

The conclusion of that test is that the groups are not defined randomly but rather there is a relationship between the color choice and the height, so that randomly divided groups would not have such a high difference as the one you observed from your data.

Good luck with that

Roger Ilagan, MD, MPH · Answer

Environmental analysis involves examining external and internal factors to understand opportunities, challenges, and trends affecting an organization or system. Several models and frameworks are commonly used to analyze and interpret environmental factors:

1. PESTEL Analysis

Focuses on macro-environmental factors:

* Political: Government policies, regulations, and political stability.
 * Economic: Market trends, inflation rates, economic growth, and exchange rates.
 * Social: Cultural trends, demographics, and consumer behaviors.
 * Technological: Innovations, R&D, and technological changes.
 * Environmental: Sustainability issues, climate change, and environmental regulations.
 * Legal: Laws affecting business, including labor laws, consumer protections, and intellectual property rights.
2. SWOT Analysis

Examines internal and external factors:

* Strengths: Internal advantages or capabilities.
 * Weaknesses: Internal limitations or deficiencies.
 * Opportunities: External chances for growth or improvement.
 * Threats: External risks or challenges.
3. Porter’s Five Forces

Assesses industry-level competitiveness:

* Threat of new entrants: Barriers to entry for potential competitors.
 * Bargaining power of buyers: Influence of customers on pricing and terms.
 * Bargaining power of suppliers: Influence of suppliers on costs and inputs.
 * Threat of substitutes: Availability of alternative products or services.
 * Industry rivalry: Intensity of competition among existing players.
4. Value Chain Analysis

Breaks down internal operations to identify areas for value creation or improvement:

* Primary activities: Inbound logistics, operations, outbound logistics, marketing, and sales.
 * Support activities: Procurement, technology development, HR management, and firm infrastructure.
5. Scenario Analysis

Explores multiple future possibilities by creating scenarios based on varying assumptions about key uncertainties (e.g., technological developments, regulatory changes, or market conditions).

6. Ansoff Matrix

Helps identify strategic growth opportunities:

* Market penetration: Existing products in existing markets.
 * Product development: New products in existing markets.
 * Market development: Existing products in new markets.
 * Diversification: New products in new markets.
7. STEEP Analysis

Similar to PESTEL, but focuses on:

* Social
 * Technological
 * Economic
 * Ecological
 * Political factors.
8. McKinsey 7S Framework

Evaluates organizational effectiveness by focusing on seven interrelated factors:

* Strategy, Structure, Systems (hard elements).
 * Shared values, Skills, Style, Staff (soft elements).
9. Benchmarking

Compares an organization’s performance, practices, or processes to industry standards or competitors to identify areas for improvement.

10. Environmental Scanning

A continuous process of collecting information on external trends and events using tools like:

* Surveys and reports.
 * Media monitoring.
 * Big data analytics.
11. BCG Matrix

Used to analyze a portfolio of business units or products:

* Stars: High growth, high market share.
 * Cash cows: Low growth, high market share.
 * Question marks: High growth, low market share.
 * Dogs: Low growth, low market share.
12. Ecosystem Mapping

Identifies key stakeholders, relationships, and dynamics within a specific environment or system to assess opportunities for collaboration or intervention.

13. Force Field Analysis

Analyzes driving and restraining forces influencing change, helping decision-makers address barriers and leverage strengths.

14. Critical Success Factor (CSF) Analysis

Identifies areas essential for achieving organizational objectives and evaluates external factors impacting these areas.

These models provide diverse perspectives on analyzing the environment, allowing organizations to make informed strategic decisions and adapt to changing contexts effectively.

Aaron Brown · Answer

Multiple regression does not speak to the question of whether there is a correlation between two variables. To answer that, you must do a univariate regression. If the t-statistic on the slope coefficient is significantly different from zero, you can assert that the correlation between the dependent and independent variable is significantly different from zero with the same significance level.

One way to think about the slope coefficient for a variable in a multiple regression is it tells you whether the independent variable is correlated to the dependent variable after adjusting both for all t

One way to think about the slope coefficient for a variable in a multiple regression is it tells you whether the independent variable is correlated to the dependent variable after adjusting both for all the other independent variables. That’s not precisely true, but it gives you the general idea.

For example, you might find that owning a bicycle had a negative correlation to weight in a univariate analysis, but if you added amount of weekly exercise and age as independent variables, the effect of owning a bicycle on weight became insignificant. That would suggest that while bicycle ownership is correlated with weight, it operates as an indirect reflection of age and exercise, without statistically significant direct effect on weight.

Peter Winwood · Answer

statistical analysis does NOT Cause' anything!

However, depending on the nature of the data , it may reveal causal connection.

Some programs like nVivo seek to do a similar thing with qualitative data, but this is NOT the same thing.

Richard Browne · Answer

Because in humans the outcome is (almost) always affected by factors that you wonder if they matter.

In research I was closely involved in, two treatments for club foot were being compared to see which would be better able to achieve a good outcome without surgery.

Sounds pretty straight forward, doesn’t it? To make a long story short, one method required a lot of treatment of the feet at home, but had fewer trips to the hospital than the other, which was mostly hospital treatments of the foot. It turns out that the home based treatment was better, but required a lot of dedication to the treatment plan. Problem: dedication was hit and miss. When the parents did not stick to the treatment plan, the result was usually disappointing.

Turns out that the effective parents were of certain socioeconomic groups. Likewise the effective outcomes were seen with parents in other socioeconomic groups. If that information had not been recorded, the mixed results for the home treatment would have been hard to explain.

Hope that helps.

Nicholas Graham · Answer

The wording of the question is a little hard to follow, but it sounds like a standard 8x2 factorial design in ANOVA, or a simple interaction in multiple regression, but split by the latter variable into two analyses (which is a standard practice to some, but not ideal since it loses power).

The simple solution, if you want to go the regression route (and it sounds like you do), is to use interaction terms in your regression. In linear regression, that’s as simple as multiplying each of your eight variables by your key “relatable conditions” variable, i.e. a second variable, such that your regression looks like this:

b0 + b1(var1) + b2(var2) + b3(var1)(var2) +… (repeat for other vars)… + e = outcome

In which each b (which is meant to be beta, sorry) are your coefficient, and you basically do it 8 times against that second key variable. That interaction term (b3 in this case) represents the change in variable 1 in the presence of variable 2 (or an increase of 1 in variable 2); and in linear regression can simply be added to the main effects (in this case, b1 and b2). It’s much better than splitting this into two regressions, since it’ll give you more statistical power since your sample includes both sides of that key variable (left/right, whatever).

Now, if you’re really feeling saavy, technically you can compare results of two multiple regression analyses— especially if you aren’t as concerned about statistical power. In meta-analysis we do this all the time by converting it to an effect size metric, like Hedges’ G. That said, if something isn’t significant, that is sometimes considered a “0” in effect size, so it only really works with significant findings unless you’re willing to “fake” it with a weight.

Suffice it to say, it’s easiest to just run everything at once using the interaction term.

Lucienne LeBeau · Answer

You look at the current scientific literature on the subject, typically what’s been published about it within the past five years. You can go a little further back than that, but expect it to be considered outdated.

But even if a disorder has a genetic basis—guess what? You can still do something about it. You’re not necessarily doomed. If you have a psychological disorder such a schizophrenia (which has a genetic component), you can still choose to take your medicine, get plenty of sleep, go to your doctor, do your therapy, get help from your treatment team, go to support groups, etc.

In other words, you have choices, despite your genetics.

Disclaimer:

This answer is not a substitute for professional psychotherapeutic or medical advice. This answer is for general informational purposes only and is not a substitute for professional medical or psychiatric advice. If you think you may have a medical emergency, call your doctor or (in the United States) 911 immediately. Always seek the advice of your doctor or therapist before starting or changing treatment. All case examples have had identities and other details changed in order to secure and protect confidentiality. They are case studies with significant details altered. Quora users who provide responses to health-related questions are intended third party beneficiaries with certain rights under Quora's Terms of Service (https://www.quora.com/about/tos).

John Frain · Answer

1. Without some knowledge of the process that you are examining it is impossible to give a good answer to this question
2. I would suggest that you return to the theory underlying the process and specify some kind of model.
3. When you have specified your model you should use your dataset to verify that the model and your data are consistent.
4. Then you estimate the model and make deductions
5. Doing multiple searches of correlations and models will lead to spurious results.
6. You should talk to your teacher and determine what he wants you to do. Your lecture notes or recommended textbook may help. Perhaps a similar model and dataset have already been studied in your lectures and that is what the lecturer wants you to apply

Gary Miller · Answer

Treat each each combination of the two variables as a single variable.

Then calculate the correlation Coefficient for the third variable against the combination variable representing the combination of the the other two variables. This would give you the impact the one variable has on different combinations of the other two variable.

Correlation Coefficient: Simple Definition, Formula, Easy Steps

Correlation coefficients are used in statistics to measure how strong a relationship is between two variables. There are several types of correlation coefficient: Pearson’s correlation (also called Pearson’s R) is a correlation coefficientcommonly used in linear regression. If you’re starting out in statistics, you’ll probably learn about Pearson’s R first. In fact, when anyone refers to the correlation coefficient, they are usually talking about Pearson’s. Correlation Coefficient: Simple Definition, Formula, Easy Steps [ http://www.statisticshowto.com/probability-and-statistics/correlation-coefficient-formula/ ]

Sumedha Sengupta · Answer

Fist you have to do a simple transformation by shifting the Average of the distances of each point from an Origin or a fixed point. You will have the relative distances for each point from that origin. And also shift the Average of the Ph to the Origin.

To get a preliminary idea, plotting the Distances along the axis of X and the Soil PH along the Y will give a visual of how the values are scattered. A pattern will emerge that may give an idea of what statistical tool to use. Is there a correlation between the two if so what is it? A +Ve or a -Ve correlation? This is assuming there are only 2 variables are being considered.

By fitting A Simple Linear Regression line will also help predicting the PH level at a certain distance, after converting back to the original measures of location. This is the simplest way to analyze the data.

Might also consider other features of the data, such as are there other components in the soil that affecting the PH levels equally or differently. But that will need more analyses of the soil.

David Robinson · Answer

By definition, dependent variables are affected by independent variables, so presumably you want to know which test to use when you have three categorical independent variables and one dependent variable?

The answer depends on (among other things) the level of measurement of the dependent variable (nominal, ordinal, interval, or ratio) and the design of the experiment (e.g., between-groups, within-groups, mixed).

For example, if your dependent variable is interval or ratio, an ANOVA might be appropriate, though, if your dependent variable is nominal, a chi square test might be.

Based on the information in your question, there are various ways your analysis might be approached, so providing a little more information (e.g., level of measure of dependent variable, experimental design, hypotheses) would enable respondents to provide a more precise answer.

Murali Krishna · Answer

Statistical analysis helps in analysing the data by using appropriate concept measures formulae and tests mainly basing on mean and standard deviation

I think application of significance confidence limits errors probality May indicate that the results are approaching to reality (As you mentioned)

Link between variablble

Relation ship between X and y VARIABLES may be association in numbers or terms ( quantitative or qualitatative) Increase or decrease ----)

Qualitative terms Good bad Large small --

Casual or emperical May be

Dale L Olausen · Answer

In addition to the logistic regression analysis, you might also try the decision trees or random forest algorithms. It might be argued that these are “data science” rather than “statistical analysis”, but that’s a definitional issue. Either of these methods will work for predicting a binary output. For that matter, a neural net type method will also work.

I have tried all three on the same dataset in the past (predicting success of graduate students in various university programs) and they came to broadly similar conclusions. Some notes about those comparisons (all done using SPSS, though I could have used R or Python, I suppose):

* Logistic regression was superior in understanding the importance of each variable, and happened to be slightly better at prediction. But it can be hard to interpret (e.g. no “easy to understand” R-square measure to quote to your employer/supervisor/client).
 * Decision trees was similar to logistic regression, both in terms of understanding the importance of the different variables and in prediction. It could be argued that logistic regression gives a more “fine-grained” understanding of the importance of the different variables (I would probably support that position). However, Decision Trees is easy to understand and explain to non-statisticians, which is a definite advantage for it. You also don’t have to do a lot of variable transformations (e.g. for categorical data), which is necessary for logistic regression.
 * A simple perceptron neural net (in SPSS) had a similar level of predictive power to the other methods, but it doesn’t score well on the understanding/explainability factors. However, if that isn’t an issue, it does have the advantage of potentially accepting many more variables in the model. Of course, putting a lot of variables into a model increases the problem of over-fitting, so there’s that. You could also step this up to much more sophisticated “deep-learning” algorithms, which could be advantageous, depending on the research problem.
So, you have choices. In practical terms, it probably depends on what method you know best (or are willing to commit the time to learn), what your data seems to call for, and who your audience is.

Gary Russell · Answer

Clearly, the easiest approach is to use multiple regression analysis. However, it depends on what you want to find out.

METHOD ONE

If you want to know the relationship between each IV and the DV separately, then you can run a series of simple regressions (one IV per regression) or just examine the correlations between each IV and the DV.

METHOD TWO

If you want to find out the relationship between each IV and the DV — but controlling for the effects of all other IV’s — then run a regression using all the IV’s simultaneously. This is usually the best approach.

METHOD THREE

If you want to organize the IV’s into new variables that measure some common characteristic, then run the IV’s through principal components analysis (PCA) and use the PCA variables instead of your original IV’s. This approach can work well if you have a large number of IV’s that are highly correlated. However, interpretation can be very difficult. Thus, this approach has to be used with care.

My advice is to use multiple regression analysis (Method 2). For most people, this approach gives sensible answers and is easy to interpret.

Ryan Graham · Answer

If you already know that, say, x, y, z are independent, then you can regress w on all of them to get partial regression coefficients for each:

[math]w=ax+by+cz+d[/math]

You can do this in R by loading your x,y,z,w data into a data frame:

MyData %3C- data.frame(x=x,y=y,z=z,w=w)

summary( lm( w ~ x+y+z, MyData) )

What this will do is, assuming the relationships are linear, give you the independent contributions of each of x, y and z to w via a partial regression coefficient for each.

Each coefficient will tell you what a unit change in the independent variable will become for the dependent variable while holding the other independent variables fixed.

I hope this helps!

Harshit Maheshwari · Answer

Broad Factors Analysis, commonly called the PEST Analysis, is a key component of external analysis. A Broad Factors Analysis assesses and summarizes the four macro-environmental factors — political, economic, socio-demographic (social), and technological. These factors have significant impacts on a business’s operating environment, posing opportunities and threats to the company and all of its competitors. Broad Factors Analysis is widely used in strategic analysis and planning because it helps companies determine the risks and opportunities in the marketplace. That, in turn, becomes an important consideration when companies are developing corporate and business strategies.

Patrick Young · Answer

Me, first thing I do is graph the data and then calculate a correlation coefficient. Does it form a line or does it look like a shotgun blast? You’re done. If it forms a pattern other than a line but is some other identifiable pattern such as an exponential or logarithmic, I fit an equation to it and re-calculate the correlation coefficient.

Anonymous · Answer

I agree with Quora User wholeheartedly - as someone who has quite a bit of expertise in time series.

If this is critical, hire someone with the chops to do it. You can learn how, but it would probably be cheaper to have someone do it for you. It would certainly take less time.

And for the love of %3Cinsert random religious/whatever figure here%3E, do NOT use Excel…