A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

En provenance du cours de Johns Hopkins University

Statistical Reasoning for Public Health 2: Regression Methods

81 notes

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

À partir de la leçon

Module 3A: Multiple Regression Methods

This module extends linear and logistic methods to allow for the inclusion of multiple predictors in a single regression model.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

All right everyone, welcome to Section B.

Here we're going to look at Multiple Logistic Regression,

talk about about the basics of model selection.

Basically, reiterate what we said for linear regression.

And show how to estimate proportions or

probabilities from logistic regression models with multiple predictors.

And actually make comparisons on the odds ratio scale between

groups who differ by more than one predictor.

So hopefully, by the end of this lecture,

you'll understand the linearity assumption as it applies to multiple logistic

regression with regards to the linear relationship between the log odds and

continuous predictors in the multiple logistic regression model.

We will just briefly explain different strategies for picking quote, unquote,

final multiple logistic regression model among candidate models.

And use the results of multiple logistic regression models to compare groups who

differ by more than one predictor.

And estimate proportions or probabilities for groups given their x values.

So, let's talk about the estimation process for Logistic Regression.

Just like it was with simple Logistic Regression,

the algorithm to estimate the equation of the multiple logistic regression is

called maximum likelihood estimation.

So given the data, the estimates for the intercept in slopes,

however many there are given the number of xs, are the values that make the observed

data set most likely among all choices for the intercept and the slopes chosen.

So this actually is an iterative process, requires numerical algorithms and

it must be done by the computer.

And I don't think that's any surprise.

None of us would want to do this by hand anyway.

And the resulting shape we're estimating in terms of the log odds,

just like with linear progression when we have more than one predictor, the logistic

regression model's no longer estimating a line, a single line, describing

the relationship between the log odds in our predictor in two-dimensional space.

It's actually describing an object in multidimensional space,

which we can't really visualize in three dimensions, per se.

So what is the linearity assumption with logistic regression?

Remember, we used low s plots in simple logistic regression

when we had a continuous predictor to see and

assess whether the relationship with the log odds of our binary outcome and

our continuous predictor was linear or roughly linear in nature.

And in logistic regression, when we have continuous predictors, and

there's other predictors in the model, this just extends to the notion that

the adjusted relationship between the log odds for

a binary outcome in this predictor is linear in nature.

There's not really any visual tools we can to see the adjusted relationship.

But there are techniques,

like comparing something we saw with Cox Regression previously.

Like comparing the results of a model where we fit the x's single continuous

predictor versus putting in categories, and look at how the results compare.

So, when faced with potentially many possible predictors, how does a researcher

go about chosing a best model, if that is indeed his or her goal?

And certainly,

it's not necessary to come up with one final multiple regression model.

Sometimes, as we've seen, researchers are interested in presenting the results from

several models and comparing them.

But model building and selection is a combination of science, statistics and

the research goals.

The same as it were with linear regression modeling and

also the same as it will be for Cox.

So if the goal is to maximize the precision of adjusted estimates,

then the best strategy is to keep only those predictors that are statistically

significant in the final model, with all the caveat about power issues, etc.

But estimating things that quote, unquote don't need to be there,

don't add extra value or information about the outcome will take

away from the precision of those things that do add information about the outcome.

And so throwing that, if you will, dead weight from model will

maximize the precision of the predictors that are associated with the outcome.

If the goal was to present results comparable to results of similar analyses

by presented by other researchers, on similar or different populations, for

example, if we want to look at breast feeding practices as a function of child's

age and sex in Nepali children and compare that to research that's been done in

African populations, European populations, Asian populations and the United States.

And everybody else had actually presented their results adjusted for age and sex,

then we would want to do the same, even if the association between one or either was

not statistically significant, so that we could compare our findings to theirs.

If the goal was to show what happens to the magnitude of association with

different levels of adjustment, then there isn't really one final model.

But you'd want to present the results from several models that include different

subsets.

Or combinations of adjustment variables to show how robust.

If you're looking at one main association and seeing how much it's affected

by potential confounding, looking at the results across several models can

help assess the degree to which the original association is confounded.

And if the goal is prediction, well again, this is slightly more complicated story,

and we'll discuss briefly a little later in the course.

But let's talk about predictions, though, how we could, given the results of

the regression model, estimate probabilities from the resulting model for

different subgroups of the population which are included in our sample.

So this, just recall,

these are the results from the logistic regression results for

predictors of breast feeding when we considered age of the child and sex.

I'll keep it, one of the smaller models here just for illustrative purposes.

And this was the model that had both predictors in it age and sex of the child.

So suppose we're using this to estimate probability.

So, the probability or proportion of children that are breastfed by different

age and sex groups based on the results of this model.

Well, here's what the model looks like on the log odds scale.

The logistic regression model that generated those

outcomes was the ln(odds of breast feeding) = an intercept

of 7.2 + -0.24 times x1, where x1 is age.

And 0.27 times x2 where x2 is sex.

And again, I could call these x anything or switch the order,

as long as I knew which each referred to and assigned a proper slope value to them.

So if we had this model and I wanted to estimate the probability or proportion of

female children 22 months old that are breastfed, how could we do this?

Well, what can we estimate straight up for this group, given the equation?

Well, we could say,

a log odds of being breastfed when age = 22 and

sex = female, which is the 0 for

the sex variable = intercept

7.2 + -0.24 times

22 + 0.27 times (0).

So this is just like we did with linear regression, and we'll get a number here.

The problem here, though, is in linear regression, we were done,

this number was on the scale we wanted.

The problem here is we're still a couple steps removed from the scale.

So this is the log odds, so if we wanted the odds estimate for

this group, we'd take e and raise it to the 1.92 power.

And that would give us an odds of 6.82 And

then we could get the estimate proportion or

probability by taking the estimated odds over 1 plus the odds,

which is 6.82 over 7.82 which is about 0.87.

So we estimate it 87% or so of female children,

22 months old are being breastfed.

Just wanted to show you something.

In papers, you generally won't get the results on the log scale.

If they gave us all the information, including the baseline odds, we could take

the log of the slopes and the log of the baseline out and recreate the equation.

Let me show you if this is on the exponentiative scale.

If we had, let's recall, the log odds from

the way we did it was equal to the intercept,

7.2 + the slope of -0.24 times 8.

But if I actually, I'm going to just exponentiate it without adding.

So if I exponentiate this portion here,

e to the -7.2 + -0.24 times 22.

You may recall, if you're familiar with exponents,

this is equal to e to the -7.2 times e to the -0.24 times 22.

And we could rewrite this as e to

the -7.2 times e to the -0.24,

raised to the 22nd power.

But e to the -7.2 is just this baseline odds, this 1,333.

And e to the -0.24 is just the odds ratio, per one month difference in age.

So if we took the baseline odds of 1,333 times the odds ratio

per one unit difference in age raised to the 22nd power and

we multiply this all out, this would give us the odds of 6.82.

And then we could convert that to a probability.

So this just provides an alternative to try and to recreate the equation from

published results that are presented on the exponentiated scale.

But if you're more comfortable and want to do this at some point, there's no shame in

taking the odds ratios, taking their logs, and

taking the log of the baseline odds, if it's given, to recreate the equation.

And do it the way we just did.

So if I were presenting a paper and I wanted to put something along with these

logistic regression results that really showed what the impact of sex and

age was on the resulting probabilities of being breastfed,

it might be nice to include a graphic like this.

And these curves are estimated by, basically,

going through all ages in the age range for both sexes.

And predicting, via that equation, the proportion of children who were breastfed

in each age and sex group, and then plotting them on a graph.

And so what you can see here is that across the 30, from the 12,

the 36 months and the children in both sexes at 12 months,

a very high proportion, almost 100% are being breastfed.

But by the time we get to 36 months, that's on the order of, and

the scale starts to about 20%.

So on or less than 20% and this shows that

both groups decreased pretty rapidly in the probability over that time period.

But this vertical difference at any point shows the difference in the estimated

proportions.

The risk difference, if you will, for been breastfed for

male compared to female children of the same age.

And so this really takes those results that gave us relative comparisons.

And puts them on an absolute scale to help us understand what

resulting odds ratio of 0.79 or a reduction of 21% per month of

age means in terms of the actual proportions been breastfed.

We can also do comparisons between groups that differ by more than x value at once.

So maybe we want to estimate the odds ratio of being breastfed for

the group we just looked at, female children, 22 months old,

compared to male children, who are 19 months old.

So, if we did this the brute force way,

we would actually write out the equation and log our scale for both groups.

We've already done it for this group, but there's the math behind it and

we could do it for this group there.

Take the intercept 7.2, plus the slope times the age f 19 months + 0.27 times 1,

because they're male.

And the sum total of these things, if we take the difference, is -0.99.

And that estimates the difference in the log odds being breastfed for the first

group, female 22 months old, compared to the second group, males 19 months old.

If we look at that, if we exponentiate that, e to the -0.99

will give us the odds ratio comparing these two groups, which is about 0.4.

So the first group on the odds scale has a substantially lower odds

of being breastfed than the second both, both because the age difference and

the sex difference.

because males are more likely to be breastfed than females.

But let's look at this piece by piece.

If you notice that this turns out to be the intercept cancels.

And then the part that's due to age is the slope for

age times the difference in ages.

And then the part that's due to sex is the slope times,

of sex times the difference in sexes.

So when all the dust settles, we have this part that's because of the age differences

and this part that's because of the sex contributing to this difference.

And if we were to exponentiate this sum here,

this was just more fun with exponents,

we'd get e to the -0.24 times 22- 19 is 3,

+ -0.27, because [INAUDIBLE].

We're comparing the group coded 0 to the group coded 1 in this comparison.

This is equal to e to the -0.24(3),

times e then -0.27.

And the odds ratio is equals e to the -0.24 to

the third power divided by e to the 0.27.

So this equals the odds ratio,

e to the -0.24,

is that reduction per month of age,

0.79 to the third power divided

by the odds ratio for being male of 1.3.

And that's because the comparison

we have on top is 22 months female,

19 months male.

So this is the part, the three month difference because of age, and

the female to male is just the inversion of the male to female ratio.

So, you see it breaks down to these parts.

This was exactly equal this up here.

But I just wanted to highlight that the differences of more than one predictor

just result in components from each, from the logistic regression model.

Let's look quickly at predictors of obesity from the NHANES.

And we looked at a couple different models.

One included the unadjusted we had potential predictors including HDL level,

male, sex, age, in four categories and then marital status and then we

looked at the model that had HDL sex and age and then the one that had all four.

And we saw that the resulting association with marital status, whether we

adjusted for other factors or not was not statistically significantly associated.

So, lets use model two to make some predictions.

So here's what the model two equation looks like on the log scale.

So, the baseline odds was the exponentiate intercepted its log was negative 0.5.

I'm going to make x1 here arbitrarily HDL, the log of this 0.956 is -0.45.

I'm actually going to put age in here even though it didn't appear next in the table.

And then there's going to be three predictors for age, this will be

the second age group, 30 to 46 years, this will be a 1 if it's that group, 0 if not.

This'll be the third age group, 46 to 62 years, the indicator for that.

This'll be the greater than equal to 62, and then this last thing I

put inside of order is a 1 if they're male, 0 if they're female.

So in any case I got that from the computer, but if you take the logs of

these respective quantities and match them up you'll see you that get those slopes.

So I wanted to estimate proportions.

Let's just estimate for the practice.

Estimate the proportion or probability that 50 year old males with an HDL level

of 80 milligrams per deciliter are obese.

So just plug in these numbers, they get the intercept.

The slope of HDL is -0.045, we multiply that times 80.

They are 50 years old, so they're in the third age group.

So that's turned on, the other two drop out.

And so they get a 0.67 for that.

And they are male, so they get a 0.97 for

being male and the sum of these parts is -2.46.

So if we exponentiate this log odds, take out the odds.

We get 0.085, and

if we use that to estimate a resultive proportion or probability.

It comes out to be 0.078.

Or 7.8% And we could go through and do this for other groups.

But if we were trying to present the gestalt, the overall

results on the probability scale based on the result of these regressions,

we could present the table of the regression results to get the relative

comparisons on the odds scale and look at the significance in confidence intervals.

But then we could present graphs like this.

I just arbitrarily to try and get all the information we had in a,

hopefully digestible way, and to show both the sex and age differences I presented in

separate curves for the four ages groups for males and females.

But put them all on the same scale, so

that we could see the association with age.

The increasing or decreasing association and the fact that at any given HDL level

the lowest group at lowest risk is the youngest group, but

all four groups decrease

their likelihood of being obese decreases with increasing HDL and they start

to coverage around very high levels of HDL in terms of the predictive probability.

Same thing goes for females, but

if you actually look at the difference at a given point for females to males.

Fix the HDL level within an age group you can see the probability among females

is lower than males up to the same HDL and age.

So this is one way we might present the results such that none within the ratio is

significant but we can also see what this means in terms of the resulting estimated

proportions and the magnitude in different groups.

So, in summary multiple logistic regression results can be used to estimate

probabilities or proportions of binary outcomes for

a given subset in a population given the predictor values for the subset.

And multiple logistic results can be used to estimate odds ratios between groups who

differ by more than one characteristic or predictor.

And confidence intervals for both the predictive probabilities and

the odds ratios comparing differences in multiple predictors can be estimated.

And this principal is the same for the odds ratios it would be on the log scale,

and you take the log plus or minus two standard errors for

the proportions it would be similar to p hat plus or minus two standard errors.

But the trick is estimating the standard errors because there's multiple components

that go into estimating that.

A computer can handle that but there wouldn't be a straightforward way to get

those by hand, but you can certainly ask the computer to give those to

you if you're interested in presenting those as well.

So, for example on those predictive probability curves we could have

put confidence bands around the curve to estimate the upper and lower

limits of the confidence interval for each estimated proportion on the curve.

If we ever wanted to do that.

All right, anyway we are going to look at some examples of logistic regression from

the published literature in the next section.

Thank you.

Coursera propose un accès universel à la meilleure formation au monde,
en partenariat avec des universités et des organisations du plus haut niveau, pour proposer des cours en ligne.