A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 2: Regression Methods

68 ratings

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

From the lesson

Module 3B: More Multiple Regression Methods

This set of lectures extends the techniques debuted in lecture set 3 to allow for multiple predictors of a time-to-event outcome using a single, multivariable regression model.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

Greetings. In Section C here we're going to talk

a little bit about handling non-linear relationships with

a continuous predictor in regression and doing something different.

And we'll talk about the potential advantages in certain situations above and

beyond categorizing the continuous predictor.

We're going to talk about something called the spline approach.

So in this section, you'll get a brief overview of another method for handling

non-linearity in a regression method that allows for a piecewise approach

to estimating the relationship between an outcome and a continuous predictor.

To get this started, we're going to look at yet another arm circumference example.

This time on a sample of 1,000 Nepalese children who are between 0 and

16 months old.

So I have pretty wide age range from birth to five years.

We're going to try and quantify the relationship between arm circumference and

weight using regression.

And here's a scatterplot that shows the arm circumference and

weight values for the 1,000 children in the sample.

So take a look at this for a moment.

What do you think the nature of the relationship is between arm circumference,

or the average arm circumference and child's weight?

Well, you might first think let's try fitting a linear association.

This is, outcome is continuous, predictor is continuous.

Let's try linear regression and put the line on our scatterplot and see how we do.

And for the most part, at least with the upper weight,

it looks like we're fitting the data pretty well.

But down here at or below around five kilograms it looks like we're missing

a key characteristic of the relationship here, a potential characteristic anyway.

By feeding one line to this overall cloud of points.

We may do pretty well with one line.

But if we're really trying to better understand growth rates in terms of

arms circumferences and

function to weights, we may miss some of the story down in the lower weights.

We could categorize weight in four groups.

So for example, I categorize weight into four quartiles and

estimated the mean arm circumference for

each of the four weight quartiles, which is the middle of this square on the graph.

I just put the square here so we could see what the means look like.

And you can see well, that'll do nicely, and that'll kept the essence of the story

and we'll see a slightly larger jump from the first quartile to the second,and

then the respective second to the third, and third to the fourth quartiles.

And this might be, for most purposes, be sufficient.

But if we're interested in the rate of change, if you will, the amount of

change in arm circumference per kilogram change in weight across this entire

range of waves, then this categorization approach is not going to help us do that.

It's going to tell us, on average,

in the four different weight groups, how the mean arm circumference is different.

But it won't talk about, in each of those weight groups,

what the relationship between the arm circumference and weight is.

So, you may have thought possibly looking at this picture,

it looks like this is somewhat curve or linear.

Maybe we could fit a curve to these data, and we could actually,

we're not limited in regression to linear terms.

We could actually put in a, a term for

weight and a squared version and fit a curve like this.

And this might be good if our only interest in was predicting arm

circumference given somebody's weight.

But the tricky thing here is, if we want to use these results to quantify,

give a numerical summary of the association between arm circumference and

weight in the form of a slope or slopes,

we're not going to be able to do it if we fit a curve.

Because a curve is structured such that the slope or the difference or

the change in average arm circumference per unit change in

weight is different at each weight.

So there's not one or two numerical summaries we can pull out of this to say,

here's the association between arm circumference and

weight in this weight range.

And here it is in another weight range.

So, for quantification purposes, this is not necessarily a very useful approach.

So there's one more approach we could take.

Maybe we could actually still estimate the relationship between arm circumference and

weight, taking weight is continuous, but allow this relationship to

change across the weight range.

But change, but each piece would be represented by a line and

that's the idea of the spline.

Kind of think of sp, sp, S.P. for

split and line for line, put it together and you get spline.

So what I am showing here are the results of a ration that I'm going to detail in

a minute where I actually allow there to be two different line slopes.

In one regression model estimating the relationship between arm

circumference and weight.

And I picked, based on this scatterplot,

the change point to allow the relationship to change at five kilograms.

So, before we detail this,

let's talk generally about the linear spline approach.

This allows for non-linearity be investigated via fitting lines with

differing slopes across the continuous predictor range.

The researcher or a statistician or

a person analyzing the data can pick the points where the line slope can change.

There are all also methods for allowing the computer to do that, but

we won't be able to get into details about that.

And then the slope changes across the outcome exposure

range for the x variable be estimated at multiple points.

We only chose to do one change for these data, but

we'll show another example with more.

And I like to think of this as really a non-linear sort of

a form of effect modification.

If you think about it, non-linearity occurs when an outcome

predictor relationship is different for different ranges of the predictor.

For example, the relationship between arm circumference and

weight may depend on weight.

Maybe the trajectory relating arm circumference to weight is different for

lower weights than it is to higher weights.

So you can sort of think of this as an outcome/predictor relationship being

modified by the predictor itself.

And the way we're going to handle this is very analogous to how we dealt with

interaction where affect modification in a,

in a regression context with the interaction term.

We're going to create something similar to that to estimate these changing slopes.

So, again, what we're going to do now is use a technique that allows us not only

to estimate differing slopes, relaying arm circumference to weight for

different weight ranges, but we'll also be able to test whether this change is

statistically significant or not.

In other words, whether the data supports the change at the population level.

So let's look at how this is set up.

If I want to set this up, it's going to look a little messy at first, but

we'll parse it.

The estimated association between arm circumference and weight, but

I want to allow for a spline or

a changing slope at five kilograms, this is how I'm going to do it.

This is the results I get from the computer.

I'm going to estimate a line that includes the slope for x1, where x1 is weight.

And, then, this may be reminiscent of of interaction terms, includes another

copy of x1, but subtracting off the point where I want to estimate the change.

The reason we have to subtract off that point is so that the two segments connect.

We didn't subtract off that point where we estimated change there'd be a jump,

potentially, between the two lines and we want to estimate a smooth function here.

But this piece here is very much like an interaction term we sometimes use

the notation plus above this to indicate that this extra term,

the spline term, is not activated.

If we're looking at the relationship between y and

x1 for x1 values less than the point where we want to estimate the change.

And then this just gets turned on as a copy of itself, x1 minus 5 or

weight minus 5.

If we're looking at the association between y and x1 at greater than or

equal to the cut point or the change point of 5.

So, let's see why this works out.

So, if we're looking at the relationship between our outcome and our predictor,

arm circumference and weight, for children whose weights are less than five,

then that piece, the x1 and minus 5 plus piece is turned off, it's zero.

And so for children who's weights are less than five,

our equation's pretty straightforward and simple.

We get that the estimated relationship between mean arm circumference and

weight is equal to an intersect of 6.25 plus a slope of 1.17.

So, on average a one kilogram difference in weight is

associated with a 1.17 centimeter difference in arm circumference for

children whose weights are less than five kilograms.

What happens if we're looking at or beyond the cut point?

Well, we get this term back.

Remember, when we're beyond the cut point, at or beyond a cut, this is just equal to

what it is in parentheses, x1 or weight minus 5 kilograms.

If we do a little algebra and multiply this out, you'll see,

just like when we had interaction terms, we just get another copy of our predictor.

And we also have to multiply this slope times the negative 5 here.

And if we redistribute this, pulling the terms over that don't involve x1,

and separating those that do, it looks like this.

Such that when all the dust settles we have a different intercept of 10.75 and

our slope for weight here is the sum of the slope that we had for

children less than 5 kilograms, plus the extra coefficient for the spline term.

And this when summed together turns out to be 0.31.

So, what we're allowing here is, for children who are less that five kilograms,

the, we estimate that a one kilogram difference in weight is

associated with an average difference in arm circumference of 1.17 centimeters.

For children who are greater than or

equal to five kilograms the trajectory shifts down, substantially There's

still a positive association between arm circumference and weight.

But now, when we compare two groups of children who differ by one kilogram,

where all groups are greater than or equal to five kilograms,

the average difference in arm circumference is now 0.31 centimeters.

So, we see that the growth relationship, if you will,

slows down after five kilograms.

So just to show what we've got on this graphic here,

the slope for this first segment here is that 1.17.

And then as soon as we hit five, we see the trajectory change and

the slope here is 0.31.

If we were to extend this first

segment down all the way to the y axis, we'd hit that intercept of 6.25.

If we were to extend the second segment

down to the y axis, we get that, and I didn't draw it to scale here,

that intercept that gets created by adding the original cept,

intercept to the extra piece that comes from the spline term, 10.75.

But, you can see here, now that we're able to quantify the per-unit change in average

arm circumference per one kilogram difference in weight separately and

differently for these two weight ranges still using linear regression.

This is really nice because we could also test now,

that was an estimate based on our data, but we could actually test whether

the true at the population level of all such children, zero to five years old.

Whether there is a real statistically significant change in

the relationship between arm circumference and weight after five kilograms.

And that's the key in the testing.

This is our estimate of that pi, change piece.

But we test whether that's zero or not to get a test of whether these data support,

show evidence of a change in the relationship between arm circumference and

weight, at five kilograms, at the population level.

If we were to do this, the p-value is very low, and so these data

show a statistically significant change in the association at that level.

We could also use it in the computer, and it's,

certainly if you have the information by hand you could do it as well.

But the standard error for these things is harder to get.

But the computer will give us confidence intervals for each of these slopes.

So for the first slope, the estimate was 1.17 and

the confidence interval went from 1.02 to 1.32.

The slope after five, which is the combination of 1.17 plus that negative

0.86 which quantifies the negative change in the slope before five to after five.

The slope after this is 0.31 with a confidence interval of 0.29 to 0.32.

And the reason this is much narrower than the, this confidence interval is,

there's more data points in the after five kilogram range compared to before.

But what can't, we can see here with these data is that the relationship between arm

circumference and weight is positive and

statistically significant for both groups of children,

those less than five kilograms and those greater than five kilograms.

But, not only are the estimates different but

the confidence intervals don't overlap and that's consistent with this test result we

had that the change was real and statistically significant.

Let's look at another example.

With go back to the NHANEs data.

Let's look at obesity and age, the relationship.

We had previously controlled for age in other analyses, and categorized age

because there was some evidence from our Lowess plot, that at least the unadjusted

association between the log odds of being obese and age was not linear.

And it's fine.

We created age quartiles, and

just estimated the log odds for each of the four age quartiles.

And then used the results to get odds ratios for

quartiles two through four compared to the reference quartile one.

But, suppose we were actually interested in the change in the log odds and hence,

the change in the odds ratio per year of age, across this entire age range.

And we did some preliminary analysis and

saw that it may not be purely linear on the regression scale, what could we do?

[SOUND] Well, we could fit a model that allows for changes.

I'm just using this graphic going to estimate that the changes occurred at

40 years old and 60 years old.

That's a subjective decision, but you may make a slightly different decision and

could model similarly to what I'm doing here.

But what we're now doing at this approach is estimating potentially three different

associations between the log odds of obesity and age, three different lines.

And the way to handle this would be an extension of what we did before.

We put in our main predictors of age, that's x1.

And then we create these blind terms at 40 and 60.

So, just as a refresher, this term here would not even be activated until

we hit 40 years, at which point it would take on the value of age minus 40.

And this term here would not be activated until we were dealing with people 60 years

and older, at which point it would take on the value of x1 or age minus 60.

And we could also controlled for

other things like we had done previously, like HDL and sex,

things that may differ across the age groups and also be related to obesity.

So splines can be done in a simple regression context, like we saw before,

but also in multiple regression context.

Since I want you to focus on the big picture,

some of you are more interested in the mechanics than the others and

if you're interested in the mechanics, see if you can get my results here.

But what this does is if we're looking at the different age ranges we get different

slopes estimating the relationship between the log odds of obesity and

age depending on the age range.

For the less than 40 group,

it's just the main slope for age of that generically placed beta one hat.

We look at the 40 to 60 group then the overall association is the starting slope,

the one for the less than 40 plus, the coefficient for that spline for after 40.

The, the second coefficient doesn't come into play until we actually hit 60.

In which case the relationship between the log odds of obesity and

age is described by the sum of the relationship for

the first group plus the coefficient for this line for the second age group

after 40 plus the coefficient for the third age group, after 60.

So you can see these things accumulatively add together.

These are log odds ratios and

we could exponentiate them to get the respective odds ratio of being obese

associated with a one year difference in age for each of these age ranges.

So, let's see what the results look like.

And this is, might be a way we present it, in a table.

I'm going to show you the results when we didn't adjust for HDL levels and

sex, and after we did.

And you can see that numerically, actually,

when you first look at these odds ratios, they are pretty much similar.

Although the odds ratios for the 40 to 60-year-old groups and greater and

equal to 60-year-olds groups get closer to one, after adjustment.

But let's think about what each of these things are telling us,

let's look at the adjusted situation.

This odds ratio here, of 1.04, is the relative odds of being

obese for two groups of persons who differ by one year in age,

where both groups are less than 40 years old.

So 30 year olds to 29 year olds, 35 year olds to 34 year olds, et cetera.

Age is associated with a 4% increase in the relative odds of obesity.

When we go to 40-to-60 ye, years I've already combined the slopes and

exponentiated them so we don't need to do anything.

We can take this odds ratio at face value.

There is no association between increasing age and obesity.

The relative odds is constant, right.

The odds ratio for a one year difference in age for

those between 40 and 60 year olds is, is one, indicating no difference in the odds.

And it is not statistically significant.

At greater than equal to 60 years,

there is a decreased odds of being obese associated with increased age.

A decrease of 1% per additional year of age,

per people greater than equal to 60 years, but this is not statistically significant.

So, the big picture here and

probably what we're seeing evidence of is there's this shift at or

about 40 years old where age no longer is associated with the odds of being obese.

But in younger years it is.

And so if I were actually doing the analysis, and not necessarily trying to

replicate the results that someone else did on a separate population, I may

go back and re-run this with a single spline term at 40 years to get a better

combined estimate of the change, or lack of change, if you will, after 40 years.

Let me show you an example that is from the literature.

This is from an article on soda consumption in physical education classes

from the American Journal of Public Health.

And I'll just read you the abstract here because they bring in this idea

of non-linearity.

They say we examine the association of adolescence beverage

consumption with physical activity and

study how their school beverage environment influences the association.

We use the nationally representative data from the 2007

early childhood longitudinal study, kindergarten cohort.

And then we examine non-linear associations of

eighth graders' self-report of beverage consumption, milk, 100% juices, or

soft drinks, with moderate to vigorous physical activity and

physical education participation using piecewise linear regression models.

That's a synonym for re, splines.

In their results they say we found a non-linear association of

participation in physical education class, PE class, with beverage consumption,

especially in schools with vending machines and those selling soft drinks.

For students participating in physical education less than three days per week,

beverage consumption was not significantly associated with

participation in physical education class frequency.

For students participating in physical education three to five days per week,

one more day of participation in phys ed class,

was associated with .43 more times per week of soft drink consumption and

0.41 fewer glasses per week of milk consumption.

So, what they're saying here is that this relationship,

the per-unit change in number of soft drinks consumed on average, or

number of glasses of milk consumed on average, as a function of number of days

of physical education, is different for those who get lesser physical education.

Zero to two days per week, versus those who get more, three to five days per week.

They actually showed the results of these regressions, and they do it with three

outcomes, soft drinks or sodas, milk, which they refer to as well, and juice.

And they share the results of several different models, but

you'll notice that model three and model four.

So, you've noticed they have two different entries under participation and

PE class, and they say spline, zero to two days and three to five days.

So, these are not actual categorical indicators when they put the words spline,

they're saying that this slope is the per-unit association in the outcome,

average soft drinks consumed per number of days of phys ed.

And the uppers e, estimate separate slopes for zero to two days, and

three to five days.

And let, they also do this in model four,

where they adjust for amount of moderate to vigorous physical activity.

Everything in this model is also, and down here they say, adjusted for

adolescence, gender, age, race, ethnicity et cetera, et cetera.

And they actually explain how to

interpret the results from these piecewise regressions lines.

And they indicate that some of these, these p values are testing for

the difference between two slopes for zero to two days versus three to five days

in participation in physical education class, in piece wise linear regression.

So they're just explaining what they did.

Let's look at what they did for soft drinks.

This is the results for soft drinks.

And I'm going to actually detail the results of this model four,

where they included all the adjustment factors they mentioned, age, etc, but

are also highlighting the association with moderate to vigorous physical activity and

this participation in physical education class.

Here's what the model looks like, what they ran.

We had a slope of negative 0.26 for number of days to moderate or physical activity.

So increased number of days is associated with a decrease in average number of

soft drinks consumed adjusting for amount of physical education and

the other things in the model.

And then they had these two pieces for

number of days participating in the phys ed class.

I'll call it x2 is the number of days participating, and

then the spline at three days, x2 minus 3.

So this is only turned on or activated for children who participated in three, four,

or 5 days of physical education.

So if you look at the results of this, we look at children who are adolescents who

participated for less than three days, this spline piece disappears,

it's not activated and the sole slope of number of

days of physical education is this negative 0.18 which is what they reported.

And it was not statistically significant.

So even though there was a negative association between average beverages

consumed, average sodas consumed, and number of days of physical activity, it

was not statistically significant, that's what they referred to in the abstract.

For the group who had physical education for three or

more days, three, four or five days we have to turn on this spline term.

And if you turn this on, activate it, and then do some algebra and regrouping,

when all the dust settles, the slope for number of days of physical education

all combined is the original slope plus this coefficient for the spline term.

And if you do that, if you add this together.

This is the 0.3 they were quoting in the abstract and

that they showed in that table.

That the association after, for three or

more days of physical education between average sodas consumed and

the amount of physical education is positive and statistically significant.

Even after adjustment for a moderate to vigorous physical activity and

the other adjustment factors.

So they give confidence intervals for

each of these pieces back in that table I showed you before, and

they also noted that this change slope here, they didn't show this piece.

They actually kindly added the two together so

we didn't have to do the adding in their table.

They gave us the before and after slopes and did not show us the change piece but

they noted that the change was statistically significant.

So, in summary, linear splines offer an alternative categorize and

a continuous predictor when investigating and

or handling potential non-linearity in outcome exposure

association estimated with regression, whether it be simple or multiple.

And this approach is very useful when the per unit change in a measure of

association, the mean difference in odds circumference per one kilogram of weight,

the change in the relative odds, of obesity per year of age, etcetera.

When this change is of scientific interest but

the association is not necessarily linear on the regression scale.

So this allows us to fit several lines to describe the relationship on

the regression scale.

Sometimes that'll, for linear regression, then we have the story, and for

logistic or Cox proportional hazard regression,

we'd have to exponentiate the results to get the differing ratio results.

But this is a nice approach when that change per unit is of interest.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.