A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

Loading...

From the course by Johns Hopkins University

Statistical Reasoning for Public Health 2: Regression Methods

68 ratings

A practical and example filled tour of simple and multiple regression techniques (linear, logistic, and Cox PH) for estimation, adjustment and prediction.

From the lesson

Module 1B: More Simple Regression Methods

In this model, more detail is given regarding Cox regression, and it's similarities and differences from the other two regression models from module 1A. The basic structure of the model is detailed, as well as its assumptions, and multiple examples are presented.

- John McGready, PhD, MSAssociate Scientist, Biostatistics

Bloomberg School of Public Health

So this section we'll put some numbers around what we were talking about in Section A,

and look at the results from

simple Cox Regression where we have binary or categorical predictors.

So at the end of this lecture section hopefully you'll be

able to interpret the slopes from

simple Cox Regression models as log hazard ratios

and the exponentiated version of the slopes as hazard ratios.

So let's look at our famous primary biliary cirrhosis trial the randomized clinical

trial on 312 patients with primary biliary cirrhosis,

or PBC, study at the Mayo Clinic in Rochester, Minnesota.

You may recall that the 312 patients were randomized either

received DPCA or placebo upon their enrollment in the study.

And they were followed from enrollment until death or censoring whether they dropped out,

or made it to the end of the study without having died.

And the follow-up period was up to 12 years for the individuals enrolled in the studies.

So the question we might have is what is

the association between treatment and patient survival?

Can we quantify that?

So previously in statistical reasoning one we did a visual assessment.

We looked at the Kaplan-Meier survival estimates and then we computed

the incidence rate ratios based on the number

of events and the time at risk in both groups.

And we saw visually from the Kaplan-Meier estimates that there

were little available visual evidence of one drug being superior to the other,

sort of fought out the same territory in the curve

and crossed over each other at various points.

And we did quantify that with

the incidence rate ratio but now we

currently have no quantitative measure that I've shown you.

So let's estimate this instance rate ratio

or hazard ratio using Cox regression to show how this works.

So what the regression model that would be fit does is it fits

a model of this sort where we estimate the log hazard of

mortality for the subjects of the given time and the follow-up period is again equal

this time specific intercept if you will plus the slope times our predictor.

And here, our predictor of interest is binary and I've made the choice to model as a

one if a subject was randomized to

the DPCA group and a zero if they were randomized to the placebo.

And as long as we're comparing these two groups at the same time,

and the follow-up period,

this intercept piece holds for both of them,

and the group who was in the DPCA who was randomized to receive DPCA gets this extra B.

So in any given time the difference in

the log hazard between these two groups is simply this slope, beta1 hat.

So in other words, just to reiterate,

this slope estimates the difference in the log hazard for

individuals in the drug group minus the log hazard of

mortality for individuals in the placebo group at the same time,

t, and the followup period or reexpressed mathmatically,

the slope is the log of the hazard ratio of mortality comparing the drug group,

to the placebo group at the same point in time.

So if we actually ran the results in the computer we get regression that look like this,

where our slope estimate was equal to 0.057.

This is the estimated log hazard research.

So this shows, at least in this study,

an the elevated log hazard and mortality for those who

received the drug compared to those who got the placebo.

If we exponentiate this to get a estimated hazard ratio,

exponentiated version is equal

to 1.057 or approximately 1.06.

So at least in the study samples we estimated that those in the drug group had

a slightly elevated risk or hazard of mortality over the follow-up period compared to

those in the placebo group elevated by roughly 6%.

What is this intercept? Well this is essentially, in this study,

this estimates the log hazard of mortality at any given time,

at a time, t. So to fully specify this at time,

t, when x1 is equal to zero when we're actually looking at the placebo group.

So this tracks the log risk of mortality

over time for those individuals randomized to the placebo group.

So this changes as a function of time.

But again the important thing to note is this model

estimates the comparison between the group coded one,

the drug group, to the placebo at any point in time to be

constant on the log hazard scale.

And when exponentiated we get a ratio,

a constant ratio, of relative hazards over the entire follow-up period.

So wherever this log hazard is at that given point in time we're looking

at for the placebo group we get the log hazard

for the drug group by adding that slope of .057.

So let's look at the infant mortality and

prenatal vitamins example doing this Cox regression approach.

We've already done this by estimating

incidence rate ratios based on the raw data, by hand,

and looking at the Kaplan-Meier curve estimates and like the DPCA study

with primary biliary cirrhosis we saw very little visual evidence of

an association between mortality of the infant and

which vitamin group the mother was

randomized to in the prenatal period whether she got vitamin A,

beta carotene, or placebo.

But how could we model this and estimate relative risks or

relative hazards at any given time point using Cox regression?

Well we've got three groups here,

and again this is just a general regression model,

so the protocol would be to estimate

two binary indicators for two of the groups and leave one out as the reference.

So what I've chosen to do is make the reference group,

the placebo group and their value of x1 and x2 is zero.

This is the group that all other groups will be compared to.

For x1 equals one,

I'm going to let that indicate the beta carotene group.

And this implies that x2 is equal to zero for those in the beta carotene group.

For x2 is equal to 1 and hence x1 is equal to zero,

this will indicate subjects whose mothers were

randomized to receive vitamin A during the prenatal period.

So what we get is a result like this on the log hazard scale.

So this slope here compares those who are indicated by x1 equals one,

beta carotene to the reference group which was placebo.

But this is on the log hazard scale and the difference in log hazards.

This compares those in the vitamin A group to those on placebo.

And this value here is the log risk of

mortality for children whose mothers were randomized to placebo

during the prenatal period at any point in

the follow-up time period of the study which was up to 180 days.

So this is our starting point.

In order to get the log hazards for the other groups we add

their respective slope to that starting point for the placebo.

So if we exponentiate this slope of .02 for the beta carotene group,

we'll get the relative hazard or

hazard ratio mortality in the followup period for children whose mothers receive

beta carotene during the prenatal period compared to mothers

who received placebo and it's roughly equal to 1.02.

And if we do the same thing for the vitamin A group relative to the placebo group,

exponentiate that, we get a relative hazard or hazard ratio of 1.06, slightly higher.

But this jibes with what we saw in this Kaplan-Meier curves that essentially the risk of

mortality over the follow up period was very similar

among the three prenatal vitamin groups,

among infants born to mothers in

the three prenatal groups in the first 180 days following birth.

Here's an article that presents the results of

a randomized trial of a home-based intervention of early feeding practices.

So as per the authors the intervention consisted of five or six home visits from

a specially trained research nurse delivering

a staged home based intervention in the antenatal period at one,

three, five, nine, and 12 months.

So this was intervention to help new mothers implement early breastfeeding.

And so here's a Kaplan-Meier curve that traces the breastfeeding rate,

the proportion of children who are still being breastfed,

in the follow-up period up to a year a follow-up by

week for those in the control group and those in the intervention group.

And it's difficult to see in this graph but the dotted line

here is the control and the solid line here is the intervention.

So the proportion, this tracks the proportion,

who are still being breastfed,

the proportion who have yet to have

the event beyond a certain point of time and we can see that

the curves are higher for

the intervention group indicating more retention with breastfeeding.

So what the authors reported,

though that quantified the difference,

and this is taken directly from the article as compared with the control group

the hazard ratio for stopping breast feeding in the intervention group was 0.82.

So in other words the relative hazard at any point in time in the follow-up period,

the yearlong follow-up period,

that children in the intervention group

had breastfeeding stopped compared to children in the control group was 0.82.

So 18% lower hazard,

lower risk of breastfeeding being stopped.

And that jibes with what we saw in the Kaplan-Meier curves because more children,

a higher percentage, were being

breastfed as a function time over time compared to the control group.

This comes from a Cox regression model that

the authors estimated if they had done it in the computer they would have gotten

a slope for x1 if x1 were

coded as one for the intervention group and zero for the control group of negative

0.2 because e to the negative .02 is approximately equal to 0.82,

their estimated hazard ratio.

So the authors used Cox regression and that's what

the results would have looked like on the log hazard scale.

So in summary again the slopes from Cox regression models with

the binary categorical predictor compare the log hazard of

the time to event outcome between two groups at the same time in the follow-up period.

And the idea is that that difference in

the log hazard is constant regardless

of where we're comparing them in the follow-up period.

And then the slopes can be exponentiated to get

the estimated hazard ratio for the groups being compared.

And we get that constant ratio of hazards

regardless of where we're comparing them in the follow-up period.

In the next section will look at doing the same thing

but where x is a continuous variable.

Coursera provides universal access to the world’s best education,
partnering with top universities and organizations to offer courses online.