This method is very similar to the post hoc pair-wise comparisons
that you may have conducted as a follow-up to running an analysis of variance
in the second course of this specialization, Data Analysis Tools.
That is, reference group coding allows us to compare
each group of our explanatory variable, referred to as the comparison groups,
to another group, which is referred to as the reference group.
For example, if our response variable is the number of nicotine dependence
symptoms, reference coding allows us to compare number of nicotine dependence
symptoms for each group of our categorical variable to a designated reference group.
However, unlike an analysis of variance post hoc test, for
which we conduct the comparisons after testing the ANOVA,
the comparisons are part of the estimation of the multi regression model.
This allows us to examine explanatory variable group differences on the response
variable after adjusting for the other explanatory variables in the model.
To demonstrate how to analyze a categorical explanatory variable with
three or more categories, we will return to our NESARC data multiple work
aggression analysis, predicting number of nicotine dependent symptoms for
multiple explanatory variables.
We could also add an ethnicity-race explanatory variable.
Our ethnicity-race variable has four categories coded 0 = Hispanic,
1 = non-Hispanic White, 2 = non-Hispanic Black, and
3 = non-Hispanic Other ethnic or racial group.
In this example, what we wanna know is whether Hispanic individuals have more or
less nicotine dependence symptoms compared to individuals from the other
three racial, ethnic groups.
That is, we want to compare Hispanic individuals, the reference group,
to individuals from the other racial ethnic groups, the comparison groups,
on a number of nicotine dependence symptoms after controlling for
the other explanatory variables in the model.
To do this, we will use the same smf.ols function
that we used to test our earlier multiple regression model.
So we have our regression equation for which our NDsymptoms response variable
is being predicted by the explanatory variables DYSLIFE, MAJORDEPLIFE,
numbercigsmoked_c, age_c, SEX.
We add our ethnicity race variable,
ETHRACE, to the list of explanatory variables.
But to tell Python that it is a categorical variable,
we need to type a capital C and then put the name of the categorical variable in
parentheses after the capital C.
In this example,
we want to compare the Hispanic group to the three other ethnicity race groups.
So this will be our reference group.
If you remember, our ethnicity race variable is coded 0 for Hispanic.
The default and
Python is reference group coding, which in python is called treatment coding.
And the default reference category is the group with a value equal to 0,
which is Hispanic in this case.
Since this is what we're looking for
in this example, we do not need to add any code to change the default.
If we hadn't added a capital C with the ETHRACE variable in parentheses, Python
would have assumed that our ethnicity race variable was a quantitative variable, so
the regression coefficient would make no sense.
Here's the output.
Basically it is the same output that we see with the smf.ols function.
But, if we look at our table of parameter estimates, we see that there are three
regression coefficients for our categorical ethnicity race variable.
Note that there is no estimate for the Hispanic reference group.
The t dot and the number after it tells us that the treatment,
that is reference group, parameterization was used and
the number is the categorical variable code for the group.
For example, the non-Hispanic white group in our ETHRACE variable was coded 1.
So the t.1 indicates that it is the regression coefficient for the comparison
of the non-Hispanic White ethnic race group to our Hispanic reference group.
The three regression coefficients compare each of our ethnicity race groups
to the Hispanic group.
We can see that none of these three groups were significantly different from
the Hispanic group in number of nicotine dependent symptoms
because the p values all exceed our alpha level of .05.
As with the previous regression analysis, we see that major life depression and
number of cigarettes smoked are positively associated with number of nicotine
dependent systems.
If we wanted to make other comparisons, for example, to compare non-Hispanic White
to non-Hispanic Black, then we would need to override the default reference group so
that the value of 1 in the ETHRACE variable,
which indicates the non-Hispanic White group, is used as the reference group.
The code here shows how to do it.
It's mostly the same code, but now because we are not using the default,
we need to add some code to tell Python to continue to use the treatment or
reference group coding and designate the reference group.
We do this by adding a comma after the name of our ETHRACE variable in
parentheses.
Then treatment with a capital T.
And then within another set of parenthesis, reference=1.
This additional Python code provides a comparison of
the three other ethnicity race groups to the non-Hispanic White group.
Here's the output.
Now the group code at 1, no longer has a parameter estimate and
the other coefficients for t.0, t.2 and t.3 compare each of the other three
racial ethnic groups to the non Hispanic white group.
Participants in the non Hispanic other ethnic racial group
had a significantly greater number of nicotine dependent symptoms compared to
non Hispanic white participants.
There are no significant differences for Hispanic and
non Hispanic black participants compared to non Hispanic white participants.