Demeaning categorical dummy variables in a design matrix (FSL-PALM)

ColinHawco · July 5, 2023, 12:48am

So, I am working on some data using FSL’s PALM for the GLM for group statistics.

We are building a design matrix and have several categorical variables (1 or 0). FSL’s wiki for randomize and GLMs says, explicitly:

“Note that categorical covariates (e.g. gender) are treated in exactly the same way as continuous covariates - that is, use two indicator values (e.g. 0 and 1) and then demean these values before entering them into the EV.”

So for sex/gender that’s pretty obvious. But we have a more complex set of ‘dummy’ variables, representing 6 scanners used in our data.

Do we demean each column? This seems… wrong. Everyone will have a set of values across those columns which has a positive number for one scanner and a negative number for others.

Relatedly, I just realized that in the original design matrix we included 6 columns, 1 per scanner. But my impression is we should leave a column out and 1 scanner as the ‘default’… but we are not at all interested ins canner based effects, of modeling the intercept. We care about group differences, and a cognitive variable we are including as another (mean centered) model term. Scanner is included as a 'covariate of no interest" (I’m so sorry to sue that term!) but it not being examined in further contrasts. We have no interest in evaluating or interpreting scanner based differences. Just further removing them (post combat).

Advice on this issue would be valued : )

psadil · July 5, 2023, 4:37am

Hi @ColinHawco ,

These are good questions to work through. I don’t know about combat, but here’s an example to help think through questions about demeaning and dummy coding.

tldr: both demeaning and not demeaning work, but the two setups produce parameters with different interpretations.

import patsy
import numpy as np
rng = np.random.default_rng(1)

X = patsy.dmatrix("a", patsy.demo_data("a", nlevels=6, min_rows=100))

$ X
DesignMatrix with shape (102, 6)
  Intercept  a[T.a2]  a[T.a3]  a[T.a4]  a[T.a5]  a[T.a6]
          1        0        0        0        0        0
          1        1        0        0        0        0
          1        0        1        0        0        0
          1        0        0        1        0        0
          1        0        0        0        1        0
          1        0        0        0        0        1
          1        0        0        0        0        0
          1        1        0        0        0        0
          1        0        1        0        0        0
          1        0        0        1        0        0
          1        0        0        0        1        0
          1        0        0        0        0        1
          1        0        0        0        0        0
          1        1        0        0        0        0
          1        0        1        0        0        0
          1        0        0        1        0        0
          1        0        0        0        1        0
          1        0        0        0        0        1
          1        0        0        0        0        0
          1        1        0        0        0        0
          1        0        1        0        0        0
          1        0        0        1        0        0
          1        0        0        0        1        0
          1        0        0        0        0        1
          1        0        0        0        0        0
          1        1        0        0        0        0
          1        0        1        0        0        0
          1        0        0        1        0        0
          1        0        0        0        1        0
          1        0        0        0        0        1
  [72 rows omitted]
  Terms:
    'Intercept' (column 0)
    'a' (columns 1:6)
  (to view full data, use np.asarray(this_obj))

# copy X and demean
X_demeaned = (X - X.mean(axis=0, keepdims=True))
# add back intercept column
X_demeaned[:,0] = 1

y = rng.normal(size = X.shape[0])
b_hat = np.linalg.pinv(X) @ y
b2_hat = np.linalg.pinv(X_demeaned) @ y

$ b_hat
array([-0.02506107, -0.12104892, -0.13902483,  0.06096995,  0.12607575,
       -0.19719659])

$ b2_hat
array([-0.07009851, -0.12104892, -0.13902483,  0.06096995,  0.12607575,
       -0.19719659])

Notice that the first parameter – the “intercept” – matches the mean of the data with the demeaned matrix.

$ y.mean()
 -0.07009851317080741

Without demeaning, the intercept parameter corresponds to the mean of the first group

$ y.mean(where=[~np.any(X[x,1:]) for x in range(X.shape[0])])
 -0.025061072220387864

But all of the other parameters are the same

$ np.allclose(b_hat[1:], b2_hat[1:])
True

If you want to convince yourself of some other features about demeaning (e.g., goodness of fit, residuals), take a look at the results generated by statsmodels

import statsmodels.api as sm
sm.OLS(y, X).fit().summary()

Since it sounds like the binary case is intuitive, it may help to view each of these dummy columns as indicating an isolated binary variable corresponding to whether a scan was collected at scanner “x” vs any of the others (e.g., second scanner vs not second scanner is a binary variable, another is third scanner vs not third, etc).

Regarding the number of columns, whether this matters again depends on the situation. In general, there are ways to get estimates of the parameters when the model is overloaded

X_all = patsy.dmatrix("0 + a", patsy.demo_data("a", nlevels=6, min_rows=100))
X_overloaded = np.hstack((np.ones(shape=(X_all.shape[0], 1)), X_all))
np.linalg.pinv(X_overloaded) @ y
# array([-0.06008444,  0.03502337, -0.08602556, -0.10400146,  0.09599332,
        0.16109912, -0.16217323])

Though, issues can arise because the parameters are no longer identified; e.g., you could get the same predictions with the 7-element vector above as another vector that has 1 added to the first element and 1/6 subtracted from the others. If it helps, think about how you’d use just one column for representing a binary variable.

Here, the “intercept” is just the sum of the other columns

b3=np.linalg.pinv(X_overloaded) @ y
(b3[0] - b3[1:].sum())
# 7.632783294297951e-17

And the other columns plus that first element give the group averages

b3[0] + b3[1:]
# array([-0.02506107, -0.14611   , -0.1640859 ,  0.03590888,  0.10101468,
       -0.22225767])

All of this gets even trickier with multiple factors/categories. But similar ideas apply. In particular, if you’re comparing across groups (especially if their membership is encoded in the design matrix), make sure to think through how the presence of other factors and their default/baseline values will influence your results.

(And, again, I’m not sure about combat, so I’m no help with those considerations)

nadluru · April 17, 2025, 11:16pm

Dear @psadil and Neurostars community,

Our group is trying to make an informed decision about demeaning categorical variables in linear models. The FSL page suggests that we should demean the confounding categorical variables similar to the continuous variables. That page does not however demean the categorical variable of interest (e.g. group). Our understanding is that the demeaning of the categorical variables primarily affects the estimated intercept in addition to any lexicographic ordering of the reference group. We would like to know if there are general recommendations/guidelines as to whether or not to demean categorical variables of interest such as the group variable to aid with standardized interpretations taking into account the intercepts.

Thank you so much for your guidance and help. Please let us know if you have any questions.

Thank you and kind regards,
Nagesh