Hi @ColinHawco ,
These are good questions to work through. I don’t know about combat, but here’s an example to help think through questions about demeaning and dummy coding.
tldr: both demeaning and not demeaning work, but the two setups produce parameters with different interpretations.
import patsy
import numpy as np
rng = np.random.default_rng(1)
X = patsy.dmatrix("a", patsy.demo_data("a", nlevels=6, min_rows=100))
$ X
DesignMatrix with shape (102, 6)
Intercept a[T.a2] a[T.a3] a[T.a4] a[T.a5] a[T.a6]
1 0 0 0 0 0
1 1 0 0 0 0
1 0 1 0 0 0
1 0 0 1 0 0
1 0 0 0 1 0
1 0 0 0 0 1
1 0 0 0 0 0
1 1 0 0 0 0
1 0 1 0 0 0
1 0 0 1 0 0
1 0 0 0 1 0
1 0 0 0 0 1
1 0 0 0 0 0
1 1 0 0 0 0
1 0 1 0 0 0
1 0 0 1 0 0
1 0 0 0 1 0
1 0 0 0 0 1
1 0 0 0 0 0
1 1 0 0 0 0
1 0 1 0 0 0
1 0 0 1 0 0
1 0 0 0 1 0
1 0 0 0 0 1
1 0 0 0 0 0
1 1 0 0 0 0
1 0 1 0 0 0
1 0 0 1 0 0
1 0 0 0 1 0
1 0 0 0 0 1
[72 rows omitted]
Terms:
'Intercept' (column 0)
'a' (columns 1:6)
(to view full data, use np.asarray(this_obj))
# copy X and demean
X_demeaned = (X - X.mean(axis=0, keepdims=True))
# add back intercept column
X_demeaned[:,0] = 1
y = rng.normal(size = X.shape[0])
b_hat = np.linalg.pinv(X) @ y
b2_hat = np.linalg.pinv(X_demeaned) @ y
$ b_hat
array([-0.02506107, -0.12104892, -0.13902483, 0.06096995, 0.12607575,
-0.19719659])
$ b2_hat
array([-0.07009851, -0.12104892, -0.13902483, 0.06096995, 0.12607575,
-0.19719659])
Notice that the first parameter – the “intercept” – matches the mean of the data with the demeaned matrix.
$ y.mean()
-0.07009851317080741
Without demeaning, the intercept parameter corresponds to the mean of the first group
$ y.mean(where=[~np.any(X[x,1:]) for x in range(X.shape[0])])
-0.025061072220387864
But all of the other parameters are the same
$ np.allclose(b_hat[1:], b2_hat[1:])
True
If you want to convince yourself of some other features about demeaning (e.g., goodness of fit, residuals), take a look at the results generated by statsmodels
import statsmodels.api as sm
sm.OLS(y, X).fit().summary()
Since it sounds like the binary case is intuitive, it may help to view each of these dummy columns as indicating an isolated binary variable corresponding to whether a scan was collected at scanner “x” vs any of the others (e.g., second scanner vs not second scanner is a binary variable, another is third scanner vs not third, etc).
Regarding the number of columns, whether this matters again depends on the situation. In general, there are ways to get estimates of the parameters when the model is overloaded
X_all = patsy.dmatrix("0 + a", patsy.demo_data("a", nlevels=6, min_rows=100))
X_overloaded = np.hstack((np.ones(shape=(X_all.shape[0], 1)), X_all))
np.linalg.pinv(X_overloaded) @ y
# array([-0.06008444, 0.03502337, -0.08602556, -0.10400146, 0.09599332,
0.16109912, -0.16217323])
Though, issues can arise because the parameters are no longer identified; e.g., you could get the same predictions with the 7-element vector above as another vector that has 1 added to the first element and 1/6 subtracted from the others. If it helps, think about how you’d use just one column for representing a binary variable.
Here, the “intercept” is just the sum of the other columns
b3=np.linalg.pinv(X_overloaded) @ y
(b3[0] - b3[1:].sum())
# 7.632783294297951e-17
And the other columns plus that first element give the group averages
b3[0] + b3[1:]
# array([-0.02506107, -0.14611 , -0.1640859 , 0.03590888, 0.10101468,
-0.22225767])
All of this gets even trickier with multiple factors/categories. But similar ideas apply. In particular, if you’re comparing across groups (especially if their membership is encoded in the design matrix), make sure to think through how the presence of other factors and their default/baseline values will influence your results.
(And, again, I’m not sure about combat, so I’m no help with those considerations)