
Psychology
Biological sex classification with structural MRI data shows increased misclassification in transgender women
C. Flindt, K. Förster, et al.
This groundbreaking study by Claes Flindt and colleagues explores how brain structures differ in transgender women compared to cisgender individuals. Utilizing advanced classification techniques, the research reveals surprising insights into brain anatomy before and after hormone therapy. Dive into the fascinating realm of neurobiology and gender identity!
~3 min • Beginner • English
Introduction
The study investigates whether structural brain patterns in transgender women (TW; biological sex male, perceived gender female) differ from those of cisgender (CG) men and women and whether such differences lead to increased misclassification of biological sex by a multivariate classifier. Prior work shows sex-related differences in gray matter volume (e.g., higher gray matter volume in CG-men; larger limbic structures in CG-women) but less pronounced sexual differentiation in the brain than in physical appearance. ROI studies often implicate the putamen and insula in TW, though findings are heterogeneous and mostly pre-hormone treatment. Cross-sex hormone treatment (CHT) may induce region-specific structural changes, yet longitudinal evidence is limited. Multivariate methods can capture distributed patterns and identify atypical cases; prior small-sample classifiers suggested decreased separability of biological sex in TIs but raised concerns about external validity. This study aims to train and validate a robust sex classifier on large CG samples (with and without depression) and apply it to TW to test for increased misclassification, and to examine univariate volumetric differences (insula, putamen) in TW pre- and post-CHT relative to CG groups.
Literature Review
- Sex differences in brain structure: CG-men typically show higher gray matter volumes; CG-women show larger limbic structures. Sexual differentiation in the brain is less clear-cut than in physical traits, complicating binary categorization.
- Transgender neuroanatomy (ROI-based): Repeated alterations reported in putamen and insula in TW relative to CG groups, but directions of effects vary across studies and were mostly limited to pre-CHT samples. CHT (estradiol-based) is associated with regional volumetric and cortical thickness changes; large longitudinal studies are scarce and sometimes negative.
- Multivariate approaches: Prior studies using multivariate classification reported decreased biological sex classification accuracy in TIs versus CGs. However, small training samples may inflate apparent accuracy and reduce external validity, motivating classifiers trained on large, diverse datasets and validated across independent cohorts, including clinical samples (e.g., MDD).
Methodology
Data and participants:
- Cisgender training and first validation: N = 1753 CG healthy participants from three cohorts (Münster Neuroimaging Cohort, BiDirect, FOR2107). Psychiatric history ruled out via SCID (DSM-IV). A random 20% holdout (N = 351; female = 219, male = 132) served as the first validation set; the remaining N = 1402 were used for training/testing.
- Balancing and cross-validation: The training subset was sex-balanced by random undersampling to N = 1218 (female = 609, male = 609). Tenfold cross-validation produced balanced training sets of 1096 per fold.
- Clinical validation (second validation): An independent MDD sample (see Results/Table 2; N = 1404 with 853 CG-women, 551 CG-men) assessed potential effects of depression/comorbidity on classifier generalization.
- Third validation (scanner generalization): CG controls collected with the TW application sample (same scanner/protocol) to test scanner invariance.
- Transgender application sample: TW recruited from an outpatient clinic; screened via SCID-I/II for comorbidities. TW included pre- and post-CHT states. Matched CG controls were scanned under identical conditions. (Tables 3 and 4 summarize the analyzed application subset: N = 60 total; CG-women = 15, CG-men = 19, TW = 26; TW subgroups: treatment-naive N = 8, post-CHT N = 18.)
Imaging and preprocessing:
- Structural MRI acquisition and VBM preprocessing followed published protocols for each cohort; gray matter (GM) segments used as inputs. GM images resliced to 3×3×3 mm³ voxels to reduce dimensionality while preserving local morphometry.
Multivariate analysis (sex classification):
- Support Vector Machine (Scikit-learn) with principal component analysis for dimensionality reduction. Maximum PCs limited to number of training subjects per fold (1096).
- Hyperparameter optimization via Bayesian search (Scikit-Optimize): kernels (rbf vs linear), C (10^-1 to 10^3), gamma (10^-10 to 10^7). 100 parameter sets evaluated within nested tenfold CV. Performance quantified via ROC-AUC and balanced accuracy on the holdout.
- Scanner variability: Multiple scanners intentionally included to learn scanner-invariant features; no explicit scanner correction. Performance compared across scanners showed no significant differences.
- Metrics reported include accuracy, balanced accuracy (bACC), precision, recall/TPR, F1, and AUC. For one-group comparisons (e.g., TW vs CG-men), true positive rate (TPR for male) and Fisher’s exact tests were used.
Univariate analysis:
- ROI-based VBM focusing on insula and putamen; group comparisons included TW pre-CHT, TW post-CHT, CG-men, CG-women.
- Analyses controlled for total intracranial volume, age, and sexual orientation; TFCE with FWE correction applied. Results thresholded with minimum cluster size (k ≥ 22). Whole-brain exploratory analyses also conducted (see supplementary).
Key Findings
Multivariate classifier performance:
- Hyperparameters selected: rbf kernel, C = 273, gamma = 2.4×10^-5.
- First validation (CG holdout, N = 351): bACC = 94.01%; accuracy = 94.87%; AUC = 0.99. TPR female = 99.9%; TPR male = 88.5% (Table 1).
- Second validation (MDD, N = 1404: 853 women, 551 men): bACC = 92.06%; accuracy = 93.16%; AUC = 0.99. TPR female = 97.2%; TPR male = 86.9% (Table 2). No significant difference vs first validation (Fisher’s tests).
- Third validation (CG controls from TW study): bACC = 94.03%. TPR CG-women = 100% (19/19); TPR CG-men = 93.3% (14/15) (Tables 3–4).
- Application to TW (N = 26): TPR male = 61.54% (16/26), indicating significantly increased misclassification of biological sex in TW compared to CG-men (Fisher’s test significant; Table 4). Subgroups: treatment-naive TW TPR male = 87.5% (7/8); post-CHT TW TPR male = 50.0% (9/18), both differing significantly from CG-men (Table 4). Predicted male probabilities showed broader uncertainty in TW than in CG (Figure 1).
Univariate ROI findings (insula, putamen):
- TW-pre vs CG-women: larger volumes in insula and putamen for TW-pre (e.g., right putamen TFCE 257.58, p<0.001; right insula TFCE 100.64, p=0.001).
- TW-pre vs CG-men: TW-pre showed larger putamen volume than CG-men in multiple clusters.
- TW-post vs CG-women: TW-post had higher insula volume but no differences in putamen under rigorous correction; some reported clusters indicate lower volumes in TW-post relative to CG-women in select regions.
- TW-post vs CG-men: TW-post showed lower volumes in insula and putamen compared to CG-men.
- TW-pre vs TW-post: TW-post exhibited lower volumes than TW-pre in both insula and putamen (multiple significant clusters; e.g., right putamen TFCE 464.56, p<0.001).
- CG-men vs CG-women: CG-men showed larger volumes in insula and putamen than CG-women (e.g., right insula TFCE 109.93, p<0.001).
- Whole-brain exploratory analyses corroborated regional differences (supplementary). Results remained generally consistent after excluding TW with psychiatric comorbidities, though sample sizes were limited.
Discussion
A robust SVM classifier trained on large multi-cohort CG data accurately distinguishes biological sex and generalizes to independent CG samples, including individuals with MDD, supporting scanner- and cohort-invariant performance. When applied to TW, the classifier’s true positive rate for biological male sex is substantially reduced, with many TW classified as female, especially among those post-CHT. This supports the hypothesis that TW brain structure does not align neatly with either biological sex or perceived gender but shows a distinct pattern.
The multivariate results align with prior studies showing decreased sex-classification separability in TIs while addressing concerns about small-sample overfitting by using large training and multiple validation samples. The univariate ROI analyses demonstrate treatment-state-dependent differences: TW-pre show larger insula and putamen volumes compared to CG-women (and in putamen vs CG-men), whereas TW-post show reduced volumes relative to TW-pre and differences from both CG groups, suggesting that CHT is associated with region-specific structural changes that may contribute to altered classifier outputs.
Together, findings indicate that neuroanatomical patterns in TW, influenced by hormonal status, challenge binary sex categorization by structural MRI, supporting a dimensional view of brain sex characteristics. The increased misclassification in TW cannot be attributed to depression or scanner effects, strengthening the interpretation that it reflects genuine neurobiological differences related to transgender identity and/or hormonal treatment.
Conclusion
The study presents a high-performing, externally validated structural MRI-based biological sex classifier for CG individuals. Applied to TW, the classifier shows markedly increased misclassification of biological male sex, particularly after cross-sex hormone treatment, indicating that TW brain structure differs from both CG-men and CG-women. Univariate analyses of insula and putamen further support treatment-dependent structural differences in TW.
These results contribute to evidence for a distinct neuroanatomical pattern in TW and support a dimensional rather than binary characterization of brain sex-related features. Future research should include larger and longitudinal TW samples (including transgender men), carefully disentangle effects of gender dysphoria, depression, and medication, and investigate hormonal influences on classifier sensitivity. The publicly available classifier may also be applied to other conditions with sex-skewed prevalence to probe potential sex-atypical neurodevelopmental patterns.
Limitations
- Validation strategy: Although a rigorous train/holdout with nested CV and multiple external validations was used, alternative strategies (e.g., repeated nested k-fold) might yield different generalization estimates or learning order effects.
- TW sample size: The transgender sample—particularly the pre-CHT subgroup—was small, limiting power and precision of subgroup comparisons. Replication with larger samples and longitudinal designs is needed.
- Generalizability to transgender men: Only TW were studied; inclusion of transgender men is necessary to assess generality across transgender populations.
- Confounds: While depression was addressed using an MDD validation sample, further work should isolate effects of gender dysphoria, comorbidities, and medications.
- Classifier sensitivity asymmetry: The classifier showed higher sensitivity for female classification; the physiological basis and hormonal contributions require further investigation.
- Reporting inconsistencies: Some sample size discrepancies across sections highlight the need for harmonized reporting and may reflect OCR or editorial differences; conclusions rely on tabulated/validated results.
Related Publications
Explore these studies to deepen your understanding of the subject.