logo
Loading...
Fine-scale population structure and demographic history of British Pakistanis

Medicine and Health

Fine-scale population structure and demographic history of British Pakistanis

E. Arciero, S. A. Dogra, et al.

Explore the intriguing genetic landscape of British Pakistanis! This research, conducted by a diverse team of authors, uncovers the strong influence of cultural practices on population structure, revealing extensive identity-by-descent sharing that heightens the risk of recessive diseases. Discover how consanguinity patterns shape genomic diversity in Pakistanis.... show more
Introduction

The study examines how the biraderi (clan-based, largely endogamous) social structure and consanguinity have shaped genetic diversity, fine-scale structure, and recent demographic history among Pakistani-ancestry individuals in the UK. Motivated by high burdens of cardiometabolic disease and congenital anomalies in British Pakistanis, and limited prior fine-scale genetic analyses, the authors aim to quantify population structure, divergence, effective population size changes, endogamy and consanguinity patterns, and their implications for recessive disease risk using the large Born in Bradford cohort. The work addresses gaps left by macro-scale studies and limited sample sizes, focusing on recent history and subgroup-specific patterns.

Literature Review

Prior work shows South Asians derive ancestry from Ancestral North Indian (ANI) and Ancestral South Indian (ASI) components, with Pakistanis enriched for ANI. Large resources (HGDP, 1000 Genomes, GenomeAsia) documented elevated homozygosity and heterogeneity across Pakistani ethnic groups but lacked fine-scale subgroup resolution and rich metadata. Historical and ethnographic literature describes the biraderi system’s origins, evolution of endogamy (strengthened during Gupta, Mughal, and British colonial eras), and complex subgroup identities (e.g., Rajput, Jatt, Gujjar, Syed, Qureshi, Malik, Sheikh), but with limited genetic corroboration and timing. Previous small-scale genetic studies (e.g., microsatellites) suggested limited insights into fine structure. Recent studies highlight how demography and structure affect polygenic metrics and disease risk prediction, underscoring the need for detailed population-specific analyses.

Methodology

Data: 7,180 Pakistani-ancestry individuals and 6,818 White British from Born in Bradford (BiB); primary analyses used 2,200 unrelated Pakistani mothers genotyped on Illumina CoreExome (251,853 autosomal SNPs after QC). Additional data: 1,616 unrelated Pakistani children (CoreExome), 228 unrelated Pakistani fathers (GSA; 1,056 Y-SNPs), and exome sequencing for 2,484 Pakistani mothers. Quality control: PLINK-based SNP/individual missingness filters, HWE filters, sex checks, duplicates removed; ancestry assignment via PCA; array-specific QC. Relatedness estimation used KING PropIBD to conservatively exclude third-degree or closer relatives, validated with self-reported relationships and simulations. Subgroup definition: Cleaned self-reported biraderi/tribal labels into 56 subgroups; for demographic analyses, defined homogeneous subsets using fineSTRUCTURE clusters and FST thresholds, pooling very similar subgroups (Awaan/Syed; Bains/Rajput-B; Jatt/Choudhry). Population structure analyses: PCA (EIGENSOFT) and ADMIXTURE on LD-pruned SNPs; UMAP on top 20 PCs; fineSTRUCTURE/ChromoPainter for haplotype-based clustering; Y haplogroups (yhaplo) and mtDNA haplogroups (HaploGrep). Genetic–geographic correlation via Mantel test on assigned villages of origin. Demographic inference: Divergence times estimated with NeON (LD decay), with jackknife CIs and sensitivity analyses (removing highly autozygous individuals). Gene flow and drift via Treemix (with migration edges) and f3-statistics. Recent effective population size (Ne) trajectories inferred with IBDNe from IBD segments (IBDseq primary; GERMLINE sensitivity), with multiple filters/sensitivity analyses. Endogamy and consanguinity: IBD scores (extent of founder events) computed from GERMLINE IBD segments (5–30 cM) and standardized to 1000G Finns. ROH calling using bcftools/roh (primary), with PLINK and GARLIC for comparisons; FROH computed per individual. Developed a neural network classifier (R nnet) trained on simulated pedigrees (sibling, avuncular up to 3 generations, first cousins up to 3 generations, first cousin once removed, second cousins, unrelated) to infer parental relatedness from ROH patterns (>10 cM): 15 features including lengths of top 10 ROHs and binned ROH length frequencies. Coalescent-based validation: Applied theory to predict ROH and IBD footprints given historical Ne trajectories (from IBDNe) and consanguinity rates (from self-report vs neural-net-inferred), comparing predictions to observed footprints in Pathan (Cluster 8) and Jatt/Choudhry (Cluster 10). Recessive disease risk simulations: From exome-sequenced mothers, ascertained pathogenic/likely pathogenic variants in autosomal recessive developmental disorder genes (LOFTEE HC PTVs and ClinVar 2-star missense/inframe). Simulated pairings to estimate fraction of at-risk couples for intra-biraderi vs inter-biraderi unions, and for first-cousin unions (sampling pairs to match observed kinship distribution). Variance and permutation-based p-values computed; assumptions include full penetrance and under-ascertainment leading to underestimation of absolute risk.

Key Findings
  • Fine-scale structure: PCA and ADMIXTURE revealed distinct subgroup patterns; fineSTRUCTURE identified clusters dominated by Pathan (Cluster 8), Bains/Rajput-B (Cluster 9), and Jatt/Choudhry (Cluster 10). Rajput split into Rajput-A and Rajput-B; Rajput-B clustered with Bains and had a higher proportion (>40%) of a specific ADMIXTURE component.
  • Divergence: Homogeneous subgroups diverged within the last ~70 generations (~1500–2000 years). Within clusters, Bains vs Rajput-B and Jatt vs Choudhry had near-zero divergence and FST < 0.001.
  • Gene flow and drift: Treemix indicated strongest drift in Bains/Rajput-B, followed by Qasabi and Pathan. Migration edges suggested gene flow into Kashmiri from Pathan and from Bains/Rajput-B and Jatt/Choudhry (jackknife p < 0.001). f3-statistics supported Kashmiri admixture.
  • Effective population size (Ne): All subgroups showed relatively low Ne over the past 50 generations compared to White British. Several groups (Bains/Rajput-B, Jatt/Choudhry, Pathan) showed a marked Ne reduction ~10–20 generations ago followed by recovery; others showed progressive Ne decline over the last ~15 generations.
  • IBD and founder effects: Pakistani subgroups exhibited substantially higher IBD scores than most 1000 Genomes populations, comparable to isolated Indian groups; highest in Bains/Rajput-B, followed by Qasabi and Jatt/Choudhry.
  • Homozygosity: Mean FROH in Pakistani mothers 0.048 vs 0.0004 in White British (t-test p < 1×10−15). Children with UK-born parents had lower FROH than those with Pakistan-born parents; mixed-origin parents showed the highest FROH among children subsets.
  • Consanguinity reporting vs inference: 57% of Pakistani mothers reported parental relatedness; 63% reported being related to their child’s father. Neural network classification (92% accuracy across three classes) inferred 78% of mothers as offspring of second cousins or closer vs 57% reported (z test p < 1×10−15); for children, 58% inferred vs 63% reported (p = 0.0008). Self-reports of first cousins agreed in 81% (first cousins or closer) but were less reliable for first cousins once removed and second cousins.
  • Subgroup differences: Consanguinity varied by subgroup: Qasabi had 67% unrelated parents; nearly all Bains/Rajput-B were second cousins or closer. Even among inferred unrelated parents, FROH differed by subgroup; groups with higher recent Ne (Arain, Awaan/Syed, Kashmiri) had lower FROH. In two-way ANOVA on FROH: inferred consanguinity explained 62% of variance (p < 1×10−15), subgroup 5% (p < 1×10−15), interaction not significant.
  • Coalescent validation: Observed ROH and IBD footprints in Pathan and Jatt/Choudhry matched expectations using neural-net-inferred kinship (Pathan kinship 0.035; Jatt/Choudhry 0.057) rather than self-reported estimates (0.016; 0.028), supporting under-reporting of consanguinity.
  • Y chromosome: No clear clustering of Y haplogroups by subgroup among fathers, despite patrilineal biraderi structure.
  • Disease risk simulation: Intra-biraderi unions had significantly higher risk of both partners carrying P/LP variants in the same gene than inter-biraderi unions (notably Bains and Jatts; one-sided p = 2×10−4 and p < 1×10−4, respectively). First-cousin unions had >10-fold higher risk than intra-biraderi unions.
Discussion

The study demonstrates that the biraderi social stratification system underlies strong and recent fine-scale genetic structure in British Pakistanis, with subgroup divergences beginning within the last ~70 generations and evidence of recent reductions in effective population size consistent with historically strengthened endogamy. Elevated IBD sharing and homozygosity across subgroups reflect both endogamy and high rates of consanguinity. Machine learning and coalescent-based analyses indicate that self-reported parental relatedness underestimates true consanguinity levels, and that multiple generations of consanguinity contribute to observed ROH patterns. The findings clarify how cultural marriage practices shape genomic diversity and have direct implications for medical genetics: intra-biraderi unions confer increased recessive disease risk compared to inter-biraderi unions, though far less than first-cousin unions. Recognizing subgroup identity (biraderi) alongside close relatedness can refine risk assessment, study design, and interpretation in genetic and clinical contexts, including GWAS control for fine-scale structure and considerations for carrier screening and founder variants within specific subgroups.

Conclusion

This work provides the largest fine-scale analysis of population structure and recent demographic history in Pakistani-ancestry individuals, revealing subgroup-specific clusters shaped by the biraderi system, divergence within the last ~70 generations, recent Ne reductions, and substantial IBD and ROH burdens. A neural network and coalescent modeling show that consanguinity is under-reported and varies markedly by subgroup; endogamy contributes additional homozygosity and recessive disease risk. Clinically, recording biraderi information could aid in identifying subgroup-specific risks and founder mutations and in tailoring genomic health services. Future research should: (1) expand sampling across more Pakistani ethnic groups (e.g., Baloch, Sindhi) in the UK and Pakistan; (2) employ larger whole-genome sequencing datasets (including more Y-chromosome markers) for finer demographic inference; (3) analyze clinical cohorts to estimate absolute risks and identify subgroup-enriched disorders; and (4) develop methods to control for recent fine-scale structure and leverage IBD to improve rare variant discovery.

Limitations
  • Sample composition: Many of the 56 reported subgroups had small sample sizes limiting subgroup-level inference; the cohort is primarily Pathan, Punjabi, and Kashmiri ancestry from the UK, limiting generalizability to all Pakistani groups.
  • Genotyping platform: Reliance on SNP arrays constrained fine-scale demographic inference; more comprehensive whole-genome sequencing would provide richer information (e.g., more Y-chromosome markers).
  • Relatedness estimation: High endogamy complicates disentangling relatedness from structure; conservative removal of putative relatives may have inflated recent Ne and reduced FST, potentially underestimating divergence times.
  • Admixture effects: Gene flow among subgroups likely influenced inferred Ne trajectories and split-time estimates not explicitly modeling admixture.
  • Temporal interpretation: Generation time uncertainty limits precise historical attribution of demographic changes (e.g., Ne declines 15–20 generations ago).
  • Consanguinity inference: Neural-net classification may misclassify due to modelled relationship set and overlapping ROH distributions; coalescent approach had limited power to detect historical changes in consanguinity rates.
  • Disease risk simulations: Underestimate absolute risk due to variant ascertainment (omitted many pathogenic missense/inframe variants) and assumption of full penetrance; marital pairing simulations may not fully reflect social mating patterns.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny