logo
ResearchBunny Logo
Al-based automation of enrollment criteria and endpoint assessment in clinical trials in liver diseases

Medicine and Health

Al-based automation of enrollment criteria and endpoint assessment in clinical trials in liver diseases

J. S. Iyer, D. Juyal, et al.

Discover the transformative potential of AIM-MASH, an innovative AI-based tool designed to enhance histologic scoring in metabolic dysfunction-associated steatohepatitis (MASH) clinical trials. Developed by a team of experts, AIM-MASH not only achieves reproducible predictions but also aligns closely with consensus scores, reducing inter-rater variability and providing a more sensitive measure of patient responses.... show more
Introduction

MASH is the progressive form of metabolic dysfunction-associated steatotic liver disease and a leading cause of cirrhosis, hepatocellular carcinoma and liver transplantation. Its incidence and associated healthcare burden are increasing and, recently, resmetirom received the first regulatory approval for MASH. Histologic surrogate endpoints are accepted in MASH trials for enrollment, risk stratification and endpoint assessment, focusing on macrovesicular steatosis, lobular inflammation, hepatocellular ballooning and fibrosis. However, limited sensitivity of existing scoring systems and substantial inter- and intra-pathologist variability, particularly in identifying ballooned hepatocytes, undermine reliability, power and outcomes of trials. Regulatory agencies (FDA, EMA) endorse histopathologic assessments but variability persists even among experts. This study addresses the need for a reproducible, quantitative and scalable approach by developing an AI-based digital pathology tool (AIM-MASH) to automate and standardize histologic scoring and assessment in MASH clinical trials.

Literature Review

Prior frameworks such as the MASH Clinical Research Network (CRN) scoring and related systems (for example, SAF, FLIP) are widely used and form the basis for regulatory guidance, yet studies have documented suboptimal reliability of biopsy evaluation, with high inter-reader variability and misclassification affecting enrollment and outcomes. Efforts to harmonize scoring guidelines continue, but variability in key features like hepatocellular ballooning recognition remains high. Advances in AI for digital pathology have shown potential for quantitative and reproducible whole-slide image assessments, but they have not been widely adopted or qualified for clinical trial use. These gaps motivate robust AI-driven solutions aligned with regulatory expectations and clinical trial workflows.

Methodology

AIM-MASH comprises multiple convolutional neural network (CNN) and graph neural network (GNN) models producing histologic readouts. Training data included 8,747 H&E and 7,660 Masson’s trichrome (MT) whole-slide images (WSIs) from six completed phase 2b/3 MASH clinical trials, augmented with PSC and HBV datasets, with 103,579 pathologist-provided pixel-level annotations from 59 MASH-expert pathologists. Data were split at the patient level into training (~70%), validation (~15%) and held-out test (~15%) sets, balancing disease severity metrics where possible. CNNs included: (1) an artifact segmentation model to distinguish evaluable tissue from background and artifacts on H&E/MT; (2) H&E tissue segmentation for macrovesicular steatosis, hepatocellular ballooning, lobular inflammation and other features (portal inflammation, microvesicular steatosis, interface hepatitis, normal hepatocytes); (3) MT segmentation for nonpathologic septal/subcapsular regions, pathologic fibrosis, bile ducts and blood vessels. Models used architectures inspired by residual and inception networks with softmax loss, data augmentation (random crops, rotations, color perturbations, noise), distributionally robust optimization, and mixup; inputs were zero-mean normalized. CNN outputs created pixel-level maps, slide-level overlays and proportionate area measurements (raw mm2 and artifact-normalized proportions) of histology features. GNNs were trained on CNN-derived superpixel graphs to predict MASH CRN ordinal grades/stages (steatosis, lobular inflammation, ballooning from H&E; fibrosis from trichrome) and corresponding continuous scores. Nodes represented spatial, topological and logit-derived features; edges connected k-nearest neighbors. To mitigate reader bias, GNNs incorporated mixed-effects with pathologist-specific bias parameters learned during training and discarded at inference, enabling unbiased predictions aligned to multi-reader consensus. Continuous scoring mapped ordinal bins to unit intervals (for example, steatosis grade 0 mapped to 0–1, grade 1 to 1–2), using learned logit cutoffs and piecewise linear mapping with tail constraints for outer bins. Quality control included annotator vetting, annotation QC, iterative model/overlay review, internal test characterization and evaluation on an external, out-of-distribution held-out dataset. Statistical analyses: repeatability was assessed via ten independent deployments on the same analytic performance test set (percentage agreement); accuracy assessed by agreement rates versus consensus of three expert pathologists in an external held-out dataset (MLOO treating the model as a fourth reader; bootstrapped CIs). Clinical utility assessments included enrollment and endpoint concordance against consensus; efficacy comparisons used Cochran-Mantel-Haenszel tests stratified by diabetes status and baseline cirrhosis (manual). Continuous score interpretability used Kendall’s tau correlations with mean pathologist scores, and correlations with noninvasive tests (NITs). Prognostic utility used Kaplan-Meier and Cox regression to predict progression to cirrhosis (F3 baseline) or liver-related events (F4 baseline) with cutoffs maximizing hazards; ROC analyses compared continuous vs ordinal AUCs.

Key Findings

• Perfect repeatability: Ten independent AIM-MASH reads per WSI yielded 100% model-model agreement (κ = 1), exceeding reported intra-pathologist agreements (37–74%). • Agreement versus consensus (external held-out dataset): steatosis κ = 0.74 (95% CI 0.71–0.77), ballooning κ = 0.70 (0.66–0.73), lobular inflammation κ = 0.67 (0.64–0.71), fibrosis κ = 0.62 (0.58–0.65). Model-consensus agreement exceeded any individual pathologist vs consensus and mean pairwise pathologist agreement. • Enrollment criteria (STELLAR-3/4, n = 605 WSIs): MAS ≥ 4 vs < 4 agreement 0.82 (95% CI 0.79–0.85) for AIM-MASH vs 0.81 (0.78–0.83) average pathologist; F1–F3 vs F4 agreement 0.97 (0.95–0.98) vs 0.96 (0.95–0.97), respectively. • Endpoint assessment (held-out phase 2b dataset): fibrosis improvement without MASH worsening agreement 0.80 (0.76–0.84) for both AIM-MASH and pathologists; MASH resolution without fibrosis worsening 0.86 (0.82–0.89) for AIM-MASH vs 0.82 (0.79–0.86) pathologists; ≥2-point MAS reduction 0.79 (0.74–0.83) vs 0.77 (0.74–0.81). • Retrospective efficacy (ATLAS, baseline to week 48): AIM-MASH detected higher responder proportions in CILO+FIR versus central reader—≥2-point MAS reduction 60% vs 35%; fibrosis improvement without MASH worsening 27% vs 16%; MASH resolution without fibrosis worsening 24% vs 5%. Placebo-adjusted response rates were greater with AIM-MASH (MAS reduction: 35% vs 25%; fibrosis improvement: 11% vs 9%; MASH resolution: 11% vs 5%). • Continuous scores correlated with mean pathologist scores and with relevant NITs: continuous fibrosis with FibroScan liver stiffness (r = 0.33, P < 0.001), FIB-4 (τ = 0.23, P < 0.001), ELF (τ = 0.22, P < 0.001), TIMP1 (τ = 0.11, P = 0.02), PIIINP (τ = 0.14, P < 0.01); continuous steatosis with MRI-PDFF (r = 0.52, P < 0.001); collagen proportionate area correlated positively with continuous fibrosis (r = 0.56, P < 0.001) and inversely with continuous steatosis (r = −0.16, P < 0.001). • Continuous fibrosis (cFib) was more sensitive than CPA to treatment-induced change in ATLAS responders: greater reduction in cFib for treated vs placebo (Mann-Whitney U = 20.0, P = 0.02), while fibrosis area change was not significant (U = 39.0, P = 0.21). cFib decreased in responders and increased in nonresponders. • Prognostic value: cFib thresholds (3.6 for F3, 4.6 for F4) stratified rapid vs slow progressors (F3 log-rank = 31.0, P = 2.6×10⁻7; F4 log-rank = 4.8, P = 0.028). Continuous scores had higher AUCs than ordinal for predicting progression to cirrhosis (0.66 vs 0.59) and LREs (0.61 vs 0.54). • Abstract highlights: AIM-MASH continuous scores strongly predicted progression-free survival in stage 3 (P < 0.0001) and stage 4 (P = 0.03) fibrosis and detected greater continuous fibrosis change in treatment responders versus placebo (P = 0.02).

Discussion

AIM-MASH addresses a critical limitation in MASH clinical trials: variability and limited sensitivity of manual histologic scoring. By leveraging CNN-based feature segmentation and GNN-based mixed-effects scoring, the system aligns with expert consensus while avoiding individual reader biases and achieves perfect computational repeatability. AIM-MASH reproduced enrollment decisions and endpoint assessments with accuracy comparable to expert pathologists and, in retrospective analyses, identified a greater proportion of treatment responders and larger placebo-adjusted responses. Continuous scoring enhanced sensitivity to subordinal histologic changes, correlated with independent noninvasive biomarkers, and provided superior prognostic discrimination for clinical progression compared with ordinal scores. These capabilities suggest AIM-MASH can standardize and augment pathologist assessments, improve trial powering and reliability, and increase sensitivity to therapeutic effects, potentially improving clinical trial outcomes and patient benefit.

Conclusion

AIM-MASH is a robust, reproducible AI tool for automated evaluation of MASH histology that recapitulates pathologist consensus for enrollment and endpoint determinations and enhances detection of treatment response. Its continuous scoring framework captures subordinal changes, correlates with noninvasive biomarkers, and improves prognostic stratification for clinical outcomes. The approach can reduce inter-rater variability, harmonize with FDA/EMA-endorsed histologic endpoints, and support more sensitive, reproducible assessment of therapeutic efficacy. Ongoing analytical and clinical validation across scanners, sites and datasets, along with regulatory qualification activities with FDA and EMA, will further establish its prospective utility. Future work should refine nonlinear mappings of disease progression/regression, define clinically meaningful continuous thresholds and changes, and explore additional AI-derived biomarkers (for example, portal inflammation, bile duct features) for risk prediction and trial enrichment.

Limitations

• Continuous scoring maps disease changes to a linear scale, which may not reflect the nonlinear nature of MASH progression/regression; similar absolute changes in different score ranges may not represent equivalent biological change. • Clinically meaningful thresholds and minimal clinically important differences for continuous scores are not yet established; associations with improved outcomes require further definition. • Additional analytical validation is needed across pre-analytic factors (scanner models, operators, staining/biopsy/section quality) to confirm generalizability. • The study is retrospective in parts; prospective validation in trial workflows is required. Some datasets and proprietary code are not publicly available, limiting independent replication of the full pipeline (though analysis code for figures is provided).

Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny