Fertil Steril. 2025 Aug 26:S0015-0282(25)01845-X. doi: 10.1016/j.fertnstert.2025.08.021. Online ahead of print.
ABSTRACT
OBJECTIVE: To evaluate the stability and reliability of AI models and approaches in embryo selection and rank ordering for in-vitro fertilization (IVF).
DESIGN: A laboratory-based study evaluating the stability and consistency of Single Instance Learning (SIL) models that assess embryos individually, predicting live-birth outcomes based solely on each embryo's morphological features. Fifty replicate convolutional neural networks (CNNs) with varying initialization parameters were trained and tested across two independent fertility center datasets. Model performance was assessed through embryo rank ordering, critical error rates, and inter-model variability. Interpretability analyses using Gradient-weighted Class Activation Mapping (GradCAM) and t-distributed stochastic neighbor embedding (t-SNE) were conducted to explore decision-making discrepancies among replicate models.
SUBJECTS: The study utilized retrospective embryo datasets from Massachusetts General Hospital (MGH) and Weill Cornell Fertility Center, including images from 1,258 patients and 10,713 embryos from MGH, and 53 patients with 648 embryos from Cornell.
MAIN OUTCOME MEASURES: Consistency in embryo ranking (Kendall's W), frequency of critical errors (instances where low-quality embryos were top-ranked), and inter-model variability across datasets.
RESULTS: AI models demonstrated poor consistency in embryo rank ordering (Kendall's W ∼0.35) and exhibited high critical error rates (∼15%), often ranking lower-quality embryos above viable ones. Significant inter-model variability was observed even among models with similar predictive accuracies (AUC ∼60%). When tested on data from a different fertility center, model instability increased (error variance delta: 46.07%2), highlighting sensitivity to distribution shifts. Interpretability analyses revealed divergent decision-making strategies among replicate models, despite identical architectures and training protocols.
CONCLUSION: SIL AI models for IVF embryo selection exhibit substantial instability and inconsistency, undermining their clinical reliability. High inter-model variability and critical error rates raise concerns about their suitability for real-world deployment. This study highlights the need for more stable AI frameworks and robust evaluation metrics tailored to the clinical demands of IVF.
PMID:40876725 | DOI:10.1016/j.fertnstert.2025.08.021