Phenotype-Guided In Silico Molecular Generation Using Large Language Models

root 提交于 周六, 01/03/2026 - 00:00
Complex diseases often emerge from coordinated, system-level cellular state changes that are difficult to address with target centric drug discovery. Phenotypic drug discovery offers a principled alternative but remains constrained by the cost and scalability of pathologically relevant assays. Here we present GEMGen, a large language modelc-based framework that performs in silico phenotypic drug discovery by generating small molecules directly from transcriptomic representations of cellular states. GEMGen encodes desired phenotypic transitions as text-based representations of up- and down-regulated gene sets, enabling transferable modeling across experimental platforms and data modalities. Trained on large scale chemical perturbation data, GEMGen robustly identifies phenotype-oriented compounds and mechanistically related but structurally distinct candidates across multiple benchmarks. Applied to signatures induced by genetic perturbations, GEMGen produces small molecules that phenocopy gene knockdown effects and identifies chemically novel inhibitors, including previously unreported KEAP1 inhibitors that activate NRF2 signaling. Extending this approach to a disease relevant model of fibrosis, GEMGen generates compounds that reverse profibrotic transcriptional programs and cellular phenotypes. These results establish a scalable framework for translating transcriptomic phenotypes into candidate therapeutic molecules, enabling systems-level exploration of vast chemical space and offering a complementary in silico counterpart to physical phenotypic drug screens.