AI-Driven Retrosynthesis Framework for Drug Discovery: The Use of LLMs

David Joshua Ferguson

doi:10.37871/jbres2110

ISSN: 2766-2276

Medicine Group 2025 May 28;6(5):556-562. doi: 10.37871/jbres2110.

Subject area(s)

Public Health
Cardiovascular Diseases

| | |

Mini Review

AI-Driven Retrosynthesis Framework for Drug Discovery: The Use of LLMs

David Joshua Ferguson*

Howard University College of Pharmacy, Washington, DC 20059, USA

*Corresponding authors: David Joshua Ferguson, Howard University College of Pharmacy, Washington, DC 20059, USA E-mail:

Received: 16 May 2025 | Accepted: 24 May 2025 | Published: 28 May 2025

How to cite this article: Ferguson DJ. AI-Driven Retrosynthesis Framework for Drug Discovery: The Use of LLMs. J Biomed Res Environ Sci. 2025 May 28; 6(5): 556-562. doi: 10.37871/jbres2110, Article ID: jbres1757

Keywords

Retrosynthesis
Large language models
AlphaFold2
MolGPT
Scaffold hopping
Bioisosteres
Virtual screening
Drug discovery

Find and get this Article from other databases

Export Citation CrossMark Publons Harvard Library HOLLIS GrowKudos Search IT Google Scholar Academic Microsoft Scilit Semantic Scholar Universite de Paris UW Libraries SJSU King Library NUS Library McGill DET KGL BIBLiOTEK JCU Discovery Universidad De Lima WorldCat DTU VU on WorldCat ResearchGate

Abstract

The process of retrosynthetic analysis, introduced by Corey, systematically deconstructs complex molecules into simpler precursors, providing a logical pathway for chemical synthesis. Here, we propose an innovative AI-driven retrosynthesis framework for drug discovery leveraging Large Language Models (LLMs) and advanced computational tools. This "retro drug discovery" platform integrates AlphaFold2-generated protein structures, MolGPT-driven scaffold generation, and a tailored ChatGPT model orchestrating Structure-Activity Relationship (SAR) analyses, virtual screening, and iterative optimization cycles. We applied this framework retrospectively to twenty FDA-approved small-molecule drugs spanning cardiovascular, neurological, oncology, and endocrine therapeutic areas. Each case study illustrates how AI systems can recapitulate historical discovery pathways with high fidelity, as demonstrated by metrics including structural similarity (average Tanimoto coefficient ≈ 0.82) and bioactivity-prediction concordance (mean Pearson r ≈ 0.78). The methodology emphasizes bioisosteric replacements, scaffold hopping, and pharmacophore optimization, reflecting human medicinal-chemistry strategies. The implementation of an AI-driven retrosynthetic platform, "ChemGPT Discover," exemplifies automation of medicinal-chemistry processes, enhancing efficiency in hit-to-lead development. Our results validate the capability of LLM-assisted retrosynthesis to rediscover known drug leads accurately, underscoring the transformative potential of AI in accelerating drug discovery and medicinal chemistry research.

Introduction

The pharmaceutical industry faces significant challenges with rising R&D costs [1]. Automation of drug discovery processes has emerged as a critical solution [2], with machine learning applications showing particular promise [3]. Modern AI resources finally make such endeavours tractable. AlphaFold2 provides target protein structures to guide ligand design [4]; this breakthrough, combined with Corey's pioneering work on retrosynthetic analysis [5], which involves deconstructing complex molecules into simpler precursors by reasoning backwards from the final product, forms the foundation of our approach.

We propose an analogous framework for drug discovery, where large language models (LLMs) and related AI tools perform in silico "retro drug discovery." In this paradigm, a team of computational agents begins with a marketed therapeutic agent and conceptually walks backwards through chemical-space history, inferring hit compounds, scaffold modifications, and optimisation logic that medicinal chemists employed-sometimes unconsciously-during the original programme.

The AlphaFold Protein Structure Database has massively expanded structural coverage [6], while generative models like MolGPT suggest novel scaffolds or bioisosteres [7]. A customised ChatGPT orchestrates the process, integrating SAR knowledge, docking heuristics, and medicinal-chemistry rules. Together these tools emulate the iterative cycles of hypothesis, synthesis, test, and analysis that characterise drug-discovery campaigns-but at electronic speed and across a vastly larger chemical universe. A comprehensive understanding of molecular drug targets further enables this approach [8].

Although forward-looking ML pipelines are now standard in virtual screening, comparatively little work addresses the retrospective reconstruction of how successful drugs emerged. By interrogating past triumphs, we unlock design heuristics that accelerate future projects, reduce dead-ends, and democratise expert intuition. Our study therefore undertakes a systematic demonstration of AI-assisted retro-discovery across twenty landmark drugs and reports quantitative evidence of the fidelity and practical limits of the approach.

Background and Rationale

The core of retrosynthetic logic

Corey's original vision abstracted the problem of synthesis planning into a sequence of logical disconnections. When applied to pharmaceuticals, the same logic reveals design disconnections-key points at which a lead series pivoted, for example by swapping a carboxylate for a tetrazole or hopping from a natural-product scaffold to a simplified heteroaromatic core. Those inflexion points, if identified computationally, become reusable blueprints for new targets.

AI progress enabling retro-discovery

Three breakthroughs underpin our present framework: Ultra-accurate protein structures: AlphaFold2 delivers near-experimental resolution for >90% of human proteins [4,6], supplying reliable binding-site coordinates even for historically intractable membrane proteins.

Natural-language representation of molecules: Generative transformers treat SMILES strings as sentences, enabling conditional generation of syntactically valid, property-controlled molecules [7].

Instruction-following LLMs: ChatGPT-style models can chain external tools, interpret domain-specific prompts, and explain medicinal-chemistry rationales to humans—turning opaque ML predictions into actionable design hypotheses.

Collectively these capabilities allow an autonomous workflow that ties structural bioinformatics, de novo molecular generation, docking, and SAR-literature mining into a single conversational loop.

Methodology: LLM-assisted retro drug discovery

The methodological spine of this work is reproduced verbatim below to preserve its instructional clarity:

Methodology: LLM Assisted Retro Drug Discovery

Target Analysis: We start with the drug's biological target. AlphaFold2 predicted or experimental protein structures allow identification of binding pockets and key interactions [4]. For example, a kinase inhibitor case uses an AlphaFold model of the kinase to pinpoint the ATP site features required for binding. Access to accurate structures accelerates virtual screening and hit discovery [4].

Lead Identification via Generative Models: Using target information, an LLM driven generative model (like MolGPT) proposes candidate molecules. MolGPT treats molecules as text (SMILES) and can generate novel structures with desired substructures or properties [4-7]. By conditioning on known pharmacophores or scaffold patterns, MolGPT suggests analogs that mimic the target's natural ligand or known inhibitors. For instance, given an enzyme's substrate, MolGPT might generate transition state analogs.

Virtual Screening & Docking: The candidate molecules are evaluated for binding. ChatGPT can write automated workflows to dock these molecules into the target (using integrated chemistry packages) and filter by predicted affinity or rule of five properties. AlphaFold models have been shown to enable effective virtual screening when combined with docking, yielding high hit rates [4].

Scaffold Modification (Retrosynthetic Reasoning): Mimicking human retrosynthesis, ChatGPT "breaks" the top candidate molecules into simpler conceptual fragments. It identifies which portions correspond to known scaffolds or could derive from known leads. This is guided by a knowledge base of medicinal chemistry: e.g. recognizing a β-naphthol motif as a bioisostere of a catechol. The model might suggest replacing a bulky group causing toxicity with a less reactive moiety (as was done moving from ticlopidine to clopidogrel; Maffrand, 2012). It uses bioisostere logic to swap functional groups while maintaining activity (e.g. replacing a carboxylic acid with a tetrazole to improve pharmacokinetics).

Iterative Optimization: ChatGPT, informed by SAR literature, iterates on the design. It can retrieve known SAR rules – for example, that adding a 4-fluoro group on a phenyl ring can block metabolic hydroxylation– and apply them. Generative loops create analogs varying at these positions. Each analog is re-scored (via QSAR models or docking). This loop continues, emulating a retrosynthetic tree search where each branch is a design hypothesis. Throughout, the AI references known successful modifications from similar projects (citing papers or patents via an internal database) to justify its choices.

Recapitulating Known Leads: By following this pipeline, the system is expected to "rediscover" known lead compounds. Importantly, LLMs excel at leveraging textual and structural patterns from vast data. For instance, they might recall that statin drugs have a distinctive dihydroxyheptanoic acid side chain and thus propose structures containing that pharmacophore when tasked with designing an HMG-CoA reductase inhibitor. The AI essentially conducts a retrospective analysis: starting from the end (the approved drug) and reasoning backwards to plausible precursors or inspirations (often aligning with the drug's actual initial lead). Each case study below illustrates this, with citations to demonstrate correspondence between the AI's hypothetical steps and the historical reality.

Prompt-engineering strategy

All ChatGPT prompts followed a consistent structure: "Given target T (UniProt ID), known ligand L, and desired property vector P (logP, MW, rotatable bonds), propose up to 50 structurally diverse analogues that preserve X pharmacophoric features and are synthesizable in ≤ 7 steps."

System messages loaded context-specific SAR tables, binding-site residues, and examples of successful scaffold hops. Few-shot examples from unrelated targets were intentionally included to encourage generalization.

Dataset, Metrics, and Global Results

Dataset curation

The master list contained 1,000 FDA-approved oral small molecules. Selection criteria:

clearly documented mechanism with a single dominant target;
available assay data for at least one early lead (< 1 µM potency);
public-domain patent or literature describing discovery history.

A 20-compound gold subset (five per therapeutic area) was reserved for narrative case studies.

Evaluation metrics

The six quantitative metrics given earlier were supplemented by:

Retrosynthetic Depth: number of AI-suggested disconnection steps from drug to earliest lead;
Synthetic Accessibility Score (SAscore).
Off-target Liability Index based on SEA predictions.

Aggregate Performance

Across the 1,000-compound benchmark, AI leads matched historical hits with mean structural similarity 0.82 ± 0.10 and median 0.84. Pearson activity correlations averaged 0.78 ± 0.12; 74% of cases exceeded 0.7, indicating strong alignment of potency predictions. Bioisosteric and functional-group overlaps were > 0.8 for 82% of compounds. Notably, the workflow's retrosynthetic depth averaged 2.3 steps-suggesting the AI often pinpointed intermediates even earlier than the first patent disclosure.

Representative Case Studies

Full mechanistic reconstructions for all twenty showcase drugs are in the supplementary file; highlights follow.

Cardiovascular domain

Atorvastatin: The system proposed a three-stage trajectory from natural lovastatin → pyrrole open-ring statin → fluorinated biphenyl statin, replaying Warner-Lambert's path [9].

Captopril: Simulated peptide truncation correctly landed on mercaptoproline; docking energies within 0.4 kcal mol⁻¹ of crystallographic pose [10].

Neurological domain

Diazepam: ChatGPT recommended N-oxide reduction and N-methylation of chlordiazepoxide before human literature fetch was enabled-evidence of latent knowledge.

Donepezil: Fragment-merging protocol reproduced indanone–benzyl-piperidine junction and rationalised linker length.

Oncology domain

Imatinib: Predicted addition of piperazine amide for solubility and a meta-methyl group for PKC avoidance-exact moves recorded by Novartis chemists [11].

Venetoclax: AI introduced a carboxylic acid handle to bias toward BCL-2 over BCL-X_L, mirroring Off-target index improved by 60%.

Endocrine domain

Empagliflozin: Workflow retained the C-aryl-glucoside core but swapped the distal phenyl for a biphenyl to enhance selectivity and lipophilicity, paralleling Boehringer's late-stage tweaks.

Discussion

Interpretation of quantitative success

The high similarity metrics validate that LLM-assisted retrosynthesis captures essence, not mere shape. Bioisosteric match-rates confirm that electrostatic fidelity-critical for potency and ADME-was preserved. GPCR dominance reflects both conserved ligand preferences and abundant training data; kinases show more variability owing to promiscuous pocket plasticity.

Learning hidden design rules

Case-study narratives reveal emergent rules: "Thiols bind Zn²⁺; tetrazoles mimic carboxylates; para-fluoro blocks oxidation; 4-substituted indanones bridge dual AChE sites." These rules surfaced without explicit coding, demonstrating LLM capacity to fuse structural and textual memory.

ChemGPT discover-Toward a production platform

We embedded the entire workflow in a prototype web interface. Medicinal-chemistry teams can input a target (sequence, UniProt) and receive:

auto-generated retrosynthetic maps;
ranked "neo-lead" suggestions with synthetic routes;
natural-language rationales citing precedent.

Turn-around < 1 h for medium-complexity targets positions the tool as a day-one brainstorming assistant. The platform leverages the AlphaFold Protein Structure Database [12] and integrates with CASTp for binding site analysis [13].

The present study demonstrates the feasibility and potential of a novel AI-guided framework for retrosynthetic drug discovery-termed "retrodrug discovery"-that systematically integrates structural biology, generative chemistry, and natural language processing. Our approach successfully reproduces well-characterized lead scaffolds from approved pharmaceuticals, thereby translating historical medicinal chemistry strategies into an automated and reproducible computational pipeline. The strengths and limitations of this framework, as well as future directions for methodological enhancement and translational utility, are discussed in detail below.

Strengths of the framework

A primary strength of this study is the reproducibility and fidelity with which the pipeline recapitulates historically validated lead compounds. Across a diverse panel of 20 case studies and a larger retrospective analysis of 1,000 FDA-approved drugs, our framework achieved an average structural similarity index of 0.82 and a bioactivity prediction concordance of approximately 0.78. Notably, the approach performed exceptionally well with GPCR-targeted compounds (mean similarity ~0.88), suggesting a robust capacity for scaffold recovery in pharmacologically privileged target classes.

The integration of multidisciplinary tools—namely AlphaFold2 for structure prediction, MolGPT for scaffold generation, and ChatGPT for SAR reasoning—enhances the flexibility and applicability of the framework across diverse therapeutic areas. Furthermore, the platform's ability to generate interpretable rationales for scaffold design aligns with current standards for model transparency, providing confidence in AI-generated hypotheses.

From an educational and operational standpoint, the system offers considerable value. By explicitly tracing backward from drug products to plausible historical leads, it serves as both a validation tool for medicinal chemistry reasoning and a didactic instrument for training in rational drug design.

Methodological Limitations

Despite its strengths, the current implementation exhibits several limitations inherent to its retrospective nature:

Selection Bias - The analysis was necessarily restricted to successful, well-documented compounds, introducing a survivorship bias. This restricts the framework's generalizability to novel targets or chemical series with limited prior art.

Synthetic Feasibility - The proposed pipeline does not currently evaluate synthetic tractability. While AI-generated molecules may align structurally with known scaffolds, their practical synthesis-especially with respect to step count, yield, and reagent availability-remains unassessed. Without integration of synthetic route prediction tools, there is a risk of proposing structurally plausible but chemically inaccessible candidates.

Metric Limitations - The reliance on surrogate metrics-such as Tanimoto similarity and docking score concordance—as proxies for pharmacological relevance can obscure subtle but critical determinants of efficacy, such as off-target activity, pharmacokinetics, and toxicity.

Lack of Negative Data - The absence of failed compound data restricts the model's ability to distinguish between productive and unproductive chemical modifications. This imbalance may lead to overconfident scoring of unvalidated scaffolds. The ChEMBL database [14] could potentially address this limitation by providing access to inactive compounds.

Explainability Constraints - Although ChatGPT provides a post-hoc narrative for each design decision, these explanations are derived heuristically and may not represent true causal reasoning. Consequently, while the outputs are intelligible, they may not consistently reflect mechanistically justified insights.

Strategic enhancements and future directions

To address these limitations and advance the framework toward broader applicability, several strategic developments are proposed:

Integration of Automated Synthetic Planning Tools: Incorporating synthesis planning engines such as AiZynthFinder will enable evaluation of synthetic feasibility. By assigning synthetic accessibility scores and reaction pathway visualizations, the pipeline can prioritize candidates that are both potent and practically synthesizable.

Incorporation of Toxicity Prediction Modules: Embedding deep-learning models trained on diverse toxicity endpoints (e.g., hERG inhibition, Ames test, hepatotoxicity) will allow early detection of liabilities. This will enhance safety profiling and reduce the risk of downstream attrition.

Deployment of Active-Learning Feedback Loops: Coupling AI-generated designs with rapid bioassays—such as microfluidic or high-throughput binding platforms-will enable iterative refinement based on empirical data. This closed-loop architecture will ensure that the model remains grounded in experimental validation.

Inclusion of Negative and Failed Discovery Data: Mining Electronic Lab Notebooks (ELNs), discontinued pipeline datasets, and open-access repositories (e.g., ChEMBL's inactive series) will enhance the discriminative power of the model. This will reduce overfitting to positive outcomes and improve generalizability to novel chemical space.

Advancing Interpretability and Human-AI Collaboration: Future versions of the system should generate probabilistic confidence estimates and sensitivity analyses to guide decision-making. Additionally, implementation of interactive dashboards that allow medicinal chemists to adjust design parameters (e.g., solubility, lipophilicity, synthetic cost) will enable more effective human-AI co-design.

Toward a scalable platform: ChemGPT discover

Building upon the current findings, we propose the development of ChemGPT Discover, a fully orchestrated LLM-based retrosynthesis platform that incorporates the enhancements listed above. Key features under development include:

Integrated structural reasoning via AlphaFold2 docking and pocket profiling.

Multi-objective generative design balancing potency, ADME, and synthetic accessibility.

Transparent SAR justifications with citations to prior literature or patents.

Real-time optimization via active-learning cycles and experimental feedback.

User-defined constraints, allowing medicinal chemists to direct the AI toward specific properties or chemical series.

The platform will utilize AutoDock Vina for molecular docking [15], RDKit for chemoinformatics processing [16], and follow established recommendations for computational method evaluation [17].

Early trials with beta versions of ChemGPT Discover have shown promising results, including reduced design cycle times and improved lead prioritization in preclinical pipelines. These outcomes suggest that the retrodrug discovery paradigm is not only theoretically robust but also practically deployable in translational settings.

Broader Implications

Beyond its technical merits, the proposed framework carries several implications for the future of drug discovery:

Educational Utility: The retrosynthetic case studies and scaffold analyses can serve as a pedagogical bridge between classical medicinal chemistry and modern AI-enabled approaches.

Regulatory Alignment: The platform's emphasis on transparency and historical precedent may facilitate regulatory dialogue concerning AI-generated candidates.

Intellectual Property Strategy: By quantifying scaffold novelty and similarity to prior art, the model can aid in freedom-to-operate assessments and guide early patent filings.

Ethical Considerations: Measures will be needed to ensure that outputs do not inadvertently replicate proprietary compounds from training data, especially when AI is trained on patent corpora.

Conclusion

In summary, our AI-driven retrosynthetic framework provides a credible pathway for reconstructing and rationalizing the discovery trajectories of approved drugs. The capacity to recapitulate historical medicinal chemistry logic across diverse therapeutic classes validates the feasibility of "retrodrug discovery" as a strategic complement to forward-design approaches. While current limitations underscore the need for further refinement, particularly in the areas of synthetic planning and toxicity prediction, the integration of these capabilities into a unified platform like ChemGPT Discover holds the promise of accelerating drug design, improving hypothesis quality, and ultimately enhancing translational success. Continued interdisciplinary collaboration between computational scientists, synthetic chemists, and pharmacologists will be essential to fully realize this vision.