Machine-learned potentials — neural network force fields trained on quantum mechanical data — have achieved QM-level accuracy at a fraction of the compute cost. The hype around them in the molecular simulation field is justified for certain applications. For relative binding free energy calculations in drug discovery, the story is more nuanced. This post explains the current state, where each approach is stronger, and where we think the hybrid landscape is heading.
What a machine-learned potential is
A machine-learned potential (MLP) is a neural network that takes atomic coordinates as input and outputs energies and forces, trained on a dataset of quantum mechanical calculations (typically DFT at B3LYP or ωB97X-D3 level). The result is a function that approximates the QM potential energy surface at a fraction of the computational cost of direct QM evaluation.
The landmark architectures — SchNet (Schütt 2018), DimeNet (Gasteiger 2020), NequIP (Batzner 2022), MACE (Batatia 2022) — progressively improved accuracy and transferability. NequIP and MACE use equivariant message passing to respect the rotational and translational symmetries of molecular systems, which improves generalization to geometries not in training data.
General-purpose MLPs trained on large chemical datasets (ANI-2x, AIQM1, MACE-OFF) are now available and parameterize a large fraction of drug-like chemical space. These aren't specialized models for a single molecule — they're pretrained potentials that can be applied to novel compounds without custom fitting.
Where MLPs currently excel
MLPs outperform classical force fields in applications that require QM-level accuracy for isolated molecules or small molecular systems:
- Conformational energy profiles: Torsional barriers for heterocycles, ring puckering energetics, and relative conformer energies are substantially more accurate with MLPs than with classical force fields. This matters for computing the strain energy of a ligand bound in a non-minimum-energy conformation — which FEP uses in the intramolecular potential correction.
- Proton transfer and tautomer stability: Predicting the tautomeric form of a drug-like molecule at physiological pH is a QM-level problem that MLPs handle well. Classical force fields typically assume a single tautomer form, which introduces errors when the target tautomer differs from the lowest-energy gas-phase form.
- Non-equilibrium sampling: For short MD simulations focused on the immediate neighborhood of a crystal structure conformation (few-nanosecond enhanced sampling of the binding site), MLPs can provide more accurate energies than classical force fields, particularly for strained ring systems.
Where classical FEP with a good force field currently wins for drug discovery
For relative binding free energy calculations — the specific application of predicting ΔΔG between two ligands binding to the same protein — classical FEP with OPLS4 or AMBER currently outperforms end-to-end MLP approaches. The reasons are:
The protein environment. MLPs trained on small-molecule QM data don't accurately describe protein-ligand interactions. The protein has ~5,000–20,000 atoms; the interactions between protein residues and ligand require parameterization of a high-dimensional chemical space that current general-purpose MLPs haven't fully covered. AMBER and OPLS4, which combine decades of experimental calibration with QM reference data, remain more accurate for protein force field terms.
The long timescales. FEP requires sampling thermodynamic ensembles — this means running MD simulations of tens to hundreds of nanoseconds across multiple lambda windows. Classical force fields are computationally inexpensive enough to make this feasible on GPU hardware at reasonable cost. MLPs are 10–100× more expensive per MD step than classical force fields, making long-timescale ensemble sampling cost-prohibitive for routine use.
Existing validation data. OPLS4 and AMBER have been validated against thousands of protein-ligand binding datasets accumulated over two decades. The accuracy benchmarks exist and are reliable. MLP-based binding free energy calculations lack this depth of prospective validation data.
The hybrid approach: OPLS4 + ΔML correction
The most productive near-term approach combines classical force fields for the protein and solvation terms with an MLP correction layer applied to the ligand intramolecular energy. This is the approach DrugSynq uses.
The rationale: the dominant accuracy-limiting factors in classical FEP are (1) ligand torsional parameters for heterocyclic scaffolds that aren't well represented in the OPLS4 training set, and (2) the intramolecular strain energy of the ligand in the bound conformation. Both are addressable with a targeted MLP correction trained on QM torsional scans and conformer energies of drug-like fragments.
On our benchmark set (14 clinical targets, 312 congeneric pairs), the OPLS4 + ΔML hybrid achieves RMSE of 0.9 kcal/mol compared to 1.2 kcal/mol for OPLS4 alone and 2.1 kcal/mol for docking scores. The improvement is concentrated in compounds with unusual heterocyclic scaffolds where OPLS4 torsional parameters were less reliable — exactly the kinds of novel scaffolds that appear in lead optimization programs.
The longer-term trajectory
MLPs are improving rapidly. The compute cost per MD step is decreasing as hardware accelerates and models become more efficient. The chemical coverage of pretrained potentials is expanding. It's plausible that within 3–5 years, a protein-scale MLP accurate enough for binding free energy prediction will exist with acceptable computational cost.
The pathway there runs through: (1) training data that covers protein-ligand interfaces, not just isolated small molecules; (2) active learning strategies that efficiently identify where current MLPs are inaccurate for drug-like geometries; (3) hardware that makes nanosecond MLP-MD competitive with current GPU OPLS4-MD costs.
Until that point, the honest engineering answer is: use the best available validated method. For ΔΔG prediction in congeneric lead series against well-characterized protein targets, that is currently physics-based FEP with a hybrid ML correction, not end-to-end ML potentials.
What this means for computational chemists evaluating tools
When a vendor claims "ML-based molecular simulation," the first question to ask is: ML for what part of the calculation, and how was it validated? There is a meaningful difference between:
- ML-based QSAR models trained on bioactivity data (useful for lead prioritization but not physics-grounded)
- ML-based ADMET prediction models trained on in vitro assay data (useful for risk scoring)
- ML correction terms applied to classical FEP calculations (our approach — physics grounded with ML accuracy improvements)
- End-to-end ML potentials replacing classical force fields (promising but not yet production-ready for protein-ligand FEP)
Each of these has a different accuracy profile, failure mode set, and appropriate use case. Calling all of them "ML simulation" obscures the distinction. A drug discovery team evaluating computational prediction tools should ask specifically: what is the method, what is the training data, what is the prospective validation benchmark, and for which target classes and scaffold types was it tested?
The answer to "ML potentials vs. physics-based simulation" isn't a single winner. It's a set of decisions about which part of the problem each approach handles best — and building systems that use the appropriate tool for each subproblem.