ADMET

ADMET Prediction Accuracy Benchmarks: What the Numbers Actually Mean

Dr. Sophie Chen Apr 22, 2026 11 min read

Abstract visualization representing ADMET property prediction accuracy benchmarking data

Published ADMET prediction benchmarks often report AUROC values of 0.88–0.94. Those numbers are real — but the experimental conditions that produced them matter enormously. This post unpacks the benchmarking design choices that determine whether a published accuracy figure translates to predictive performance on your compound series.

The three ways an ADMET benchmark can mislead you

ADMET benchmarks fail to transfer to real programs for three interconnected reasons: temporal leakage, prevalence mismatch, and chemical space bias. Most published benchmarks have at least one of these problems. Many have all three.

1. Temporal leakage

Training and test compounds are often split randomly rather than by date. A random split puts structurally similar compounds on both sides of the split — the model "sees" close analogs of its test molecules during training, which inflates apparent generalization. A temporally stratified split (train on data through year X, test on data after year X) more accurately reflects what happens when you run a novel compound series through the model.

For the DrugSynq ADMET suite, training and test sets are split by data acquisition date. Test compounds entered the database after model training was completed. This is a prospective validation protocol — the model had no information about the test series when it was being trained.

2. Prevalence mismatch

AUROC measures discriminative ability averaged across all classification thresholds, which makes it insensitive to class prevalence. A hERG inhibition classifier trained on a dataset where 30% of compounds are inhibitors will look similar in AUROC to one deployed on a lead series where 60% are inhibitors — but the precision of the deployed model will be substantially different.

For cardiotoxicity endpoints (hERG, QT prolongation liability), real-world medicinal chemistry programs often have higher prevalence of flagged compounds because chemists iteratively add features that increase potency — and many potency-enhancing modifications also increase hERG risk. An AUROC calculated on a random database sample doesn't reflect this.

Complementary metrics to AUROC: Matthews Correlation Coefficient (MCC) is more informative at imbalanced class prevalence because it accounts for all four quadrants of the confusion matrix. Positive predictive value (PPV, also called precision) is directly useful for synthesis prioritization — it tells you what fraction of your model's "high risk" predictions are true positives. A model with AUROC 0.91 but PPV 0.55 means roughly half of the compounds it flags as hERG risks are false alarms. That's acceptable in a screening context but may be worth noting if it's driving synthesis exclusions.

3. Chemical space bias

Most large ADMET training datasets (ChEMBL, PubChem BioAssay, Enamine REAL) are heavily weighted toward compounds that were previously synthesized and submitted for assay. They underrepresent chemical space that hasn't been explored — which is exactly where novel scaffolds and fragment-derived leads live. Models trained primarily on historical data will show accuracy degradation at the boundary of explored chemical space.

Applicability domain estimation partially addresses this: the model reports a confidence score alongside the prediction, lower when the input compound is structurally distant from training data. Any deployment of ADMET prediction should include applicability domain checking. A prediction marked "outside applicability domain" should be treated with substantially more uncertainty than one within it.

What the DrugSynq ADMET accuracy numbers actually represent

The DrugSynq ADMET suite reports prospective AUROC and MCC values against temporally held-out test sets. The benchmarked endpoints and their prospective metrics:

hERG Inhibition (AUROC 0.91, MCC 0.73, n=3,410 test compounds): High-confidence due to large training set and well-standardized assay conditions in the literature. Sensitivity 0.84 means the model catches 84% of true hERG inhibitors. Specificity 0.89 means 11% of non-inhibitors are incorrectly flagged — manageable false positive rate for a screening application.
CYP3A4 Inhibition (AUROC 0.88, MCC 0.68): CYP inhibition data is noisier than hERG data because assay formats differ across labs. The model was trained on standardized IC50 values but variability remains. For CYP endpoints, treat the model as a tier-1 flag, not a final answer — a flagged compound should be confirmed with a fluorescence-based or LC-MS/MS assay before deprioritization.
Metabolic Stability (HLM) (AUROC 0.85, MCC 0.61): Human liver microsome half-life classification. More variable than hERG because metabolic stability is highly sensitive to compound concentration and protein content in the assay. Intrinsic clearance predictions (CLint) are more reproducible than binary stable/unstable classifications, but the binary metric is what most discovery teams act on.
Aqueous Solubility (AUROC 0.87, MCC 0.66): Kinetic solubility at pH 7.4. Solubility is notoriously variable across measurement methods (nephelometry, UV, HPLC-based). The model was trained on thermodynamic solubility where available; kinetic solubility data is used as a secondary source. Predictions near the threshold (20–100 µg/mL) carry higher uncertainty.
Caco-2 Permeability (AUROC 0.89, MCC 0.71): Passive transcellular permeability proxy. Well-suited to model because Caco-2 assay conditions are relatively standardized across CROs. The main limitation is that Caco-2 permeability doesn't capture active transport contributions (P-gp efflux, OATP uptake) which are important for CNS and hepatic bioavailability. DrugSynq reports Caco-2 passive permeability and P-gp substrate probability as separate predictions.

How to evaluate any vendor's ADMET accuracy claims

Before accepting a published benchmark, ask these five questions:

Is the test set prospective or retrospective? Prospective = test compounds not in training data and from a later time period. Retrospective = random split from the same pool.
What is the train/test split date? For prospective validation, the boundary date should be published. If it isn't, the "prospective" claim is unverifiable.
Are SMILES and test set compounds published? Reproducibility requires access to the test set. If you can't reproduce the reported number, you can't trust it.
What prevalence is the benchmark calculated at? A hERG benchmark at 30% positive prevalence does not transfer to a lead series at 60% positive prevalence. Ask for MCC and PPV, not just AUROC.
Is the applicability domain reported? For novel scaffolds, what percentage of compounds fall within the model's applicability domain? A model that refuses to predict 40% of your series because they're out-of-domain is less useful than one that tells you which predictions are reliable.

Integrating ADMET predictions into a medicinal chemistry program

ADMET predictions are not binary gates. The most productive way to use them in lead optimization is as a multi-parameter input to a desirability function or MPO score, weighted by what matters for your program.

A CNS program weights BBB penetration and P-gp efflux heavily; hERG matters but is secondary to CNS exposure. An oncology program targeting a kinase may accept higher CYP inhibition because the drug is expected to be co-dosed with CYP inhibitors anyway. An oral bioavailability-critical program weights aqueous solubility and metabolic stability above hERG.

DrugSynq's configurable MPO weights allow this weighting to be set per project. The output is a single ranked synthesis queue, not a data dump requiring interpretation. But the weights should be set by a scientist who understands the program — not defaulted to equal weights across all endpoints.

Finally: ADMET predictions identify risk, not certainty. A compound flagged for hERG risk may assay clean; a compound that passes all 12 ADMET endpoints may still fail in cell-based assays due to off-target activity. Predictions inform prioritization; they don't replace experimental safety pharmacology.