1Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA
2Department of Chemistry, Carnegie Mellon University, Pittsburgh, PA, USA
3Department of Chemical Engineering and Materials Science, Michigan State University, East Lansing, MI, USA
4Department of Materials Science and Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
5NVIDIA, Santa Clara, CA, USA
📄 Paper · 📖 Citation · ⚙️ Setup · 🔗 GitHub
*Corresponding author: olexandr@olexandrisayev.com
Molecular geometry is crucial for biological activity and chemical reactivity; however, computational methods for generating 3D structures are limited by the vast scale of conformational space and the complexities of stereochemistry. Here we present an approach that combines an expansive dataset of molecular conformers with generative diffusion models to address this problem. We introduce ChEMBL3D, which contains over 250 million molecular geometries for 1.8 million drug-like compounds, optimized using AIMNet2 neural network potentials to a near-quantum mechanical accuracy with implicit solvent effects included. This dataset captures complex organic molecules in various protonation states and stereochemical configurations.
We then developed LoQI (Low-energy QM Informed conformer generative model), a stereochemistry-aware diffusion model that learns molecular geometry distributions directly from this data. Through graph augmentation, LoQI accurately generates molecular structures with targeted stereochemistry, representing a significant advance in modeling capabilities over previous generative methods. The model outperforms traditional approaches, achieving up to tenfold improvement in energy accuracy and effective recovery of optimal conformations. Benchmark tests on complex systems, including macrocycles and flexible molecules, as well as validation with crystal structures, show LoQI can perform low energy conformer search efficiently.
Note on Implementation: LoQI is built upon the Megalodon architecture developed, adapting it specifically for stereochemistry-aware conformer generation with the ChEMBL3D dataset.
- ChEMBL3D Dataset: 250+ million AIMNet2-optimized conformers for 1.8M drug-like molecules
- Stereochemistry-Aware: First all-atom diffusion model with explicit stereochemical encoding
- Quantum Mechanical Accuracy: Near-DFT accuracy with implicit solvent effects
- Superior Performance: Up to 10x improvement in energy accuracy over traditional methods
- Complex Molecule Support: Handles macrocycles, flexible molecules, and challenging stereochemistry
Installation will usually take up to 20 minutes.
- OS tested by authors:
- Ubuntu 24.04 LTS (latest stable Ubuntu LTS at time of writing)
- Other platforms:
- Expected to work, but if installation is not out-of-the-box, use the PyTorch Geometric installation guide for your exact Python/PyTorch/CUDA combination: https://pytorch-geometric.readthedocs.io/en/latest/install/installation.html
- Tested inference hardware:
- GPU: NVIDIA RTX 3090 (24 GB VRAM)
- CPU: AMD Ryzen 9 5950X
- Recommended GPU memory:
- 16-24 GB VRAM for comfortable inference/evaluation with larger molecules and higher batch sizes
- Minimum practical GPU memory:
- 8 GB VRAM can run inference, but requires reduced batch sizes
- CPU-only:
- Possible, but not recommended and not systematically studied by the authors
OOM mitigation for larger molecules:
- reduce inference batch size (
--batch_sizein sampling, ordata.inference_batch_sizein config) - if using evaluation/optimization, also reduce optimization batch size (
evaluation.energy_metrics_args.batchsize)
# Clone the repository
git clone https://github.com/isayevlab/LoQI.git
cd LoQI
# Create and activate conda environment
conda create -n loqi python=3.10 -y
conda activate loqi
# Install core dependencies
pip install -r requirements.txt
# Install this package in editable mode (adds src to PYTHONPATH)
pip install -e .If you prefer a fully conda-based setup (recommended for RDKit), you can install RDKit via conda-forge before running pip install -r requirements.txt.
Training and evaluation use the ChEMBL3D data releases below.
Release 1: Full ChEMBL3D Quantum-Accurate conformer dataset
- URL: https://kilthub.cmu.edu/articles/dataset/_b_ChEMBL3D_Quantum-Accurate_3D_Conformers_for_ChEMBL_at_Scale_b_/31428449
- DOI: https://doi.org/10.1184/R1/31428449
Release 2: Processed dataset + LoQI checkpoints (diffusion + flow matching)
- URL: https://kilthub.cmu.edu/articles/dataset/LoQI_Scalable_Low-Energy_Molecular_Conformer_Generation_with_Quantum_Mechanical_Accuracy/31441570
- DOI: https://doi.org/10.1184/R1/31441570
- Includes:
loqi.ckptloqi_flow.ckptchembl3d_stereo/processed dataset
For this repository, place downloaded assets with this layout:
LoQI/
data/
loqi.ckpt
loqi_flow.ckpt
chembl3d_stereo/
processed/
...
AimNet2 model path expected by configs:
src/megalodon/metrics/aimnet2/cpcm_model/wb97m_cpcms_v2_0.jpt
The repository includes a Streamlit interface for interactive conformer generation, postprocessing, and visualization.
Use the app-specific installation and usage instructions from app/README.md (recommended, as app dependencies are separated from core training/inference dependencies).
Quick start from repo root:
pip install -r app/requirements.txt
streamlit run app/app.pyMake sure that src content is available in your PYTHONPATH (e.g., export PYTHONPATH="./src:$PYTHONPATH") if LoQI is not installed locally (pip install -e .).
# LoQI conformer generation model
python scripts/train.py --config-name=loqi outdir=./outputs train.gpus=1 data.dataset_root="./chembl3d_data"
# LoQI flow-matching conformer generation model
python scripts/train.py --config-name=loqi_flow outdir=./outputs train.gpus=1 data.dataset_root="data/chembl3d_stereo"
# Customize training parameters
python scripts/train.py --config-name=loqi \
outdir=./outputs \
train.gpus=2 \
train.n_epochs=800 \
train.seed=42 \
data.batch_size=150 \
optimizer.lr=0.0001# Generate conformers for a single molecule
python scripts/sample_conformers.py \
--config scripts/conf/loqi/loqi.yaml \
--ckpt data/loqi.ckpt \
--input "c1ccccc1" \
--output outputs/benzene_conformers.sdf \
--n_confs 10 \
--batch_size 1
# Generate conformers with evaluation (requires 3D input, e.g., SDF with low energy conformer)
python scripts/sample_conformers.py \
--config scripts/conf/loqi/loqi.yaml \
--ckpt data/loqi.ckpt \
--input data/ethanot_low_energy.sdf \
--output outputs/ethanol_conformers.sdf \
--n_confs 100 \
--batch_size 10 \
--eval
# Optional postprocessing: AIMNet2 optimization + iRMSD unique-set pruning
python scripts/sample_conformers.py \
--config scripts/conf/loqi/loqi_flow.yaml \
--ckpt data/loqi_flow.ckpt \
--input "CC(=O)Oc1ccccc1C(=O)O" \
--output outputs/aspirin_opt_unique.sdf \
--n_confs 50 \
--batch_size 50 \
--postprocess optimization+irmsd \
--optimization_batch_size 64 \
--opt_fmax 0.05 \
--opt_max_nstep 250 \
--irmsd_rthr 0.125Recent sampling updates in scripts/sample_conformers.py:
- input validation + SMILES revalidation (canonical roundtrip), with unsupported-element/radical checks
- atom-aware dynamic batching for inference (
--atom-aware-batching,--target-molecule-size,--shuffle) - optional hydrogen addition for SMILES inputs (
--add-hs/--no-add-hs) - no RDKit conformer initialization for SMILES; zero-initialized coordinates are used
- if input is SDF with conformers, existing 3D coordinates are used
- optional postprocessing (
--postprocess none|optimization|optimization+irmsd)
On the tested setup (RTX 3090 + Ryzen 9 5950X), inference for a typical ChEMBL molecule takes approximately 0.1 seconds per conformer when processed within a batch. See System and Hardware Requirements above for VRAM guidance and OOM mitigation.
Note: Make sure you define correct paths for dataset and AimNet2 model in loqi.yaml. The relative path of AimNet2 model is src/megalodon/metrics/aimnet2/cpcm_model/wb97m_cpcms_v2_0.jpt.
Sampling steps: --n_steps defaults to 25. Diffusion models were trained with 25 steps and are not expected to work well for other values. Flow-matching models can be run with different step counts.
Use scripts/performance_test.py to:
- sample 1000 molecules each with atom counts 10, 25, 50, and 100 from
data/chembl3d_stereo/processed/train_h.pt - select molecules deterministically (first
Nper size in dataset order) - export per-molecule SDF inputs
- measure per-molecule generation and optimization times
conda run -n mega env PYTHONPATH=./src TORCH_COMPILE_DISABLE=1 \
python scripts/performance_test.py \
--dataset_pt data/chembl3d_stereo/processed/train_h.pt \
--sizes 10,25,50,100 \
--n_per_size 100 \
--outdir outputs/performance_test \
--config scripts/conf/loqi/loqi.yaml \
--ckpt data/loqi.ckpt \
--n_confs 100 \
--generation_batch_size 1By default, optimization settings are taken from the selected config
(evaluation.energy_metrics_args.batchsize and evaluation.energy_metrics_args.opt_params).
Outputs:
outputs/performance_test/selected_manifest.csv(selected molecules + per-molecule SDF path)outputs/performance_test/size_<N>/mol_*.sdf(one input SDF per selected molecule)outputs/performance_test/size_<N>_selected.sdf(combined SDF per size)outputs/performance_test/timings_per_molecule.csv(generation/optimization timing per molecule)
LoQI Models:
loqi.yaml- LoQI stereochemistry-aware conformer generation modelnextmol.yaml- Alternative configuration for NextMol-style generationloqi_flow.yaml- LoQI flow-matching conformer generation model
If you use LoQI in your research, please cite our paper:
@article{nikitin2025scalable,
title={Scalable Low-Energy Molecular Conformer Generation with Quantum Mechanical Accuracy},
author={Nikitin, Filipp and Anstine, Dylan M and Zubatyuk, Roman and Paliwal, Saee Gopal and Isayev, Olexandr},
year={2025}
}This work builds upon the Megalodon architecture. If you use the underlying architecture, please also cite:
@article{reidenbach2025applications,
title={Applications of Modular Co-Design for De Novo 3D Molecule Generation},
author={Reidenbach, Danny and Nikitin, Filipp and Isayev, Olexandr and Paliwal, Saee},
journal={arXiv preprint arXiv:2505.18392},
year={2025}
}