SMSD is an open-source toolkit for exact substructure search and maximum common substructure (MCS) finding in chemical graphs. It runs on Java, C++ (header-only), and Python, and builds on established algorithms from the graph-isomorphism literature (VF2++, McSplit, McGregor, Horton, Vismara).
<dependency>
<groupId>com.bioinceptionlabs</groupId>
<artifactId>smsd</artifactId>
<version>5.3.0</version>
</dependency>curl -LO https://github.com/asad/SMSD/releases/download/v5.3.0/smsd-5.3.0-jar-with-dependencies.jar
java -jar smsd-5.3.0-jar-with-dependencies.jar \
--Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -pip install smsdimport smsd
result = smsd.substructure_search("c1ccccc1", "c1ccc(O)cc1")
mcs = smsd.mcs("c1ccccc1", "c1ccc2ccccc2c1")
# Tautomer-aware MCS
mcs = smsd.mcs("CC(=O)C", "CC(O)=C", tautomer_aware=True)
# Similarity upper bound (fast pre-filter)
sim = smsd.similarity("c1ccccc1", "c1ccc(O)cc1")
fp = smsd.fingerprint("c1ccccc1", kind="mcs")git clone https://github.com/asad/SMSD.git
# Add SMSD/cpp/include to your include path — no other dependencies needed#include "smsd/smsd.hpp"
auto mol1 = smsd::parseSMILES("c1ccccc1");
auto mol2 = smsd::parseSMILES("c1ccc(O)cc1");
bool isSub = smsd::isSubstructure(mol1, mol2, smsd::ChemOptions{});
auto mcs = smsd::findMCS(mol1, mol2, smsd::ChemOptions{}, smsd::McsOptions{});git clone https://github.com/asad/SMSD.git
cd SMSD
# Java
mvn -U clean package
# C++
mkdir cpp/build && cd cpp/build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
# Python
cd python && pip install -e .docker build -t smsd .
docker run --rm smsd --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -SMSD runs a funnel of algorithms in order of cost. Cheaper methods (seed-and-extend, McSplit partition refinement) handle most cases quickly; McGregor bond-grow extension covers hard instances.
| Algorithm | Based on |
|---|---|
| Seed-and-extend | Bond-growth from rare-label seeds with backtracking |
| McSplit + RRSplit | Partition refinement — McSplit (McCreesh et al., 2017) with RRSplit maximality pruning |
| Bron-Kerbosch | Product-graph clique with Tomita pivoting + k-core reduction |
| McGregor extension | Forced-assignment bond-grow frontier (McGregor, 1982) |
| Coverage-driven termination | Label-frequency upper bound; stops early when the bound is tight |
| Variant | Flag |
|---|---|
| MCIS (induced) | induced=true |
| MCCS (connected) | default |
| MCES (edge subgraph) | maximizeBonds=true |
| dMCS (disconnected) | disconnectedMCS=true |
| N-MCS (multi-molecule) | findNMCS() |
| Weighted MCS | atomWeights |
| Scaffold MCS | findScaffoldMCS() |
| Tautomer-aware MCS | ChemOptions.tautomerProfile() |
VF2++ (Jüttner & Madarasi, 2018) with FASTiso/VF3-Light matching order, 3-level NLF pruning, and bit-parallel candidate domains.
Rings are computed via Horton's candidate generation (Horton, 1987) combined with 2-phase GF(2) Gaussian elimination following Vismara (1997) for relevant cycles, and orbit-based automorphism grouping for Unique Ring Families (URFs).
| Output | Description |
|---|---|
| SSSR / MCB | Smallest Set of Smallest Rings |
| RCB | Relevant Cycle Basis — all chemically meaningful rings |
| URF | Unique Ring Families — groups symmetry-equivalent rings by automorphism orbit |
Cubane: MCB=5, RCB=6, URF=1 — all six square faces fall in one automorphism orbit.
| Option | Values |
|---|---|
| Chirality | R/S tetrahedral, E/Z double bond |
| Isotope | matchIsotope=true |
| Tautomers | keto/enol, amide, thione/thiol, nitroso/oxime, phenol/quinone, lactam/lactim, imidazole NH |
| completeRingsOnly | whole SSSR rings included or excluded |
| Ring fusion | IGNORE / PERMISSIVE / STRICT |
| Ring-matches-ring | ring atoms match only ring atoms |
| Bond order | STRICT / LOOSE / ANY |
| Aromaticity | STRICT / FLEXIBLE |
Preset profiles
| Profile | Use when |
|---|---|
ChemOptions() |
default — chemically rigorous, ring topology enforced |
ChemOptions.tautomerProfile() |
tautomer-aware matching |
ChemOptions.fmcsProfile() |
loose topology, mirrors RDKit FMCS defaults |
| Tool | Description |
|---|---|
| R-group decomposition | decomposeRGroups() |
| Path fingerprint | graph-aware, tautomer-invariant |
| MCS fingerprint | MCS-aware, auto-sized |
| RASCAL screening | O(V+E) similarity upper bound for pre-filtering |
| Canonical labeling | Bliss-style individualization-refinement |
| Canonical SMILES | deterministic, toolkit-independent |
| Canonical SMARTS | toSMARTS() C++ header — invariant scaffold notation with configurable predicates |
| Isomorphism check | via canonical hash comparison |
| Reaction atom mapping | mapReaction() |
The search algorithms run on MolGraph, a plain-array molecule representation. Any
cheminformatics toolkit can plug in via a simple builder:
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ CDK │ │ RDKit │ │ OpenBabel│ │ Your Tool│
└────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │ │
v v v v
MolGraph MolGraph MolGraph MolGraph
│ │ │ │
└─────────────┴─────────────┴─────────────┘
│
┌─────────v─────────┐
│ SearchEngine │
│ (no toolkit │
│ dependency) │
└───────────────────┘
# Substructure
java -jar smsd-5.3.0-jar-with-dependencies.jar \
--Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -
# MCS
java -jar smsd-5.3.0-jar-with-dependencies.jar \
--mode mcs --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc2ccccc2c1" \
--json - --json-pretty
# SMARTS query
java -jar smsd-5.3.0-jar-with-dependencies.jar \
--Q SIG --q "[NH2;!$(N=O)]" --T SMI --t "CCN" --json -
# SDF batch
java -jar smsd-5.3.0-jar-with-dependencies.jar \
--Q SMI --q "c1ccccc1" --T SDF --t compounds.sdf --json -import com.bioinception.smsd.core.*;
// Substructure
SMSD smsd = new SMSD("c1ccccc1", "c1ccc(O)cc1", new ChemOptions());
boolean found = smsd.isSubstructure();
// MCS — strict (default)
Map<Integer,Integer> mcs = smsd.findMCS(true, true, 5000L);
// MCS — tautomer-aware
SMSD taut = new SMSD("CC(=O)C", "CC(O)=C", ChemOptions.tautomerProfile());
Map<Integer,Integer> mcs = taut.findMCS();
// MCS — loose topology (mirrors RDKit FMCS)
SMSD loose = new SMSD(mol1, mol2, ChemOptions.fmcsProfile());
Map<Integer,Integer> mcs = loose.findMCS();
// RASCAL similarity upper bound
double ub = smsd.similarityUpperBound();
// N-MCS across a set of molecules
var common = SearchEngine.findNMCS(molecules, new ChemOptions(), 0.8, 10000);
// R-group decomposition
var rgroups = SearchEngine.decomposeRGroups(core, molecules, new ChemOptions(), 10000);
// Skip CDK preprocessing (molecules already standardised)
SMSD fast = new SMSD(myQuery, myTarget, new ChemOptions(), false);
// Direct API
boolean hit = SearchEngine.isSubstructure(query, target, opts, 5000L);cd web && mvn package -DskipTests
java -jar target/smsd-web-5.3.0-jar-with-dependencies.jar
# Open http://localhost:7070- Paste SMILES → 2D structure rendering
- Colour-highlighted matched atoms
- Substructure and MCS results side-by-side
- Export as SVG or SMILES
- REST API:
/api/validate,/api/sub,/api/mcs,/api/depict,/api/export
| Platform | Notes |
|---|---|
| Java 11–25 | CDK 2.11, available on Maven Central |
| C++ 17 | Header-only, no external dependencies, Ubuntu / macOS / Windows |
| Python 3.8+ | pybind11 bindings, pip install smsd |
| OpenMP | Multi-threaded batch processing |
| CUDA | GPU batch RASCAL screening |
| Docker | Pre-built container |
| Browser | Web UI at localhost:7070 |
| Format | Read | Write |
|---|---|---|
| SMILES | Java, C++ | Java, C++ |
| SMARTS | Java, C++ | C++ |
| MOL V2000 | Java, C++ | C++ |
| SDF | Java, C++ | — |
| Mol2 | Java | — |
| PDB | Java | — |
| CML | Java | — |
src/main/java/com/bioinception/smsd/
cli/SMSDcli.java CLI + MolIO + OutputUtil
core/SearchEngine.java All algorithms
core/SMSD.java Public API facade
core/ChemOptions.java Configuration (includes fmcsProfile, tautomerProfile)
core/Standardiser.java Preprocessing + SMARTS
cpp/include/smsd/
mol_graph.hpp Molecule model + ChemOps
vf2pp.hpp VF2++ substructure
mcs.hpp MCS funnel (6 strategies)
ring_finder.hpp URF ring finder (Horton/GF(2)/Vismara)
smiles_parser.hpp Standalone SMILES parser
smarts_parser.hpp SMARTS pattern matcher
mol_reader.hpp MOL/SDF V2000 reader/writer
batch.hpp OpenMP batch processing
rascal.hpp RASCAL screening
bitops.hpp Portable 64-bit bit-ops (GCC / Clang / MSVC)
python/smsd/
__init__.py Python API
core.py pybind11 bindings
benchmarks/
benchmark_python_vs_rdkit.py smsd (pip) vs RDKit, like-for-like
benchmark_all.py Java + Python + C++ vs RDKit
results_python.tsv Latest benchmark results
diverse_molecules.txt 1,003 benchmark molecules
web/ Web UI (Javalin + PWA)
paper/ Manuscript draft
pip install smsd vs RDKit 2025.09.2 — same machine, same Python process, best of 5 runs.
Times in milliseconds. Full data: benchmarks/results_python.tsv
| Pair | Category | SMSD (ms) | RDKit (ms) | SMSD MCS | RDKit MCS |
|---|---|---|---|---|---|
| Cubane (self) | Cage | 0.003 | 0.241 | 8 | 8 |
| Adamantane (self) | Symmetric | 0.003 | 0.256 | 10 | 10 |
| Coronene (self) | PAH | 0.006 | 0.727 | 24 | 24 |
| NAD / NADH | Cofactor | 0.012 | timeout | 44 | 33 |
| Methane / Ethane | Trivial | 0.008 | 0.036 | 1 | 1 |
| Benzene / Phenol | Heteroatom | 0.011 | 0.155 | 6 | 6 |
| Benzene / Toluene | Aromatic | 0.014 | 0.286 | 6 | 6 |
| Caffeine / Theophylline | N-methyl diff | 0.016 | 0.354 | 13 | 13 |
| Guanine keto / enol | Tautomer | 0.005 | 0.280 | 11 | 10 |
| ATP / ADP | Nucleotide | 0.085 | 0.897 | 27 | 27 |
| Ibuprofen / Naproxen | NSAID | 0.069 | 3.5 | 15 | 15 |
| Morphine / Codeine | Alkaloid | 0.049 | 550.5 | 20 | 20 |
| Aspirin / Acetaminophen | Drug pair | 0.532 | 0.250 | 10 | 7 |
| RDKit known failure #1585 | Edge case | 25.0 | timeout | 29 | 24 |
| Strychnine / Quinine | Alkaloid | 39.6 | 437.9 | 19 | 21 |
| Atorvastatin / Rosuvastatin | Statin | 1 079 | 11.8 | 25 | 15 |
| Erythromycin / Azithromycin | Macrolide | 2 333 | timeout | 50 | 50 |
| Paclitaxel / Docetaxel | Taxane | 2 405 | timeout | 56 | 53 |
| PEG-12 / PEG-16 | Polymer | 2 591 | 2.2 | 40 | 40 |
Bold MCS = SMSD found a larger substructure than RDKit. timeout = 10 s limit reached.
On most drug-like molecules SMSD is considerably faster and finds the same or larger MCS.
The two cases where RDKit is quicker — Atorvastatin/Rosuvastatin and PEG — are both situations
where RDKit stops early with a smaller answer; SMSD keeps going to find the exact maximum.
Strychnine/Quinine (19 vs 21) reflects SMSD's default ring-topology constraint; passing
ChemOptions.fmcsProfile() reproduces RDKit's looser result.
Run python benchmarks/benchmark_python_vs_rdkit.py to reproduce on your own machine.
- 1 046 Java tests — heterocycles, reactions, drug pairs, tautomers, stereochemistry, ring perception, URF families, hydrogen handling, adversarial edge cases
- C++ tests — substructure, MCS, URF ring finder, hydrogen handling, canonical labeling, SMILES round-trip (Ubuntu / macOS / Windows)
- Python tests — full API coverage including hydrogen handling and charged species
Contributions and bug reports are welcome via GitHub Issues.
| Change | Details |
|---|---|
| Python web server | pip install 'smsd[web]' → smsd-web launches a Flask server with the identical REST API as the Java/Javalin server (/api/validate, /api/sub, /api/mcs, /api/depict, /api/export). Both backends share the same static frontend via symlink. |
| GPU auto-detection (CUDA) | SMSD_BUILD_CUDA=AUTO — silently enables CUDA batch screening if nvcc is found. pip install 'smsd[gpu]' ensures the CUDA 12 runtime on Linux/Windows. |
| Metal/MPS on macOS | SMSD_BUILD_METAL=AUTO — detects Metal.framework and builds smsd_metal automatically on macOS. No extra install needed; Metal ships with every Mac since 2013. On Apple Silicon (M1–M4) molecule descriptors are passed to the GPU with zero copy via unified memory. gpu_is_available() and gpu_device_info() work identically on all platforms. |
gpu.hpp dispatch |
smsd::gpu::batchRascalScreenAuto() dispatches CUDA → Metal → CPU/OpenMP in priority order. smsd::gpu::deviceInfo() returns e.g. "Metal GPU: Apple M2 Pro [OpenMP 5.0, 10 threads]", "GPU: Tesla T4 [OpenMP 4.5, 8 threads]", or "CPU: OpenMP 4.5, 16 threads". |
| Parallel Java batch | SearchEngine.batchMCS() and batchSubstructure() now use ForkJoinPool + IntStream.range().parallel() — scales across all available processors by default (numThreads=0). |
| Thread-safe CLI batch | SMSDcli.runBatch() — applyHydrogenOptions(query) is called once before the parallel block; the pre-processed IAtomContainer is shared read-only across threads. Eliminates CDK data-race on query.container. |
popcount64 infinite recursion fix |
smsd_bindings.cpp GCC/Clang branch called smsd_popcount64(x) from inside itself; changed to __builtin_popcountll(x). |
| Missing Python binding exports | to_smarts, gpu_is_available, gpu_device_info, batch_substructure, batch_mcs were listed in __init__.py but absent from the pybind11 module. All five are now exported. |
server.py gpu_info fix |
_gpu_info() called gpu_device_info() in both the if-branch and the else-branch. Simplified to a single call; the function already embeds a "GPU:" / "CPU:" prefix. |
--threads CLI flag |
smsd … --threads N pins the Java CLI batch to N worker threads (0 = all processors). |
| Change | Details |
|---|---|
toSMARTS() (C++) |
Canonical SMARTS writer in smiles_parser.hpp. Uses the same Morgan-based DFS traversal as toSMILES(); atom primitives use [#Z] base with ;a/;A, ;+/;-, ;H<n>, ;R/;!R, isotope prefix. SmartsWriteOptions controls which predicates are emitted. |
| URF symmetry fix | computeURFs() orbit-signature merge now guards with a vertex-transitivity check — orbit-sig equality only triggers a merge when every atom in both cycles shares the same orbit value. Fixes naphthalene (URF=2) and spiro (URF=2) returning 1. Fix applied to both C++ and Java. |
| SMILES ring-closure fix | Double ring-closure bug in the canonical SMILES DFS writer — both endpoints of a back-edge were independently assigning ring-closure numbers (producing c12ccccc12 for benzene). Fixed with an assignedRings bond-key set. Applies to both toSMILES() and toSMARTS(). |
defaultValences() expanded |
Added H, Al, Si, As, Se, Sb, Te — computeImplicitH now returns correct values for [SeH], [SiH4], [AsH3] and other elements previously missing from the table. |
computeImplicitH cation fix |
Group-15/16 cation expansion now covers As and Se in addition to N, P, O, S. |
| 441 C++ comprehensive tests | test_smiles_comprehensive.cpp extended with SMARTS write group (15 cases) and corrected cholesterol atom count. |
| Change | Details |
|---|---|
| URF ring finder | Horton + 2-phase GF(2) + orbit-based grouping. Cubane: MCB=5, RCB=6, URF=1. |
fmcsProfile() |
New preset — loose topology matching for direct RDKit comparisons. |
| Strict aromaticity fix | Ring-membership guard removed from bondsCompatible(); aromatic bonds no longer match aliphatic chains in STRICT mode. |
| ppx loop fix | enforceCompleteRings and largestConnected now run to convergence; no more oscillation on fused rings. |
| Disconnected MCS fix | Fast bail-out gated by connectedOnly; disconnectedMCS=true finds multi-component matches correctly. |
| McGregor DFS refactor | Flat int[] arrays throughout — zero autoboxing on the hot path. |
| Windows C++ build | bitops.hpp portable bit-ops wrapper; __builtin_ctzll / __builtin_popcountll replaced with MSVC-safe equivalents. |
| 1 046 tests | Hydrogen handling (implicit/explicit), charged species, stereo notation variants. |
If you use SMSD in your research, please cite:
Rahman SA, Bashton M, Holliday GL, Schrader R, Thornton JM. Small Molecule Subgraph Detector (SMSD) toolkit. Journal of Cheminformatics, 1:12, 2009.
Syed Asad Rahman — BioInception PVT LTD
Apache License 2.0 — see LICENSE