Skip to content

asad/SMSD

Repository files navigation

SMSD — Substructure & MCS Search for Chemical Graphs

SMSD is an open-source toolkit for exact substructure search and maximum common substructure (MCS) finding in chemical graphs. It runs on Java, C++ (header-only), and Python, and builds on established algorithms from the graph-isomorphism literature (VF2++, McSplit, McGregor, Horton, Vismara).

Maven Central PyPI License Release


Install

Java (Maven)

<dependency>
  <groupId>com.bioinceptionlabs</groupId>
  <artifactId>smsd</artifactId>
  <version>5.3.0</version>
</dependency>

Java (Download JAR)

curl -LO https://github.com/asad/SMSD/releases/download/v5.3.0/smsd-5.3.0-jar-with-dependencies.jar

java -jar smsd-5.3.0-jar-with-dependencies.jar \
  --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -

Python (pip)

pip install smsd
import smsd

result = smsd.substructure_search("c1ccccc1", "c1ccc(O)cc1")
mcs    = smsd.mcs("c1ccccc1", "c1ccc2ccccc2c1")

# Tautomer-aware MCS
mcs    = smsd.mcs("CC(=O)C", "CC(O)=C", tautomer_aware=True)

# Similarity upper bound (fast pre-filter)
sim    = smsd.similarity("c1ccccc1", "c1ccc(O)cc1")

fp     = smsd.fingerprint("c1ccccc1", kind="mcs")

C++ (Header-Only)

git clone https://github.com/asad/SMSD.git
# Add SMSD/cpp/include to your include path — no other dependencies needed
#include "smsd/smsd.hpp"

auto mol1 = smsd::parseSMILES("c1ccccc1");
auto mol2 = smsd::parseSMILES("c1ccc(O)cc1");

bool isSub = smsd::isSubstructure(mol1, mol2, smsd::ChemOptions{});
auto mcs   = smsd::findMCS(mol1, mol2, smsd::ChemOptions{}, smsd::McsOptions{});

Build from Source

git clone https://github.com/asad/SMSD.git
cd SMSD

# Java
mvn -U clean package

# C++
mkdir cpp/build && cd cpp/build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

# Python
cd python && pip install -e .

Docker

docker build -t smsd .
docker run --rm smsd --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -

Algorithms

SMSD runs a funnel of algorithms in order of cost. Cheaper methods (seed-and-extend, McSplit partition refinement) handle most cases quickly; McGregor bond-grow extension covers hard instances.

MCS

Algorithm Based on
Seed-and-extend Bond-growth from rare-label seeds with backtracking
McSplit + RRSplit Partition refinement — McSplit (McCreesh et al., 2017) with RRSplit maximality pruning
Bron-Kerbosch Product-graph clique with Tomita pivoting + k-core reduction
McGregor extension Forced-assignment bond-grow frontier (McGregor, 1982)
Coverage-driven termination Label-frequency upper bound; stops early when the bound is tight

MCS Variants

Variant Flag
MCIS (induced) induced=true
MCCS (connected) default
MCES (edge subgraph) maximizeBonds=true
dMCS (disconnected) disconnectedMCS=true
N-MCS (multi-molecule) findNMCS()
Weighted MCS atomWeights
Scaffold MCS findScaffoldMCS()
Tautomer-aware MCS ChemOptions.tautomerProfile()

Substructure Search (VF2++)

VF2++ (Jüttner & Madarasi, 2018) with FASTiso/VF3-Light matching order, 3-level NLF pruning, and bit-parallel candidate domains.

Ring Perception

Rings are computed via Horton's candidate generation (Horton, 1987) combined with 2-phase GF(2) Gaussian elimination following Vismara (1997) for relevant cycles, and orbit-based automorphism grouping for Unique Ring Families (URFs).

Output Description
SSSR / MCB Smallest Set of Smallest Rings
RCB Relevant Cycle Basis — all chemically meaningful rings
URF Unique Ring Families — groups symmetry-equivalent rings by automorphism orbit

Cubane: MCB=5, RCB=6, URF=1 — all six square faces fall in one automorphism orbit.

Chemistry Options

Option Values
Chirality R/S tetrahedral, E/Z double bond
Isotope matchIsotope=true
Tautomers keto/enol, amide, thione/thiol, nitroso/oxime, phenol/quinone, lactam/lactim, imidazole NH
completeRingsOnly whole SSSR rings included or excluded
Ring fusion IGNORE / PERMISSIVE / STRICT
Ring-matches-ring ring atoms match only ring atoms
Bond order STRICT / LOOSE / ANY
Aromaticity STRICT / FLEXIBLE

Preset profiles

Profile Use when
ChemOptions() default — chemically rigorous, ring topology enforced
ChemOptions.tautomerProfile() tautomer-aware matching
ChemOptions.fmcsProfile() loose topology, mirrors RDKit FMCS defaults

Additional Tools

Tool Description
R-group decomposition decomposeRGroups()
Path fingerprint graph-aware, tautomer-invariant
MCS fingerprint MCS-aware, auto-sized
RASCAL screening O(V+E) similarity upper bound for pre-filtering
Canonical labeling Bliss-style individualization-refinement
Canonical SMILES deterministic, toolkit-independent
Canonical SMARTS toSMARTS() C++ header — invariant scaffold notation with configurable predicates
Isomorphism check via canonical hash comparison
Reaction atom mapping mapReaction()

Toolkit Independence

The search algorithms run on MolGraph, a plain-array molecule representation. Any cheminformatics toolkit can plug in via a simple builder:

┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐
│  CDK     │  │  RDKit   │  │ OpenBabel│  │ Your Tool│
└────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘
     │             │             │             │
     v             v             v             v
   MolGraph    MolGraph      MolGraph      MolGraph
     │             │             │             │
     └─────────────┴─────────────┴─────────────┘
                        │
              ┌─────────v─────────┐
              │   SearchEngine    │
              │  (no toolkit      │
              │   dependency)     │
              └───────────────────┘

CLI

# Substructure
java -jar smsd-5.3.0-jar-with-dependencies.jar \
  --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc(O)cc1" --json -

# MCS
java -jar smsd-5.3.0-jar-with-dependencies.jar \
  --mode mcs --Q SMI --q "c1ccccc1" --T SMI --t "c1ccc2ccccc2c1" \
  --json - --json-pretty

# SMARTS query
java -jar smsd-5.3.0-jar-with-dependencies.jar \
  --Q SIG --q "[NH2;!$(N=O)]" --T SMI --t "CCN" --json -

# SDF batch
java -jar smsd-5.3.0-jar-with-dependencies.jar \
  --Q SMI --q "c1ccccc1" --T SDF --t compounds.sdf --json -

Java API

import com.bioinception.smsd.core.*;

// Substructure
SMSD smsd = new SMSD("c1ccccc1", "c1ccc(O)cc1", new ChemOptions());
boolean found = smsd.isSubstructure();

// MCS — strict (default)
Map<Integer,Integer> mcs = smsd.findMCS(true, true, 5000L);

// MCS — tautomer-aware
SMSD taut = new SMSD("CC(=O)C", "CC(O)=C", ChemOptions.tautomerProfile());
Map<Integer,Integer> mcs = taut.findMCS();

// MCS — loose topology (mirrors RDKit FMCS)
SMSD loose = new SMSD(mol1, mol2, ChemOptions.fmcsProfile());
Map<Integer,Integer> mcs = loose.findMCS();

// RASCAL similarity upper bound
double ub = smsd.similarityUpperBound();

// N-MCS across a set of molecules
var common = SearchEngine.findNMCS(molecules, new ChemOptions(), 0.8, 10000);

// R-group decomposition
var rgroups = SearchEngine.decomposeRGroups(core, molecules, new ChemOptions(), 10000);

// Skip CDK preprocessing (molecules already standardised)
SMSD fast = new SMSD(myQuery, myTarget, new ChemOptions(), false);

// Direct API
boolean hit = SearchEngine.isSubstructure(query, target, opts, 5000L);

Web Application

cd web && mvn package -DskipTests
java -jar target/smsd-web-5.3.0-jar-with-dependencies.jar
# Open http://localhost:7070
  • Paste SMILES → 2D structure rendering
  • Colour-highlighted matched atoms
  • Substructure and MCS results side-by-side
  • Export as SVG or SMILES
  • REST API: /api/validate, /api/sub, /api/mcs, /api/depict, /api/export

Platform Support

Platform Notes
Java 11–25 CDK 2.11, available on Maven Central
C++ 17 Header-only, no external dependencies, Ubuntu / macOS / Windows
Python 3.8+ pybind11 bindings, pip install smsd
OpenMP Multi-threaded batch processing
CUDA GPU batch RASCAL screening
Docker Pre-built container
Browser Web UI at localhost:7070

File Formats

Format Read Write
SMILES Java, C++ Java, C++
SMARTS Java, C++ C++
MOL V2000 Java, C++ C++
SDF Java, C++
Mol2 Java
PDB Java
CML Java

Repository Layout

src/main/java/com/bioinception/smsd/
  cli/SMSDcli.java          CLI + MolIO + OutputUtil
  core/SearchEngine.java    All algorithms
  core/SMSD.java            Public API facade
  core/ChemOptions.java     Configuration (includes fmcsProfile, tautomerProfile)
  core/Standardiser.java    Preprocessing + SMARTS

cpp/include/smsd/
  mol_graph.hpp             Molecule model + ChemOps
  vf2pp.hpp                 VF2++ substructure
  mcs.hpp                   MCS funnel (6 strategies)
  ring_finder.hpp           URF ring finder (Horton/GF(2)/Vismara)
  smiles_parser.hpp         Standalone SMILES parser
  smarts_parser.hpp         SMARTS pattern matcher
  mol_reader.hpp            MOL/SDF V2000 reader/writer
  batch.hpp                 OpenMP batch processing
  rascal.hpp                RASCAL screening
  bitops.hpp                Portable 64-bit bit-ops (GCC / Clang / MSVC)

python/smsd/
  __init__.py               Python API
  core.py                   pybind11 bindings

benchmarks/
  benchmark_python_vs_rdkit.py  smsd (pip) vs RDKit, like-for-like
  benchmark_all.py              Java + Python + C++ vs RDKit
  results_python.tsv            Latest benchmark results
  diverse_molecules.txt         1,003 benchmark molecules

web/                        Web UI (Javalin + PWA)
paper/                      Manuscript draft

Benchmarks

pip install smsd vs RDKit 2025.09.2 — same machine, same Python process, best of 5 runs. Times in milliseconds. Full data: benchmarks/results_python.tsv

Pair Category SMSD (ms) RDKit (ms) SMSD MCS RDKit MCS
Cubane (self) Cage 0.003 0.241 8 8
Adamantane (self) Symmetric 0.003 0.256 10 10
Coronene (self) PAH 0.006 0.727 24 24
NAD / NADH Cofactor 0.012 timeout 44 33
Methane / Ethane Trivial 0.008 0.036 1 1
Benzene / Phenol Heteroatom 0.011 0.155 6 6
Benzene / Toluene Aromatic 0.014 0.286 6 6
Caffeine / Theophylline N-methyl diff 0.016 0.354 13 13
Guanine keto / enol Tautomer 0.005 0.280 11 10
ATP / ADP Nucleotide 0.085 0.897 27 27
Ibuprofen / Naproxen NSAID 0.069 3.5 15 15
Morphine / Codeine Alkaloid 0.049 550.5 20 20
Aspirin / Acetaminophen Drug pair 0.532 0.250 10 7
RDKit known failure #1585 Edge case 25.0 timeout 29 24
Strychnine / Quinine Alkaloid 39.6 437.9 19 21
Atorvastatin / Rosuvastatin Statin 1 079 11.8 25 15
Erythromycin / Azithromycin Macrolide 2 333 timeout 50 50
Paclitaxel / Docetaxel Taxane 2 405 timeout 56 53
PEG-12 / PEG-16 Polymer 2 591 2.2 40 40

Bold MCS = SMSD found a larger substructure than RDKit. timeout = 10 s limit reached.

On most drug-like molecules SMSD is considerably faster and finds the same or larger MCS. The two cases where RDKit is quicker — Atorvastatin/Rosuvastatin and PEG — are both situations where RDKit stops early with a smaller answer; SMSD keeps going to find the exact maximum. Strychnine/Quinine (19 vs 21) reflects SMSD's default ring-topology constraint; passing ChemOptions.fmcsProfile() reproduces RDKit's looser result.

Run python benchmarks/benchmark_python_vs_rdkit.py to reproduce on your own machine.


Tests

  • 1 046 Java tests — heterocycles, reactions, drug pairs, tautomers, stereochemistry, ring perception, URF families, hydrogen handling, adversarial edge cases
  • C++ tests — substructure, MCS, URF ring finder, hydrogen handling, canonical labeling, SMILES round-trip (Ubuntu / macOS / Windows)
  • Python tests — full API coverage including hydrogen handling and charged species

Contributions and bug reports are welcome via GitHub Issues.


What's New in 5.3.0

Change Details
Python web server pip install 'smsd[web]'smsd-web launches a Flask server with the identical REST API as the Java/Javalin server (/api/validate, /api/sub, /api/mcs, /api/depict, /api/export). Both backends share the same static frontend via symlink.
GPU auto-detection (CUDA) SMSD_BUILD_CUDA=AUTO — silently enables CUDA batch screening if nvcc is found. pip install 'smsd[gpu]' ensures the CUDA 12 runtime on Linux/Windows.
Metal/MPS on macOS SMSD_BUILD_METAL=AUTO — detects Metal.framework and builds smsd_metal automatically on macOS. No extra install needed; Metal ships with every Mac since 2013. On Apple Silicon (M1–M4) molecule descriptors are passed to the GPU with zero copy via unified memory. gpu_is_available() and gpu_device_info() work identically on all platforms.
gpu.hpp dispatch smsd::gpu::batchRascalScreenAuto() dispatches CUDA → Metal → CPU/OpenMP in priority order. smsd::gpu::deviceInfo() returns e.g. "Metal GPU: Apple M2 Pro [OpenMP 5.0, 10 threads]", "GPU: Tesla T4 [OpenMP 4.5, 8 threads]", or "CPU: OpenMP 4.5, 16 threads".
Parallel Java batch SearchEngine.batchMCS() and batchSubstructure() now use ForkJoinPool + IntStream.range().parallel() — scales across all available processors by default (numThreads=0).
Thread-safe CLI batch SMSDcli.runBatch()applyHydrogenOptions(query) is called once before the parallel block; the pre-processed IAtomContainer is shared read-only across threads. Eliminates CDK data-race on query.container.
popcount64 infinite recursion fix smsd_bindings.cpp GCC/Clang branch called smsd_popcount64(x) from inside itself; changed to __builtin_popcountll(x).
Missing Python binding exports to_smarts, gpu_is_available, gpu_device_info, batch_substructure, batch_mcs were listed in __init__.py but absent from the pybind11 module. All five are now exported.
server.py gpu_info fix _gpu_info() called gpu_device_info() in both the if-branch and the else-branch. Simplified to a single call; the function already embeds a "GPU:" / "CPU:" prefix.
--threads CLI flag smsd … --threads N pins the Java CLI batch to N worker threads (0 = all processors).

What's New in 5.2.2

Change Details
toSMARTS() (C++) Canonical SMARTS writer in smiles_parser.hpp. Uses the same Morgan-based DFS traversal as toSMILES(); atom primitives use [#Z] base with ;a/;A, ;+/;-, ;H<n>, ;R/;!R, isotope prefix. SmartsWriteOptions controls which predicates are emitted.
URF symmetry fix computeURFs() orbit-signature merge now guards with a vertex-transitivity check — orbit-sig equality only triggers a merge when every atom in both cycles shares the same orbit value. Fixes naphthalene (URF=2) and spiro (URF=2) returning 1. Fix applied to both C++ and Java.
SMILES ring-closure fix Double ring-closure bug in the canonical SMILES DFS writer — both endpoints of a back-edge were independently assigning ring-closure numbers (producing c12ccccc12 for benzene). Fixed with an assignedRings bond-key set. Applies to both toSMILES() and toSMARTS().
defaultValences() expanded Added H, Al, Si, As, Se, Sb, Te — computeImplicitH now returns correct values for [SeH], [SiH4], [AsH3] and other elements previously missing from the table.
computeImplicitH cation fix Group-15/16 cation expansion now covers As and Se in addition to N, P, O, S.
441 C++ comprehensive tests test_smiles_comprehensive.cpp extended with SMARTS write group (15 cases) and corrected cholesterol atom count.

What's New in 5.2.1

Change Details
URF ring finder Horton + 2-phase GF(2) + orbit-based grouping. Cubane: MCB=5, RCB=6, URF=1.
fmcsProfile() New preset — loose topology matching for direct RDKit comparisons.
Strict aromaticity fix Ring-membership guard removed from bondsCompatible(); aromatic bonds no longer match aliphatic chains in STRICT mode.
ppx loop fix enforceCompleteRings and largestConnected now run to convergence; no more oscillation on fused rings.
Disconnected MCS fix Fast bail-out gated by connectedOnly; disconnectedMCS=true finds multi-component matches correctly.
McGregor DFS refactor Flat int[] arrays throughout — zero autoboxing on the hot path.
Windows C++ build bitops.hpp portable bit-ops wrapper; __builtin_ctzll / __builtin_popcountll replaced with MSVC-safe equivalents.
1 046 tests Hydrogen handling (implicit/explicit), charged species, stereo notation variants.

Citation

If you use SMSD in your research, please cite:

Rahman SA, Bashton M, Holliday GL, Schrader R, Thornton JM. Small Molecule Subgraph Detector (SMSD) toolkit. Journal of Cheminformatics, 1:12, 2009.


Author

Syed Asad RahmanBioInception PVT LTD

License

Apache License 2.0 — see LICENSE