SvABA (formerly Snowman) is an SV and indel caller for short-read BAMs.
It performs genome-wide local assembly with a vendored SGA, realigns
contigs with BWA-MEM, and scores variants by reassembled read support.
Tumor/normal, trios, and single-sample modes are supported; variants
are emitted as VCF plus a verbose tab-delimited companion
(bps.txt.gz) that carries the full per-sample evidence.
License: GNU GPLv3. Uses the SeqLib API for BAM I/O, BWA-MEM alignment, interval trees, and the assembly front-end.
CMake is required; htslib is an external dependency (no bundled copy). If htslib is installed system-wide, svaba auto-detects it via pkg-config or the default include/lib search paths:
git clone --recursive https://github.com/walaj/svaba
cd svaba && mkdir build && cd build
cmake .. && make -jFor a non-standard htslib install location, point at it explicitly:
cmake .. -DHTSLIB_DIR=/path/to/htslib-1.xx
make -jTo link jemalloc at compile time (recommended on Linux for -p 16+
runs — eliminates allocator contention, no LD_PRELOAD needed):
cmake .. -DUSE_JEMALLOC=ON
make -jThe binary lands at build/svaba. To install it system-wide (or
into this repo's bin/), use CMake's install target — the default
prefix is /usr/local and needs sudo, or override it with
-DCMAKE_INSTALL_PREFIX:
make install # /usr/local/bin/svaba
cmake --install build --prefix $(pwd)/.. # repo-local bin/svabaBuild type defaults to RelWithDebInfo (-O2 -g -DNDEBUG). The
vendored SeqLib/bwa and SeqLib/fermi-lite hardcode -O2 in their
Makefiles; see CLAUDE.md for how to push them to -O3 -mcpu=native
(typically a 5–15% wall-time win).
By default svaba assembles contigs with fermi-lite (ropebwt2-based,
the faster path). The vendored SGA (String Graph Assembler) is still
fully supported but off by default — switch it on at compile time by
passing -DSVABA_ASSEMBLER_FERMI=0 to CMake:
cmake .. -DSVABA_ASSEMBLER_FERMI=0
make -jThe choice is a single compile-time symbol in
src/svaba/SvabaAssemblerConfig.h; you can also flip it by editing that
file directly and rebuilding. Runtime logs report the active
assembler under svaba::kAssemblerName, so you can always confirm
which engine a build was compiled with.
On Linux, running svaba with jemalloc as the allocator typically
wins 10–20 % wall-time at -p 16+ (large WGS, tumor/normal). svaba's
per-thread assembly loop generates heavy malloc/free traffic and
glibc's default allocator serializes on its arena locks under that
contention pattern; jemalloc's thread-local arenas sidestep the
problem. The repo ships a drop-in wrapper at ./svaba_jemalloc that
LD_PRELOADs the system libjemalloc.so.2 and then exec's svaba:
# one-liner replacement for `svaba run ...`:
./svaba_jemalloc run -t tumor.bam -n normal.bam -G ref.fa -a my_run -p 16 \
--blacklist tracks/hg38.combined_blacklist.bedInstall jemalloc first (apt install libjemalloc2, yum install jemalloc, dnf install jemalloc). The wrapper auto-detects the
library under the standard distro paths; if yours is elsewhere, set
JEMALLOC_LIB=/path/to/libjemalloc.so.2. If the svaba binary isn't on
your $PATH, set SVABA=/path/to/svaba.
For very high concurrency (-p 24+), also pass jemalloc's own tuning
knobs via MALLOC_CONF:
MALLOC_CONF=background_thread:true,narenas:24,dirty_decay_ms:10000 \
./svaba_jemalloc run -p 24 ...narenas should match or modestly exceed the thread count;
background_thread:true asks jemalloc to reclaim dirty pages on a
background thread instead of in the hot path.
macOS users should not use jemalloc. Apple's native libmalloc (with its nanomalloc fast path for small allocations) and the DYLD_INSERT_LIBRARIES mechanism combine to run 5–10× slower than system malloc on this exact workload in our A/B tests. The wrapper refuses to run on Darwin for that reason. Stick with the system allocator on macOS.
Three steps. Run the caller, merge + dedup + index the per-thread
outputs, then convert the deduped bps.txt.gz to VCF. The bundled
combined blacklist is strongly recommended — it keeps svaba out of
well-known pileup / high-complexity regions that would otherwise
dominate wall-clock time with no real calls to show for it. See
tracks/README.md for customizing or extending the blacklist.
# 1. Call: tumor/normal on chr22 with 4 threads, bundled blacklist
svaba run -t tumor.bam -n normal.bam -G ref.fa -a my_run -p 4 \
-k chr22 \
--blacklist tracks/hg38.combined_blacklist.bed
# 2. Post-process: merge per-thread BAMs, sort+dedup+index,
# sort+dedup bps.txt.gz, filter r2c to PASS.
scripts/svaba_postprocess.sh -t 8 -m 4G my_run
# 3. Convert the deduped bps.txt.gz to VCFv4.5 (SV + indel)
svaba tovcf -i my_run.bps.sorted.dedup.txt.gz -b tumor.bam -a my_runA single-sample call drops -n. Any number of cases/controls can be
jointly assembled; prefix on the sample ID drives case/control routing
(t* = case, n* = control).
SvABA is a multi-tool binary. svaba help lists everything. The main
ones:
svaba run performs the whole assembly + variant-calling pipeline.
Takes a BAM (or many) plus a reference, a blacklist, and an output ID.
Emits bps.txt.gz, per-sample VCFs, contigs.bam, runtime.txt, and
(with --dump-reads) per-thread *.discordant.bam,
*.corrected.bam, *.weird.bam, and *.r2c.txt.gz.
svaba postprocess sorts, deduplicates, @PG-stamps, and indexes the
per-thread BAMs and the bps.txt.gz emitted by svaba run. Typically
invoked via scripts/svaba_postprocess.sh which also merges per-thread
files and builds the PASS / PASS-somatic r2c subsets.
svaba tovcf converts a deduplicated bps.txt.gz into VCFv4.5 output
(one SV VCF, one indel VCF; somatic distinguished by the SOMATIC
INFO flag). Clean intrachrom events emit as symbolic <DEL>/<DUP>/
<INV>; everything else stays paired BND. Input is assumed already
sorted/deduped by svaba_postprocess.sh — use --dedup to opt back
into the legacy internal dedup.
The SOMATIC flag is stamped when a record's somatic LOD
(INFO/SOMLOD) meets or exceeds a configurable cutoff; the default
is 1.0, which is a reasonable "confident somatic" gate and keeps
marginal calls out of the somatic set. Tune it with --somlod:
svaba tovcf -i deduped.bps.txt.gz -b tumor.bam -a my_run # default somlod >= 1
svaba tovcf ... --somlod 2.0 # stricter: only strong somatic calls
svaba tovcf ... --somlod 0.0 # permissive: flag anything with LO_s >= 0The raw score is always written to INFO/SOMLOD regardless of the
cutoff, so downstream bcftools view -i 'INFO/SOMLOD >= 3' still
works if you want to re-threshold after the fact without regenerating
the VCF.
svaba refilter re-runs the LOD cutoffs / PASS logic on an existing
bps.txt.gz with new thresholds, regenerating VCFs without rerunning
assembly. Use it when you want to tune sensitivity/specificity after
the fact.
${ID}.bps.txt.gz is the primary output — one row per breakpoint,
with a v3 schema of 52 core columns + per-sample blocks (see
BreakPoint::header for column names). The 52nd column is a unique
bp_id of the form bpTTTNNNNNNNN that joins back to the BAM aux
tags and the VCF EVENT= field. ${ID}.contigs.bam holds every
assembled contig, ${ID}.runtime.txt holds per-region timing, and
${ID}.log carries the run log.
The VCF files (${ID}.sv.vcf.gz, ${ID}.indel.vcf.gz) declare
VCFv4.5, use symbolic alleles where unambiguous, and carry the
canonical scoring in INFO: MAXLOD (variant-vs-error, per-sample
max), SOMLOD (somatic LLR), SOMATIC (flag), and SVCLAIM
(evidence class). VCF QUAL defaults to . — filter on FILTER=PASS
or the two LOD fields, not QUAL. See CLAUDE.md for the full scoring
model.
Opt-in outputs (gated behind --dump-reads): ${ID}.r2c.txt.gz is a
structured, re-plottable dump of every contig + its r2c-aligned reads;
${ID}.corrected.bam / ${ID}.weird.bam / ${ID}.discordant.bam
carry per-read evidence streams. These can run to tens of GB on deep
samples, so they're off by default.
svaba run emits per-thread, unsorted BAMs and a raw per-thread
r2c.txt.gz. Merge + sort + dedup + filter them with one command:
scripts/svaba_postprocess.sh -t 8 -m 4G my_runFive idempotent steps: merge per-thread BAMs and r2c files, run
svaba postprocess (sort + stream-dedup + @PG-stamp + index), sort
and dedup bps.txt.gz with PASS filter, emit PASS / PASS-somatic
subsets of r2c.txt.gz, and (optional) demux the BAMs by source
prefix. See CLAUDE.md for the full flag surface.
All-HTML, no server, drop files in from file://. Entry point:
viewer/index.html. The primary viewer is bps_explorer.html —
sortable bps.txt.gz table with chip filters, per-sample detail
panel, log10 histograms for somlod/maxlod/span, and click-to-IGV
navigation. r2c_explorer.html re-plots the structured r2c TSV in
browser (contig ruler, fragment rows, indel || marker rows with
labels, per-read gap-expanded CIGAR, bp_id filter dropdown).
runtime_explorer.html visualizes runtime.txt; comparison.html
does side-by-side diffs of two runs.
viewer/alignments_viewer.html still renders the legacy
alignments.txt.gz ASCII output, but new runs don't produce that
file — r2c_explorer.html is the replacement.
tracks/hg38.combined_blacklist.bed is the ready-made blacklist;
feed it to svaba run --blacklist. It is a regeneratable union of
component BED files in tracks/ (ENCODE, high-runtime regions, manual
curations, simple repeats, non-standard contigs, and a
low-mappability blacklist derived from paired mosdepth runs). See
tracks/README.md for the recipe.
Compile with -DSVABA_TRACE_READ to get per-decision-point stderr output
for a single read (by QNAME) from BAM ingestion through to output tagging.
Zero runtime cost when not compiled in (all trace macros compile to no-ops).
cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DCMAKE_CXX_FLAGS='-DSVABA_TRACE_READ="\"LH00306:129:227V5CLT4:6:1204:38807:7191\""'
cmake --build build -j$(nproc)
# Run on just the region containing the read (faster iteration)
svaba run -t tumor.bam -n normal.bam -G ref.fa -p 4 \
-k chr2:215000000-216000000 --dump-reads -a trace_run 2> read_trace.log
grep READ_TRACE read_trace.logThe trace covers every gate the read hits:
| Stage | Log prefix | What it tells you |
|---|---|---|
| BAM read & filter | (various, from SvabaBamWalker) | Dup/QC/blacklist gates, rule_pass, NM salvage, adapter filter, quality trim, buffer admission |
| BFC error correction | BFC corrected: |
Whether BFC changed the sequence and by how much |
| R2C alignment | R2C SKIP: / R2C HIT: |
Perfect-ref-match skip, or which contigs the read aligned to (AS score, CIGAR) |
| Native realignment | NATIVE_REALIGN: |
Whether the corrected seq was re-aligned to ref or reused the BAM CIGAR |
| Split coverage scoring | SPLIT_COV enter |
Entry into per-BP scoring with r2c coords and breakpoint positions |
| R2C vs native gate | SPLIT_COV TP8 |
The critical score comparison: r2c_score, native_score, both CIGARs, NM values |
| Indel-at-break check | SPLIT_COV TP9 |
Whether an r2c CIGAR indel lands at the breakpoint |
| Span check | SPLIT_COV TP10 |
Whether the read spans the breakpoint(s), with exact coord bounds |
| DEL-covers-break | SPLIT_COV TP11 |
Whether an r2c deletion masks the breakpoint |
| Final verdict | SPLIT_COV CREDITED / NOT CREDITED / SKIPPED |
Did this read count as a variant supporter? |
| Output tagging | OUTPUT TAG bi:Z |
BP id stamped on the read, confidence, somatic status |
Example trace (abridged) for a read supporting a somatic 1bp deletion:
[READ_TRACE:SvabaBamWalker.cpp:203] read=LH00306:129:... flag=163 mapq=60 pos=chr2:215869800 ...
[READ_TRACE:SvabaBamWalker.cpp:236] PASS mr.isValid rule_pass=1
[READ_TRACE:SvabaBamWalker.cpp:340] ADDED to read buffer (n=847)
[READ_TRACE:SvabaRegionProcessor.cpp:377] BFC corrected: changed=NO ...
[READ_TRACE:SvabaRegionProcessor.cpp:808] R2C HIT: contig=c_fermi_chr2_... AS=150 rc=0 CIGAR=75M1D74M
[READ_TRACE:SvabaRegionProcessor.cpp:978] NATIVE_REALIGN: reusing BAM CIGAR (seq unchanged by BFC)
[READ_TRACE:BreakPoint.cpp:379] SPLIT_COV enter contig=c_fermi_chr2_... r2c_start=340 r2c_end=489 ...
[READ_TRACE:BreakPoint.cpp:476] SPLIT_COV TP8 PASS r2c>native r2c_score=150 native_score=143 ...
[READ_TRACE:BreakPoint.cpp:601] SPLIT_COV TP10 span check issplit1=1 issplit2=1 one_split=1
[READ_TRACE:BreakPoint.cpp:697] SPLIT_COV CREDITED as variant supporter sample=t000 tumor=1
[READ_TRACE:SvabaRegionProcessor.cpp:1346] OUTPUT TAG bi:Z bp_id=bp00100000042 confidence=PASS somatic=1
cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DCMAKE_CXX_FLAGS='-DSVABA_TRACE_CONTIG="\"c_fermi_chr2_215869501_215894501_13C\""'
cmake --build build -j$(nproc)Shows assembly filtering (TP1–TP5), variant identification (TP6), per-read split-coverage scoring (TP8–TP11), breakpoint confidence (TP13–TP16, TP23).
cmake -B build \
-DCMAKE_CXX_FLAGS='-DSVABA_TRACE_READ="\"LH00306:129:227V5CLT4:6:1204:38807:7191\"" \
-DSVABA_TRACE_CONTIG="\"c_fermi_chr2_215869501_215894501_13C\""'
cmake --build build -j$(nproc)Independent prefixes ([READ_TRACE:...] vs [TRACE:...]) — grep for either.
cmake ... -DCMAKE_CXX_FLAGS='-DSVABA_TRACE_ALL_READS=1' # every read
cmake ... -DCMAKE_CXX_FLAGS='-DSVABA_TRACE_ALL=1' # every contigBest combined with a small -k region.
# From bps.txt.gz — get contig name (col 30) and bp_id (col 52)
zcat results.bps.txt.gz | awk -F'\t' '$1=="chr2" && $2 > 215869000 && $2 < 215870000'
# From the corrected BAM — reads tagged with a specific bp_id
samtools view results.corrected.bam chr2:215869000-215870000 | grep "bi:Z:.*bp00100000042"
# From the r2c TSV — all reads for a contig
zcat results.r2c.txt.gz | awk -F'\t' '$2 == "c_fermi_chr2_215869501_215894501_13C" && $1 == "read"' | cut -f8cmake -B build -DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DCMAKE_CXX_FLAGS='-DSVABA_KMER_RESTRICT="\"CCATGCAGAGTGTTGAAGAAAAGGC\""'
cmake --build build -j$(nproc)After BFC error correction, only reads whose corrected sequence
contains the specified kmer (or its reverse complement) survive.
All other reads get to_assemble = false — they won't enter fermi
assembly, r2c alignment, corrected.bam, or scoring. The log prints
kept/dropped counts per region.
Use case: you see a suspicious contig and want to know whether
assembly still produces it when restricted to reads from a particular
locus. Pick a 25-mer unique to that locus, compile with
SVABA_KMER_RESTRICT, and run on the same region:
svaba run -t tumor.bam -n normal.bam -G ref.fa -k chr11:5000000-5100000 \
-a kmer_test --dump-reads -p 1If the chimeric contig still assembles, the kmer-carrying reads are sufficient to produce it. If it disappears, reads from elsewhere were required.
Can be combined with read/contig tracing:
cmake -B build \
-DCMAKE_CXX_FLAGS='-DSVABA_KMER_RESTRICT="\"CCATGCAGAGTGTTGAAGAAAAGGC\"" \
-DSVABA_TRACE_CONTIG="\"c_fermi_chr11_5000001_5100001_3C\""'cmake ... -DCMAKE_CXX_FLAGS='-DSVABA_R2C_NATIVE_GATE=0'Restores old behavior where any r2c-spanning read credits as variant supporter. Will reintroduce false-positive somatic calls from homology-trap reads.
BWA-MEM sometimes places a contig fragment on an alt contig (e.g.
chr11_JH159136v1_alt) when chr11 proper has an equally good
alignment. If the alt later gets blacklisted, the real breakpoint
is lost.
svaba now requests secondary alignments for contigs and runs a
post-alignment step (preferStandardChromosomes): for each
primary/supplementary fragment on a non-standard chromosome
(ChrID > maxMateChrID, default 23), it looks for a secondary
alignment on a standard chromosome that:
- Covers ≥ 80% of the same query range (reciprocal overlap)
- Has AS ≥ 95% of the non-standard alignment's AS
If found, the standard-chr alignment is promoted (gets the
primary/supplementary flag) and the non-standard one is demoted
to secondary. The contig trace (SVABA_TRACE_CONTIG) logs each
swap with both chromosome IDs and alignment scores.
This is controlled by --max-mate-chr (same constant used for
mate-region lookup). --non-human sets it to -1, which disables
alt-demotion entirely (no chromosome is "non-standard").
Germline-only. Raise the mate-lookup threshold so only larger
discordant clusters trigger a cross-region lookup — more conservative
and usually appropriate for a single-sample germline run where we
don't have a built-in control to guard against mapping artifacts
(a tumor would typically see smaller, subclonal clusters we want
to chase down). Add --single-end to disable mate lookups entirely
if you want to be even more conservative:
svaba run -t germline.bam -p 8 --mate-min 6 -a germline_run \
-G ref.fa \
--blacklist tracks/hg38.combined_blacklist.bedTargeted assembly over a list of regions (BED, chr, or IGV-style string):
svaba run -t sample.bam -k targets.bed -a exome_cap -G ref.fa
svaba run -t sample.bam -k chr17:7,541,145-7,621,399 -a TP53 -G ref.faDump all per-read evidence (large output, only for debugging a specific call):
svaba run -t sample.bam -G ref.fa -a debug_run --dump-reads
scripts/svaba_postprocess.sh -t 8 -m 4G debug_runNon-human genome (e.g. mouse, zebrafish, C. elegans). By default
svaba assumes a human reference: mate-region lookups skip chromosomes
past chrY (ChrID > 23), and insert-size learning samples only the
primary assembly (chr1–chrY). --non-human removes these gates so
every contig in the reference is treated equally:
svaba run -t mouse.bam -G mm39.fa -a mouse_run --non-human -p 16The flag sets maxMateChrID = -1 internally. You can also fine-tune
individually: --max-mate-chr 19 (mouse has 19 autosomes + X + Y =
ChrIDs 0–20 in a typical reference), --min-mate-mapq 10 (require
MAPQ ≥ 10 on the primary read before chasing its mate).
Snapshot where a long-running job currently is:
tail my_run.logCLAUDE.md at the repo root is the crash-safety-net doc — conventions,
file landmarks, build-system quirks (the -O2 hardcoding in
submodules), the somatic LOD model, the postprocess pipeline details,
performance notes, and open investigations. Always update CLAUDE.md
when you change something non-obvious.
scripts/svaba_local_function.sh is a sourceable bash helper library
with svaba_* utilities for grepping contigs, following a bp_id
through the output set, opening locations in IGV, etc. Source it from
your shell rc file to use.
Please file bug reports, feature requests, and questions on the GitHub issues tracker: https://github.com/walaj/svaba/issues.
SvABA is developed by Jeremiah Wala in the Rameen Berkoukhim lab at Dana-Farber Cancer Institute, in collaboration with the Cancer Genome Analysis team at the Broad Institute. Particular thanks to Cheng-Zhong Zhang (HMS DBMI) and Marcin Imielinski (Weill Cornell / NYGC).
Additional thanks to Jared Simpson for SGA, Heng Li for htslib and BWA, and the SeqLib contributors.
The SvABA 2.0 release and its documentation were built with the assistance of OpenAI Codex and Anthropic Claude, with extensive human-in-the-loop review, testing, and decision-making throughout.