Skip to content

TimoLassmann/kalign

Repository files navigation

CMake Python Build Python Wheels CodeQL GitHub stars GitHub issues

Kalign

Kalign is a fast multiple sequence alignment program for biological sequences written in C with Python bindings.

🚀 Key Features

  • 🔥 High Performance: Fast multiple sequence alignment with multi-threading support
  • ⚡ Smart Threading: Auto-detects CPU cores and uses N-1 threads by default (max 16) for optimal performance
  • 🔧 Cross-Platform: Works on Linux and macOS with multiple build systems (CMake, Zig)
  • 📊 Multiple Formats: FASTA, MSF, Clustal, Stockholm, PHYLIP support
  • 🧬 Sequence Types: Optimized for protein, DNA, RNA, and divergent sequences
  • ⚡ SIMD Optimizations: Vectorized code for x86_64 systems (SSE4.1, AVX, AVX2)
  • 🐍 Python Integration: Modern Python package with comprehensive bioinformatics ecosystem support

Installation

From Source (Primary)

Prerequisites

  • C compiler (GCC, Clang, or MSVC)
  • CMake (3.18 or higher)
  • OpenMP (optional, for parallelization)

Basic Build

# Download and extract latest release
tar -zxvf kalign-<version>.tar.gz
cd kalign-<version>

# Build
mkdir build && cd build
cmake ..
make
make test
make install

macOS with Homebrew

On macOS, install dependencies first:

# Install dependencies
brew install cmake

# For OpenMP support (recommended)
brew install libomp

# Clone and build
git clone https://github.com/TimoLassmann/kalign.git
cd kalign
mkdir build && cd build
cmake ..
make
make test
make install

Note: On macOS, Kalign automatically configures OpenMP with Homebrew's libomp installation at /opt/homebrew/opt/libomp/.

Alternative Build Systems

When using zig, version 0.12 is required to build this project.

Zig Build (for cross-compilation):

zig build

Debug Build:

cmake -DCMAKE_BUILD_TYPE=Debug ..
make

Without OpenMP:

cmake -DUSE_OPENMP=OFF ..
make

Python Package

For development or latest features, install from source:

pip install git+https://github.com/TimoLassmann/kalign.git

For enhanced bioinformatics ecosystem integration:

pip install "kalign[biopython] @ git+https://github.com/TimoLassmann/kalign.git"    # + Biopython integration
pip install "kalign[skbio] @ git+https://github.com/TimoLassmann/kalign.git"        # + scikit-bio integration  
pip install "kalign[all] @ git+https://github.com/TimoLassmann/kalign.git"          # Full ecosystem support

Usage

Command Line Interface

Usage: kalign  -i <seq file> -o <out aln> 

Options:

   --format           : Output format. [Fasta]
   --type             : Alignment type (rna, dna, internal). [rna]
                        Options: protein, divergent (protein) 
                                 rna, dna, internal (nuc). 
   --gpo              : Gap open penalty. []
   --gpe              : Gap extension penalty. []
   --tgpe             : Terminal gap extension penalty. []
   -n/--nthreads      : Number of threads. [auto: N-1, max 16]
   --version (-V/-v)  : Prints version. [NA]

Threading Behavior

New in this version: Kalign automatically detects your system's CPU cores and uses N-1 threads by default (leaving one core free), with a maximum of 16 threads. This provides good performance out-of-the-box while maintaining system responsiveness.

  • Auto-detection: Uses CPU cores - 1 (e.g., 15 threads on a 16-core system)
  • Maximum cap: Never uses more than 16 threads
  • Manual override: Use -n/--nthreads to specify a custom thread count
  • Single-threaded: Use -n 1 to disable parallelization

Input Formats

Kalign accepts:

  • Unaligned sequences: FASTA format
  • Pre-aligned sequences: FASTA, MSF, or Clustal format (gaps will be removed and sequences re-aligned)

Sequence Types

Kalign automatically detects sequence types but offers manual control via --type:

  • protein: Uses CorBLOSUM66_13plus substitution matrix (default for protein)
  • divergent: Uses Gonnet 250 substitution matrix for highly divergent proteins
  • dna: DNA parameters (match: +5, mismatch: -4, gap open: -8, gap ext: -6)
  • rna: Optimized parameters for RNA alignments
  • internal: Like DNA but encourages internal gaps (terminal gap penalty: 8)

Fine-tune with --gpo (gap open), --gpe (gap extension), and --tgpe (terminal gap extension).

Python API

import kalign

# Align DNA sequences
sequences = [
    "ATCGATCGATCG",
    "ATCGTCGATCG", 
    "ATCGATCATCG"
]

aligned = kalign.align(sequences, seq_type="dna")
for seq in aligned:
    print(seq)

For comprehensive Python documentation, see README-python.md and the python-docs directory.

Examples

Basic Usage

Pass sequences via stdin:

cat input.fa | kalign -f fasta > out.afa

Combine multiple input files:

kalign seqsA.fa seqsB.fa seqsC.fa -f fasta > combined.afa

Use optimal threading (auto-detected):

kalign -i sequences.fa -o aligned.afa  # Uses N-1 threads automatically

Custom threading:

kalign -i sequences.fa -o aligned.afa -n 8  # Use exactly 8 threads

Format Conversion

MSF format:

kalign -i BB11001.tfa -f msf -o out.msf

Clustal format:

kalign -i BB11001.tfa -f clu -o out.clu

Re-align existing alignment:

kalign -i BB11001.msf -o out.afa

Library Integration

CMake Integration

Link Kalign into your C/C++ projects:

find_package(kalign)
target_link_libraries(<target> kalign::kalign)

Direct inclusion:

if (NOT TARGET kalign)
  add_subdirectory(<path_to_kalign>/kalign EXCLUDE_FROM_ALL)
endif ()
target_link_libraries(<target> kalign::kalign)

Python Module Development

Local development:

uv pip install -e .

Build Python module with CMake:

mkdir build && cd build
cmake -DBUILD_PYTHON_MODULE=ON ..
make

Cutting a New Python Release (PyPI)

This repo is set up to publish to PyPI from GitHub Actions on version tags (v*) via .github/workflows/wheels.yml.

  1. Bump versions
  • Update the Python package version in pyproject.toml ([project].version).
  • Update the C/C++ library version in CMakeLists.txt (KALIGN_LIBRARY_VERSION_{MAJOR,MINOR,PATCH}).
  • (Optional but recommended) Add an entry to ChangeLog.
  1. Sanity check locally (recommended)
python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip build twine
python -m build
twine check dist/*
  1. Configure trusted publishing on PyPI
  • Add this GitHub repo/workflow as a trusted publisher in the PyPI project settings.
  • Configure it to match the workflow/environment used in .github/workflows/wheels.yml (environment pypi).
  1. Tag and push
git tag vX.Y.Z
git push origin vX.Y.Z

That tag push triggers the wheel + sdist build, install tests, and then uploads to PyPI.

Testing publishing on TestPyPI (optional)

If you want to test the publishing pipeline without uploading to the real PyPI project, this repo also supports a manual TestPyPI publish:

  1. Create a TestPyPI API token and add it as the GitHub secret TEST_PYPI_API_TOKEN.
  2. Run the GitHub Actions workflow Build Python Wheels manually and set publish_target = testpypi.

To install from TestPyPI while resolving dependencies from PyPI:

pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple <package-name>

Performance

Benchmark Results

Kalign performs well for both speed and accuracy:

Balibase

Balibase_scores

Bralibase

Bralibase_scores

Performance Features

  • Multi-threading: Automatic CPU core detection with OpenMP parallelization
  • SIMD optimizations: Vectorized algorithms on x86_64 systems (SSE4.1, AVX, AVX2)
  • Bit-parallel algorithms: Myers' algorithm for efficient alignment
  • Memory optimization: Custom allocation strategies for large datasets

Performance Tips

  • Let auto-threading work: The default N-1 threading usually provides good performance
  • Large datasets: Consider using --type internal for sequences with many gaps
  • Memory: For very large alignments, monitor memory usage and consider reducing thread count
  • x86_64 systems: SIMD optimizations provide additional speedup on Intel/AMD processors

Contributing

We welcome contributions! See our Contributing Guide for details on:

  • Reporting bugs and requesting features
  • Development environment setup
  • Code style guidelines
  • Pull request process

Community Standards

This project follows the Contributor Covenant Code of Conduct. By participating, you agree to uphold this code.

System Requirements

  • Linux: GCC 4.8+ or Clang 3.4+
  • macOS: Xcode 8+ or Homebrew GCC/Clang
  • Memory: ~1GB RAM per 10,000 sequences (typical)
  • CPU: Any modern processor (additional optimizations on x86_64)

Troubleshooting

Common Issues

macOS OpenMP: If you see OpenMP-related errors on macOS:

brew install libomp
# Kalign automatically finds Homebrew's OpenMP installation

Python module: For Python installation issues:

pip install --upgrade pip setuptools wheel
pip install git+https://github.com/TimoLassmann/kalign.git

Threading: If performance seems slow, check thread detection:

kalign --help  # Shows current thread default
kalign -i test.fa -n 1 -o out.fa  # Force single-threaded for testing

For more troubleshooting, see python-docs/python-troubleshooting.md.

Citation

Please cite Kalign in your publications:

  1. Lassmann, Timo. Kalign 3: multiple sequence alignment of large data sets. Bioinformatics (2019). DOI | PDF

Previous Versions

  1. Lassmann, Timo, Oliver Frings, and Erik LL Sonnhammer. Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features. Nucleic acids research 37.3 (2008): 858-865. PubMed

  2. Lassmann, Timo, and Erik LL Sonnhammer. Kalign: an accurate and fast multiple sequence alignment algorithm. BMC bioinformatics 6.1 (2005): 298. PubMed

License

Kalign is licensed under the GNU General Public License v3.0. See COPYING for details.