Skip to content

seq-fetch/seq-fetch-cli

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

seq-fetch-cli

Command-line tool for downloading biological data from NCBI/ENA with progress tracking and validation.

Features

  • πŸ“Š Data Summary: Show detailed summary of sequencing data before download
  • πŸ“₯ Progress Tracking: Real-time progress bars during downloads
  • βœ… MD5 Verification: Automatic checksum verification after download
  • πŸ” Gzip Validation: Verify gzip file integrity for compressed files
  • πŸ“ Incomplete Tracking: Automatically track failed/incomplete downloads
  • ⏭ Skip Completed: Never re-download already completed files
  • πŸ”„ Retry Support: Easy retry mechanism for incomplete downloads

Installation

Prerequisites

First, make sure the seq-fetch library is installed:

cd ../seq-fetch
pip install -e .

Install seq-fetch-cli

cd seq-fetch-cli
pip install -e .

This will install the seq-fetch command available globally.

Quick Start

Download FastQ Files

# Download a single run
seq-fetch download SRR10617884

# Download to specific directory
seq-fetch download SRR10617884 -o ./data

# Download multiple runs
seq-fetch download SRR10617884 SRR10617885 SRR10617886

Download SRA Files

# Download SRA format instead of FastQ
seq-fetch download SRR10617884 --type sra

View Data Summary Without Downloading

# Show run summary
seq-fetch summary SRR10617884

# Show sample summary
seq-fetch summary SAMN14684814 --type sample

Commands

download

Download files for one or more accessions.

seq-fetch download [OPTIONS] ACCESSIONS...

Options:
  -o, --output-dir PATH     Output directory (default: current directory)
  -t, --type [fastq|sra]    File type to download (default: fastq)
  --no-summary              Skip data summary before download
  --no-progress             Skip progress bars
  --no-md5                  Skip MD5 verification
  --no-gzip                 Skip gzip validation
  --max-retries INTEGER     Max retry attempts (default: 3)
  -r, --record-file PATH    Custom incomplete records file

Examples:

# Basic download with all validations
seq-fetch download SRR10617884

# Silent download (no summary, no progress)
seq-fetch download SRR10617884 --no-summary --no-progress

# Download without verification (faster but less safe)
seq-fetch download SRR10617884 --no-md5 --no-gzip

# Download multiple accessions
seq-fetch download SRR10617884 SRR10617885 -o ./data

summary

Show detailed summary of sequencing data without downloading.

seq-fetch summary ACCESSION [OPTIONS]

Options:
  -t, --type [run|sample|study]  Accession type (default: run)

Example Output:

============================================================
Run Summary: SRR10617884
============================================================

Title: Illumina NovaSeq 6000 sequencing
Platform: ILLUMINA
Instrument: NovaSeq 6000
Library Strategy: RNA-Seq
Sample: SAMN14684814
Study: SRP123456

πŸ“¦ FastQ Files (2):
  - SRR10617884_1.fastq.gz (2.50 GB) [1]
  - SRR10617884_2.fastq.gz (2.48 GB) [2]

  Total Size: 4.98 GB
============================================================

incomplete

List all incomplete or failed downloads.

seq-fetch incomplete [OPTIONS]

Options:
  -r, --record-file PATH    Custom incomplete records file
  --by-reason [md5_mismatch|gzip_invalid|download_failed]  Filter by reason

Example Output:

Incomplete Downloads (2 records):
================================================================================

⚠ SRR10617884
  File: SRR10617884_1.fastq.gz
  Type: fastq
  Reason: md5_mismatch
  Size: 2.50 GB
  Retries: 3
  Time: 2026-02-27T10:30:00

⚠ SRR10617885
  File: SRR10617885.fastq.gz
  Type: fastq
  Reason: gzip_invalid
  Size: 1.80 GB
  Retries: 2
  Time: 2026-02-27T11:00:00
================================================================================

Tip: Use 'seq-fetch retry' to retry downloading these files.

retry

Retry downloading incomplete files.

seq-fetch retry [OPTIONS] [ACCESSIONS]...

Options:
  -o, --output-dir PATH     Output directory
  -r, --record-file PATH    Custom incomplete records file
  --all                     Retry all incomplete downloads
  --type [fastq|sra]        File type (default: fastq)

Examples:

# Retry all incomplete downloads
seq-fetch retry --all

# Retry specific accession
seq-fetch retry SRR10617884

# Retry with custom output directory
seq-fetch retry --all -o ./data

verify

Manually verify a downloaded file.

seq-fetch verify FILE [OPTIONS]

Options:
  --md5 TEXT       Expected MD5 checksum
  --gzip           Also validate gzip format

Examples:

# Verify MD5 only
seq-fetch verify sample.fastq.gz --md5 abc123def456

# Verify gzip integrity
seq-fetch verify sample.fastq.gz --gzip

# Verify both
seq-fetch verify sample.fastq.gz --md5 abc123 --gzip

Workflow Examples

Typical Download Workflow

# 1. First, check the data summary
seq-fetch summary SRR10617884

# 2. Download with all validations
seq-fetch download SRR10617884 -o ./data

# 3. If interrupted or failed, check incomplete files
seq-fetch incomplete

# 4. Retry incomplete downloads
seq-fetch retry --all -o ./data

Batch Download Workflow

# Download multiple runs
seq-fetch download SRR10617884 SRR10617885 SRR10617886 -o ./data

# Check if any failed
seq-fetch incomplete

# Retry only the failed ones
seq-fetch retry --all -o ./data

Download Without Re-downloading Completed Files

The tool automatically tracks completed files and skips them:

# First run - downloads all files
seq-fetch download SRR10617884 SRR10617885 -o ./data

# Second run - skips already completed files
seq-fetch download SRR10617884 SRR10617885 -o ./data
# Output: "File exists, verifying: ..."

Incomplete Records

Incomplete downloads are automatically tracked in ~/.seq-fetch/incomplete.json.

The record includes:

  • Accession number
  • File path
  • File type (fastq/sra)
  • Failure reason (md5_mismatch, gzip_invalid, download_failed)
  • Expected MD5
  • File size
  • Retry count
  • Timestamp

This allows you to:

  1. Know exactly which files need to be re-downloaded
  2. Understand why they failed
  3. Retry only the incomplete files without affecting completed ones

Configuration

All options can be combined:

seq-fetch download SRR10617884 \
    -o ./data \
    --no-summary \
    --no-progress \
    --max-retries 5 \
    --record-file ./custom_records.json

Troubleshooting

MD5 Verification Failed

If MD5 verification fails repeatedly:

  1. Check your network connection
  2. The source file on ENA might be corrupted
  3. Try downloading with --no-md5 (not recommended)

Gzip Validation Failed

If gzip validation fails:

  1. The download might be incomplete (interrupted)
  2. Try retrying: seq-fetch retry SRRXXXXXXX
  3. Check disk space

Files Being Re-downloaded

If completed files are being re-downloaded:

  1. Make sure you're using the same output directory
  2. Check the incomplete records: seq-fetch incomplete
  3. The file might have failed validation (check records for reason)

License

MIT License

Acknowledgments

  • Built on top of the seq-fetch library
  • Data provided by ENA (European Nucleotide Archive)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages