The weave docs batch command enables efficient batch processing of multiple documents from a directory. It supports parallel processing, automatic retry on failures, progress tracking, and comprehensive reporting.
- Batch Processing: Process all supported files in a directory automatically
- Parallel Processing: Configure multiple workers with
--parallelflag - Smart Retry: Automatic retry on failures with configurable attempts
- Progress Tracking: Track already-processed files using
.processedmetadata files - Visual Progress: Emojis and colored output showing real-time progress
- Time Estimation: Estimated time remaining based on processing speed
- Comprehensive Reporting: CSV reports with detailed processing statistics
- Text files:
.txt,.md,.json,.yaml,.yml - PDF files:
.pdf(with text extraction and image extraction) - Image files:
.jpg,.jpeg,.png,.gif,.bmp,.webp
weave docs batch --directory <dir> --collection <name> [options]--directory,-d: Directory containing documents to process--collection,-c: Collection name for documents
--parallel,-p: Number of parallel workers (default: 1)--retry: Number of retry attempts for failed documents (default: 2)--chunk-size,-s: Chunk size for text content in characters (default: 5000)--force-reprocess: Reprocess all files, ignore.processedfiles (default: false)
--image-collection: Collection name for extracted PDF images (default: same as main collection)--skip-small-images: Skip small images when extracting from PDFs (default: true)--min-image-size: Minimum image size in bytes (default: 5120 = 5KB)--batch-size: Number of images to process before pausing for memory cleanup (default: 10)
--create-report: Create a new CSV report of processing results (default: batch-report.csv)
Process all documents in a directory:
weave docs batch --directory ./documents --collection MyDocsUse 5 parallel workers for faster processing:
weave docs batch --directory ./documents --collection MyDocs --parallel 5Process PDFs with custom image extraction settings:
weave docs batch \
--directory ./pdfs \
--collection MyPDFs \
--image-collection MyImages \
--min-image-size 10240 \
--batch-size 5 \
--parallel 3Generate a detailed CSV report:
weave docs batch \
--directory ./documents \
--collection MyDocs \
--create-report processing-report.csv \
--parallel 3Use open-source embedding models for 100% FREE operation:
# Setup (one-time)
pip install sentence-transformers
# Batch ingest with OSS embedding (saves $240/year on 10M tokens)
weave docs batch \
--directory ./documents \
--collection MyDocs_OSS \
--embedding sentence-transformers/all-mpnet-base-v2 \
--parallel 3
# Fast variant (384 dims instead of 768)
weave docs batch \
--directory ./documents \
--collection MyDocs_Fast \
--embedding sentence-transformers/all-MiniLM-L6-v2 \
--parallel 5
# With Ollama (local LLM embedding)
weave docs batch \
--directory ./documents \
--collection MyDocs_Ollama \
--embedding ollama/nomic-embed-text \
--parallel 3OSS Benefits for Batch Processing:
- Zero API Costs: Completely free, no OpenAI charges
- Privacy: All embeddings generated locally
- Quality: 90%+ retention vs OpenAI for most use cases
- Speed: Similar performance to OpenAI for batch operations
- All VDBs: Works with all 10 supported vector databases
Reprocess all files, even if already processed:
weave docs batch \
--directory ./documents \
--collection MyDocs \
--force-reprocessThe command scans the specified directory recursively for all supported file types:
./documents/
├── file1.txt ✓ Supported
├── file2.pdf ✓ Supported
├── image1.jpg ✓ Supported
├── data.csv ✗ Not supported
└── subdirectory/
└── file3.md ✓ Supported
For each file processed, a .processed metadata file is created:
{
"file_path": "/path/to/document.txt",
"processed_at": "2025-10-31T10:30:00Z",
"success": true,
"text_chunks": 5,
"images": 0,
"processing_time_ms": 1234,
"file_size": 25000,
"retry_count": 0,
"chunks_failed": 0,
"images_failed": 0
}This allows the batch processor to:
- Skip successfully processed files on subsequent runs
- Retry failed files automatically
- Track processing statistics per file
Workers process files concurrently based on --parallel setting:
Worker 1: file1.txt → file4.pdf → file7.jpg
Worker 2: file2.md → file5.txt → file8.png
Worker 3: file3.pdf → file6.md → file9.txt
Real-time progress with visual indicators:
Batch Processing Configuration:
Directory: ./documents
Collection: MyDocs
Parallel workers: 3
Retry attempts: 2
Chunk size: 5000 chars
Found 15 file(s) to process
Processing 15 file(s)...
📝 ✓ [1/15] document1.txt (3 chunks) - 1.2s - ETA: 16s
📄 ✓ [2/15] report.pdf (10 chunks, 5 images) - 3.5s - ETA: 45s
🖼️ ✓ [3/15] photo.jpg - 0.8s - ETA: 9s
❌ ✗ [4/15] corrupted.pdf - Failed to read file (after 2 retries)
📝 ✓ [5/15] notes.md (2 chunks) - 0.5s - ETA: 5s
...
After processing, a summary is displayed:
============================================================
Batch Processing Summary
============================================================
Total files: 15
Successful: 13
Failed: 2
Text chunks: 125
Images extracted: 23
Total time: 1m 45s
Avg time/file: 7s
Failed files:
- corrupted.pdf: Failed to read file
- invalid.txt: Permission denied
============================================================
If --create-report is specified, a CSV file is generated:
Timestamp,File,Status,TextChunks,Images,ProcessingTimeMs,FileSize,RetryCount,ChunksFailed,ImagesFailed,Error
2025-10-31 10:30:15,document1.txt,success,3,0,1234,5000,0,0,0,
2025-10-31 10:30:18,report.pdf,success,10,5,3456,150000,1,0,0,
2025-10-31 10:30:19,photo.jpg,success,0,1,890,25000,0,0,0,
2025-10-31 10:30:22,corrupted.pdf,failed,0,0,567,0,2,0,0,Failed to read file- File processed completely
- All chunks/images created successfully
.processedfile created withsuccess: true- File skipped on subsequent batch runs (unless
--force-reprocess)
- File processed but some chunks/images failed
.processedfile created with failure counts- Considered successful overall but with warnings
- File failed to process after all retry attempts
.processedfile created withsuccess: false- File will be retried on subsequent batch runs
- File has valid
.processedfile withsuccess: true - Automatically skipped to save time
- Can override with
--force-reprocess
Begin with 1-3 workers and increase based on system performance:
# Start conservative
weave docs batch --directory ./docs --collection MyDocs --parallel 3
# Increase if system can handle it
weave docs batch --directory ./docs --collection MyDocs --parallel 10- Default (2): Good for network hiccups
- Higher (5+): For unstable connections
- Lower (1): For fast failure detection
# High retry for unreliable networks
weave docs batch --directory ./docs --collection MyDocs --retry 5For large batches with PDFs:
# Reduce batch size to prevent memory issues
weave docs batch \
--directory ./pdfs \
--collection Docs \
--batch-size 5 \
--parallel 2Track processing details for auditing:
weave docs batch \
--directory ./documents \
--collection MyDocs \
--create-report "$(date +%Y%m%d)-batch-report.csv"Review failed files in the report and retry with adjusted settings:
# First run
weave docs batch --directory ./docs --collection MyDocs
# Review failures, then reprocess failed files
weave docs batch \
--directory ./docs \
--collection MyDocs \
--retry 5 \
--parallel 1 # Slower but more reliableFor directories with many large PDFs:
weave docs batch \
--directory ./pdf-library \
--collection PDFDocs \
--image-collection PDFImages \
--parallel 2 \
--batch-size 5 \
--min-image-size 10240 \
--chunk-size 3000 \
--retry 3 \
--create-report pdf-batch-report.csvRationale:
- Low parallel count (2) prevents memory exhaustion
- Small batch size (5) for image processing
- Higher min image size (10KB) filters tiny images
- Smaller chunks (3000) for better granularity
- More retries (3) for large file reliability
For directories with text, images, and PDFs:
weave docs batch \
--directory ./mixed-content \
--collection AllDocs \
--image-collection AllImages \
--parallel 5 \
--chunk-size 5000 \
--create-report mixed-batch-report.csvRegular batch runs will automatically skip processed files:
# Daily batch job
weave docs batch \
--directory /data/daily-docs \
--collection DailyDocs \
--parallel 5 \
--create-report "/reports/$(date +%Y%m%d)-batch.csv"Solution: Reduce parallel workers and batch size
weave docs batch \
--directory ./docs \
--collection MyDocs \
--parallel 1 \
--batch-size 3Solution: Increase parallel workers if system has capacity
weave docs batch \
--directory ./docs \
--collection MyDocs \
--parallel 10Solution: Check error messages in report, increase retry count
weave docs batch \
--directory ./docs \
--collection MyDocs \
--retry 5 \
--parallel 1 # Single worker for debuggingSolution: Check for .processed files or use --force-reprocess
# Remove all .processed files to reprocess everything
find ./docs -name "*.processed" -delete
# Or use force-reprocess flag
weave docs batch \
--directory ./docs \
--collection MyDocs \
--force-reprocessNote: Time estimates improve as more files are processed. Initial estimates may be off.
- Text files: CPU-bound, benefit from higher parallelism
- PDFs with images: I/O and memory-bound, use moderate parallelism
- Images: I/O-bound, moderate parallelism
| Workload Type | Parallel | Batch Size | Retry |
|---|---|---|---|
| Mostly text | 5-10 | N/A | 2 |
| Mostly images | 3-5 | N/A | 2 |
| Mostly PDFs | 2-3 | 5-10 | 3 |
| Mixed | 3-5 | 10 | 2 |
| Large PDFs | 1-2 | 3-5 | 3 |
#!/bin/bash
# daily-batch.sh
DATE=$(date +%Y%m%d)
REPORT_DIR="/var/reports"
DOCS_DIR="/var/documents/incoming"
weave docs batch \
--directory "$DOCS_DIR" \
--collection DailyDocs \
--parallel 5 \
--retry 3 \
--create-report "$REPORT_DIR/$DATE-batch-report.csv"
# Clean up successfully processed files if needed
# find "$DOCS_DIR" -name "*.processed" -mtime +7 -delete# .github/workflows/process-docs.yml
name: Process Documents
on:
push:
paths:
- 'documents/**'
jobs:
process:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Batch process documents
run: |
weave docs batch \
--directory ./documents \
--collection CI_Docs \
--parallel 3 \
--create-report batch-report.csv
- name: Upload report
uses: actions/upload-artifact@v2
with:
name: batch-report
path: batch-report.csv{
"file_path": "string", // Full path to original file
"processed_at": "RFC3339", // ISO 8601 timestamp
"success": boolean, // Overall success status
"error": "string", // Error message if failed
"text_chunks": integer, // Number of text chunks created
"images": integer, // Number of images extracted
"processing_time_ms": integer, // Processing time in milliseconds
"file_size": integer, // File size in bytes
"retry_count": integer, // Number of retry attempts
"chunks_failed": integer, // Number of chunks that failed
"images_failed": integer // Number of images that failed
}| Column | Type | Description |
|---|---|---|
| Timestamp | string | Processing timestamp |
| File | string | Filename (basename) |
| Status | string | "success" or "failed" |
| TextChunks | integer | Number of text chunks |
| Images | integer | Number of images |
| ProcessingTimeMs | integer | Processing time in milliseconds |
| FileSize | integer | File size in bytes |
| RetryCount | integer | Number of retries |
| ChunksFailed | integer | Number of failed chunks |
| ImagesFailed | integer | Number of failed images |
| Error | string | Error message if failed |
- User Guide - Complete user documentation
- Weave vs RagMe - Schema differences
- Demo - Interactive examples