BiospexOcrProcessor

Purpose

Performs optical character recognition (OCR) on images using Tesseract.js. It is optimized for label text extraction while filtering out noise like rulers or barcodes.

Workflow

Trigger: Automatically triggered by S3 ObjectCreated events (typically after BiospexImageFetcher uploads an image).
Analysis: Retrieves the image and its metadata (from the S3 object's Metadata field).
OCR: Executes Tesseract.js with PSM 1 (Page Segmentation Mode: Automatic with OSD) using English and Latin language packs.
Filtering: Processes extracted text blocks to remove:
- Small "noise" blocks (area < 2000px).
- Thin vertical/horizontal lines (rulers).
- Low confidence guesses (< 20%).
Cleanup: Deletes the source image from S3 after processing.
Callback: Sends the extracted text and status back to the Laravel app via SQS.

Inputs/Outputs

Inputs (S3 Event):
- S3 Object Key: The path to the image in the bucket.
- S3 Object Metadata: queue-id, file-id, subject-id, updates-url.
Outputs:
- SQS: Extracted text and status notification sent to updates-url.

Related Components

Laravel Command: App\Console\Commands\SqsListenerOcrUpdate (Listens for success status).
Laravel Job: App\Jobs\TesseractOcrUpdateJob (Processes the extracted text in the Laravel database).
Related Lambda: BiospexImageFetcher (The primary source of images for this processor).

Deployment

Use the deploy.sh script for interactive deployment to AWS (Region: us-east-2).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BiospexOcrProcessor

Purpose

Workflow

Inputs/Outputs

Related Components

Deployment

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

BiospexOcrProcessor

Purpose

Workflow

Inputs/Outputs

Related Components

Deployment