Skip to content

Latest commit

 

History

History
30 lines (25 loc) · 1.56 KB

File metadata and controls

30 lines (25 loc) · 1.56 KB

BiospexOcrProcessor

Purpose

Performs optical character recognition (OCR) on images using Tesseract.js. It is optimized for label text extraction while filtering out noise like rulers or barcodes.

Workflow

  1. Trigger: Automatically triggered by S3 ObjectCreated events (typically after BiospexImageFetcher uploads an image).
  2. Analysis: Retrieves the image and its metadata (from the S3 object's Metadata field).
  3. OCR: Executes Tesseract.js with PSM 1 (Page Segmentation Mode: Automatic with OSD) using English and Latin language packs.
  4. Filtering: Processes extracted text blocks to remove:
    • Small "noise" blocks (area < 2000px).
    • Thin vertical/horizontal lines (rulers).
    • Low confidence guesses (< 20%).
  5. Cleanup: Deletes the source image from S3 after processing.
  6. Callback: Sends the extracted text and status back to the Laravel app via SQS.

Inputs/Outputs

  • Inputs (S3 Event):
    • S3 Object Key: The path to the image in the bucket.
    • S3 Object Metadata: queue-id, file-id, subject-id, updates-url.
  • Outputs:
    • SQS: Extracted text and status notification sent to updates-url.

Related Components

  • Laravel Command: App\Console\Commands\SqsListenerOcrUpdate (Listens for success status).
  • Laravel Job: App\Jobs\TesseractOcrUpdateJob (Processes the extracted text in the Laravel database).
  • Related Lambda: BiospexImageFetcher (The primary source of images for this processor).

Deployment

Use the deploy.sh script for interactive deployment to AWS (Region: us-east-2).