Skip to content

Comments

Evaluation: TTS#619

Open
AkhileshNegi wants to merge 2 commits intomainfrom
feature/tts-evaluation
Open

Evaluation: TTS#619
AkhileshNegi wants to merge 2 commits intomainfrom
feature/tts-evaluation

Conversation

@AkhileshNegi
Copy link
Collaborator

@AkhileshNegi AkhileshNegi commented Feb 23, 2026

Summary

Target issue is #535
Explain the motivation for making this change. What existing problem does the pull request solve?

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

Release Notes

  • New Features

    • TTS evaluation system for assessing speech synthesis quality
    • Dataset creation and management for text samples
    • Evaluation runs across multiple TTS providers
    • Feedback system for synthesized audio quality
  • Documentation

    • API documentation for TTS evaluation endpoints and workflows

@AkhileshNegi AkhileshNegi self-assigned this Feb 23, 2026
@AkhileshNegi AkhileshNegi added the enhancement New feature or request label Feb 23, 2026
@AkhileshNegi AkhileshNegi marked this pull request as ready for review February 23, 2026 05:21
@coderabbitai
Copy link

coderabbitai bot commented Feb 23, 2026

📝 Walkthrough

Walkthrough

Adds a complete TTS evaluation system including database schema for results, API routes for managing datasets and evaluation runs, CRUD operations, batch submission to Gemini API via Alembic, audio processing utilities, polling/cron integration, and supporting domain models with human feedback mechanisms.

Changes

Cohort / File(s) Summary
Database Schema
backend/app/alembic/versions/047_add_tts_evaluation_tables.py
Creates tts_result table with comprehensive TTS evaluation fields: text input, audio URL, metadata, provider, status, scores, human feedback, error messages, and foreign keys. Adds three indexes and downgrade path.
API Documentation
backend/app/api/docs/tts_evaluation/*
Adds endpoint documentation for dataset CRUD operations, evaluation run management (start, list, get with filtering/pagination), result retrieval, and human feedback updates.
API Routes - Router Setup
backend/app/api/routes/evaluations/__init__.py, backend/app/api/routes/tts_evaluations/router.py, backend/app/api/routes/tts_evaluations/__init__.py
Configures TTS evaluation API router with /evaluations/tts prefix, integrates three sub-routers (dataset, evaluation, result), and exposes in main evaluations route group.
API Routes - Datasets
backend/app/api/routes/tts_evaluations/dataset.py
Implements POST/GET/GET-by-id endpoints for dataset management with language validation, pagination, sample_count metadata, and project scoping.
API Routes - Evaluations
backend/app/api/routes/tts_evaluations/evaluation.py
Implements endpoints to start TTS evaluation runs (loads samples from storage, creates results, submits batch), list/get runs with optional result inclusion and provider/status filtering.
API Routes - Results
backend/app/api/routes/tts_evaluations/result.py
Implements GET and PATCH endpoints for retrieving TTS results and updating human feedback (is_correct, comment) with project scoping.
Models - TTS Evaluation
backend/app/models/tts_evaluation.py
Defines SQLModel TTSResult table and Pydantic request/response schemas: dataset creation/representation, result feedback, run creation with model validation against SUPPORTED_TTS_MODELS, and paginated result representation.
CRUD - Datasets
backend/app/crud/tts_evaluations/dataset.py
Implements create (with unique constraint handling), get by id, and list (with pagination) for TTS datasets; integrates with public schema mapping.
CRUD - Runs
backend/app/crud/tts_evaluations/run.py
Implements create, get by id, list (with dataset/status filters), and update operations for evaluation runs with EvaluationType.TTS validation and TTSEvaluationRunPublic mapping.
CRUD - Results
backend/app/crud/tts_evaluations/result.py
Implements bulk create, get by id, list with filtering/pagination, status updates, human feedback updates, and query helpers (get pending, count by status).
CRUD - Batch Submission
backend/app/crud/tts_evaluations/batch.py
Handles batch submission: groups results by model, constructs Gemini batch requests, submits jobs, tracks batch_job_id, marks failures with error messages, and updates run status to "processing".
CRUD - Polling/Cron
backend/app/crud/tts_evaluations/cron.py
Implements polling loop: aggregates pending runs per project, polls batch jobs, processes completed batches (downloads results, extracts base64 audio, wraps as WAV, uploads to storage, updates results), tracks processed/failed counts with extensive logging and error handling.
CRUD - Integration
backend/app/crud/tts_evaluations/__init__.py
Re-exports CRUD operations (dataset, run, result, batch, cron) for convenient public access.
Cron Integration
backend/app/crud/evaluations/cron.py
Updates existing cron to call poll_all_pending_tts_evaluations and merge TTS summary with text/STT processing totals and details.
Batch API Integration
backend/app/core/batch/gemini.py, backend/app/core/batch/__init__.py
Adds create_tts_batch_requests to construct Gemini TTS batch payloads (texts, voice, optional style) mirroring STT structure; exports from batch module.
Services - Audio Utilities
backend/app/services/tts_evaluations/audio.py
Provides PCM-to-WAV conversion and duration calculation from PCM data (sample rate, bits, channels).
Services - Constants
backend/app/services/tts_evaluations/constants.py
Defines default TTS configuration: DEFAULT_TTS_MODEL, DEFAULT_VOICE_NAME, DEFAULT_STYLE_PROMPT.
Services - Dataset Management
backend/app/services/tts_evaluations/dataset.py
Implements upload_tts_dataset workflow: converts samples to CSV, uploads to object storage, computes metadata, creates database record with transaction rollback on failure. Includes CSV parsing helper.
Services - Init
backend/app/services/tts_evaluations/__init__.py
Package initialization with module docstring.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API as API Routes
    participant CRUD as CRUD Layer
    participant DB as Database
    participant ObjectStore as Object Storage
    participant GeminiBatch as Gemini Batch API

    rect rgba(100, 150, 200, 0.5)
    Note over Client,GeminiBatch: 1. Start TTS Evaluation
    Client->>API: POST /evaluations/tts/runs (dataset_id, models)
    API->>CRUD: get_tts_dataset_by_id
    CRUD->>DB: Query dataset
    DB-->>CRUD: Return dataset
    CRUD-->>API: Dataset
    API->>ObjectStore: Load sample texts from CSV
    ObjectStore-->>API: Samples
    API->>CRUD: create_tts_run
    CRUD->>DB: Insert evaluation run
    DB-->>CRUD: Run with ID
    API->>CRUD: create_tts_results (for each sample × model)
    CRUD->>DB: Bulk insert results (status=pending)
    DB-->>CRUD: Results
    end

    rect rgba(150, 200, 100, 0.5)
    Note over CRUD,GeminiBatch: 2. Submit Batch for Processing
    API->>CRUD: start_tts_evaluation_batch
    CRUD->>CRUD: Group results by provider
    CRUD->>API: create_tts_batch_requests (texts, voice, style)
    API-->>CRUD: Batch request payloads
    CRUD->>GeminiBatch: Submit batch job
    GeminiBatch-->>CRUD: batch_job_id
    CRUD->>DB: Update run (status=processing, batch_job_id)
    DB-->>CRUD: Updated run
    CRUD-->>API: Success payload
    API-->>Client: HTTP 200 with run
    end

    rect rgba(200, 150, 100, 0.5)
    Note over CRUD,ObjectStore: 3. Poll & Process Results (Cron)
    CRUD->>DB: Query runs with status=processing
    DB-->>CRUD: Pending runs
    loop For each pending run
        CRUD->>GeminiBatch: Poll batch job status
        GeminiBatch-->>CRUD: Batch results (when complete)
        CRUD->>CRUD: Extract base64 audio from response
        CRUD->>CRUD: Decode PCM, wrap as WAV, calculate duration
        CRUD->>ObjectStore: Upload WAV file
        ObjectStore-->>CRUD: object_store_url
        CRUD->>DB: Update result (url, metadata, status=completed)
        DB-->>CRUD: Updated result
    end
    CRUD->>DB: Update run (status=completed)
    DB-->>CRUD: Updated run
    end

    rect rgba(200, 100, 150, 0.5)
    Note over Client,CRUD: 4. Review Results & Provide Feedback
    Client->>API: GET /evaluations/tts/runs/{run_id}
    API->>CRUD: get_tts_run_by_id + get_results_by_run_id
    CRUD->>DB: Query run and results
    DB-->>CRUD: Run + results with object_store_urls
    CRUD-->>API: Run with results
    API-->>Client: HTTP 200 with run and results
    Client->>API: PATCH /evaluations/tts/results/{result_id} (is_correct, comment)
    API->>CRUD: update_tts_human_feedback
    CRUD->>DB: Update result feedback fields
    DB-->>CRUD: Updated result
    CRUD-->>API: Updated result
    API-->>Client: HTTP 200 with updated result
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • ProjectTech4DevAI/kaapi-backend#405: Adds core evaluation, batch, and cron infrastructure that this PR builds upon for TTS-specific batch request creation and polling integration.

Suggested reviewers

  • Prajna1999
  • kartpop
  • rajagopalmotivate

Poem

🐰 Hops with glee at TTS delight,
Audio wrapped in WAV so bright,
Batches batch, the Gemini way,
Datasets dance, results at play,
Feedback flows from human ears,
Speech evaluation clears the fears! 🎵

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Evaluation: TTS' is vague and lacks specific detail about what was actually implemented. While it relates to the PR's overall purpose, it doesn't convey the main change or scope. Provide a more descriptive title that captures the primary change, such as 'Add TTS evaluation feature with dataset management and batch processing' or 'Implement end-to-end TTS evaluation workflow'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 95.12% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/tts-evaluation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (13)
backend/app/services/tts_evaluations/audio.py (1)

3-5: Unused import: struct.

struct is imported but never used — the wave module handles header construction internally.

🧹 Proposed fix
 import io
-import struct
 import wave
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/tts_evaluations/audio.py` around lines 3 - 5, The import
statement for the symbol struct in backend/app/services/tts_evaluations/audio.py
is unused and should be removed to clean up the module imports; update the
top-of-file imports by deleting the struct import so only io and wave remain
(leave any other existing imports untouched).
backend/app/alembic/versions/047_add_tts_evaluation_tables.py (2)

133-150: Redundant single-column index ix_tts_result_run_id.

Both idx_tts_result_feedback(evaluation_run_id, is_correct) and idx_tts_result_status(evaluation_run_id, status) have evaluation_run_id as the leading column, so either composite index already satisfies queries that filter only on evaluation_run_id. The standalone index ix_tts_result_run_id is redundant and adds write overhead without benefit.

♻️ Proposed fix: remove the redundant index

In upgrade():

-    op.create_index(
-        "ix_tts_result_run_id",
-        "tts_result",
-        ["evaluation_run_id"],
-        unique=False,
-    )

In downgrade():

-    op.drop_index("ix_tts_result_run_id", table_name="tts_result")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/alembic/versions/047_add_tts_evaluation_tables.py` around lines
133 - 150, The migration creates a redundant single-column index
ix_tts_result_run_id on tts_result.evaluation_run_id because the composite
indexes idx_tts_result_feedback (evaluation_run_id, is_correct) and
idx_tts_result_status (evaluation_run_id, status) already cover queries
filtering only by evaluation_run_id; remove the op.create_index call for
"ix_tts_result_run_id" from upgrade(), and also remove the corresponding
op.drop_index for "ix_tts_result_run_id" from downgrade() so the migration no
longer creates or drops that redundant index (leave idx_tts_result_feedback and
idx_tts_result_status unchanged).

103-114: Consider adding server_default for timestamp columns.

inserted_at and updated_at are nullable=False with no server_default. If a row is ever inserted outside the ORM (e.g., a manual SQL fix, a migration backfill, or another service), the insert will fail. Adding server_default=sa.func.now() provides a safety net.

♻️ Suggested change
         sa.Column(
             "inserted_at",
             sa.DateTime(),
             nullable=False,
+            server_default=sa.func.now(),
             comment="Timestamp when the result was created",
         ),
         sa.Column(
             "updated_at",
             sa.DateTime(),
             nullable=False,
+            server_default=sa.func.now(),
             comment="Timestamp when the result was last updated",
         ),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/alembic/versions/047_add_tts_evaluation_tables.py` around lines
103 - 114, The timestamp columns inserted_at and updated_at in the Alembic
revision (columns named "inserted_at" and "updated_at") are non-nullable but
lack server defaults; update their Column definitions to include a
server_default (e.g., server_default=sa.func.now() or sa.text('now()')) so
inserts outside the ORM won't fail, and keep the existing nullable=False and
comments intact; apply this change to both Column(...) calls in this migration.
backend/app/crud/evaluations/cron.py (1)

49-66: Minor inconsistency in dict access style (pre-existing).

Lines 49–61 use direct key access (summary["processed"]) while Lines 62–65 use .get("details", []). Both are safe since all three polling functions guarantee these keys, but you might consider making the access style uniform for consistency. Not blocking since this pattern was pre-existing before TTS was added.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/crud/evaluations/cron.py` around lines 49 - 66, Summary: Make
dict access style consistent by using .get for all summary keys. Fix: Replace
direct key indexing for processed/failed/still_processing in the computations of
total_processed, total_failed, and total_still_processing to use
text_summary.get("processed", 0), stt_summary.get("processed", 0),
tts_summary.get("processed", 0) (and similar for "failed" and
"still_processing"), and keep all_details using .get("details", []); this
ensures uniform access for text_summary, stt_summary, and tts_summary and safe
defaults if keys are ever missing.
backend/app/api/routes/tts_evaluations/result.py (2)

64-83: Duplicated TTSResultPublic construction across both endpoints.

The mapping from TTSResultTTSResultPublic (including the metadata_ extraction) is repeated verbatim. Consider extracting a small helper like _to_public(result: TTSResult) -> TTSResultPublic to keep it DRY—especially since this same pattern appears in the evaluation routes too.

Also applies to: 109-128

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/routes/tts_evaluations/result.py` around lines 64 - 83, The
TTSResult→TTSResultPublic mapping is duplicated; extract a helper function (e.g.
_to_public(result: TTSResult) -> TTSResultPublic) that centralizes building
TTSResultPublic (including extracting duration_seconds and size_bytes from
result.metadata_). Replace the inline constructions in this file’s endpoints
(the current return block that creates TTSResultPublic and the similar block at
lines 109-128) with calls to _to_public(result) so both endpoints reuse the same
conversion logic.

44-62: Redundant existence check before update_tts_human_feedback.

update_tts_human_feedback (per backend/app/crud/tts_evaluations/result.py lines 221-273) already validates existence via get_tts_result_by_id and raises HTTPException(404) if not found. The pre-check on lines 44-52 duplicates that query.

You can remove the pre-check and rely on the CRUD layer's 404 handling, or remove the internal check from update_tts_human_feedback and keep it here—but having both is an unnecessary extra DB query.

♻️ Suggested simplification
 def update_result_feedback(
     _session: SessionDep,
     auth_context: AuthContextDep,
     result_id: int,
     feedback: TTSFeedbackUpdate = Body(...),
 ) -> APIResponse[TTSResultPublic]:
     """Update human feedback on a TTS result."""
     logger.info(
         f"[update_result_feedback] Updating feedback | "
         f"result_id: {result_id}, is_correct: {feedback.is_correct}"
     )
 
-    # Verify result exists and belongs to this project
-    existing = get_tts_result_by_id(
-        session=_session,
-        result_id=result_id,
-        org_id=auth_context.organization_.id,
-        project_id=auth_context.project_.id,
-    )
-
-    if not existing:
-        raise HTTPException(status_code=404, detail="Result not found")
-
     # Update feedback
     result = update_tts_human_feedback(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/routes/tts_evaluations/result.py` around lines 44 - 62,
Remove the redundant existence check that calls get_tts_result_by_id before
calling update_tts_human_feedback: delete the block that queries existing and
raises HTTPException when falsy, and rely on update_tts_human_feedback to
perform the existence validation (or conversely, if you prefer validation in the
route, remove the internal check inside update_tts_human_feedback); locate the
redundant code by the get_tts_result_by_id call and the subsequent if not
existing / raise HTTPException block and remove it so only one DB existence
check remains for result_id with auth_context.organization_.id and
auth_context.project_.id.
backend/app/crud/tts_evaluations/batch.py (1)

152-153: Bare Exception raised when all batch submissions fail.

Using raise Exception(...) loses context and makes it hard to distinguish this failure from unexpected errors upstream (e.g., in the API route's except Exception handler at evaluation.py line 119). Consider defining or reusing a domain-specific exception.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/crud/tts_evaluations/batch.py` around lines 152 - 153, Replace
the bare raise Exception(...) used when batch_jobs is empty with a
domain-specific exception: define a new exception class (e.g.,
BatchSubmissionError) in this module or a shared exceptions module and raise
that instead (include the original error details or chain underlying exceptions
via "from" where applicable) so callers can distinguish batch submission
failures; update any upstream handlers if needed to catch BatchSubmissionError
instead of a generic Exception.
backend/app/crud/tts_evaluations/dataset.py (1)

117-173: Consider using model_validate to reduce boilerplate in the mapping.

The manual field-by-field construction of TTSDatasetPublic (lines 157-169) is repeated here and in the API route's get_dataset. If TTSDatasetPublic fields align with EvaluationDataset, Pydantic's model_validate (or from_orm) can eliminate this boilerplate.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/crud/tts_evaluations/dataset.py` around lines 117 - 173, Replace
the manual field-by-field construction in list_tts_datasets with Pydantic model
validation/mapping to remove boilerplate: instead of building
TTSDatasetPublic(...) for each EvaluationDataset, call
TTSDatasetPublic.model_validate(dataset) (or
TTSDatasetPublic.model_validate(dataset, mode="from_attributes") / use from_orm
equivalent) to convert the ORM EvaluationDataset into the Pydantic
TTSDatasetPublic; if needed, enable TTSDatasetPublic model_config
(from_attributes/orm_mode) so model_validate/from_orm can read attributes from
EvaluationDataset, and update the mapping in list_tts_datasets (and mirror the
same change in get_dataset) to return [TTSDatasetPublic.model_validate(d) for d
in datasets] and keep the same returned total.
backend/app/crud/tts_evaluations/result.py (3)

62-64: Redundant flush() before commit().

session.commit() implicitly flushes, so session.flush() on Line 63 is unnecessary here.

Simplify
     session.add_all(results)
-    session.flush()
     session.commit()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/crud/tts_evaluations/result.py` around lines 62 - 64, Remove the
redundant explicit flush before commit: the block that calls
session.add_all(results) followed by session.flush() and session.commit() should
drop the session.flush() call because session.commit() implicitly flushes; keep
session.add_all(results) and then call session.commit() (reference:
session.add_all, session.flush, session.commit in the tts_evaluations/result.py
function).

320-328: Simplify dict comprehension to dict() constructor.

Per static analysis (Ruff C416), {status: count for status, count in rows} can be simplified.

Fix
-    return {status: count for status, count in rows}
+    return dict(rows)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/crud/tts_evaluations/result.py` around lines 320 - 328, The dict
comprehension building the status->count map can be simplified: replace the
final return that currently constructs {status: count for status, count in rows}
with a direct dict conversion of the query result (rows) so the code returns
dict(rows); locate the query using TTSResult (the statement variable) and the
rows variable set from session.exec(statement).all(), and change only the return
expression to use dict(rows).

177-219: update_tts_result flushes but does not commit — document this caller obligation.

This function intentionally uses flush() to support batch contexts (e.g., process_completed_tts_batch in cron.py commits once at the end). However, if called from a non-batch context, uncommitted changes will be silently lost. A brief docstring note about the caller's commit responsibility would prevent misuse.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/crud/tts_evaluations/result.py` around lines 177 - 219, Update
the update_tts_result docstring to explicitly state that this function calls
session.flush() but does not commit the transaction, so callers (e.g.,
process_completed_tts_batch) are responsible for calling session.commit() when
not running inside a batch; reference the function name update_tts_result, the
model TTSResult, and the use of now() to indicate this only persists flushed
changes until the caller commits.
backend/app/models/tts_evaluation.py (1)

207-230: Validator parameter name valid_model is misleading; default duplicates the constant.

  1. The field_validator parameter (Line 220) is named valid_model but it receives the raw input before validation. A name like models or v (Pydantic convention) would be clearer.

  2. The default factory on Line 213 hard-codes ["gemini-2.5-pro-preview-tts"] while SUPPORTED_TTS_MODELS already holds the same list. Reuse the constant to avoid drift.

Suggested changes
     models: list[str] = Field(
-        default_factory=lambda: ["gemini-2.5-pro-preview-tts"],
+        default_factory=lambda: list(SUPPORTED_TTS_MODELS),
         description="List of TTS models to use",
         min_length=1,
     )
 
     `@field_validator`("models")
     `@classmethod`
-    def validate_models(cls, valid_model: list[str]) -> list[str]:
+    def validate_models(cls, v: list[str]) -> list[str]:
         """Validate that all models are supported."""
-        if not valid_model:
+        if not v:
             raise ValueError("At least one model must be specified")
-        unsupported = [m for m in valid_model if m not in SUPPORTED_TTS_MODELS]
+        unsupported = [m for m in v if m not in SUPPORTED_TTS_MODELS]
         if unsupported:
             raise ValueError(
                 f"Unsupported model(s): {', '.join(unsupported)}. "
                 f"Supported models are: {', '.join(SUPPORTED_TTS_MODELS)}"
             )
-        return valid_model
+        return v
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/models/tts_evaluation.py` around lines 207 - 230, In
TTSEvaluationRunCreate, update the validator and default to be clearer and less
brittle: rename the validator parameter in validate_models (the method decorated
with `@field_validator`("models")) from valid_model to a conventional name like
models or v to reflect it receives the raw input, and change the models Field
default_factory from the hard-coded ["gemini-2.5-pro-preview-tts"] to reuse
SUPPORTED_TTS_MODELS (e.g., default_factory=lambda: SUPPORTED_TTS_MODELS) so the
default cannot drift from the supported list; keep the validation logic in
validate_models the same but refer to the renamed parameter.
backend/app/crud/tts_evaluations/run.py (1)

178-230: update_tts_run lacks org/project/type scoping, unlike other functions in this module.

get_tts_run_by_id and list_tts_runs both filter by organization_id, project_id, and EvaluationType.TTS. This function queries only by run_id (Line 202), which is inconsistent. While current callers already have valid run IDs, adding scoping here would be more defensive.

Also, the if value is not None guard (Line 217) means you can never explicitly clear a field (e.g., reset error_message to None on a retry). Consider using a sentinel value or an explicit "fields to update" set if clearing is needed later.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/crud/tts_evaluations/run.py` around lines 178 - 230, The
update_tts_run function currently selects EvaluationRun only by run_id; change
its lookup to also filter by organization_id, project_id and EvaluationType.TTS
(similar to get_tts_run_by_id and list_tts_runs) so the update is properly
scoped; additionally replace the current updates dict + "if value is not None"
loop (which prevents clearing fields like error_message) with an explicit
fields_to_update parameter or a sentinel (e.g., use a special UNSET constant)
and only update attributes present in that set/whose value is not UNSET while
allowing None to be written, then persist via
session.add/session.commit/session.refresh and keep logging with update_tts_run
and updated_at = now().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/api/routes/tts_evaluations/evaluation.py`:
- Around line 277-297: The GET handler get_tts_evaluation_run must include the
same run_metadata that start_tts_evaluation sets on TTSEvaluationRunPublic so
payloads match; update the TTSEvaluationRunWithResults construction in
get_tts_evaluation_run to populate run_metadata (e.g.,
run_metadata=run.run_metadata or the appropriate attribute on the run object)
and return it in the APIResponse (or remove from POST if you choose consistency
the other way), ensuring the field name matches TTSEvaluationRunPublic's
run_metadata.
- Line 130: Replace the HTTPException detail that interpolates the raw exception
(the raise HTTPException(status_code=500, detail=f"Batch submission failed:
{e}") line) with a generic message like "Batch submission failed" and do not
include the exception text; keep the existing server-side logging (the logger
calls around lines 120-123) to record full error details for debugging. Ensure
only the generic message is returned to the client while the original exception
is logged server-side.
- Around line 165-169: The helper _get_sample_texts_from_dataset uses
forward-string types that are unresolved by type checkers; import
EvaluationDataset and use a concrete sqlmodel Session type instead of the
FastAPI dependency alias SessionDep: add an import for EvaluationDataset (from
app.models) at the top of the file, replace the parameter type SessionDep with
Session in the _get_sample_texts_from_dataset signature, and ensure Session is
imported from sqlmodel (add it to the existing sqlmodel imports if missing).
- Around line 132-138: get_tts_run_by_id may return None but the code assumes
run is non-null; add a null guard after the refresh call that checks the run
variable returned by get_tts_run_by_id and handles the None case (e.g., raise a
clear HTTPException/return a 404 or appropriate error using the same
auth_context/project/org info) instead of dereferencing run.id later; update the
block around run = get_tts_run_by_id(...) to validate run and provide a
descriptive error path so subsequent accesses (run.id, run.status, etc.) are
safe.

In `@backend/app/crud/tts_evaluations/batch.py`:
- Around line 98-118: The config dict passed to start_batch_job uses the bare
variable model while GeminiBatchProvider was constructed with model_path =
f"models/{model}", causing create_batch (which prefers config.get("model",
self._model)) to receive an un-prefixed model name; update the config to pass
the prefixed model_path (i.e., use model_path instead of model) so the
provider/Google API gets the "models/..." name expected by
GeminiBatchProvider.create_batch; apply the same change in the STT batch code
where a bare model is passed.

In `@backend/app/crud/tts_evaluations/cron.py`:
- Around line 466-474: The string passed as the subdirectory argument to
upload_to_object_store uses an unnecessary f-string prefix; update the call in
the block that prepares audio upload (where audio_filename is set and
upload_to_object_store is invoked) to use a plain string for subdirectory
(replace f"evaluations/tts/audio" with "evaluations/tts/audio") so there is no
redundant f-prefix in the subdirectory parameter.
- Around line 506-518: The exception handler in process_completed_tts_batch
currently re-raises without rolling back the SQLAlchemy session, which can leave
session in an invalid state and cause later commits (e.g., in
poll_all_pending_tts_evaluations → update_tts_run) to fail; fix by calling
session.rollback() (or using a nested savepoint / begin_nested around the
batch-result loop) before re-raising so the session is returned to a usable
state, ensuring you reference the same session variable and keep the existing
logger.error call and raise behavior.
- Around line 275-368: The code sets any_succeeded but never uses it and may
mark runs as "completed" even when some batch jobs failed; remove the unused
any_succeeded variable and ensure final_status reflects failures: change the
finalization logic around final_status (used before calling update_tts_run and
in the return) to set final_status to "failed" (or "completed_with_errors") when
any_failed is True (instead of basing it solely on pending == 0), and then call
update_tts_run and compute action from any_failed as done now so DB state and
returned action are consistent; update references in the block that computes
final_status, error_message, update_tts_run, and the returned dict accordingly
(functions/variables to edit: any_succeeded, any_failed, final_status,
update_tts_run, and the return payload).

In `@backend/app/services/tts_evaluations/dataset.py`:
- Around line 96-145: Currently _upload_samples_to_object_store swallows all
exceptions and returns None, which allows upload_tts_dataset to create a dataset
with object_store_url=None; instead, surface the failure so the dataset creation
aborts or the caller is notified: remove the broad except swallowing or re-raise
after logging (or raise a specific UploadError) from
_upload_samples_to_object_store, and update upload_tts_dataset to catch that
specific error and return/propagate an HTTP/functional error rather than
proceeding; reference _upload_samples_to_object_store, upload_tts_dataset and
_get_sample_texts_from_dataset when making these changes.

---

Nitpick comments:
In `@backend/app/alembic/versions/047_add_tts_evaluation_tables.py`:
- Around line 133-150: The migration creates a redundant single-column index
ix_tts_result_run_id on tts_result.evaluation_run_id because the composite
indexes idx_tts_result_feedback (evaluation_run_id, is_correct) and
idx_tts_result_status (evaluation_run_id, status) already cover queries
filtering only by evaluation_run_id; remove the op.create_index call for
"ix_tts_result_run_id" from upgrade(), and also remove the corresponding
op.drop_index for "ix_tts_result_run_id" from downgrade() so the migration no
longer creates or drops that redundant index (leave idx_tts_result_feedback and
idx_tts_result_status unchanged).
- Around line 103-114: The timestamp columns inserted_at and updated_at in the
Alembic revision (columns named "inserted_at" and "updated_at") are non-nullable
but lack server defaults; update their Column definitions to include a
server_default (e.g., server_default=sa.func.now() or sa.text('now()')) so
inserts outside the ORM won't fail, and keep the existing nullable=False and
comments intact; apply this change to both Column(...) calls in this migration.

In `@backend/app/api/routes/tts_evaluations/result.py`:
- Around line 64-83: The TTSResult→TTSResultPublic mapping is duplicated;
extract a helper function (e.g. _to_public(result: TTSResult) ->
TTSResultPublic) that centralizes building TTSResultPublic (including extracting
duration_seconds and size_bytes from result.metadata_). Replace the inline
constructions in this file’s endpoints (the current return block that creates
TTSResultPublic and the similar block at lines 109-128) with calls to
_to_public(result) so both endpoints reuse the same conversion logic.
- Around line 44-62: Remove the redundant existence check that calls
get_tts_result_by_id before calling update_tts_human_feedback: delete the block
that queries existing and raises HTTPException when falsy, and rely on
update_tts_human_feedback to perform the existence validation (or conversely, if
you prefer validation in the route, remove the internal check inside
update_tts_human_feedback); locate the redundant code by the
get_tts_result_by_id call and the subsequent if not existing / raise
HTTPException block and remove it so only one DB existence check remains for
result_id with auth_context.organization_.id and auth_context.project_.id.

In `@backend/app/crud/evaluations/cron.py`:
- Around line 49-66: Summary: Make dict access style consistent by using .get
for all summary keys. Fix: Replace direct key indexing for
processed/failed/still_processing in the computations of total_processed,
total_failed, and total_still_processing to use text_summary.get("processed",
0), stt_summary.get("processed", 0), tts_summary.get("processed", 0) (and
similar for "failed" and "still_processing"), and keep all_details using
.get("details", []); this ensures uniform access for text_summary, stt_summary,
and tts_summary and safe defaults if keys are ever missing.

In `@backend/app/crud/tts_evaluations/batch.py`:
- Around line 152-153: Replace the bare raise Exception(...) used when
batch_jobs is empty with a domain-specific exception: define a new exception
class (e.g., BatchSubmissionError) in this module or a shared exceptions module
and raise that instead (include the original error details or chain underlying
exceptions via "from" where applicable) so callers can distinguish batch
submission failures; update any upstream handlers if needed to catch
BatchSubmissionError instead of a generic Exception.

In `@backend/app/crud/tts_evaluations/dataset.py`:
- Around line 117-173: Replace the manual field-by-field construction in
list_tts_datasets with Pydantic model validation/mapping to remove boilerplate:
instead of building TTSDatasetPublic(...) for each EvaluationDataset, call
TTSDatasetPublic.model_validate(dataset) (or
TTSDatasetPublic.model_validate(dataset, mode="from_attributes") / use from_orm
equivalent) to convert the ORM EvaluationDataset into the Pydantic
TTSDatasetPublic; if needed, enable TTSDatasetPublic model_config
(from_attributes/orm_mode) so model_validate/from_orm can read attributes from
EvaluationDataset, and update the mapping in list_tts_datasets (and mirror the
same change in get_dataset) to return [TTSDatasetPublic.model_validate(d) for d
in datasets] and keep the same returned total.

In `@backend/app/crud/tts_evaluations/result.py`:
- Around line 62-64: Remove the redundant explicit flush before commit: the
block that calls session.add_all(results) followed by session.flush() and
session.commit() should drop the session.flush() call because session.commit()
implicitly flushes; keep session.add_all(results) and then call session.commit()
(reference: session.add_all, session.flush, session.commit in the
tts_evaluations/result.py function).
- Around line 320-328: The dict comprehension building the status->count map can
be simplified: replace the final return that currently constructs {status: count
for status, count in rows} with a direct dict conversion of the query result
(rows) so the code returns dict(rows); locate the query using TTSResult (the
statement variable) and the rows variable set from
session.exec(statement).all(), and change only the return expression to use
dict(rows).
- Around line 177-219: Update the update_tts_result docstring to explicitly
state that this function calls session.flush() but does not commit the
transaction, so callers (e.g., process_completed_tts_batch) are responsible for
calling session.commit() when not running inside a batch; reference the function
name update_tts_result, the model TTSResult, and the use of now() to indicate
this only persists flushed changes until the caller commits.

In `@backend/app/crud/tts_evaluations/run.py`:
- Around line 178-230: The update_tts_run function currently selects
EvaluationRun only by run_id; change its lookup to also filter by
organization_id, project_id and EvaluationType.TTS (similar to get_tts_run_by_id
and list_tts_runs) so the update is properly scoped; additionally replace the
current updates dict + "if value is not None" loop (which prevents clearing
fields like error_message) with an explicit fields_to_update parameter or a
sentinel (e.g., use a special UNSET constant) and only update attributes present
in that set/whose value is not UNSET while allowing None to be written, then
persist via session.add/session.commit/session.refresh and keep logging with
update_tts_run and updated_at = now().

In `@backend/app/models/tts_evaluation.py`:
- Around line 207-230: In TTSEvaluationRunCreate, update the validator and
default to be clearer and less brittle: rename the validator parameter in
validate_models (the method decorated with `@field_validator`("models")) from
valid_model to a conventional name like models or v to reflect it receives the
raw input, and change the models Field default_factory from the hard-coded
["gemini-2.5-pro-preview-tts"] to reuse SUPPORTED_TTS_MODELS (e.g.,
default_factory=lambda: SUPPORTED_TTS_MODELS) so the default cannot drift from
the supported list; keep the validation logic in validate_models the same but
refer to the renamed parameter.

In `@backend/app/services/tts_evaluations/audio.py`:
- Around line 3-5: The import statement for the symbol struct in
backend/app/services/tts_evaluations/audio.py is unused and should be removed to
clean up the module imports; update the top-of-file imports by deleting the
struct import so only io and wave remain (leave any other existing imports
untouched).

status="failed",
error_message=str(e),
)
raise HTTPException(status_code=500, detail=f"Batch submission failed: {e}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Internal error details leaked to the client.

f"Batch submission failed: {e}" exposes the raw exception message (which could contain credentials, internal paths, or stack details) in the HTTP 500 response body. Return a generic message and rely on the server-side log (line 120-123) for debugging.

🔧 Suggested fix
-        raise HTTPException(status_code=500, detail=f"Batch submission failed: {e}")
+        raise HTTPException(status_code=500, detail="Batch submission failed")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
raise HTTPException(status_code=500, detail=f"Batch submission failed: {e}")
raise HTTPException(status_code=500, detail="Batch submission failed")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/routes/tts_evaluations/evaluation.py` at line 130, Replace
the HTTPException detail that interpolates the raw exception (the raise
HTTPException(status_code=500, detail=f"Batch submission failed: {e}") line)
with a generic message like "Batch submission failed" and do not include the
exception text; keep the existing server-side logging (the logger calls around
lines 120-123) to record full error details for debugging. Ensure only the
generic message is returned to the client while the original exception is logged
server-side.

Comment on lines +132 to +138
# Refresh run to get updated status
run = get_tts_run_by_id(
session=_session,
run_id=run.id,
org_id=auth_context.organization_.id,
project_id=auth_context.project_.id,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Possible None dereference after refreshing the run.

get_tts_run_by_id can return None, but line 142 accesses run.id (and subsequent fields) without a null check. While unlikely to fail in practice (the run was just created), adding a guard prevents an AttributeError from surfacing as an opaque 500.

🛡️ Suggested guard
     # Refresh run to get updated status
     run = get_tts_run_by_id(
         session=_session,
         run_id=run.id,
         org_id=auth_context.organization_.id,
         project_id=auth_context.project_.id,
     )
+    if not run:
+        raise HTTPException(status_code=500, detail="Failed to refresh evaluation run")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/routes/tts_evaluations/evaluation.py` around lines 132 - 138,
get_tts_run_by_id may return None but the code assumes run is non-null; add a
null guard after the refresh call that checks the run variable returned by
get_tts_run_by_id and handles the None case (e.g., raise a clear
HTTPException/return a 404 or appropriate error using the same
auth_context/project/org info) instead of dereferencing run.id later; update the
block around run = get_tts_run_by_id(...) to validate run and provide a
descriptive error path so subsequent accesses (run.id, run.status, etc.) are
safe.

Comment on lines +165 to +169
def _get_sample_texts_from_dataset(
session: "SessionDep",
dataset: "EvaluationDataset",
project_id: int,
) -> list[str]:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

EvaluationDataset is not imported — NameError for type checkers.

The string annotation "EvaluationDataset" on line 167 avoids a runtime NameError, but the name is never imported, so type checkers (and IDEs) cannot resolve it. Import it at the top of the file.

Also, "SessionDep" on line 166 is a FastAPI dependency alias; for a plain helper function, Session (from sqlmodel) is the more accurate type.

🔧 Suggested fix

Add at the top of the file:

from app.models import EvaluationDataset

Then update the signature:

 def _get_sample_texts_from_dataset(
-    session: "SessionDep",
-    dataset: "EvaluationDataset",
+    session: Session,
+    dataset: EvaluationDataset,
     project_id: int,
 ) -> list[str]:

And add Session to the existing sqlmodel import if not already present:

from sqlmodel import Session
🧰 Tools
🪛 Ruff (0.15.1)

[error] 167-167: Undefined name EvaluationDataset

(F821)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/routes/tts_evaluations/evaluation.py` around lines 165 - 169,
The helper _get_sample_texts_from_dataset uses forward-string types that are
unresolved by type checkers; import EvaluationDataset and use a concrete
sqlmodel Session type instead of the FastAPI dependency alias SessionDep: add an
import for EvaluationDataset (from app.models) at the top of the file, replace
the parameter type SessionDep with Session in the _get_sample_texts_from_dataset
signature, and ensure Session is imported from sqlmodel (add it to the existing
sqlmodel imports if missing).

Comment on lines +277 to +297
return APIResponse.success_response(
data=TTSEvaluationRunWithResults(
id=run.id,
run_name=run.run_name,
dataset_name=run.dataset_name,
type=run.type,
language_id=run.language_id,
models=run.providers,
dataset_id=run.dataset_id,
status=run.status,
total_items=run.total_items,
score=run.score,
error_message=run.error_message,
organization_id=run.organization_id,
project_id=run.project_id,
inserted_at=run.inserted_at,
updated_at=run.updated_at,
results=results,
results_total=results_total,
),
metadata={"results_total": results_total},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

run_metadata missing from GET response, present in POST response.

start_tts_evaluation (line 153-156) includes run_metadata={"voice_name": ..., "style_prompt": ...} in TTSEvaluationRunPublic, but get_tts_evaluation_run does not populate it. Callers fetching a run after creation will see different payloads. Either include it consistently in both, or omit it from both and serve it via a dedicated field on the model.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/routes/tts_evaluations/evaluation.py` around lines 277 - 297,
The GET handler get_tts_evaluation_run must include the same run_metadata that
start_tts_evaluation sets on TTSEvaluationRunPublic so payloads match; update
the TTSEvaluationRunWithResults construction in get_tts_evaluation_run to
populate run_metadata (e.g., run_metadata=run.run_metadata or the appropriate
attribute on the run object) and return it in the APIResponse (or remove from
POST if you choose consistency the other way), ensuring the field name matches
TTSEvaluationRunPublic's run_metadata.

Comment on lines +98 to +118
model_path = f"models/{model}"
batch_provider = GeminiBatchProvider(
client=gemini_client.client, model=model_path
)

try:
batch_job = start_batch_job(
session=session,
provider=batch_provider,
provider_name="google",
job_type="tts_evaluation",
organization_id=org_id,
project_id=project_id,
jsonl_data=jsonl_data,
config={
"model": model,
"tts_provider": model,
"evaluation_run_id": run.id,
"voice_name": DEFAULT_VOICE_NAME,
"style_prompt": DEFAULT_STYLE_PROMPT,
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the actual file content around lines 98-118
cat -n backend/app/crud/tts_evaluations/batch.py | sed -n '90,125p'

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1512


🏁 Script executed:

# Find the GeminiBatchProvider class to see how create_batch uses the model parameter
fd -t f '*.py' | xargs grep -l "class GeminiBatchProvider" | head -1

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 358


🏁 Script executed:

# Search for GeminiBatchProvider.create_batch method
rg -n "def create_batch" --type=py -A20 backend/app/core/batch/

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 4866


🏁 Script executed:

# Look for DEFAULT_MODEL in GeminiBatchProvider
rg -n "DEFAULT_MODEL\|class GeminiBatchProvider" --type=py -B5 -A15

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57


🏁 Script executed:

# Check how STT batch code handles model paths for comparison
rg -n "model_path\|models/" --type=py backend/app/crud/tts_evaluations/ -C2

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57


🏁 Script executed:

# Get the full GeminiBatchProvider class to see DEFAULT_MODEL and the implementation
cat -n backend/app/core/batch/gemini.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 20920


🏁 Script executed:

# Check how create_batch uses the model variable
sed -n '68,120p' backend/app/core/batch/gemini.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 2177


🏁 Script executed:

# Search for other uses of GeminiBatchProvider to see how model is typically passed
rg -n "GeminiBatchProvider" --type=py -B3 -A3

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 17820


🏁 Script executed:

# Look for how the model is actually used in the batch request creation
sed -n '88,115p' backend/app/core/batch/gemini.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1076


🏁 Script executed:

# Check the STT batch code to see how it handles the config model parameter
sed -n '138,170p' backend/app/crud/stt_evaluations/batch.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1119


🏁 Script executed:

# Look at the full context of config in STT batch
sed -n '140,180p' backend/app/crud/stt_evaluations/batch.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1457


Model name inconsistency: config passes bare model name while provider expects "models/" prefix.

Line 98 constructs model_path = f"models/{model}" and passes it to GeminiBatchProvider, but line 113 passes the bare model (without "models/" prefix) in the config dict. Inside GeminiBatchProvider.create_batch (line 88), the config value takes priority (config.get("model", self._model)), so the Gemini API receives the model name without the prefix at line 109—contradicting the provider's DEFAULT_MODEL = "models/gemini-2.5-pro" convention. This will cause the batch job creation to fail with an invalid model error.

The same pattern exists in the STT batch code (backend/app/crud/stt_evaluations/batch.py line 154), indicating a systemic issue.

🐛 Suggested fix — use the prefixed path in config
         model_path = f"models/{model}"
         batch_provider = GeminiBatchProvider(
             client=gemini_client.client, model=model_path
         )

         try:
             batch_job = start_batch_job(
                 session=session,
                 provider=batch_provider,
                 provider_name="google",
                 job_type="tts_evaluation",
                 organization_id=org_id,
                 project_id=project_id,
                 jsonl_data=jsonl_data,
                 config={
-                    "model": model,
+                    "model": model_path,
                     "tts_provider": model,
                     "evaluation_run_id": run.id,
                     "voice_name": DEFAULT_VOICE_NAME,
                     "style_prompt": DEFAULT_STYLE_PROMPT,
                 },
             )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
model_path = f"models/{model}"
batch_provider = GeminiBatchProvider(
client=gemini_client.client, model=model_path
)
try:
batch_job = start_batch_job(
session=session,
provider=batch_provider,
provider_name="google",
job_type="tts_evaluation",
organization_id=org_id,
project_id=project_id,
jsonl_data=jsonl_data,
config={
"model": model,
"tts_provider": model,
"evaluation_run_id": run.id,
"voice_name": DEFAULT_VOICE_NAME,
"style_prompt": DEFAULT_STYLE_PROMPT,
},
model_path = f"models/{model}"
batch_provider = GeminiBatchProvider(
client=gemini_client.client, model=model_path
)
try:
batch_job = start_batch_job(
session=session,
provider=batch_provider,
provider_name="google",
job_type="tts_evaluation",
organization_id=org_id,
project_id=project_id,
jsonl_data=jsonl_data,
config={
"model": model_path,
"tts_provider": model,
"evaluation_run_id": run.id,
"voice_name": DEFAULT_VOICE_NAME,
"style_prompt": DEFAULT_STYLE_PROMPT,
},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/crud/tts_evaluations/batch.py` around lines 98 - 118, The config
dict passed to start_batch_job uses the bare variable model while
GeminiBatchProvider was constructed with model_path = f"models/{model}", causing
create_batch (which prefers config.get("model", self._model)) to receive an
un-prefixed model name; update the config to pass the prefixed model_path (i.e.,
use model_path instead of model) so the provider/Google API gets the
"models/..." name expected by GeminiBatchProvider.create_batch; apply the same
change in the STT batch code where a bare model is passed.

Comment on lines +275 to +368
all_terminal = True
any_succeeded = False
any_failed = False
errors: list[str] = []

for batch_job in batch_jobs:
provider_name = batch_job.config.get("tts_provider", "unknown")

# Skip batch jobs already in terminal state
if batch_job.provider_status in TERMINAL_STATES:
if batch_job.provider_status == BatchJobState.SUCCEEDED.value:
any_succeeded = True
else:
any_failed = True
errors.append(
f"{provider_name}: {batch_job.error_message or batch_job.provider_status}"
)
continue

# Poll batch job status
poll_batch_status(
session=session,
provider=batch_provider,
batch_job=batch_job,
)

session.refresh(batch_job)
provider_status = batch_job.provider_status

logger.info(
f"[poll_tts_run] {log_prefix} Batch status | "
f"batch_job_id={batch_job.id} | provider={provider_name} | "
f"state={provider_status}"
)

if provider_status not in TERMINAL_STATES:
all_terminal = False
continue

# Batch reached terminal state - process it
if provider_status == BatchJobState.SUCCEEDED.value:
await process_completed_tts_batch(
session=session,
run=run,
batch_job=batch_job,
batch_provider=batch_provider,
)
any_succeeded = True
else:
any_failed = True
errors.append(
f"{provider_name}: {batch_job.error_message or provider_status}"
)

if not all_terminal:
return {
"run_id": run.id,
"run_name": run.run_name,
"type": "tts",
"previous_status": previous_status,
"current_status": run.status,
"action": "no_change",
}

# All batch jobs are done - finalize the run
status_counts = count_results_by_status(session=session, run_id=run.id)
pending = status_counts.get(JobStatus.PENDING.value, 0)
failed_count = status_counts.get(JobStatus.FAILED.value, 0)

final_status = "completed" if pending == 0 else "processing"
error_message = None
if any_failed:
error_message = "; ".join(errors)
elif failed_count > 0:
error_message = f"{failed_count} synthesis(es) failed"

update_tts_run(
session=session,
run_id=run.id,
status=final_status,
error_message=error_message,
)

action = "completed" if not any_failed else "failed"

return {
"run_id": run.id,
"run_name": run.run_name,
"type": "tts",
"previous_status": previous_status,
"current_status": final_status,
"action": action,
**({"error": error_message} if error_message else {}),
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Run finalization logic has inconsistencies: any_succeeded is unused; final_status can be "completed" even when batch jobs failed.

Two issues:

  1. any_succeeded (Lines 276, 286, 322) is assigned but never read. The static analysis flag is correct.

  2. When any_failed is True but all results are non-pending, final_status is set to "completed" (Line 344) while action is "failed" (Line 358). A run with failed batch jobs shouldn't be marked "completed" in the DB—consider setting final_status = "failed" (or a distinct state like "completed_with_errors") when any_failed is True.

Proposed fix
     all_terminal = True
-    any_succeeded = False
     any_failed = False
     errors: list[str] = []
 
     for batch_job in batch_jobs:
         provider_name = batch_job.config.get("tts_provider", "unknown")
 
         # Skip batch jobs already in terminal state
         if batch_job.provider_status in TERMINAL_STATES:
-            if batch_job.provider_status == BatchJobState.SUCCEEDED.value:
-                any_succeeded = True
-            else:
+            if batch_job.provider_status != BatchJobState.SUCCEEDED.value:
                 any_failed = True
                 errors.append(
                     f"{provider_name}: {batch_job.error_message or batch_job.provider_status}"
                 )
             continue
         ...
         if provider_status == BatchJobState.SUCCEEDED.value:
             await process_completed_tts_batch(...)
-            any_succeeded = True
         else:
             any_failed = True
             ...
 
     ...
-    final_status = "completed" if pending == 0 else "processing"
+    if any_failed:
+        final_status = "failed"
+    elif pending == 0:
+        final_status = "completed"
+    else:
+        final_status = "processing"
     ...
-    action = "completed" if not any_failed else "failed"
+    action = final_status
🧰 Tools
🪛 Ruff (0.15.1)

[error] 322-322: Local variable any_succeeded is assigned to but never used

Remove assignment to unused variable any_succeeded

(F841)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/crud/tts_evaluations/cron.py` around lines 275 - 368, The code
sets any_succeeded but never uses it and may mark runs as "completed" even when
some batch jobs failed; remove the unused any_succeeded variable and ensure
final_status reflects failures: change the finalization logic around
final_status (used before calling update_tts_run and in the return) to set
final_status to "failed" (or "completed_with_errors") when any_failed is True
(instead of basing it solely on pending == 0), and then call update_tts_run and
compute action from any_failed as done now so DB state and returned action are
consistent; update references in the block that computes final_status,
error_message, update_tts_run, and the returned dict accordingly
(functions/variables to edit: any_succeeded, any_failed, final_status,
update_tts_run, and the return payload).

Comment on lines +466 to +474
# Upload WAV to S3
audio_filename = f"{uuid.uuid4()}.wav"
audio_url = upload_to_object_store(
storage=storage,
content=wav_data,
filename=audio_filename,
subdirectory=f"evaluations/tts/audio",
content_type="audio/wav",
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove extraneous f prefix from string literal without placeholders.

Line 472: f"evaluations/tts/audio" contains no interpolation placeholders.

Fix
                     audio_url = upload_to_object_store(
                         storage=storage,
                         content=wav_data,
                         filename=audio_filename,
-                        subdirectory=f"evaluations/tts/audio",
+                        subdirectory="evaluations/tts/audio",
                         content_type="audio/wav",
                     )
🧰 Tools
🪛 Ruff (0.15.1)

[error] 472-472: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/crud/tts_evaluations/cron.py` around lines 466 - 474, The string
passed as the subdirectory argument to upload_to_object_store uses an
unnecessary f-string prefix; update the call in the block that prepares audio
upload (where audio_filename is set and upload_to_object_store is invoked) to
use a plain string for subdirectory (replace f"evaluations/tts/audio" with
"evaluations/tts/audio") so there is no redundant f-prefix in the subdirectory
parameter.

Comment on lines +506 to +518
error_message=batch_result.get("error", "Unknown error"),
)
failed_count += 1

session.commit()

except Exception as e:
logger.error(
f"[process_completed_tts_batch] Failed to process batch results | "
f"batch_job_id={batch_job.id}, error={str(e)}",
exc_info=True,
)
raise
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

No session.rollback() before re-raising; partial flushes may leave the session in an inconsistent state.

If the exception on Line 512 is a database-level error (e.g., IntegrityError), the session is invalidated but never rolled back. The caller's subsequent update_tts_runsession.commit() (in poll_all_pending_tts_evaluations) will then also fail, and the run won't be marked as "failed".

Consider adding a rollback before re-raising, or wrapping the batch-result processing in a nested savepoint so partial progress can be preserved while still leaving the session usable.

Minimal fix: rollback before re-raise
     except Exception as e:
         logger.error(
             f"[process_completed_tts_batch] Failed to process batch results | "
             f"batch_job_id={batch_job.id}, error={str(e)}",
             exc_info=True,
         )
+        session.rollback()
         raise
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/crud/tts_evaluations/cron.py` around lines 506 - 518, The
exception handler in process_completed_tts_batch currently re-raises without
rolling back the SQLAlchemy session, which can leave session in an invalid state
and cause later commits (e.g., in poll_all_pending_tts_evaluations →
update_tts_run) to fail; fix by calling session.rollback() (or using a nested
savepoint / begin_nested around the batch-result loop) before re-raising so the
session is returned to a usable state, ensuring you reference the same session
variable and keep the existing logger.error call and raise behavior.

Comment on lines +96 to +145
def _upload_samples_to_object_store(
session: Session,
project_id: int,
dataset_name: str,
samples: list[TTSSampleCreate],
) -> str | None:
"""Upload TTS samples as CSV to object store.
Args:
session: Database session
project_id: Project ID for storage credentials
dataset_name: Dataset name for filename
samples: List of samples to upload
Returns:
Object store URL if successful, None otherwise
"""
try:
storage = get_cloud_storage(session=session, project_id=project_id)

csv_content = _samples_to_csv(samples)

filename = generate_timestamped_filename(dataset_name, "csv")
object_store_url = upload_to_object_store(
storage=storage,
content=csv_content,
filename=filename,
subdirectory="tts_datasets",
content_type="text/csv",
)

if object_store_url:
logger.info(
f"[_upload_samples_to_object_store] Upload successful | "
f"url={object_store_url}"
)
else:
logger.info(
"[_upload_samples_to_object_store] Upload returned None | "
"continuing without object store storage"
)

return object_store_url

except Exception as e:
logger.warning(
f"[_upload_samples_to_object_store] Failed to upload | {e}",
exc_info=True,
)
return None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silent upload failure creates a dataset that cannot be used for evaluation runs.

If _upload_samples_to_object_store fails (line 140-145), it returns None and upload_tts_dataset still creates a dataset record with object_store_url=None. Downstream, _get_sample_texts_from_dataset in evaluation.py (line 180) returns an empty list when object_store_url is absent, causing any evaluation run against this dataset to fail with "Could not load samples from dataset".

Consider either propagating the upload error so the dataset isn't created at all, or at minimum warning the caller in the response.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/tts_evaluations/dataset.py` around lines 96 - 145,
Currently _upload_samples_to_object_store swallows all exceptions and returns
None, which allows upload_tts_dataset to create a dataset with
object_store_url=None; instead, surface the failure so the dataset creation
aborts or the caller is notified: remove the broad except swallowing or re-raise
after logging (or raise a specific UploadError) from
_upload_samples_to_object_store, and update upload_tts_dataset to catch that
specific error and return/propagate an HTTP/functional error rather than
proceeding; reference _upload_samples_to_object_store, upload_tts_dataset and
_get_sample_texts_from_dataset when making these changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant