Fix: Resolve macOS FileNotFoundError in LLM_PARSE for docx (Issue #177) by notsooamit · Pull Request #178 · oidlabs-com/Lexoid

notsooamit · 2026-04-21T18:31:47Z

Fixes Issue #177

Description:
The test_parsing_docx_type test was failing for LLM_PARSE on macOS with a FileNotFoundError, and on native Windows due to COM object pathing constraints. While fixing this, I also fortified the test suite to not require live API keys and patched a fragile bounding-box check.

Changes Included:

Core Pathing Fix (conversion_utils.py): Updated convert_doc_to_pdf to dynamically resolve input_path and temp_dir using os.path.abspath(). The underlying docx2pdf utility relies on AppleScript (macOS) and COM objects (Windows), both of which strictly require absolute file paths. Feeding them relative paths caused silent failures.
Test Suite Stabilization (test_parser.py): Added a @patch to mock parse_llm_doc. Previously, running this test locally without valid API keys in a .env file would trigger a live network failure. This mock ensures the .docx to .pdf conversion is still rigorously tested without hitting API rate limits or requiring local credentials.
API Robustness (api.py): Wrapped the fragile bounding box check (result["segments"][0].get("bboxes")) in a safe condition. If an LLM parse fails or returns an empty segment list, the router now gracefully triggers fallback logic instead of crashing with an IndexError.

Testing:
Verified locally on a native Linux environment (Ubuntu/WSL) and natively on Windows. Test suite passes 100%

Copilot

Pull request overview

This PR addresses docx→pdf conversion path issues that caused LLM_PARSE failures on macOS/Windows, and hardens parsing behavior and tests around those failures (Issue #177).

Changes:

Ensure .doc/.docx→PDF conversion uses absolute paths for compatibility with docx2pdf on macOS/Windows.
Mock parse_llm_doc in the docx parsing test to avoid requiring live LLM credentials for that specific test.
Prevent IndexError when checking for bounding boxes if a parse returns an empty segments list.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`tests/test_parser.py`	Adds mocking for `parse_llm_doc` in the docx test to avoid live LLM calls for that case.
`lexoid/core/conversion_utils.py`	Converts docx conversion inputs (`input_path`, `temp_dir`) to absolute paths before calling platform-specific converters.
`lexoid/api.py`	Guards bbox detection against empty `segments` to avoid crashes during bbox routing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-21T18:36:03Z

+@pytest.mark.asyncio
+@pytest.mark.parametrize("model", models)
+async def test_llm_parse(model):
+    input_data = "examples/inputs/test_1.pdf"
+    expected_ouput_path = "examples/outputs/test_1.md"
+    config = {"parser_type": "LLM_PARSE", "model": model, "verbose": True}
+    result = parse(input_data, **config)["raw"]


The PR description says the test suite no longer requires live API keys, but test_parser.py still includes multiple LLM_PARSE tests (e.g., test_llm_parse, test_jpg_parse, token usage/cost tests) that will hit real providers unless keys are configured. If the intent is to make the default test run keyless/offline, consider skipping these tests when required env vars are missing (or isolating them under an integration marker) and keeping only the conversion/pathing check mocked.

Copilot · 2026-04-21T18:36:04Z

+    input_data = "examples/inputs/test_1.pdf"
+    expected_ouput_path = "examples/outputs/test_1.md"
+    config = {"parser_type": "LLM_PARSE", "model": model, "verbose": True}
+    result = parse(input_data, **config)["raw"]
+    assert isinstance(result, str)
+
+    # Compare the result with the expected output
+    expected_ouput = open(expected_ouput_path, "r").read()
+    # save the result to a file


Typo in variable names (expected_ouput_path / expected_ouput) makes the test harder to read/grep and is easy to fix while touching this file; consider renaming to expected_output_path / expected_output.

Copilot · 2026-04-21T18:36:04Z

+    return_bboxes = kwargs.get("return_bboxes", False)
+    has_bboxes = False
+    if result.get("segments"):
+        has_bboxes = bool(result["segments"][0].get("bboxes"))
+    bbox_framework = kwargs.get("bbox_framework", None)
+    framework = kwargs.get("framework", DEFAULT_STATIC_FRAMEWORK)
+    bbox_framework_different = bbox_framework and bbox_framework != framework
+    if return_bboxes and (not has_bboxes or bbox_framework_different):


The new has_bboxes guard avoids an IndexError when segments is empty, but it also means parse_chunk will now silently return an error/empty result from parse_llm_doc instead of raising and triggering retry_with_different_parser_type fallback. Since parse_llm_doc can return {..., "segments": [], "error": ...} on repeated failures, consider explicitly detecting result.get("error") or not result.get("segments") and raising a controlled exception (or directly retrying with STATIC_PARSE when routed=True) so AUTO routing can still fall back reliably.

pramitchoudhary · 2026-04-23T01:27:18Z

@notsooamit thanks again for looking into this. 🙏

It looks like the formatting or something else is causing most of the changes to appear as new files.
This might be difficult to verify.
The project is using Ruff, which mimics black closely, if I think.

Sharing .vscode config can be dumped settings.json
.vscode/settings.json

{
  "[python]": {
        "editor.defaultFormatter": "charliermarsh.ruff",
        "editor.codeActionsOnSave": {
            // Organize imports using Ruff on save
            "source.organizeImports.ruff": "explicit",
            // Apply all auto-fixable problems on save
            "source.fixAll": "explicit"
        }
    },
    "python.linting.enabled": true,
    "python.linting.ruffEnabled": true,
    "python.linting.lintOnSave": true,
    "python.linting.pylintEnabled": false,
    "python.linting.flake8Enabled": false,
    "ruff.lint.args": ["--select=E,W,F,I001,PL", "--ignore=E203", "--per-file-ignores=**/__init__.py:F401"],
    "python.formatting.blackArgs": ["--line-length", "120"],
    "files.autoSave": "onWindowChange",
    "files.trimTrailingWhitespace": true,
    "git.ignoreLimitWarning": true,
    "cSpell.words": [
      "sonner"
    ],
}

notsooamit · 2026-04-23T16:28:52Z

Hey @pramitchoudhary! You were completely right, my local formatter ran with an 88-character limit instead of the project's 120-character limit, causing a massive formatting diff.

I reset the branch to upstream/main to wipe out the formatting artifacts and surgically re-applied the fixes. The PR diff is now clean (+26 -8) and ready for review!

notsooamit · 2026-04-23T16:51:02Z

Just a quick note on the second commit: While doing native Windows testing, I realized that if os.path.abspath(temp_dir) is called after temp_path is constructed, docx2pdf is still handed a relative output path, causing Microsoft Word to crash upon saving. The second commit simply moves the path resolution to the top of the function so both the input and output paths are strictly absolute before the COM object ever sees them!

pramitchoudhary · 2026-04-26T03:49:04Z

Hi @notsooamit proposed solution is not ideal.
Will it be possible to adopt the following suggestions and verify?

def _is_valid_pdf(path: str) -> bool:
    """Return True when the PDF exists and is non-empty."""
    return os.path.exists(path) and os.path.getsize(path) > 0


def _convert_with_soffice(input_path: str, out_dir: str) -> None:
    """Convert DOC/DOCX to PDF using LibreOffice-compatible binaries."""
    soffice_bin = shutil.which("soffice") or shutil.which("lowriter")
    if not soffice_bin:
        raise RuntimeError("LibreOffice binary not found (soffice/lowriter).")

    subprocess.run(
        [
            soffice_bin,
            "--headless",
            "--convert-to",
            "pdf",
            "--outdir",
            out_dir,
            input_path,
        ],
        check=True,
    )


def convert_doc_to_pdf(input_path: str, temp_dir: str) -> str:
    input_path = os.path.abspath(input_path)
    temp_dir = os.path.abspath(temp_dir)
    os.makedirs(temp_dir, exist_ok=True)
    doc_name = os.path.basename(input_path)

    temp_path = os.path.join(
        temp_dir, os.path.splitext(os.path.basename(input_path))[0] + ".pdf"
    )
    logger.info(f"DOCX->PDF conversion started for {doc_name}")

    # Linux: use LibreOffice path only.
    if "linux" in sys.platform.lower():
        _convert_with_soffice(input_path, temp_dir)
        if not _is_valid_pdf(temp_path):
            raise RuntimeError(
                f"LibreOffice conversion finished, but PDF is missing/empty: {temp_path}."
            )
        return temp_path

    # Non-Linux (macOS/Windows): try docx2pdf first, then fallback to soffice.
    try:
        logger.debug(f"Converting document {doc_name} to PDF using docx2pdf...")
        docx2pdf.convert(input_path, temp_path)
    except Exception as docx_err:
        logger.warning(
            f"docx2pdf failed for {doc_name}, will fallback to soffice: {docx_err}"
        )

    if _is_valid_pdf(temp_path):
        return temp_path

    logger.debug("Falling back to LibreOffice conversion via soffice/lowriter...")
    _convert_with_soffice(input_path, temp_dir)

    if not _is_valid_pdf(temp_path):
        raise RuntimeError(
            "DOCX to PDF conversion failed with both docx2pdf and soffice. "
            f"Expected output: {temp_path}."
        )

    return temp_path

test_parser.py: Remove mock parsing calls that will hide the issue.

pramitchoudhary · 2026-04-26T03:51:20Z

@dilithjay also can we swtich to soffice instead of lowriter?

Copilot AI review requested due to automatic review settings April 21, 2026 18:31

Copilot started reviewing on behalf of notsooamit April 21, 2026 18:32 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

notsooamit force-pushed the fix-macos-docx-sandbox branch from 8815af4 to ac7d0ac Compare April 21, 2026 18:43

pramitchoudhary assigned notsooamit Apr 23, 2026

pramitchoudhary added the bug Something isn't working label Apr 23, 2026

pramitchoudhary requested a review from dilithjay April 23, 2026 01:20

pramitchoudhary linked an issue Apr 23, 2026 that may be closed by this pull request

Test fails on test_parsing_docx_type #177

Open

notsooamit added 2 commits April 23, 2026 14:56

fix: resolve macOS FileNotFoundError and stabilize API error handling

14902a6

fix: resolve macOS FileNotFoundError and stabilize API error handling

e543390

notsooamit force-pushed the fix-macos-docx-sandbox branch from ac7d0ac to e543390 Compare April 23, 2026 16:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Resolve macOS FileNotFoundError in LLM_PARSE for docx (Issue #177)#178

Fix: Resolve macOS FileNotFoundError in LLM_PARSE for docx (Issue #177)#178
notsooamit wants to merge 2 commits into
oidlabs-com:mainfrom
notsooamit:fix-macos-docx-sandbox

notsooamit commented Apr 21, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

pramitchoudhary commented Apr 23, 2026

Uh oh!

notsooamit commented Apr 23, 2026

Uh oh!

notsooamit commented Apr 23, 2026

Uh oh!

pramitchoudhary commented Apr 26, 2026

Uh oh!

pramitchoudhary commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

notsooamit commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

pramitchoudhary commented Apr 23, 2026

Uh oh!

notsooamit commented Apr 23, 2026

Uh oh!

notsooamit commented Apr 23, 2026

Uh oh!

pramitchoudhary commented Apr 26, 2026

Uh oh!

pramitchoudhary commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

notsooamit commented Apr 21, 2026 •

edited

Loading