Skip to content

Fix: Resolve macOS FileNotFoundError in LLM_PARSE for docx (Issue #177)#178

Open
notsooamit wants to merge 2 commits into
oidlabs-com:mainfrom
notsooamit:fix-macos-docx-sandbox
Open

Fix: Resolve macOS FileNotFoundError in LLM_PARSE for docx (Issue #177)#178
notsooamit wants to merge 2 commits into
oidlabs-com:mainfrom
notsooamit:fix-macos-docx-sandbox

Conversation

@notsooamit
Copy link
Copy Markdown

@notsooamit notsooamit commented Apr 21, 2026

Fixes Issue #177

Description:
The test_parsing_docx_type test was failing for LLM_PARSE on macOS with a FileNotFoundError, and on native Windows due to COM object pathing constraints. While fixing this, I also fortified the test suite to not require live API keys and patched a fragile bounding-box check.

Changes Included:

  1. Core Pathing Fix (conversion_utils.py): Updated convert_doc_to_pdf to dynamically resolve input_path and temp_dir using os.path.abspath(). The underlying docx2pdf utility relies on AppleScript (macOS) and COM objects (Windows), both of which strictly require absolute file paths. Feeding them relative paths caused silent failures.
  2. Test Suite Stabilization (test_parser.py): Added a @patch to mock parse_llm_doc. Previously, running this test locally without valid API keys in a .env file would trigger a live network failure. This mock ensures the .docx to .pdf conversion is still rigorously tested without hitting API rate limits or requiring local credentials.
  3. API Robustness (api.py): Wrapped the fragile bounding box check (result["segments"][0].get("bboxes")) in a safe condition. If an LLM parse fails or returns an empty segment list, the router now gracefully triggers fallback logic instead of crashing with an IndexError.

Testing:
Verified locally on a native Linux environment (Ubuntu/WSL) and natively on Windows. Test suite passes 100%

Copilot AI review requested due to automatic review settings April 21, 2026 18:31
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses docx→pdf conversion path issues that caused LLM_PARSE failures on macOS/Windows, and hardens parsing behavior and tests around those failures (Issue #177).

Changes:

  • Ensure .doc/.docx→PDF conversion uses absolute paths for compatibility with docx2pdf on macOS/Windows.
  • Mock parse_llm_doc in the docx parsing test to avoid requiring live LLM credentials for that specific test.
  • Prevent IndexError when checking for bounding boxes if a parse returns an empty segments list.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
tests/test_parser.py Adds mocking for parse_llm_doc in the docx test to avoid live LLM calls for that case.
lexoid/core/conversion_utils.py Converts docx conversion inputs (input_path, temp_dir) to absolute paths before calling platform-specific converters.
lexoid/api.py Guards bbox detection against empty segments to avoid crashes during bbox routing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_parser.py Outdated
Comment on lines +27 to +33
@pytest.mark.asyncio
@pytest.mark.parametrize("model", models)
async def test_llm_parse(model):
input_data = "examples/inputs/test_1.pdf"
expected_ouput_path = "examples/outputs/test_1.md"
config = {"parser_type": "LLM_PARSE", "model": model, "verbose": True}
result = parse(input_data, **config)["raw"]
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says the test suite no longer requires live API keys, but test_parser.py still includes multiple LLM_PARSE tests (e.g., test_llm_parse, test_jpg_parse, token usage/cost tests) that will hit real providers unless keys are configured. If the intent is to make the default test run keyless/offline, consider skipping these tests when required env vars are missing (or isolating them under an integration marker) and keeping only the conversion/pathing check mocked.

Copilot uses AI. Check for mistakes.
Comment thread tests/test_parser.py Outdated
Comment on lines +30 to +38
input_data = "examples/inputs/test_1.pdf"
expected_ouput_path = "examples/outputs/test_1.md"
config = {"parser_type": "LLM_PARSE", "model": model, "verbose": True}
result = parse(input_data, **config)["raw"]
assert isinstance(result, str)

# Compare the result with the expected output
expected_ouput = open(expected_ouput_path, "r").read()
# save the result to a file
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in variable names (expected_ouput_path / expected_ouput) makes the test harder to read/grep and is easy to fix while touching this file; consider renaming to expected_output_path / expected_output.

Copilot uses AI. Check for mistakes.
Comment thread lexoid/api.py Outdated
Comment on lines +159 to +166
return_bboxes = kwargs.get("return_bboxes", False)
has_bboxes = False
if result.get("segments"):
has_bboxes = bool(result["segments"][0].get("bboxes"))
bbox_framework = kwargs.get("bbox_framework", None)
framework = kwargs.get("framework", DEFAULT_STATIC_FRAMEWORK)
bbox_framework_different = bbox_framework and bbox_framework != framework
if return_bboxes and (not has_bboxes or bbox_framework_different):
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new has_bboxes guard avoids an IndexError when segments is empty, but it also means parse_chunk will now silently return an error/empty result from parse_llm_doc instead of raising and triggering retry_with_different_parser_type fallback. Since parse_llm_doc can return {..., "segments": [], "error": ...} on repeated failures, consider explicitly detecting result.get("error") or not result.get("segments") and raising a controlled exception (or directly retrying with STATIC_PARSE when routed=True) so AUTO routing can still fall back reliably.

Copilot uses AI. Check for mistakes.
@notsooamit notsooamit force-pushed the fix-macos-docx-sandbox branch from 8815af4 to ac7d0ac Compare April 21, 2026 18:43
@pramitchoudhary pramitchoudhary added the bug Something isn't working label Apr 23, 2026
@pramitchoudhary
Copy link
Copy Markdown
Contributor

@notsooamit thanks again for looking into this. 🙏

It looks like the formatting or something else is causing most of the changes to appear as new files.
This might be difficult to verify.
The project is using Ruff, which mimics black closely, if I think.

Sharing .vscode config can be dumped settings.json
.vscode/settings.json

{
  "[python]": {
        "editor.defaultFormatter": "charliermarsh.ruff",
        "editor.codeActionsOnSave": {
            // Organize imports using Ruff on save
            "source.organizeImports.ruff": "explicit",
            // Apply all auto-fixable problems on save
            "source.fixAll": "explicit"
        }
    },
    "python.linting.enabled": true,
    "python.linting.ruffEnabled": true,
    "python.linting.lintOnSave": true,
    "python.linting.pylintEnabled": false,
    "python.linting.flake8Enabled": false,
    "ruff.lint.args": ["--select=E,W,F,I001,PL", "--ignore=E203", "--per-file-ignores=**/__init__.py:F401"],
    "python.formatting.blackArgs": ["--line-length", "120"],
    "files.autoSave": "onWindowChange",
    "files.trimTrailingWhitespace": true,
    "git.ignoreLimitWarning": true,
    "cSpell.words": [
      "sonner"
    ],
}

@pramitchoudhary pramitchoudhary linked an issue Apr 23, 2026 that may be closed by this pull request
@notsooamit notsooamit force-pushed the fix-macos-docx-sandbox branch from ac7d0ac to e543390 Compare April 23, 2026 16:07
@notsooamit
Copy link
Copy Markdown
Author

Hey @pramitchoudhary! You were completely right, my local formatter ran with an 88-character limit instead of the project's 120-character limit, causing a massive formatting diff.

I reset the branch to upstream/main to wipe out the formatting artifacts and surgically re-applied the fixes. The PR diff is now clean (+26 -8) and ready for review!

@notsooamit
Copy link
Copy Markdown
Author

Just a quick note on the second commit: While doing native Windows testing, I realized that if os.path.abspath(temp_dir) is called after temp_path is constructed, docx2pdf is still handed a relative output path, causing Microsoft Word to crash upon saving. The second commit simply moves the path resolution to the top of the function so both the input and output paths are strictly absolute before the COM object ever sees them!

@pramitchoudhary
Copy link
Copy Markdown
Contributor

Hi @notsooamit proposed solution is not ideal.
Will it be possible to adopt the following suggestions and verify?

def _is_valid_pdf(path: str) -> bool:
    """Return True when the PDF exists and is non-empty."""
    return os.path.exists(path) and os.path.getsize(path) > 0


def _convert_with_soffice(input_path: str, out_dir: str) -> None:
    """Convert DOC/DOCX to PDF using LibreOffice-compatible binaries."""
    soffice_bin = shutil.which("soffice") or shutil.which("lowriter")
    if not soffice_bin:
        raise RuntimeError("LibreOffice binary not found (soffice/lowriter).")

    subprocess.run(
        [
            soffice_bin,
            "--headless",
            "--convert-to",
            "pdf",
            "--outdir",
            out_dir,
            input_path,
        ],
        check=True,
    )


def convert_doc_to_pdf(input_path: str, temp_dir: str) -> str:
    input_path = os.path.abspath(input_path)
    temp_dir = os.path.abspath(temp_dir)
    os.makedirs(temp_dir, exist_ok=True)
    doc_name = os.path.basename(input_path)

    temp_path = os.path.join(
        temp_dir, os.path.splitext(os.path.basename(input_path))[0] + ".pdf"
    )
    logger.info(f"DOCX->PDF conversion started for {doc_name}")

    # Linux: use LibreOffice path only.
    if "linux" in sys.platform.lower():
        _convert_with_soffice(input_path, temp_dir)
        if not _is_valid_pdf(temp_path):
            raise RuntimeError(
                f"LibreOffice conversion finished, but PDF is missing/empty: {temp_path}."
            )
        return temp_path

    # Non-Linux (macOS/Windows): try docx2pdf first, then fallback to soffice.
    try:
        logger.debug(f"Converting document {doc_name} to PDF using docx2pdf...")
        docx2pdf.convert(input_path, temp_path)
    except Exception as docx_err:
        logger.warning(
            f"docx2pdf failed for {doc_name}, will fallback to soffice: {docx_err}"
        )

    if _is_valid_pdf(temp_path):
        return temp_path

    logger.debug("Falling back to LibreOffice conversion via soffice/lowriter...")
    _convert_with_soffice(input_path, temp_dir)

    if not _is_valid_pdf(temp_path):
        raise RuntimeError(
            "DOCX to PDF conversion failed with both docx2pdf and soffice. "
            f"Expected output: {temp_path}."
        )

    return temp_path
  1. test_parser.py: Remove mock parsing calls that will hide the issue.

@pramitchoudhary
Copy link
Copy Markdown
Contributor

@dilithjay also can we swtich to soffice instead of lowriter?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Test fails on test_parsing_docx_type

3 participants