Fix: Resolve macOS FileNotFoundError in LLM_PARSE for docx (Issue #177)#178
Fix: Resolve macOS FileNotFoundError in LLM_PARSE for docx (Issue #177)#178notsooamit wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses docx→pdf conversion path issues that caused LLM_PARSE failures on macOS/Windows, and hardens parsing behavior and tests around those failures (Issue #177).
Changes:
- Ensure
.doc/.docx→PDF conversion uses absolute paths for compatibility withdocx2pdfon macOS/Windows. - Mock
parse_llm_docin the docx parsing test to avoid requiring live LLM credentials for that specific test. - Prevent
IndexErrorwhen checking for bounding boxes if a parse returns an emptysegmentslist.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
tests/test_parser.py |
Adds mocking for parse_llm_doc in the docx test to avoid live LLM calls for that case. |
lexoid/core/conversion_utils.py |
Converts docx conversion inputs (input_path, temp_dir) to absolute paths before calling platform-specific converters. |
lexoid/api.py |
Guards bbox detection against empty segments to avoid crashes during bbox routing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @pytest.mark.asyncio | ||
| @pytest.mark.parametrize("model", models) | ||
| async def test_llm_parse(model): | ||
| input_data = "examples/inputs/test_1.pdf" | ||
| expected_ouput_path = "examples/outputs/test_1.md" | ||
| config = {"parser_type": "LLM_PARSE", "model": model, "verbose": True} | ||
| result = parse(input_data, **config)["raw"] |
There was a problem hiding this comment.
The PR description says the test suite no longer requires live API keys, but test_parser.py still includes multiple LLM_PARSE tests (e.g., test_llm_parse, test_jpg_parse, token usage/cost tests) that will hit real providers unless keys are configured. If the intent is to make the default test run keyless/offline, consider skipping these tests when required env vars are missing (or isolating them under an integration marker) and keeping only the conversion/pathing check mocked.
| input_data = "examples/inputs/test_1.pdf" | ||
| expected_ouput_path = "examples/outputs/test_1.md" | ||
| config = {"parser_type": "LLM_PARSE", "model": model, "verbose": True} | ||
| result = parse(input_data, **config)["raw"] | ||
| assert isinstance(result, str) | ||
|
|
||
| # Compare the result with the expected output | ||
| expected_ouput = open(expected_ouput_path, "r").read() | ||
| # save the result to a file |
There was a problem hiding this comment.
Typo in variable names (expected_ouput_path / expected_ouput) makes the test harder to read/grep and is easy to fix while touching this file; consider renaming to expected_output_path / expected_output.
| return_bboxes = kwargs.get("return_bboxes", False) | ||
| has_bboxes = False | ||
| if result.get("segments"): | ||
| has_bboxes = bool(result["segments"][0].get("bboxes")) | ||
| bbox_framework = kwargs.get("bbox_framework", None) | ||
| framework = kwargs.get("framework", DEFAULT_STATIC_FRAMEWORK) | ||
| bbox_framework_different = bbox_framework and bbox_framework != framework | ||
| if return_bboxes and (not has_bboxes or bbox_framework_different): |
There was a problem hiding this comment.
The new has_bboxes guard avoids an IndexError when segments is empty, but it also means parse_chunk will now silently return an error/empty result from parse_llm_doc instead of raising and triggering retry_with_different_parser_type fallback. Since parse_llm_doc can return {..., "segments": [], "error": ...} on repeated failures, consider explicitly detecting result.get("error") or not result.get("segments") and raising a controlled exception (or directly retrying with STATIC_PARSE when routed=True) so AUTO routing can still fall back reliably.
8815af4 to
ac7d0ac
Compare
|
@notsooamit thanks again for looking into this. 🙏 It looks like the formatting or something else is causing most of the changes to appear as new files. Sharing .vscode config can be dumped settings.json |
ac7d0ac to
e543390
Compare
|
Hey @pramitchoudhary! You were completely right, my local formatter ran with an 88-character limit instead of the project's 120-character limit, causing a massive formatting diff. I reset the branch to upstream/main to wipe out the formatting artifacts and surgically re-applied the fixes. The PR diff is now clean (+26 -8) and ready for review! |
|
Just a quick note on the second commit: While doing native Windows testing, I realized that if os.path.abspath(temp_dir) is called after temp_path is constructed, docx2pdf is still handed a relative output path, causing Microsoft Word to crash upon saving. The second commit simply moves the path resolution to the top of the function so both the input and output paths are strictly absolute before the COM object ever sees them! |
|
Hi @notsooamit proposed solution is not ideal.
|
|
@dilithjay also can we swtich to |
Fixes Issue #177
Description:
The
test_parsing_docx_typetest was failing forLLM_PARSEon macOS with aFileNotFoundError, and on native Windows due to COM object pathing constraints. While fixing this, I also fortified the test suite to not require live API keys and patched a fragile bounding-box check.Changes Included:
conversion_utils.py): Updatedconvert_doc_to_pdfto dynamically resolveinput_pathandtemp_dirusingos.path.abspath(). The underlyingdocx2pdfutility relies on AppleScript (macOS) and COM objects (Windows), both of which strictly require absolute file paths. Feeding them relative paths caused silent failures.test_parser.py): Added a@patchto mockparse_llm_doc. Previously, running this test locally without valid API keys in a.envfile would trigger a live network failure. This mock ensures the.docxto.pdfconversion is still rigorously tested without hitting API rate limits or requiring local credentials.api.py): Wrapped the fragile bounding box check (result["segments"][0].get("bboxes")) in a safe condition. If an LLM parse fails or returns an empty segment list, the router now gracefully triggers fallback logic instead of crashing with anIndexError.Testing:
Verified locally on a native Linux environment (Ubuntu/WSL) and natively on Windows. Test suite passes 100%