fix: skip empty knowledge-base embedding batches#6106
fix: skip empty knowledge-base embedding batches#6106stablegenius49 wants to merge 1 commit intoAstrBotDevs:masterfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a crucial fix to enhance the robustness of the knowledge-base ingestion process. It prevents system crashes that previously occurred when Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses a bug where empty batches passed to FaissVecDB.insert_batch() would cause a crash. The fix correctly adds a check to handle empty contents by returning early. A new unit test is also included to verify this behavior and prevent regressions. The changes are correct and well-tested. I have one minor suggestion in the new test file to improve code style for better maintainability.
| vec_db.document_storage = AsyncMock() | ||
| vec_db.embedding_storage = AsyncMock() | ||
|
|
||
| result = await FaissVecDB.insert_batch(vec_db, []) |
There was a problem hiding this comment.
For improved readability and adherence to standard Python conventions, it's better to call instance methods on the object instance directly. Calling the method on the class and passing the instance explicitly is less common and can be confusing for other developers. Using a keyword argument for contents also improves clarity.
| result = await FaissVecDB.insert_batch(vec_db, []) | |
| result = await vec_db.insert_batch(contents=[]) |
There was a problem hiding this comment.
Hey - 我在这里给出了一些高层次的反馈:
- 可以考虑将
if not contents的提前返回移动到构造metadatas和ids之前,这样在空批次的情况下就不会分配永远不会被使用的列表。 - 在测试中,与其使用
FaissVecDB.__new__,不如通过它的公开构造函数(或一个专门的测试 helper/factory)来构造FaissVecDB,这样测试设置能更好地反映真实的使用场景。
给 AI Agent 的提示
Please address the comments from this code review:
## Overall Comments
- Consider moving the `if not contents` early-return before constructing `metadatas` and `ids` so you don't allocate lists that will never be used in the empty-batch case.
- In the test, instead of using `FaissVecDB.__new__`, it may be clearer to construct `FaissVecDB` via its public initializer (or a dedicated test helper/factory) so the test setup better reflects real-world usage.帮我变得更有用!请在每条评论上点击 👍 或 👎,我会根据你的反馈来改进代码评审。
Original comment in English
Hey - I've left some high level feedback:
- Consider moving the
if not contentsearly-return before constructingmetadatasandidsso you don't allocate lists that will never be used in the empty-batch case. - In the test, instead of using
FaissVecDB.__new__, it may be clearer to constructFaissVecDBvia its public initializer (or a dedicated test helper/factory) so the test setup better reflects real-world usage.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider moving the `if not contents` early-return before constructing `metadatas` and `ids` so you don't allocate lists that will never be used in the empty-batch case.
- In the test, instead of using `FaissVecDB.__new__`, it may be clearer to construct `FaissVecDB` via its public initializer (or a dedicated test helper/factory) so the test setup better reflects real-world usage.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Fixes #3709
Modifications / 改动点
FaissVecDB.insert_batch()when the parser/chunker produces an empty batch instead of flowing into the embedding/document stores and crashing withtuple index out of rangeThis keeps PDF uploads from blowing up when a file yields zero indexable chunks (for example, scanned/image-only PDFs or PDFs whose extracted text is empty) and preserves the existing upload flow instead of surfacing an opaque vector-db exception.
Screenshots or Test Results / 运行截图或测试结果
Verification Steps:
Result:
Result:
Checklist / 检查清单
requirements.txt和pyproject.toml文件相应位置。/ I have ensured that no new dependencies are introduced, OR if new dependencies are introduced, they have been added to the appropriate locations inrequirements.txtandpyproject.toml.Summary by Sourcery
在 FAISS 向量数据库的插入流程中安全地处理空知识库批次,当未生成可索引内容时避免出错。
Bug Fixes:
Tests:
Original summary in English
Summary by Sourcery
Handle empty knowledge-base batches safely in the FAISS vector database insert path to avoid errors when no indexable content is produced.
Bug Fixes:
Tests: