fix: skip empty knowledge-base embedding batches by stablegenius49 · Pull Request #6106 · AstrBotDevs/AstrBot

stablegenius49 · 2026-03-12T07:48:03Z

Modifications / 改动点

short-circuit FaissVecDB.insert_batch() when the parser/chunker produces an empty batch instead of flowing into the embedding/document stores and crashing with tuple index out of range
add a focused regression test to ensure empty knowledge-base batches return cleanly and never touch the embedding/document storage layer

This keeps PDF uploads from blowing up when a file yields zero indexable chunks (for example, scanned/image-only PDFs or PDFs whose extracted text is empty) and preserves the existing upload flow instead of surfacing an opaque vector-db exception.

This is NOT a breaking change. / 这不是一个破坏性变更。

Screenshots or Test Results / 运行截图或测试结果

Verification Steps:

python3.11 -m pytest tests/unit/test_faiss_vec_db.py -q

Result:

1 passed in 1.01s

python3.11 -m ruff check astrbot/core/db/vec_db/faiss_impl/vec_db.py tests/unit/test_faiss_vec_db.py

Result:

All checks passed!

Checklist / 检查清单

😊 如果 PR 中有新加入的功能，已经通过 Issue / 邮件等方式和作者讨论过。/ If there are new features added in the PR, I have discussed it with the authors through issues/emails, etc.
👀 我的更改经过了良好的测试，并已在上方提供了“验证步骤”和“运行截图”。/ My changes have been well-tested, and "Verification Steps" and "Screenshots" have been provided above.
🤓 我确保没有引入新依赖库，或者引入了新依赖库的同时将其添加到了 requirements.txt 和 pyproject.toml 文件相应位置。/ I have ensured that no new dependencies are introduced, OR if new dependencies are introduced, they have been added to the appropriate locations in requirements.txt and pyproject.toml.
😮 我的更改没有引入恶意代码。/ My changes do not introduce malicious code.

Summary by Sourcery

在 FAISS 向量数据库的插入流程中安全地处理空知识库批次，当未生成可索引内容时避免出错。

Bug Fixes:

当解析后的内容列表为空时，阻止批量插入尝试执行嵌入和存储操作，从而避免在空上传场景下的运行时错误。

Tests:

添加一个异步回归测试，确保对空批次的插入返回空结果，并且从不调用嵌入层或文档存储层。

Original summary in English

Summary by Sourcery

Handle empty knowledge-base batches safely in the FAISS vector database insert path to avoid errors when no indexable content is produced.

Bug Fixes:

Prevent batch insert from attempting embedding and storage operations when the parsed contents list is empty, avoiding runtime errors on empty uploads.

Tests:

Add an async regression test ensuring empty batch inserts return an empty result and never call embedding or document storage layers.

gemini-code-assist · 2026-03-12T07:48:17Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a crucial fix to enhance the robustness of the knowledge-base ingestion process. It prevents system crashes that previously occurred when FaissVecDB.insert_batch() received empty content batches, a common scenario with certain PDF types. The change ensures a smoother and more resilient data upload experience by gracefully handling these edge cases.

Highlights

Empty Batch Handling: Implemented a short-circuit mechanism in FaissVecDB.insert_batch() to prevent crashes when processing empty batches of content, specifically addressing issues with knowledge-base uploads from files like scanned PDFs that yield no indexable chunks.
Regression Testing: Added a dedicated regression test to ensure that FaissVecDB.insert_batch() correctly handles empty content batches by returning an empty list and avoiding unnecessary calls to embedding and document storage.

Changelog

astrbot/core/db/vec_db/faiss_impl/vec_db.py
- Added a conditional check at the beginning of insert_batch to immediately return an empty list if contents is empty, preventing further processing.
tests/unit/test_faiss_vec_db.py
- Added a new test file containing test_insert_batch_skips_empty_contents to verify that FaissVecDB.insert_batch correctly handles an empty list of contents by returning an empty result and not invoking embedding or document storage methods.

Activity

The author provided verification steps including pytest and ruff checks, both showing successful results.
The author confirmed the change is not breaking.
The author confirmed the changes are well-tested and provided verification steps.
The author confirmed no new dependencies were introduced.
The author confirmed no malicious code was introduced.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request addresses a bug where empty batches passed to FaissVecDB.insert_batch() would cause a crash. The fix correctly adds a check to handle empty contents by returning early. A new unit test is also included to verify this behavior and prevent regressions. The changes are correct and well-tested. I have one minor suggestion in the new test file to improve code style for better maintainability.

gemini-code-assist · 2026-03-12T07:49:20Z

tests/unit/test_faiss_vec_db.py

+    vec_db.document_storage = AsyncMock()
+    vec_db.embedding_storage = AsyncMock()
+
+    result = await FaissVecDB.insert_batch(vec_db, [])


For improved readability and adherence to standard Python conventions, it's better to call instance methods on the object instance directly. Calling the method on the class and passing the instance explicitly is less common and can be confusing for other developers. Using a keyword argument for contents also improves clarity.

Suggested change

result = await FaissVecDB.insert_batch(vec_db, [])

result = await vec_db.insert_batch(contents=[])

sourcery-ai

Hey - 我在这里给出了一些高层次的反馈：

可以考虑将 if not contents 的提前返回移动到构造 metadatas 和 ids 之前，这样在空批次的情况下就不会分配永远不会被使用的列表。
在测试中，与其使用 FaissVecDB.__new__，不如通过它的公开构造函数（或一个专门的测试 helper/factory）来构造 FaissVecDB，这样测试设置能更好地反映真实的使用场景。

给 AI Agent 的提示

Please address the comments from this code review:

## Overall Comments
- Consider moving the `if not contents` early-return before constructing `metadatas` and `ids` so you don't allocate lists that will never be used in the empty-batch case.
- In the test, instead of using `FaissVecDB.__new__`, it may be clearer to construct `FaissVecDB` via its public initializer (or a dedicated test helper/factory) so the test setup better reflects real-world usage.

Sourcery 对开源项目是免费的——如果你觉得我们的评审有帮助，欢迎分享 ✨

_{帮我变得更有用！请在每条评论上点击 👍 或 👎，我会根据你的反馈来改进代码评审。}

Original comment in English

Hey - I've left some high level feedback:

Consider moving the if not contents early-return before constructing metadatas and ids so you don't allocate lists that will never be used in the empty-batch case.
In the test, instead of using FaissVecDB.__new__, it may be clearer to construct FaissVecDB via its public initializer (or a dedicated test helper/factory) so the test setup better reflects real-world usage.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- Consider moving the `if not contents` early-return before constructing `metadatas` and `ids` so you don't allocate lists that will never be used in the empty-batch case.
- In the test, instead of using `FaissVecDB.__new__`, it may be clearer to construct `FaissVecDB` via its public initializer (or a dedicated test helper/factory) so the test setup better reflects real-world usage.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

fix: skip empty knowledge-base embedding batches

b3df734

auto-assign bot requested review from LIghtJUNction and advent259141 March 12, 2026 07:48

dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Mar 12, 2026

gemini-code-assist bot reviewed Mar 12, 2026

View reviewed changes

sourcery-ai bot reviewed Mar 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: skip empty knowledge-base embedding batches#6106

fix: skip empty knowledge-base embedding batches#6106
stablegenius49 wants to merge 1 commit intoAstrBotDevs:masterfrom
stablegenius49:pr-factory/issue-3709-empty-batch-upload

stablegenius49 commented Mar 12, 2026 •

edited by sourcery-ai bot

Loading

Uh oh!

gemini-code-assist bot commented Mar 12, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 12, 2026

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	result = await FaissVecDB.insert_batch(vec_db, [])
	result = await vec_db.insert_batch(contents=[])

Uh oh!

Conversation

stablegenius49 commented Mar 12, 2026 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Modifications / 改动点

Screenshots or Test Results / 运行截图或测试结果

Checklist / 检查清单

Summary by Sourcery

Summary by Sourcery

Uh oh!

gemini-code-assist bot commented Mar 12, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stablegenius49 commented Mar 12, 2026 •

edited by sourcery-ai bot

Loading