Conversation
📝 WalkthroughWalkthroughAdds build-time installation of a pinned spaCy model to the backend Docker image and refactors the PII remover to use a shared, cached NLP engine (configured via a new public Changes
Sequence Diagram(s)sequenceDiagram
participant CI as Docker Build
participant Image as Backend Image
participant App as Backend App
participant Provider as NlpEngineProvider
participant Nlp as NLP Engine
participant Analyzer as AnalyzerEngine
participant Validator as PIIRemover
CI->>Image: build (ARG SPACY_MODEL_WHEEL_URL)\nRUN pip install model wheel
Image->>App: deploy image
App->>Provider: instantiate with CONFIGURATION
Provider->>Nlp: create_engine()
Nlp-->>App: NLP engine instance
App->>Analyzer: AnalyzerEngine(nlp_engine, recognizers...)
Analyzer->>Validator: cached analyzer returned
Validator-->>App: perform PII detection & anonymization
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
backend/app/core/validators/pii_remover.py (1)
75-75: SetTOKENIZERS_PARALLELISMonce at module load, not per instance.At Line 75, this mutates global process env from inside
__init__. Prefer a one-time module-levelsetdefaultto avoid repeated global writes and preserve externally provided values.Suggested change
import os +os.environ.setdefault("TOKENIZERS_PARALLELISM", "false") @@ - os.environ["TOKENIZERS_PARALLELISM"] = "false"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/core/validators/pii_remover.py` at line 75, The code currently sets os.environ["TOKENIZERS_PARALLELISM"] = "false" inside the PII remover initializer (inside __init__), which mutates process-wide env on each instance; move this to a one-time module-level call using os.environ.setdefault("TOKENIZERS_PARALLELISM", "false") at the top of backend/app/core/validators/pii_remover.py so the environment is only set once at import time and existing external values are preserved; remove the assignment from the __init__ method in the class (the initializer) so no per-instance global writes occur.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/core/validators/pii_remover.py`:
- Around line 77-80: The PIIRemover currently creates a new NlpEngineProvider
and AnalyzerEngine inside __init__ (via
NlpEngineProvider(nlp_configuration=CONFIGURATION) and
AnalyzerEngine(nlp_engine=nlp_engine)), causing expensive spaCy model loads per
request; move creation to module-level so the NlpEngineProvider / nlp_engine and
the AnalyzerEngine are initialized once and reused by PIIRemover instances
(e.g., module-level cached variables like _GLOBAL_NLP_ENGINE and
_GLOBAL_ANALYZER used by PIIRemover.__init__), and also set
TOKENIZERS_PARALLELISM at module import time instead of inside __init__ to avoid
repeated os.environ assignments; ensure references to CONFIGURATION,
NlpEngineProvider, AnalyzerEngine, and build_guard remain correct.
In `@backend/Dockerfile`:
- Around line 45-46: The Dockerfile currently runs "python -m spacy download
en_core_web_lg" unpinned and after copying app sources; change this to pin a
specific model artifact and decouple it from app changes by (1) replacing the
unpinned download step (the RUN python -m spacy download en_core_web_lg line)
with a deterministic model install using a versioned wheel or model package
(e.g., install a specific en_core_web_lg==<version> wheel or reference a model
lock), and (2) move that model-install step into a stable build layer before
COPY ./app so app edits won't invalidate the cached model layer; alternatively
produce the model as a separate build artifact/image stage and copy it into the
final image to ensure reproducible builds.
---
Nitpick comments:
In `@backend/app/core/validators/pii_remover.py`:
- Line 75: The code currently sets os.environ["TOKENIZERS_PARALLELISM"] =
"false" inside the PII remover initializer (inside __init__), which mutates
process-wide env on each instance; move this to a one-time module-level call
using os.environ.setdefault("TOKENIZERS_PARALLELISM", "false") at the top of
backend/app/core/validators/pii_remover.py so the environment is only set once
at import time and existing external values are preserved; remove the assignment
from the __init__ method in the class (the initializer) so no per-instance
global writes occur.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/core/validators/README.md`:
- Around line 168-169: Update the README.md statement about the spaCy model to
reflect the PR changes: state that en_core_web_lg is installed at build time in
the Dockerfile (so containerized deployments will not trigger a runtime
download), note that pii_remover.py is configured to use this pre-installed
model, and clarify that the “fallback” download behavior only applies when
running locally without the Docker image; replace the current line claiming
Presidio attempts to download the model at runtime with these three concise
points referencing Dockerfile, pii_remover.py, and en_core_web_lg.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
backend/README.mdbackend/app/core/validators/README.md
✅ Files skipped from review due to trivial changes (1)
- backend/README.md
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/core/validators/pii_remover.py`:
- Around line 83-87: The cache key in _get_cached_analyzer currently includes
duplicate entity entries causing redundant cache entries; normalize the key by
removing duplicates from entity_types before sorting so equivalent lists (e.g.,
["IN_PAN"] vs ["IN_PAN","IN_PAN"]) map to the same recognizer_key, then continue
using that canonical recognizer_key to read/write _ANALYZER_CACHE and call
_build_analyzer(entity_types) as needed (update any call sites within
_get_cached_analyzer to pass the deduplicated list if required).
In `@backend/app/tests/validators/test_pii_remover.py`:
- Around line 82-100: The test mutates module globals _ANALYZER_CACHE and
_GLOBAL_NLP_ENGINE and does not restore them; change the test
(test_cached_analyzer_registers_only_requested_indian_recognizers) to save the
original values of pii_remover._ANALYZER_CACHE and
pii_remover._GLOBAL_NLP_ENGINE before clearing/setting them, then restore those
originals in a finally block (or use pytest's monkeypatch to set and revert
attributes) so module state is returned after calling
pii_remover._get_cached_analyzer; ensure the same symbols (_ANALYZER_CACHE,
_GLOBAL_NLP_ENGINE, _get_cached_analyzer) are used so later tests are not
polluted by mocks.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
backend/Dockerfilebackend/app/core/validators/README.mdbackend/app/core/validators/pii_remover.pybackend/app/tests/validators/test_pii_remover.py
| def _get_cached_analyzer(entity_types: list[str]) -> AnalyzerEngine: | ||
| recognizer_key = tuple(sorted(t for t in entity_types if t in INDIA_RECOGNIZERS)) | ||
| if recognizer_key not in _ANALYZER_CACHE: | ||
| _ANALYZER_CACHE[recognizer_key] = _build_analyzer(entity_types) | ||
| return _ANALYZER_CACHE[recognizer_key] |
There was a problem hiding this comment.
Normalize cache key to avoid duplicate-key fragmentation.
Line 84 keeps duplicate entity entries in the key. Equivalent requests like ["IN_PAN"] and ["IN_PAN", "IN_PAN"] create different cache entries unnecessarily.
🧩 Suggested key canonicalization
def _get_cached_analyzer(entity_types: list[str]) -> AnalyzerEngine:
- recognizer_key = tuple(sorted(t for t in entity_types if t in INDIA_RECOGNIZERS))
+ recognizer_key = tuple(sorted({t for t in entity_types if t in INDIA_RECOGNIZERS}))
if recognizer_key not in _ANALYZER_CACHE:
_ANALYZER_CACHE[recognizer_key] = _build_analyzer(entity_types)
return _ANALYZER_CACHE[recognizer_key]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/core/validators/pii_remover.py` around lines 83 - 87, The cache
key in _get_cached_analyzer currently includes duplicate entity entries causing
redundant cache entries; normalize the key by removing duplicates from
entity_types before sorting so equivalent lists (e.g., ["IN_PAN"] vs
["IN_PAN","IN_PAN"]) map to the same recognizer_key, then continue using that
canonical recognizer_key to read/write _ANALYZER_CACHE and call
_build_analyzer(entity_types) as needed (update any call sites within
_get_cached_analyzer to pass the deduplicated list if required).
| def test_cached_analyzer_registers_only_requested_indian_recognizers(): | ||
| with patch( | ||
| "app.core.validators.pii_remover.NlpEngineProvider" | ||
| ) as mock_provider, patch( | ||
| "app.core.validators.pii_remover.AnalyzerEngine" | ||
| ) as mock_analyzer, patch("app.core.validators.pii_remover.AnonymizerEngine"): | ||
| registry = mock_analyzer.return_value.registry | ||
|
|
||
| PIIRemover( | ||
| entity_types=[ | ||
| "IN_AADHAAR", | ||
| "IN_PAN", | ||
| "IN_PASSPORT", | ||
| "IN_VEHICLE_REGISTRATION", | ||
| "IN_VOTER", | ||
| ], | ||
| threshold=0.5, | ||
| ) | ||
| ) as mock_analyzer: | ||
| pii_remover._ANALYZER_CACHE.clear() | ||
| pii_remover._GLOBAL_NLP_ENGINE = None | ||
| analyzer_instance = mock_analyzer.return_value | ||
|
|
||
| # Called once per recognizer | ||
| assert registry.add_recognizer.call_count == 5 | ||
| pii_remover._get_cached_analyzer(["EMAIL_ADDRESS", "IN_AADHAAR", "IN_PAN"]) | ||
| pii_remover._get_cached_analyzer(["EMAIL_ADDRESS", "IN_AADHAAR", "IN_PAN"]) | ||
|
|
||
| mock_provider.assert_called_once_with( | ||
| nlp_configuration=pii_remover.CONFIGURATION | ||
| ) | ||
| mock_provider.return_value.create_engine.assert_called_once() | ||
| mock_analyzer.assert_called_once() | ||
| assert analyzer_instance.registry.add_recognizer.call_count == 2 |
There was a problem hiding this comment.
Restore mutated module globals to keep tests isolated.
Lines 88-90 overwrite shared module state and leave it changed after the test. This can pollute later tests with mocked analyzer/engine instances.
🧪 Suggested isolation-safe pattern
def test_cached_analyzer_registers_only_requested_indian_recognizers():
with patch(
"app.core.validators.pii_remover.NlpEngineProvider"
) as mock_provider, patch(
"app.core.validators.pii_remover.AnalyzerEngine"
) as mock_analyzer:
- pii_remover._ANALYZER_CACHE.clear()
- pii_remover._GLOBAL_NLP_ENGINE = None
- analyzer_instance = mock_analyzer.return_value
-
- pii_remover._get_cached_analyzer(["EMAIL_ADDRESS", "IN_AADHAAR", "IN_PAN"])
- pii_remover._get_cached_analyzer(["EMAIL_ADDRESS", "IN_AADHAAR", "IN_PAN"])
-
- mock_provider.assert_called_once_with(
- nlp_configuration=pii_remover.CONFIGURATION
- )
- mock_provider.return_value.create_engine.assert_called_once()
- mock_analyzer.assert_called_once()
- assert analyzer_instance.registry.add_recognizer.call_count == 2
+ old_cache = dict(pii_remover._ANALYZER_CACHE)
+ old_nlp_engine = pii_remover._GLOBAL_NLP_ENGINE
+ try:
+ pii_remover._ANALYZER_CACHE.clear()
+ pii_remover._GLOBAL_NLP_ENGINE = None
+ analyzer_instance = mock_analyzer.return_value
+
+ pii_remover._get_cached_analyzer(["EMAIL_ADDRESS", "IN_AADHAAR", "IN_PAN"])
+ pii_remover._get_cached_analyzer(["EMAIL_ADDRESS", "IN_AADHAAR", "IN_PAN"])
+
+ mock_provider.assert_called_once_with(
+ nlp_configuration=pii_remover.CONFIGURATION
+ )
+ mock_provider.return_value.create_engine.assert_called_once()
+ mock_analyzer.assert_called_once()
+ assert analyzer_instance.registry.add_recognizer.call_count == 2
+ finally:
+ pii_remover._ANALYZER_CACHE.clear()
+ pii_remover._ANALYZER_CACHE.update(old_cache)
+ pii_remover._GLOBAL_NLP_ENGINE = old_nlp_engine📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def test_cached_analyzer_registers_only_requested_indian_recognizers(): | |
| with patch( | |
| "app.core.validators.pii_remover.NlpEngineProvider" | |
| ) as mock_provider, patch( | |
| "app.core.validators.pii_remover.AnalyzerEngine" | |
| ) as mock_analyzer, patch("app.core.validators.pii_remover.AnonymizerEngine"): | |
| registry = mock_analyzer.return_value.registry | |
| PIIRemover( | |
| entity_types=[ | |
| "IN_AADHAAR", | |
| "IN_PAN", | |
| "IN_PASSPORT", | |
| "IN_VEHICLE_REGISTRATION", | |
| "IN_VOTER", | |
| ], | |
| threshold=0.5, | |
| ) | |
| ) as mock_analyzer: | |
| pii_remover._ANALYZER_CACHE.clear() | |
| pii_remover._GLOBAL_NLP_ENGINE = None | |
| analyzer_instance = mock_analyzer.return_value | |
| # Called once per recognizer | |
| assert registry.add_recognizer.call_count == 5 | |
| pii_remover._get_cached_analyzer(["EMAIL_ADDRESS", "IN_AADHAAR", "IN_PAN"]) | |
| pii_remover._get_cached_analyzer(["EMAIL_ADDRESS", "IN_AADHAAR", "IN_PAN"]) | |
| mock_provider.assert_called_once_with( | |
| nlp_configuration=pii_remover.CONFIGURATION | |
| ) | |
| mock_provider.return_value.create_engine.assert_called_once() | |
| mock_analyzer.assert_called_once() | |
| assert analyzer_instance.registry.add_recognizer.call_count == 2 | |
| def test_cached_analyzer_registers_only_requested_indian_recognizers(): | |
| with patch( | |
| "app.core.validators.pii_remover.NlpEngineProvider" | |
| ) as mock_provider, patch( | |
| "app.core.validators.pii_remover.AnalyzerEngine" | |
| ) as mock_analyzer: | |
| old_cache = dict(pii_remover._ANALYZER_CACHE) | |
| old_nlp_engine = pii_remover._GLOBAL_NLP_ENGINE | |
| try: | |
| pii_remover._ANALYZER_CACHE.clear() | |
| pii_remover._GLOBAL_NLP_ENGINE = None | |
| analyzer_instance = mock_analyzer.return_value | |
| pii_remover._get_cached_analyzer(["EMAIL_ADDRESS", "IN_AADHAAR", "IN_PAN"]) | |
| pii_remover._get_cached_analyzer(["EMAIL_ADDRESS", "IN_AADHAAR", "IN_PAN"]) | |
| mock_provider.assert_called_once_with( | |
| nlp_configuration=pii_remover.CONFIGURATION | |
| ) | |
| mock_provider.return_value.create_engine.assert_called_once() | |
| mock_analyzer.assert_called_once() | |
| assert analyzer_instance.registry.add_recognizer.call_count == 2 | |
| finally: | |
| pii_remover._ANALYZER_CACHE.clear() | |
| pii_remover._ANALYZER_CACHE.update(old_cache) | |
| pii_remover._GLOBAL_NLP_ENGINE = old_nlp_engine |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/tests/validators/test_pii_remover.py` around lines 82 - 100, The
test mutates module globals _ANALYZER_CACHE and _GLOBAL_NLP_ENGINE and does not
restore them; change the test
(test_cached_analyzer_registers_only_requested_indian_recognizers) to save the
original values of pii_remover._ANALYZER_CACHE and
pii_remover._GLOBAL_NLP_ENGINE before clearing/setting them, then restore those
originals in a finally block (or use pytest's monkeypatch to set and revert
attributes) so module state is returned after calling
pii_remover._get_cached_analyzer; ensure the same symbols (_ANALYZER_CACHE,
_GLOBAL_NLP_ENGINE, _get_cached_analyzer) are used so later tests are not
polluted by mocks.
Summary
Target issue is #29.
Explain the motivation for making this change. What existing problem does the pull request solve?
Post deployment, the 1st time we call the Guardrails API consumes few seconds (5-10 seconds) since the spaCy english language model (en_core_web_lg) is being installed for PII remover. We will experience similar issues in validators planned for this quarter so we should setup a deployment script which can pre-install all the requirements and the APIs are ready to process from the get go.
There are 2 ways to do this -
1. Update Dockerfile (build-time install) - Chosen approach
The spaCy model (en_core_web_lg) is downloaded during image build and baked into the container. We are following the principle of immutable infrastructure that containers should be fully assembled at build time. This leads to -
Model is downloaded when the app starts or on first request. This can cause issues like -
In
pii_remover.py, the PIIRemover currently creates a new NlpEngineProvider and AnalyzerEngine inside__init__, causing expensive spaCy model loads per request. I have moved creation to module-level so the AnalyzerEngine instance is initialized once and reused.Checklist
Before submitting a pull request, please ensure that you mark these task.
fastapi run --reload app/main.pyordocker compose upin the repository root and test.Notes
Please add here if any other information is required for the reviewer.
Summary by CodeRabbit