Skip to content

PII - Deployment script#68

Open
rkritika1508 wants to merge 5 commits intomainfrom
feat/deployment-script
Open

PII - Deployment script#68
rkritika1508 wants to merge 5 commits intomainfrom
feat/deployment-script

Conversation

@rkritika1508
Copy link
Collaborator

@rkritika1508 rkritika1508 commented Feb 26, 2026

Summary

Target issue is #29.
Explain the motivation for making this change. What existing problem does the pull request solve?
Post deployment, the 1st time we call the Guardrails API consumes few seconds (5-10 seconds) since the spaCy english language model (en_core_web_lg) is being installed for PII remover. We will experience similar issues in validators planned for this quarter so we should setup a deployment script which can pre-install all the requirements and the APIs are ready to process from the get go.

There are 2 ways to do this -
1. Update Dockerfile (build-time install) - Chosen approach
The spaCy model (en_core_web_lg) is downloaded during image build and baked into the container. We are following the principle of immutable infrastructure that containers should be fully assembled at build time. This leads to -

  • Faster startup
  • Safe autoscaling
  • No runtime internet dependency
  1. Runtime Download in Python (on startup / first request)
    Model is downloaded when the app starts or on first request. This can cause issues like -
  • Slower cold start
  • Non-deterministic (model version may change)
  • Scaling events trigger repeated downloads

In pii_remover.py, the PIIRemover currently creates a new NlpEngineProvider and AnalyzerEngine inside __init__, causing expensive spaCy model loads per request. I have moved creation to module-level so the AnalyzerEngine instance is initialized once and reused.

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

  • Chores
    • Improved PII detection by preinstalling an upgraded NLP model in the backend image, reducing startup time and avoiding runtime internet dependency.
    • Introduced lazy, cached analyzer initialization for more efficient PII processing.
  • Documentation
    • Added backend setup and runtime notes with a local fallback command to install the NLP model if needed.

@rkritika1508 rkritika1508 self-assigned this Feb 26, 2026
@rkritika1508 rkritika1508 linked an issue Feb 26, 2026 that may be closed by this pull request
@coderabbitai
Copy link

coderabbitai bot commented Feb 26, 2026

📝 Walkthrough

Walkthrough

Adds build-time installation of a pinned spaCy model to the backend Docker image and refactors the PII remover to use a shared, cached NLP engine (configured via a new public CONFIGURATION) that supplies the AnalyzerEngine; tests and docs updated accordingly.

Changes

Cohort / File(s) Summary
Docker build
backend/Dockerfile
Adds build ARG SPACY_MODEL_WHEEL_URL and a pip install RUN to install a pinned spaCy model wheel during image build.
PII validator logic
backend/app/core/validators/pii_remover.py
Adds public CONFIGURATION, global NLP engine and analyzer caches (_GLOBAL_NLP_ENGINE, _ANALYZER_CACHE), introduces INDIA_RECOGNIZERS, _build_analyzer, _get_cached_analyzer; switches PIIRemover to retrieve cached analyzer; sets TOKENIZERS_PARALLELISM.
Tests
backend/app/tests/validators/test_pii_remover.py
Updates tests to patch and assert _get_cached_analyzer behavior, verify NlpEngineProvider usage, and test selective cached analyzer registration.
Docs
backend/README.md, backend/app/core/validators/README.md
Documents the en_core_web_lg requirement, notes model pre-installation in Docker, and adds local-dev fallback command.
Metadata / manifests
manifest_file, requirements.txt, pyproject.toml
Referenced in summary — manifest/requirements changes noted (no file-level details in diff).

Sequence Diagram(s)

sequenceDiagram
  participant CI as Docker Build
  participant Image as Backend Image
  participant App as Backend App
  participant Provider as NlpEngineProvider
  participant Nlp as NLP Engine
  participant Analyzer as AnalyzerEngine
  participant Validator as PIIRemover

  CI->>Image: build (ARG SPACY_MODEL_WHEEL_URL)\nRUN pip install model wheel
  Image->>App: deploy image
  App->>Provider: instantiate with CONFIGURATION
  Provider->>Nlp: create_engine()
  Nlp-->>App: NLP engine instance
  App->>Analyzer: AnalyzerEngine(nlp_engine, recognizers...)
  Analyzer->>Validator: cached analyzer returned
  Validator-->>App: perform PII detection & anonymization
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped inside the image bright,

and tucked a model in at night.
With cached engines, tidy and keen,
I scrub the PII, neat and green.
A little hop, a quieter scene.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'PII - Deployment script' is vague and does not clearly convey the main purpose of the changes, which is to optimize spaCy model loading by pre-installing it at build time via Docker. Consider a more descriptive title such as 'Pre-install spaCy model in Docker for faster PII validator startup' that clearly explains the optimization being implemented.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/deployment-script

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
backend/app/core/validators/pii_remover.py (1)

75-75: Set TOKENIZERS_PARALLELISM once at module load, not per instance.

At Line 75, this mutates global process env from inside __init__. Prefer a one-time module-level setdefault to avoid repeated global writes and preserve externally provided values.

Suggested change
 import os
+os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")
@@
-        os.environ["TOKENIZERS_PARALLELISM"] = "false"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/core/validators/pii_remover.py` at line 75, The code currently
sets os.environ["TOKENIZERS_PARALLELISM"] = "false" inside the PII remover
initializer (inside __init__), which mutates process-wide env on each instance;
move this to a one-time module-level call using
os.environ.setdefault("TOKENIZERS_PARALLELISM", "false") at the top of
backend/app/core/validators/pii_remover.py so the environment is only set once
at import time and existing external values are preserved; remove the assignment
from the __init__ method in the class (the initializer) so no per-instance
global writes occur.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/core/validators/pii_remover.py`:
- Around line 77-80: The PIIRemover currently creates a new NlpEngineProvider
and AnalyzerEngine inside __init__ (via
NlpEngineProvider(nlp_configuration=CONFIGURATION) and
AnalyzerEngine(nlp_engine=nlp_engine)), causing expensive spaCy model loads per
request; move creation to module-level so the NlpEngineProvider / nlp_engine and
the AnalyzerEngine are initialized once and reused by PIIRemover instances
(e.g., module-level cached variables like _GLOBAL_NLP_ENGINE and
_GLOBAL_ANALYZER used by PIIRemover.__init__), and also set
TOKENIZERS_PARALLELISM at module import time instead of inside __init__ to avoid
repeated os.environ assignments; ensure references to CONFIGURATION,
NlpEngineProvider, AnalyzerEngine, and build_guard remain correct.

In `@backend/Dockerfile`:
- Around line 45-46: The Dockerfile currently runs "python -m spacy download
en_core_web_lg" unpinned and after copying app sources; change this to pin a
specific model artifact and decouple it from app changes by (1) replacing the
unpinned download step (the RUN python -m spacy download en_core_web_lg line)
with a deterministic model install using a versioned wheel or model package
(e.g., install a specific en_core_web_lg==<version> wheel or reference a model
lock), and (2) move that model-install step into a stable build layer before
COPY ./app so app edits won't invalidate the cached model layer; alternatively
produce the model as a separate build artifact/image stage and copy it into the
final image to ensure reproducible builds.

---

Nitpick comments:
In `@backend/app/core/validators/pii_remover.py`:
- Line 75: The code currently sets os.environ["TOKENIZERS_PARALLELISM"] =
"false" inside the PII remover initializer (inside __init__), which mutates
process-wide env on each instance; move this to a one-time module-level call
using os.environ.setdefault("TOKENIZERS_PARALLELISM", "false") at the top of
backend/app/core/validators/pii_remover.py so the environment is only set once
at import time and existing external values are preserved; remove the assignment
from the __init__ method in the class (the initializer) so no per-instance
global writes occur.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 09931fb and 61ebdf7.

📒 Files selected for processing (2)
  • backend/Dockerfile
  • backend/app/core/validators/pii_remover.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/core/validators/README.md`:
- Around line 168-169: Update the README.md statement about the spaCy model to
reflect the PR changes: state that en_core_web_lg is installed at build time in
the Dockerfile (so containerized deployments will not trigger a runtime
download), note that pii_remover.py is configured to use this pre-installed
model, and clarify that the “fallback” download behavior only applies when
running locally without the Docker image; replace the current line claiming
Presidio attempts to download the model at runtime with these three concise
points referencing Dockerfile, pii_remover.py, and en_core_web_lg.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 61ebdf7 and 7a17bfa.

📒 Files selected for processing (2)
  • backend/README.md
  • backend/app/core/validators/README.md
✅ Files skipped from review due to trivial changes (1)
  • backend/README.md

@rkritika1508
Copy link
Collaborator Author

@coderabbitai full review

@coderabbitai
Copy link

coderabbitai bot commented Feb 26, 2026

✅ Actions performed

Full review triggered.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/core/validators/pii_remover.py`:
- Around line 83-87: The cache key in _get_cached_analyzer currently includes
duplicate entity entries causing redundant cache entries; normalize the key by
removing duplicates from entity_types before sorting so equivalent lists (e.g.,
["IN_PAN"] vs ["IN_PAN","IN_PAN"]) map to the same recognizer_key, then continue
using that canonical recognizer_key to read/write _ANALYZER_CACHE and call
_build_analyzer(entity_types) as needed (update any call sites within
_get_cached_analyzer to pass the deduplicated list if required).

In `@backend/app/tests/validators/test_pii_remover.py`:
- Around line 82-100: The test mutates module globals _ANALYZER_CACHE and
_GLOBAL_NLP_ENGINE and does not restore them; change the test
(test_cached_analyzer_registers_only_requested_indian_recognizers) to save the
original values of pii_remover._ANALYZER_CACHE and
pii_remover._GLOBAL_NLP_ENGINE before clearing/setting them, then restore those
originals in a finally block (or use pytest's monkeypatch to set and revert
attributes) so module state is returned after calling
pii_remover._get_cached_analyzer; ensure the same symbols (_ANALYZER_CACHE,
_GLOBAL_NLP_ENGINE, _get_cached_analyzer) are used so later tests are not
polluted by mocks.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7a17bfa and f13d377.

📒 Files selected for processing (4)
  • backend/Dockerfile
  • backend/app/core/validators/README.md
  • backend/app/core/validators/pii_remover.py
  • backend/app/tests/validators/test_pii_remover.py

Comment on lines +83 to +87
def _get_cached_analyzer(entity_types: list[str]) -> AnalyzerEngine:
recognizer_key = tuple(sorted(t for t in entity_types if t in INDIA_RECOGNIZERS))
if recognizer_key not in _ANALYZER_CACHE:
_ANALYZER_CACHE[recognizer_key] = _build_analyzer(entity_types)
return _ANALYZER_CACHE[recognizer_key]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Normalize cache key to avoid duplicate-key fragmentation.

Line 84 keeps duplicate entity entries in the key. Equivalent requests like ["IN_PAN"] and ["IN_PAN", "IN_PAN"] create different cache entries unnecessarily.

🧩 Suggested key canonicalization
 def _get_cached_analyzer(entity_types: list[str]) -> AnalyzerEngine:
-    recognizer_key = tuple(sorted(t for t in entity_types if t in INDIA_RECOGNIZERS))
+    recognizer_key = tuple(sorted({t for t in entity_types if t in INDIA_RECOGNIZERS}))
     if recognizer_key not in _ANALYZER_CACHE:
         _ANALYZER_CACHE[recognizer_key] = _build_analyzer(entity_types)
     return _ANALYZER_CACHE[recognizer_key]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/core/validators/pii_remover.py` around lines 83 - 87, The cache
key in _get_cached_analyzer currently includes duplicate entity entries causing
redundant cache entries; normalize the key by removing duplicates from
entity_types before sorting so equivalent lists (e.g., ["IN_PAN"] vs
["IN_PAN","IN_PAN"]) map to the same recognizer_key, then continue using that
canonical recognizer_key to read/write _ANALYZER_CACHE and call
_build_analyzer(entity_types) as needed (update any call sites within
_get_cached_analyzer to pass the deduplicated list if required).

Comment on lines +82 to +100
def test_cached_analyzer_registers_only_requested_indian_recognizers():
with patch(
"app.core.validators.pii_remover.NlpEngineProvider"
) as mock_provider, patch(
"app.core.validators.pii_remover.AnalyzerEngine"
) as mock_analyzer, patch("app.core.validators.pii_remover.AnonymizerEngine"):
registry = mock_analyzer.return_value.registry

PIIRemover(
entity_types=[
"IN_AADHAAR",
"IN_PAN",
"IN_PASSPORT",
"IN_VEHICLE_REGISTRATION",
"IN_VOTER",
],
threshold=0.5,
)
) as mock_analyzer:
pii_remover._ANALYZER_CACHE.clear()
pii_remover._GLOBAL_NLP_ENGINE = None
analyzer_instance = mock_analyzer.return_value

# Called once per recognizer
assert registry.add_recognizer.call_count == 5
pii_remover._get_cached_analyzer(["EMAIL_ADDRESS", "IN_AADHAAR", "IN_PAN"])
pii_remover._get_cached_analyzer(["EMAIL_ADDRESS", "IN_AADHAAR", "IN_PAN"])

mock_provider.assert_called_once_with(
nlp_configuration=pii_remover.CONFIGURATION
)
mock_provider.return_value.create_engine.assert_called_once()
mock_analyzer.assert_called_once()
assert analyzer_instance.registry.add_recognizer.call_count == 2
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Restore mutated module globals to keep tests isolated.

Lines 88-90 overwrite shared module state and leave it changed after the test. This can pollute later tests with mocked analyzer/engine instances.

🧪 Suggested isolation-safe pattern
 def test_cached_analyzer_registers_only_requested_indian_recognizers():
     with patch(
         "app.core.validators.pii_remover.NlpEngineProvider"
     ) as mock_provider, patch(
         "app.core.validators.pii_remover.AnalyzerEngine"
     ) as mock_analyzer:
-        pii_remover._ANALYZER_CACHE.clear()
-        pii_remover._GLOBAL_NLP_ENGINE = None
-        analyzer_instance = mock_analyzer.return_value
-
-        pii_remover._get_cached_analyzer(["EMAIL_ADDRESS", "IN_AADHAAR", "IN_PAN"])
-        pii_remover._get_cached_analyzer(["EMAIL_ADDRESS", "IN_AADHAAR", "IN_PAN"])
-
-        mock_provider.assert_called_once_with(
-            nlp_configuration=pii_remover.CONFIGURATION
-        )
-        mock_provider.return_value.create_engine.assert_called_once()
-        mock_analyzer.assert_called_once()
-        assert analyzer_instance.registry.add_recognizer.call_count == 2
+        old_cache = dict(pii_remover._ANALYZER_CACHE)
+        old_nlp_engine = pii_remover._GLOBAL_NLP_ENGINE
+        try:
+            pii_remover._ANALYZER_CACHE.clear()
+            pii_remover._GLOBAL_NLP_ENGINE = None
+            analyzer_instance = mock_analyzer.return_value
+
+            pii_remover._get_cached_analyzer(["EMAIL_ADDRESS", "IN_AADHAAR", "IN_PAN"])
+            pii_remover._get_cached_analyzer(["EMAIL_ADDRESS", "IN_AADHAAR", "IN_PAN"])
+
+            mock_provider.assert_called_once_with(
+                nlp_configuration=pii_remover.CONFIGURATION
+            )
+            mock_provider.return_value.create_engine.assert_called_once()
+            mock_analyzer.assert_called_once()
+            assert analyzer_instance.registry.add_recognizer.call_count == 2
+        finally:
+            pii_remover._ANALYZER_CACHE.clear()
+            pii_remover._ANALYZER_CACHE.update(old_cache)
+            pii_remover._GLOBAL_NLP_ENGINE = old_nlp_engine
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_cached_analyzer_registers_only_requested_indian_recognizers():
with patch(
"app.core.validators.pii_remover.NlpEngineProvider"
) as mock_provider, patch(
"app.core.validators.pii_remover.AnalyzerEngine"
) as mock_analyzer, patch("app.core.validators.pii_remover.AnonymizerEngine"):
registry = mock_analyzer.return_value.registry
PIIRemover(
entity_types=[
"IN_AADHAAR",
"IN_PAN",
"IN_PASSPORT",
"IN_VEHICLE_REGISTRATION",
"IN_VOTER",
],
threshold=0.5,
)
) as mock_analyzer:
pii_remover._ANALYZER_CACHE.clear()
pii_remover._GLOBAL_NLP_ENGINE = None
analyzer_instance = mock_analyzer.return_value
# Called once per recognizer
assert registry.add_recognizer.call_count == 5
pii_remover._get_cached_analyzer(["EMAIL_ADDRESS", "IN_AADHAAR", "IN_PAN"])
pii_remover._get_cached_analyzer(["EMAIL_ADDRESS", "IN_AADHAAR", "IN_PAN"])
mock_provider.assert_called_once_with(
nlp_configuration=pii_remover.CONFIGURATION
)
mock_provider.return_value.create_engine.assert_called_once()
mock_analyzer.assert_called_once()
assert analyzer_instance.registry.add_recognizer.call_count == 2
def test_cached_analyzer_registers_only_requested_indian_recognizers():
with patch(
"app.core.validators.pii_remover.NlpEngineProvider"
) as mock_provider, patch(
"app.core.validators.pii_remover.AnalyzerEngine"
) as mock_analyzer:
old_cache = dict(pii_remover._ANALYZER_CACHE)
old_nlp_engine = pii_remover._GLOBAL_NLP_ENGINE
try:
pii_remover._ANALYZER_CACHE.clear()
pii_remover._GLOBAL_NLP_ENGINE = None
analyzer_instance = mock_analyzer.return_value
pii_remover._get_cached_analyzer(["EMAIL_ADDRESS", "IN_AADHAAR", "IN_PAN"])
pii_remover._get_cached_analyzer(["EMAIL_ADDRESS", "IN_AADHAAR", "IN_PAN"])
mock_provider.assert_called_once_with(
nlp_configuration=pii_remover.CONFIGURATION
)
mock_provider.return_value.create_engine.assert_called_once()
mock_analyzer.assert_called_once()
assert analyzer_instance.registry.add_recognizer.call_count == 2
finally:
pii_remover._ANALYZER_CACHE.clear()
pii_remover._ANALYZER_CACHE.update(old_cache)
pii_remover._GLOBAL_NLP_ENGINE = old_nlp_engine
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/validators/test_pii_remover.py` around lines 82 - 100, The
test mutates module globals _ANALYZER_CACHE and _GLOBAL_NLP_ENGINE and does not
restore them; change the test
(test_cached_analyzer_registers_only_requested_indian_recognizers) to save the
original values of pii_remover._ANALYZER_CACHE and
pii_remover._GLOBAL_NLP_ENGINE before clearing/setting them, then restore those
originals in a finally block (or use pytest's monkeypatch to set and revert
attributes) so module state is returned after calling
pii_remover._get_cached_analyzer; ensure the same symbols (_ANALYZER_CACHE,
_GLOBAL_NLP_ENGINE, _get_cached_analyzer) are used so later tests are not
polluted by mocks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deployment: Setup Script

1 participant