Skip to content

feat(data-masking): add Data Masking utility#5143

Draft
svozza wants to merge 16 commits intomainfrom
001-data-masking
Draft

feat(data-masking): add Data Masking utility#5143
svozza wants to merge 16 commits intomainfrom
001-data-masking

Conversation

@svozza
Copy link
Copy Markdown
Contributor

@svozza svozza commented Mar 29, 2026

Summary

Changes

This PR is an experiment in delivering a full feature, end to end, using spec-driven development and agentic coding. As such, I have set it as a draft. We may or may not merge this, but if we do, only after a thorough review by the team. The purpose of this PR is as much to provoke discussion as it is to implement a feature. I would appreciate if @dreamorosi and @sdangol could look at the code and give their opinions.

From my perspective, I think this was a very successful experiment: I am happy with the code quality and I also took the opportunity to add property tests, which are a perfect fit for this sort of logic.

Something I would note is that this probably worked so well because of how well-defined the issue was by @walmsles, and also that we could use the Python implementation as a reference.

One place I differed from the proposed implementation was that I don't batch the calls to KMS. This simplifies the API and means we mirror the Python implementation exactly. While ordinarily this would be a performance concern, we use the caching feature in the AWS Cryptography library to ensure that we only ever make one call to KMS when encrypting multiple fields.

What's included

  • @aws-lambda-powertools/data-masking package with erase, encrypt, and decrypt operations
  • AWSEncryptionSDKProvider using KMS envelope encryption (@aws-crypto/client-node as optional peer dep)
  • Field selection via dot notation, [*] array wildcards, and * object wildcards
  • Custom masking rules (regex, dynamic length, custom strings)
  • Encryption context (AAD) for integrity and authenticity
  • Prototype pollution protection
  • Unit tests with property-based testing via fast-check
  • E2e test scaffolding (CDK stack + Lambda handlers)
  • Documentation mirroring the Python Powertools data masking docs

A note on field path resolution

The Python implementation uses jsonpath_ng for field selection, which natively supports both querying and path extraction for write-back. We considered using a JavaScript JSONPath library (e.g. jsonpath-plus) to match this approach, but decided against it for a few reasons:

  • jsonpath-plus (9.8m weekly downloads) is marked as unmaintained by its maintainers
  • We already have @aws-lambda-powertools/jmespath in-house

Instead, we use JMESPath to validate expressions and a small (~20 line) custom walker to resolve wildcards ([*] and *) into concrete paths for write-back. JMESPath is read-only by design so it can't be used for path extraction directly, but the walker is simple, well-tested, and avoids any new dependencies.

There is however another library, jsonpath, that has 3.8m weekly downloads. I am always hesitant to introduce new dependencies to the project but I think if we want feature parity with Python we will need to take the dependency on. I would like to hear the maintainers thoughts before committing to this course of action though.

Issue number: closes #4960


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@pull-request-size pull-request-size bot added the size/XXL PRs with 1K+ LOC, largely documentation related label Mar 29, 2026
@svozza svozza force-pushed the 001-data-masking branch 4 times, most recently from b8ad946 to 8bf61c2 Compare March 29, 2026 19:00
@svozza
Copy link
Copy Markdown
Contributor Author

svozza commented Mar 29, 2026

A note here: I wasted a couple of hours getting the end to end tests to work because the linter wouldn't allow us to create an async function without the await keyword and the LLM ended up accidentally making the lambda function sync because of this. There are perfectly valid reason to want to use the async keyword in a function that doesn't use await, e.g., the value we return is a call to an async function: return someAsyncFunction(value). This has caused me issues multiple times in the past and I think we should disable this rule. I want a developer or an LLM to be able to know a function is async and the best way to do that is with the async keyword, not that await has been used there. In fact, the Biome docs already say that this rule is not recommended:

Summary

  • Rule available since: v1.4.0
  • Diagnostic Category: lint/suspicious/useAwait
  • This rule isn’t recommended, so you need to enable it.

I will raise a separate issue and PR to handle this.

@svozza
Copy link
Copy Markdown
Contributor Author

svozza commented Apr 2, 2026

I have removed the JMESPath dependency here, it's an unnecesaary depencendency because we only use it for reading but could mislead users into thinking we support the whole JMESPath spec for writes. In fact, for this MVP, we only support the following write paths.

  • dot notation (obj.key)
  • wildcard for object keys (obj.*.key)
  • wildcard for arrays (obj.[*].key)

I am very confident in the implementation of these three cases as they are validated with property tests, which are much more thorough than traditional unit tests.

svozza added 16 commits April 2, 2026 17:15
Add new @aws-lambda-powertools/data-masking package with support for:
- Irreversible field erasure with default or custom masking rules
- Field-level and full-payload encryption/decryption via AWS Encryption SDK
- Encryption context for integrity and authenticity
- Dot notation and [*] wildcard field selection
- Prototype pollution protection

Includes unit tests with property-based testing (fast-check),
e2e test scaffolding, and user-facing documentation.
…card support

Replace jmespath validation with native path resolution, add object wildcard
(.*) support alongside existing array wildcard ([*]), update JSDoc, add
property-based tests, encryption context provider tests, and enable
data-masking in CI test matrix.
…functions

Reduce cyclomatic complexity in walk by extracting wildcard/literal
segment resolution into a helper. Convert module-level functions to
arrow expressions per project code standards.
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 2, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XXL PRs with 1K+ LOC, largely documentation related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request: Add DataMasking utility for encrypting and masking sensitive data

1 participant