feat: add src/regex.ts core module + aggregate() regex filter#113
feat: add src/regex.ts core module + aggregate() regex filter#113
Conversation
- isRegexQuery(): detects /pattern/flags token in query string - buildApiQuery(): derives a safe GitHub API query from a regex query - top-level alternation A|B|C → "A OR B OR C" (GitHub OR operator) - otherwise: longest word-char literal sequence outside [...], escapes, metacharacters - qualifier tokens (filename:, language:, path:) preserved unchanged - <3 char result → warn (caller must use --regex-hint) - invalid regex → null filter + warn - aggregate(): new optional regexFilter parameter filters CodeMatch entries by testing at least one TextMatch.fragment against the RegExp Closes #111 Part of #110
|
Coverage after merging feat/regex-core into main will be
Coverage Report
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment.
Pull request overview
Adds a pure, testable core for regex-style queries (/pattern/flags) by extracting an API-safe literal term while returning a RegExp for local post-filtering, plus a new optional regexFilter capability in aggregation.
Changes:
- Added
src/regex.tswithisRegexQuery()andbuildApiQuery()for regex token detection, API query derivation, and localRegExpcompilation. - Extended
aggregate()with an optionalregexFilter?: RegExpto dropCodeMatchentries whose fragments don’t satisfy the regex. - Added/expanded unit tests for regex query parsing/derivation and aggregation filtering behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| src/regex.ts | New core module that detects /.../flags, derives a GitHub API query term, and returns a compiled RegExp for local filtering. |
| src/regex.test.ts | New unit tests validating regex detection, API term derivation (literal/OR), qualifier handling, flags, and warn/invalid cases. |
| src/aggregate.ts | Adds optional regexFilter support to filter matches by textMatches.fragment. |
| src/aggregate.test.ts | Adds coverage for aggregate() behavior with/without a regex filter, multi-fragment cases, flags, and empty fragments. |
- regex.ts: strip y (sticky) flag alongside g to prevent stateful RegExp.test() across independent fragments - regex.ts: preserve all non-regex tokens (free-text and qualifiers alike) in buildApiQuery — previously only qualifier-prefixed tokens were kept, silently dropping terms like 'useFeatureFlag NOT deprecated' - regex.ts: fix splitTopLevelAlternation escape detection — replace prev==='\\' check with an 'escaped' boolean so that \\|foo correctly identifies | as top-level (the double backslash escapes a backslash, not the pipe) - aggregate.ts: reset regexFilter.lastIndex = 0 before each .test() call to avoid false negatives with global/sticky regexes - regex.test.ts: add .toBe(true) to the missing assertion (line 16) - regex.test.ts: fix test description for qualifier-preservation case (remove non-existent :\s*\" from name) - regex.test.ts: add tests for y-flag stripping, free-text preservation, and escaped-backslash-before-pipe corner case
|
Coverage after merging feat/regex-core into main will be
Coverage Report
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
- regex.ts: raise branch length threshold from >= 1 to >= 3 in extractApiTerm — short branches like /a|bc/ now fall through to the longest-literal path which enforces the existing '< 3 chars → warn' rule, preventing extremely broad API queries such as 'a OR bc' - aggregate.ts: align regexFilter type with buildApiQuery return value; accept RegExp | null | undefined (via RegExp | null optional param) and update guard from !== undefined to != null so callers can pass the result of buildApiQuery directly without a null→undefined conversion - regex.test.ts: add test for short-branch alternation fallback (/a|bc/) - aggregate.test.ts: add test for null regexFilter treated as no-op
|
Coverage after merging feat/regex-core into main will be
Coverage Report
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The flag character class now includes all current JS RegExp flags: g i m s u y (already present) + d (hasIndices, ES2022) and v (unicodeSets, ES2023). The comment is updated to list them explicitly. This ensures /pattern/s (dotAll), /pattern/d, and /pattern/v are tokenized correctly and the regex token is replaced in the API query rather than left as a raw /.../flags string unsafe for the GitHub API. Add test case for /pattern/s to assert the flag is recognized and the API query is correctly derived. Closes review comment on PR #113
|
Coverage after merging feat/regex-core into main will be
Coverage Report
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
…stIndex leak
src/regex.ts — extractRegexToken():
- Exclude \r and \n from the character class ([^/\\] → [^/\\\r\n])
so the comment ("doesn't contain unescaped newlines") matches reality.
- Add (?=$|\s) end-of-token boundary after the flags so a word like
/useState/iSomething is not mis-tokenised (the non-flag suffix stops
the match before it can anchor to whitespace/end-of-string).
src/regex.test.ts:
- isRegexQuery('/useState/iSomething') → false (boundary regression)
- buildApiQuery('/useState/iSomething') → passthrough, regexFilter null
src/aggregate.ts — aggregate():
- Reset regexFilter.lastIndex = 0 after the some() loop so callers
that reuse the same RegExp instance don't observe a stale lastIndex
after aggregate() returns.
src/aggregate.test.ts:
- New test: verify lastIndex is 0 after aggregate() when using /g regex
|
Coverage after merging feat/regex-core into main will be
Coverage Report
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
…ents
src/regex.ts — extractRegexToken():
- Remove 'e' from the allowed flag set: [gimsuydev]* -> [gimsuydv]*
'e' is not a valid JavaScript RegExp flag; its presence would cause
/pattern/e to be tokenised as a regex and then fail compilation,
instead of being passed through unchanged as a plain query token.
src/regex.ts — splitTopLevelAlternation():
- Clarify truncated inline comments:
depth: 'tracks unescaped ( nesting'
→ 'tracks unescaped \"(\" nesting depth outside character classes'
inClass: 'tracks [...]'
→ 'tracks whether we are currently inside a character class [...]'
src/regex.test.ts:
- isRegexQuery('/pattern/e') → false (regression test)
- buildApiQuery('/pattern/e') → passthrough, regexFilter null
|
Coverage after merging feat/regex-core into main will be
Coverage Report
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
…warn src/aggregate.ts — aggregate(): - Capture regexFilter.lastIndex before filtering and restore it afterward (instead of hard-coding 0). This ensures aggregate() has no observable side effects on the caller's RegExp instance: a caller that set lastIndex intentionally before calling aggregate() will find it unchanged on return. src/aggregate.test.ts: - Updated test: pre-seed a non-zero lastIndex and assert it is restored to that exact value after aggregate() returns (not forced to 0). src/regex.ts — buildApiQuery(): - Capture the thrown Error and include err.message in the warn string. Callers now see the engine's own diagnostic (e.g. 'Invalid regular expression: /[/: Unterminated character class') instead of a generic 'Invalid regex pattern …' that loses the root cause. src/regex.test.ts: - Extended the invalid-regex test to assert the raw token (/[/) appears in the warn string for easy identification.
|
Coverage after merging feat/regex-core into main will be
Coverage Report
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
… test title
src/regex.ts — longestLiteralSequence():
- Remove 'a' and 'e' from isSpecialEscape. In JS without u/v, \a and \e
are identity escapes that simply match the literal characters 'a' and 'e',
so they should be accumulated like any other word char, not break the
sequence. Only true non-literal regex escapes remain (b, B, d, D, s, S,
w, W, p, P, u, U, x, X, 0-9, n, r, t, f, v).
src/regex.ts — buildApiQuery():
- Expand the comment explaining why both g and y are stripped:
g: GitHub returns at most a few fragments, not all occurrences.
y: sticky makes RegExp.test() stateful via lastIndex.
Both are intentional; all other flags are kept unchanged.
src/regex.test.ts:
- Fix test title: was missing ']' after first character class
('/'['\"axios['\"]:/') → correct ('/['\"']axios['\"]:/'), now
matching the actual query string under test.
|
Coverage after merging feat/regex-core into main will be
Coverage Report
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The previous split(/\s+/) + join approach would break GitHub quoted phrases (e.g. "feature flag" /from.*axios/) by splitting 'feature flag' into two separate tokens and losing the quote grouping. Fix: use q.replace(raw, term).trim() to replace only the regex token in-place, leaving every other character in the original query unchanged. This preserves quoted phrases, extra whitespace handling, and all qualifiers byte-for-byte. Regression test added: '"feature flag" /from.*axios/' → '"feature flag" axios'
|
Coverage after merging feat/regex-core into main will be
Coverage Report
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
String.replace(raw, term) replaces the first *substring* occurrence of raw,
which can target the wrong position when the same text appears earlier in the
query as a non-token prefix (e.g. '/useState/iSomething /useState/i' — the
string '/useState/i' is a substring of the first token but the actual matched
token is the second one).
Fix:
- Add `index: number` to RegexToken, storing the exact start position of the
token within the original query (computed from m.index + leading-space offset).
- Use slice-based splice in buildApiQuery():
q.slice(0, index) + term + q.slice(index + raw.length)
This replaces only the boundary-validated token at its exact location,
regardless of whether the same raw text appears elsewhere.
Regression test added:
'/useState/iSomething /useState/i' → '/useState/iSomething useState'
|
Coverage after merging feat/regex-core into main will be
Coverage Report
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Coverage after merging feat/regex-core into main will be
Coverage Report
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Summary
Part of #110 — closes #111.
Implements the pure-function core that enables regex query support in
github-code-search. The GitHub REST API does not support/pattern/syntax, so we work around it by:No I/O, no side effects — only pure functions eligible for unit testing.
What changed
src/regex.ts(new)Two public exports:
isRegexQuery(q)— returnstrueif the query contains a/pattern/or/pattern/flagstoken.buildApiQuery(q)— deterministic algorithm:apiQuery/from.*['"]axios/axios[a-zA-Z0-9_-]sequence outside[...], escapes, metacharacters/TODO|FIXME|HACK/TODO OR FIXME OR HACKORoperator/require\(['"]old-lib['"]\)/old-libfilename:package.json /["']axios["']:/filename:package.json axios/[~^]?[0-9]+\.[0-9]+/""+ warn/[/(invalid)""+ warn + null filterplain textplain textgflag is stripped (GitHub doesn't return all inline occurrences).iand other flags are kept and applied client-side.src/aggregate.tsAdded optional
regexFilter?: RegExpparameter toaggregate(). When set, aCodeMatchis kept only if at least oneTextMatch.fragmentmatches the regex. Existing callers unaffected (parameter is optional).src/regex.test.ts(new) — 22 test casesAll acceptance criteria from the issue covered: literal extraction, top-level alternation, qualifier preservation, flag handling, warn cases, invalid regex, plain-text passthrough.
src/aggregate.test.ts6 new cases covering the
regexFilterparameter: keep/exclude by fragment match, multi-fragment, case-insensitive flag, emptytextMatchesarray, backward-compat (undefined filter).How to test
Notes for reviewer
longestLiteralSequenceonly accumulates[a-zA-Z0-9_-]— this is intentional. Punctuation that looks like a literal (e.g.\() produces poor GitHub search terms and is excluded.[...]are skipped entirely (content like0-9is useless as a search term).>=comparison when updatingbestmeans later sequences of equal length win — this ensuresold-libbeatsrequirein/require\(['"]old-lib['"]\)/.Next
#112 wires this into the CLI (
searchAction), adds--regex-hint, UX messaging, docs and C4 diagram update.