Skip to content

fix(lapis): allow non-ASCII characters in advanced queries#1628

Open
fengelniederhammer wants to merge 6 commits intomainfrom
1603-check-umlauts-and-other-non-ascii-characters-in-advanced-queries
Open

fix(lapis): allow non-ASCII characters in advanced queries#1628
fengelniederhammer wants to merge 6 commits intomainfrom
1603-check-umlauts-and-other-non-ascii-characters-in-advanced-queries

Conversation

@fengelniederhammer
Copy link
Copy Markdown
Contributor

@fengelniederhammer fengelniederhammer commented Apr 2, 2026

resolves #1603

Problem

Non-ASCII characters (umlauts, accented letters, Cyrillic, CJK, etc.) in unquoted advanced query values were silently dropped by the ANTLR lexer, producing wrong results with no error. For example:

  • division=Zürich was parsed as division=Zrich → 0 results
  • division.regex=Graubünden was parsed as division.regex=Graubnden → 0 results

Quoted values like division='Zürich' already worked correctly, since the QUOTED_STRING lexer rule accepts any character.

Fix

Added a UNICODE_LETTER lexer rule ([\p{Letter}]) and included it in the charOrNumber parser rule. This makes unquoted values behave consistently with quoted ones for any Unicode letter. ASCII letters continue to be matched by the existing AZ lexer rules (which take priority by rule order), so all existing parsing — nucleotide/amino acid symbols, keywords (NOT, MAYBE, ISNULL, etc.) — is unaffected.

Non-ASCII characters are also now valid in field name and gene/segment name positions, where they will produce a meaningful "field/gene not found" error rather than a silent wrong result or a confusing syntax error.

(also see antlr/antlr4#1688 for some background info)

PR Checklist

  • All necessary documentation has been adapted.
  • All necessary changes are explained in the llms.txt.
  • The implemented feature is covered by an appropriate test.

@vercel
Copy link
Copy Markdown

vercel bot commented Apr 2, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
lapis Ready Ready Preview, Comment Apr 2, 2026 8:51am

Request Review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Check Umlauts and other non-ASCII characters in advanced queries

2 participants