Performance: ByteTables + targeted byte-walking for hot parse paths#2070
Draft
Performance: ByteTables + targeted byte-walking for hot parse paths#2070
Conversation
2377573 to
1f0a6c1
Compare
ByteTables provides pre-computed 256-entry boolean lookup arrays for byte classification (IDENT_START, IDENT_CONT, WORD, DIGIT, WHITESPACE) and named constants for delimiter bytes (NEWLINE, DASH, DOT, HASH). bench_quick.rb measures parse µs, render µs, and object allocations for the theme benchmark suite.
parse_number: replace INTEGER_REGEX/FLOAT_REGEX matching and StringScanner loop with a single byte-walking pass using ByteTables::DIGIT. Avoids MatchData allocation and StringScanner reset on every call. Expression.parse: only call String#strip when leading/trailing whitespace is actually present (checked via ByteTables::WHITESPACE). Avoids allocating a new String on ~4,464 calls per compile.
Skip the expensive recursive VariableParser regex for simple lookups like 'product.title' (~90% of real-world cases). SIMPLE_LOOKUP_RE validates the input is a plain a.b.c chain (no brackets, no quotes). On match, byte-walks on dots to split segments instead of invoking the regex engine. Falls through to the original VariableParser scan for complex inputs.
Add try_parse_tag_token that parses {%...%} tag tokens using
getbyte/byteslice + ByteTables lookup arrays instead of the
FullToken regex with 4 capture groups. Allocates only the 2
strings needed (tag_name, markup) vs 4+ from regex captures.
Uses ByteTables::WORD (no hyphen) for tag name scanning,
matching TagName = /#|\w+/ exactly. Falls back to FullToken
regex when the fast path returns nil.
1f0a6c1 to
d3e3952
Compare
36 tests covering the three optimization sites:
Expression.parse_number (13 tests):
- Simple integers, negatives, floats, trailing dots
- Multi-dot truncation (1.2.3 → 1.2)
- Rejection of non-numeric input and trailing alpha (1.2.3a)
Expression.parse strip guard (7 tests):
- Leading, trailing, both-sides whitespace
- No-strip-needed case (no allocation)
- Null byte stripping (matches String#strip behavior)
VariableLookup.simple_lookup? (8 tests):
- Accepts: single names, dotted chains, question marks, hyphens
- Rejects: brackets, empty, leading/trailing dots, double dots
VariableLookup fast path equivalence (7 tests):
- name/lookups/command_flags match for simple and deep chains
- Bracket inputs fall through to regex path correctly
BlockBody.try_parse_tag_token (10 tests):
- Simple tags, whitespace control variants ({%-, -%}, both)
- No-markup tags, hash comments, newline counting
- Hyphenated names stop at hyphen (matching TagName = /\w+/)
- Malformed tokens return nil (fallback to FullToken regex)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add pre-computed byte lookup tables (
ByteTables) and apply byte-walking to three hot parsing paths, replacing regex matching and StringScanner usage in the most frequently called parse methods.+208 / -61 across 5 files (net +147 lines of production code).
Results (Ruby 4.0.2 + YJIT, theme benchmark)
What changed
1.
ByteTablesmodule (new, 44 lines)Pre-computed 256-entry boolean arrays for byte classification:
IDENT_START,IDENT_CONT,WORD,DIGIT,WHITESPACE. A single array index (TABLE[byte]) replaces 3-5 chained comparisons per byte check. Built once at load time, frozen.2.
Expression.parse_number— byte-walk instead of regex + StringScannerReplaced
INTEGER_REGEX/FLOAT_REGEXmatching and the StringScanner byte loop with a single forward pass usingByteTables::DIGIT. AvoidsMatchDataallocation andStringScannerreset per call. Handles all edge cases: negative numbers, multi-dot floats (1.2.3.4), trailing dots (123.), and rejects trailing non-numeric bytes (1.2.3a).Also guards
Expression.parse'sString#stripcall — only allocates when whitespace is actually present (~4,464 avoided allocations per compile).3.
VariableLookup— fast path for simple identifier chainsSIMPLE_LOOKUP_REvalidates that input is a plaina.b.cchain (no brackets, no quotes — ~90% of real-world lookups). On match, byte-walks on dots instead of invoking the recursiveVariableParserregex (/\[(?>[^\[\]]+|\g<0>)*\]|[\w-]+\??/). Falls through to original path for complex inputs.4.
BlockBody.try_parse_tag_token— byte-walk tag tokensParses
{%...%}tokens usinggetbyte/byteslice+ByteTablesinstead of theFullTokenregex with 4 capture groups. Allocates only the 2 strings needed (tag_name,markup) vs 4+ from regex captures. UsesByteTables::WORD(no hyphen) for tag name scanning, matchingTagName = /#|\w+/exactly. Falls back toFullTokenregex onnil.Design principles
nil/false, the original regex path runs. Zero risk for edge cases.Review process
This was developed iteratively with multi-agent code review covering:
IDENT_CONT,?suffix in tag names, multi-dot trailing alpha inparse_number)WHITESPACEtable vsString#stripmatch?+ byte-walk is near-optimal (pure byte-walk is 2.8× slower than regex for validation)