Skip to content

Add std::encoding::xml — XML parser and serializer#3074

Closed
ChristianReifberger wants to merge 2 commits intoc3lang:masterfrom
ChristianReifberger:add-xml-parser-stdlib
Closed

Add std::encoding::xml — XML parser and serializer#3074
ChristianReifberger wants to merge 2 commits intoc3lang:masterfrom
ChristianReifberger:add-xml-parser-stdlib

Conversation

@ChristianReifberger
Copy link
Copy Markdown
Contributor

@ChristianReifberger ChristianReifberger commented Mar 27, 2026

  • Recursive-descent parser supporting elements, attributes, text, CDATA, comments, PIs, DOCTYPE, XML declaration, UTF-8 BOM
  • Full entity expansion: named (& < > ' ") and numeric (decimal A and hex A)
  • Serializer (encode/tencode) with text and attribute escaping
  • Public API: parse/parse_string/tparse/tparse_string, XmlDoc.encode/tencode/free, XmlNode.get_attr/has_attr/child/children_named/text_content and operators(len)/[]/&[]
  • Comprehensive unit tests in test/unit/stdlib/encoding/xml.c3

Types

  • XmlDoc — the parsed document; owns all allocated nodes via its allocator
  • XmlNode — a single node in the tree; owns its name, content, attrs, and children
  • XmlNodeType — enum: ELEMENT, TEXT, CDATA, COMMENT, PI
  • faultdef UNEXPECTED_CHARACTER, UNEXPECTED_EOF, INVALID_TAG, MISMATCHED_TAG, INVALID_ATTRIBUTE, MAX_DEPTH_REACHED
  • int max_depth = 128 — configurable nesting limit (matches std::encoding::json)

API

Method Description
xml::parse(allocator, stream) Parse from any InStream, heap-allocated; returns XmlDoc?
xml::parse_string(allocator, s) Parse from a String; returns XmlDoc?
xml::tparse(stream) Parse from InStream, temp-allocated
xml::tparse_string(s) Parse from String, temp-allocated
XmlDoc.free() Free all owned memory
XmlDoc.encode(allocator) Serialize to XML text with declaration
XmlDoc.tencode() encode using temp allocator
XmlNode.get_attr(name) → String? Attribute value, or NOT_FOUND
XmlNode.has_attr(name) → bool True if attribute exists
XmlNode.child(tag) → XmlNode*? First child element with given tag, or NOT_FOUND
XmlNode.children_named(tag) → XmlNode*[] Temp-allocated slice of all child elements with given tag
XmlNode.text_content() → String Concatenated text of all TEXT/CDATA children
XmlNode.len → usz Number of children (@operator(len))
XmlNode[i] → XmlNode* Index into children (@operator([]))
&XmlNode[i] → XmlNode** Reference to child slot (@operator(&[]))

Implementation notes

  • The parser is a recursive-descent reader over any InStream. A single shared DString scratch (stack-allocated via stack_mem) is used for all intermediate string building; only strings that survive past the current parse step are heap-copied.
  • Closing-tag names are compared directly against ctx.scratch without allocation (read_name_scratch), saving one heap round-trip per element close.
  • Attribute names are heap-allocated for the HashMap.set call then immediately freed, since HashMap{String, String} copies its keys (COPY_KEYS=true).
  • XmlNode.free() recurses through children; leaf nodes skip the attr/children teardown since they are never initialized. defer catch node.free() in parse_element ensures partially-constructed nodes are cleaned up on any parse error.
  • encode() builds output into a temp-backed ByteWriter and copies the result into the target allocator at the end, keeping the hot path allocation-free.

Parser features

  • Elements, self-closing elements, attributes (single and double quoted)
  • Text content with entity expansion: named (&amp; &lt; &gt; &apos; &quot;) and numeric (decimal &#65; and hex &#x41;)
  • CDATA sections (<![CDATA[...]]>)
  • Comments (<!-- ... -->) — skipped during parsing
  • Processing instructions (<?target data?>) — skipped during parsing
  • XML declaration (<?xml ... ?>) — skipped
  • DOCTYPE (<!DOCTYPE ...>) — skipped, including internal subsets with [...]
  • UTF-8 BOM (EF BB BF) — consumed silently
  • Configurable nesting depth limit (max_depth, default 128)

Tests

test/unit/stdlib/encoding/xml.c3 covers: simple elements, nested elements, text content, attributes (single/double quoted), entity expansion (all five named entities, decimal and hex numeric), CDATA sections, comments (inline and prologue), XML declaration, DOCTYPE (simple and with internal subset), mixed content, children_named, whitespace around root, encode/decode roundtrip, tencode self-closing, temp-allocator variants, and error cases (MISMATCHED_TAG, UNEXPECTED_EOF, INVALID_ATTRIBUTE, MAX_DEPTH_REACHED). Includes a comprehensive integration test exercising all features in a single document with a parse → encode → parse roundtrip.

@ChristianReifberger ChristianReifberger marked this pull request as ready for review March 27, 2026 14:40
@ManuLinares
Copy link
Copy Markdown
Member

Hey, I've been testing the std::encoding::xml against libxml2 with a verification tool, and found a few things:

  1. Custom entities break parsing. It doesn't handle things like &hsize5; (common in DocBook), throwing UNEXPECTED_CHARACTER.
  2. Whitespace bug in attributes. Numeric references like &#xA; (newline) are being flattened into spaces instead of being preserved.
  3. Comments are lost. The parser currently just skips them entirely. This is fine.

Maybe you would like to add these benchmarks? benchmarks/stdlib/encoding/xml.c3 xml.txt

@ChristianReifberger
Copy link
Copy Markdown
Contributor Author

Hey, I've been testing the std::encoding::xml against libxml2 with a verification tool, and found a few things:

  1. Custom entities break parsing. It doesn't handle things like &hsize5; (common in DocBook), throwing UNEXPECTED_CHARACTER.
  2. Whitespace bug in attributes. Numeric references like &#xA; (newline) are being flattened into spaces instead of being preserved.
  3. Comments are lost. The parser currently just skips them entirely. This is fine.

Maybe you would like to add these benchmarks? benchmarks/stdlib/encoding/xml.c3 xml.txt

Thank's a lot for the feedback! I fixed 1., 2. and added the benchmarks.

Regarding 3. (comments) I actually missed that I had created XMLNodeType.COMMENT and XMLNodeType.PI (Processing Instruction, e.g. <?xml-stylesheet type="text/xsl" href="style.xsl"?>, <?php ... ?> or <?app-version 2.0?>), which I have both now implemented. Both can be included with a .keep_comments or keep_processing_instructions option.

Additionally I extended it so top-level things are stored as prologue and epilogue (e.g. comments or Processing Instructions at top level)

@ManuLinares
Copy link
Copy Markdown
Member

C14N verification against other tools is not going well. I would mark this as draft and revisit

@ChristianReifberger
Copy link
Copy Markdown
Contributor Author

ChristianReifberger commented Mar 29, 2026

Thanks again for the input. I used your steps and created a verification tool /test/tools/xml_c14n_test.c3 that verifies against xmllint --c14n.

After building the tool, you can provide it either a file or a directory of files that undergo the following steps:

  1. parsed with xmllint (baseline)
  2. parse with c3
  3. encode with c3
  4. written to a tmp file
  5. verify tmp file with xmllint --c14n
  6. compare outputs.

Based on that I fixed the implementation and added the missing features, e.g. external entity resolving.

For my tests I used the W3C XML Test Suite, which can be found here: https://dev.w3.org/cvsweb/2001/XML-Test-Suite/xmlconf/ / https://www.w3.org/XML/Test/xmlconf-20020606.htm

All of these calls yield no errors now:

  • /tmp/xml_c14n_test /tmp/xmlconf/xmltest/valid -> 327 passed
  • /tmp/xml_c14n_test /tmp/xmlconf/ibm/valid -> 298 passed
  • /tmp/xml_c14n_test /tmp/xmlconf/sun/valid -> 55 passed

The c3 implementation is currently more lenient than xmllint --c14n, i.e. in some cases C3 allows invalid xml where xmllint --c14n just fails. Maybe you can let me know if the goal of the stdlib is strict xmllint --c14n conformity or if that is fine.

@ManuLinares
Copy link
Copy Markdown
Member

Hi, I can't see clearly the changes because you force-pushed.

Was this removal intentional?

        // §2.11: \r\n and standalone \r normalize to \n.
        if (c == '\r')
        {
            char peek = read_next(ctx)!;
            if (peek != '\n') pushback(ctx, peek);
            ctx.line++;
            ctx.scratch.append(' ');
            continue;
        }

@ChristianReifberger
Copy link
Copy Markdown
Contributor Author

Yeah I am really sorry about that. I messed up my local git history by pulling in master wrong and then had to reconstruct the correct history with force push so the master commits unrelated to this PR don't show up. Still having my battles on that front.

Regarding your question: Yes, the removal was intentional. read_attr_value() gets its chars from read_next(), which already normalizes CRLF in lines 430+.

@ManuLinares
Copy link
Copy Markdown
Member

A test against https://www.w3.org/XML/Test/#releases (https://www.w3.org/XML/Test/xmlts20130923.tar.gz), with a failing example:

<!DOCTYPE foo [
<!ELEMENT foo (foo*)>
<!ENTITY space "&#38;#32;">
]>
<foo><foo/>&space;<foo/></foo>

xmllint --c14n parsed output:

<foo><foo></foo> <foo></foo></foo>

c3 output:

<foo><foo></foo>&amp;#32;<foo></foo></foo>

From the one you provided (the tests from https://www.w3.org/XML/Test/xmlts20020606.zip) I get many errors and an infinite recursion segfaults on lib/std/encoding/xml.c3:600. Entity declaration parsing seems somewhat broken, from the example above there is a double-expansion escaping issue. Mostly I see a lot of code repetition and any adherence to any xml spec.

I don't know how to proceed, I would define a smaller scope because 2000+ lines of code are very hard to review but that is only my opinion.

ChristianReifberger and others added 2 commits April 17, 2026 19:11
- Recursive-descent parser supporting elements, attributes, text,
  CDATA, comments, PIs, DOCTYPE, XML declaration, UTF-8 BOM
- Full entity expansion: named (&amp; &lt; &gt; &apos; &quot;)
  and numeric (decimal &c3lang#65; and hex &#x41;)
- Serializer (encode/tencode) with text and attribute escaping
- Public API: parse/parse_string/tparse/tparse_string,
  XmlDoc.encode/tencode/free, XmlNode.get_attr/has_attr/child/
  children_named/text_content and @operator(len)/[]/&[]
- Comprehensive unit tests in test/unit/stdlib/encoding/xml.c3

added std::encoding::xml benchmarks

fix custom entities (e.g. &hsize5) breaking parsing; added docbook test

fix xml attr whitespace normalization per §3.3.3: literal \n/\r/\t → space, &#xA; preserved

added optional comment parsing

xml: add processing instructions, DOCTYPE, xml declaration, epilogue, encode options, bugfixes

- Rename PI → PROCESSING_INSTRUCTION in XmlNodeType
- Add XmlParseOptions: keep_processing_instructions, max_depth (removes global)
- Add XmlEncodeOptions: version field
- Add XmlDoctype struct with name, public_id, system_id
- Add XmlDoc.prologue, epilogue, xml_declaration, doctype fields
- parse_pi: returns XmlNode* instead of void; xml declaration captured separately
- parse_doctype: replaces skip_doctype; reads name/PUBLIC/SYSTEM ids
- encode: emits DOCTYPE, prologue/epilogue nodes; respects XmlEncodeOptions.version
- Fix CDATA ]]> detection: replace 3-char lookahead with bracket-count state machine
- Fix max_depth: moved from global to XmlContext, populated from XmlParseOptions
- Fix numeric character references: validate codepoint ≤ 0x10FFFF before append_char32
- Fix attribute \r\n normalization: collapses to single space per XML §2.11
- Add 20 new tests (45 total)

xml: fix entity overflow, PI leak, DOCTYPE leak; add pretty-print and Unicode names

- append_entity: always consume to ';' even when name exceeds 63-char buffer,
  preventing parser desync on overlong entity names
- parse_pi: replace manual EOF-only cleanup with defer catch, fixing a leak
  when skip_whitespace propagates a non-EOF IO error
- parse_document: free DOCTYPE string members before the struct on duplicate
  DOCTYPE, matching the cleanup pattern used everywhere else
- is_name_start/is_name_char: accept bytes >= 0x80 for UTF-8 encoded element
  and attribute names (e.g. <图书>, <作者>)
- XmlAttrMap/XmlNodeList: remove @Private so external code can name the types
- XmlEncodeOptions: add indent field; empty string (default) preserves existing
  compact output; non-empty enables block layout — children indented when any
  sibling is an element/comment/PI, inline otherwise
- Add tests for Unicode names and all pretty-print variants

c14n conformity

xml: reject NULs and preserve long entity refs
@lerno lerno force-pushed the add-xml-parser-stdlib branch from 0b4b2de to 6c6842f Compare April 17, 2026 17:42
@lerno
Copy link
Copy Markdown
Collaborator

lerno commented Apr 17, 2026

I've reviewed this code. It has many parts which are repetitive, like I counted 5 hex string -> integer conversions that were mostly identical. The style of the code and lack of succinctness is problematic. It's clearly strongly LLM assisted at the least. It would take me maybe 8 hours or more to take this code and whip it into a shape which is possible to merge with the stdlib. It should be about 1/3 of it's current size I believe. While correctly using optionals etc it doesn't actually leverage them, leading to unnecessarily long winded code that's also contributing to the expanse of the source. So I will say it's possible to accept this kind of "enterprisey" style of code into the standard library at this point. I recommend you take this and make it into a separate library that you can offer people so that it doesn't come to waste.

@lerno
Copy link
Copy Markdown
Collaborator

lerno commented Apr 17, 2026

For the reasons outline above, I am therefore sad to say that I need to close this PR.

@lerno lerno closed this Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants