Add std::encoding::xml — XML parser and serializer by ChristianReifberger · Pull Request #3074 · c3lang/c3c

ChristianReifberger · 2026-03-27T14:36:41Z

Recursive-descent parser supporting elements, attributes, text, CDATA, comments, PIs, DOCTYPE, XML declaration, UTF-8 BOM
Full entity expansion: named (& < > ' ") and numeric (decimal A and hex A)
Serializer (encode/tencode) with text and attribute escaping
Public API: parse/parse_string/tparse/tparse_string, XmlDoc.encode/tencode/free, XmlNode.get_attr/has_attr/child/children_named/text_content and operators(len)/[]/&[]
Comprehensive unit tests in test/unit/stdlib/encoding/xml.c3

Types

XmlDoc — the parsed document; owns all allocated nodes via its allocator
XmlNode — a single node in the tree; owns its name, content, attrs, and children
XmlNodeType — enum: ELEMENT, TEXT, CDATA, COMMENT, PI
faultdef UNEXPECTED_CHARACTER, UNEXPECTED_EOF, INVALID_TAG, MISMATCHED_TAG, INVALID_ATTRIBUTE, MAX_DEPTH_REACHED
int max_depth = 128 — configurable nesting limit (matches std::encoding::json)

API

Method	Description
`xml::parse(allocator, stream)`	Parse from any `InStream`, heap-allocated; returns `XmlDoc?`
`xml::parse_string(allocator, s)`	Parse from a `String`; returns `XmlDoc?`
`xml::tparse(stream)`	Parse from `InStream`, temp-allocated
`xml::tparse_string(s)`	Parse from `String`, temp-allocated
`XmlDoc.free()`	Free all owned memory
`XmlDoc.encode(allocator)`	Serialize to XML text with declaration
`XmlDoc.tencode()`	`encode` using temp allocator
`XmlNode.get_attr(name) → String?`	Attribute value, or `NOT_FOUND`
`XmlNode.has_attr(name) → bool`	True if attribute exists
`XmlNode.child(tag) → XmlNode*?`	First child element with given tag, or `NOT_FOUND`
`XmlNode.children_named(tag) → XmlNode*[]`	Temp-allocated slice of all child elements with given tag
`XmlNode.text_content() → String`	Concatenated text of all `TEXT`/`CDATA` children
`XmlNode.len → usz`	Number of children (`@operator(len)`)
`XmlNode[i] → XmlNode*`	Index into children (`@operator([])`)
`&XmlNode[i] → XmlNode**`	Reference to child slot (`@operator(&[])`)

Implementation notes

The parser is a recursive-descent reader over any InStream. A single shared DString scratch (stack-allocated via stack_mem) is used for all intermediate string building; only strings that survive past the current parse step are heap-copied.
Closing-tag names are compared directly against ctx.scratch without allocation (read_name_scratch), saving one heap round-trip per element close.
Attribute names are heap-allocated for the HashMap.set call then immediately freed, since HashMap{String, String} copies its keys (COPY_KEYS=true).
XmlNode.free() recurses through children; leaf nodes skip the attr/children teardown since they are never initialized. defer catch node.free() in parse_element ensures partially-constructed nodes are cleaned up on any parse error.
encode() builds output into a temp-backed ByteWriter and copies the result into the target allocator at the end, keeping the hot path allocation-free.

Parser features

Elements, self-closing elements, attributes (single and double quoted)
Text content with entity expansion: named (& < > ' ") and numeric (decimal A and hex A)
CDATA sections (<![CDATA[...]]>)
Comments () — skipped during parsing
Processing instructions (<?target data?>) — skipped during parsing
XML declaration (<?xml ... ?>) — skipped
DOCTYPE (<!DOCTYPE ...>) — skipped, including internal subsets with [...]
UTF-8 BOM (EF BB BF) — consumed silently
Configurable nesting depth limit (max_depth, default 128)

Tests

test/unit/stdlib/encoding/xml.c3 covers: simple elements, nested elements, text content, attributes (single/double quoted), entity expansion (all five named entities, decimal and hex numeric), CDATA sections, comments (inline and prologue), XML declaration, DOCTYPE (simple and with internal subset), mixed content, children_named, whitespace around root, encode/decode roundtrip, tencode self-closing, temp-allocator variants, and error cases (MISMATCHED_TAG, UNEXPECTED_EOF, INVALID_ATTRIBUTE, MAX_DEPTH_REACHED). Includes a comprehensive integration test exercising all features in a single document with a parse → encode → parse roundtrip.

ManuLinares · 2026-03-27T21:46:31Z

Hey, I've been testing the std::encoding::xml against libxml2 with a verification tool, and found a few things:

Custom entities break parsing. It doesn't handle things like &hsize5; (common in DocBook), throwing UNEXPECTED_CHARACTER.
Whitespace bug in attributes. Numeric references like 
 (newline) are being flattened into spaces instead of being preserved.
Comments are lost. The parser currently just skips them entirely. This is fine.

Maybe you would like to add these benchmarks? benchmarks/stdlib/encoding/xml.c3 xml.txt

ChristianReifberger · 2026-03-28T09:44:28Z

Hey, I've been testing the std::encoding::xml against libxml2 with a verification tool, and found a few things:

Custom entities break parsing. It doesn't handle things like &hsize5; (common in DocBook), throwing UNEXPECTED_CHARACTER.

Whitespace bug in attributes. Numeric references like 
 (newline) are being flattened into spaces instead of being preserved.

Comments are lost. The parser currently just skips them entirely. This is fine.

Maybe you would like to add these benchmarks? benchmarks/stdlib/encoding/xml.c3 xml.txt

Thank's a lot for the feedback! I fixed 1., 2. and added the benchmarks.

Regarding 3. (comments) I actually missed that I had created XMLNodeType.COMMENT and XMLNodeType.PI (Processing Instruction, e.g. <?xml-stylesheet type="text/xsl" href="style.xsl"?>, <?php ... ?> or <?app-version 2.0?>), which I have both now implemented. Both can be included with a .keep_comments or keep_processing_instructions option.

Additionally I extended it so top-level things are stored as prologue and epilogue (e.g. comments or Processing Instructions at top level)

ManuLinares · 2026-03-28T16:10:54Z

C14N verification against other tools is not going well. I would mark this as draft and revisit

ChristianReifberger · 2026-03-29T08:27:44Z

Thanks again for the input. I used your steps and created a verification tool /test/tools/xml_c14n_test.c3 that verifies against xmllint --c14n.

After building the tool, you can provide it either a file or a directory of files that undergo the following steps:

parsed with xmllint (baseline)
parse with c3
encode with c3
written to a tmp file
verify tmp file with xmllint --c14n
compare outputs.

Based on that I fixed the implementation and added the missing features, e.g. external entity resolving.

For my tests I used the W3C XML Test Suite, which can be found here: https://dev.w3.org/cvsweb/2001/XML-Test-Suite/xmlconf/ / https://www.w3.org/XML/Test/xmlconf-20020606.htm

All of these calls yield no errors now:

/tmp/xml_c14n_test /tmp/xmlconf/xmltest/valid -> 327 passed
/tmp/xml_c14n_test /tmp/xmlconf/ibm/valid -> 298 passed
/tmp/xml_c14n_test /tmp/xmlconf/sun/valid -> 55 passed

The c3 implementation is currently more lenient than xmllint --c14n, i.e. in some cases C3 allows invalid xml where xmllint --c14n just fails. Maybe you can let me know if the goal of the stdlib is strict xmllint --c14n conformity or if that is fine.

ManuLinares · 2026-04-08T17:11:06Z

Hi, I can't see clearly the changes because you force-pushed.

Was this removal intentional?

        // §2.11: \r\n and standalone \r normalize to \n.
        if (c == '\r')
        {
            char peek = read_next(ctx)!;
            if (peek != '\n') pushback(ctx, peek);
            ctx.line++;
            ctx.scratch.append(' ');
            continue;
        }

ChristianReifberger · 2026-04-08T17:27:12Z

Yeah I am really sorry about that. I messed up my local git history by pulling in master wrong and then had to reconstruct the correct history with force push so the master commits unrelated to this PR don't show up. Still having my battles on that front.

Regarding your question: Yes, the removal was intentional. read_attr_value() gets its chars from read_next(), which already normalizes CRLF in lines 430+.

ManuLinares · 2026-04-08T20:03:07Z

A test against https://www.w3.org/XML/Test/#releases (https://www.w3.org/XML/Test/xmlts20130923.tar.gz), with a failing example:

<!DOCTYPE foo [
<!ELEMENT foo (foo*)>
<!ENTITY space "&#38;#32;">
]>
<foo><foo/>&space;<foo/></foo>

xmllint --c14n parsed output:

<foo><foo></foo> <foo></foo></foo>

c3 output:

<foo><foo></foo>&amp;#32;<foo></foo></foo>

From the one you provided (the tests from https://www.w3.org/XML/Test/xmlts20020606.zip) I get many errors and an infinite recursion segfaults on lib/std/encoding/xml.c3:600. Entity declaration parsing seems somewhat broken, from the example above there is a double-expansion escaping issue. Mostly I see a lot of code repetition and any adherence to any xml spec.

I don't know how to proceed, I would define a smaller scope because 2000+ lines of code are very hard to review but that is only my opinion.

@Private

- Recursive-descent parser supporting elements, attributes, text, CDATA, comments, PIs, DOCTYPE, XML declaration, UTF-8 BOM - Full entity expansion: named (& < > ' ") and numeric (decimal &c3lang#65; and hex A) - Serializer (encode/tencode) with text and attribute escaping - Public API: parse/parse_string/tparse/tparse_string, XmlDoc.encode/tencode/free, XmlNode.get_attr/has_attr/child/ children_named/text_content and @operator(len)/[]/&[] - Comprehensive unit tests in test/unit/stdlib/encoding/xml.c3 added std::encoding::xml benchmarks fix custom entities (e.g. &hsize5) breaking parsing; added docbook test fix xml attr whitespace normalization per §3.3.3: literal \n/\r/\t → space, 
 preserved added optional comment parsing xml: add processing instructions, DOCTYPE, xml declaration, epilogue, encode options, bugfixes - Rename PI → PROCESSING_INSTRUCTION in XmlNodeType - Add XmlParseOptions: keep_processing_instructions, max_depth (removes global) - Add XmlEncodeOptions: version field - Add XmlDoctype struct with name, public_id, system_id - Add XmlDoc.prologue, epilogue, xml_declaration, doctype fields - parse_pi: returns XmlNode* instead of void; xml declaration captured separately - parse_doctype: replaces skip_doctype; reads name/PUBLIC/SYSTEM ids - encode: emits DOCTYPE, prologue/epilogue nodes; respects XmlEncodeOptions.version - Fix CDATA ]]> detection: replace 3-char lookahead with bracket-count state machine - Fix max_depth: moved from global to XmlContext, populated from XmlParseOptions - Fix numeric character references: validate codepoint ≤ 0x10FFFF before append_char32 - Fix attribute \r\n normalization: collapses to single space per XML §2.11 - Add 20 new tests (45 total) xml: fix entity overflow, PI leak, DOCTYPE leak; add pretty-print and Unicode names - append_entity: always consume to ';' even when name exceeds 63-char buffer, preventing parser desync on overlong entity names - parse_pi: replace manual EOF-only cleanup with defer catch, fixing a leak when skip_whitespace propagates a non-EOF IO error - parse_document: free DOCTYPE string members before the struct on duplicate DOCTYPE, matching the cleanup pattern used everywhere else - is_name_start/is_name_char: accept bytes >= 0x80 for UTF-8 encoded element and attribute names (e.g. <图书>, <作者>) - XmlAttrMap/XmlNodeList: remove @Private so external code can name the types - XmlEncodeOptions: add indent field; empty string (default) preserves existing compact output; non-empty enables block layout — children indented when any sibling is an element/comment/PI, inline otherwise - Add tests for Unicode names and all pretty-print variants c14n conformity xml: reject NULs and preserve long entity refs

lerno · 2026-04-17T17:48:50Z

I've reviewed this code. It has many parts which are repetitive, like I counted 5 hex string -> integer conversions that were mostly identical. The style of the code and lack of succinctness is problematic. It's clearly strongly LLM assisted at the least. It would take me maybe 8 hours or more to take this code and whip it into a shape which is possible to merge with the stdlib. It should be about 1/3 of it's current size I believe. While correctly using optionals etc it doesn't actually leverage them, leading to unnecessarily long winded code that's also contributing to the expanse of the source. So I will say it's possible to accept this kind of "enterprisey" style of code into the standard library at this point. I recommend you take this and make it into a separate library that you can offer people so that it doesn't come to waste.

lerno · 2026-04-17T17:49:35Z

For the reasons outline above, I am therefore sad to say that I need to close this PR.

ChristianReifberger marked this pull request as ready for review March 27, 2026 14:40

ChristianReifberger force-pushed the add-xml-parser-stdlib branch from a8660ac to 0b4b2de Compare April 8, 2026 12:12

ChristianReifberger and others added 2 commits April 17, 2026 19:11

Merge with master and small fixup.

6c6842f

lerno force-pushed the add-xml-parser-stdlib branch from 0b4b2de to 6c6842f Compare April 17, 2026 17:42

lerno closed this Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add std::encoding::xml — XML parser and serializer#3074

Add std::encoding::xml — XML parser and serializer#3074
ChristianReifberger wants to merge 2 commits intoc3lang:masterfrom
ChristianReifberger:add-xml-parser-stdlib

ChristianReifberger commented Mar 27, 2026 •

edited

Loading

Uh oh!

ManuLinares commented Mar 27, 2026

Uh oh!

ChristianReifberger commented Mar 28, 2026

Uh oh!

ManuLinares commented Mar 28, 2026

Uh oh!

ChristianReifberger commented Mar 29, 2026 •

edited

Loading

Uh oh!

ManuLinares commented Apr 8, 2026

Uh oh!

ChristianReifberger commented Apr 8, 2026

Uh oh!

ManuLinares commented Apr 8, 2026

Uh oh!

lerno commented Apr 17, 2026

Uh oh!

lerno commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ChristianReifberger commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Types

API

Implementation notes

Parser features

Tests

Uh oh!

ManuLinares commented Mar 27, 2026

Uh oh!

ChristianReifberger commented Mar 28, 2026

Uh oh!

ManuLinares commented Mar 28, 2026

Uh oh!

ChristianReifberger commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ManuLinares commented Apr 8, 2026

Uh oh!

ChristianReifberger commented Apr 8, 2026

Uh oh!

ManuLinares commented Apr 8, 2026

Uh oh!

lerno commented Apr 17, 2026

Uh oh!

lerno commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ChristianReifberger commented Mar 27, 2026 •

edited

Loading

ChristianReifberger commented Mar 29, 2026 •

edited

Loading