Add std::encoding::xml — XML parser and serializer#3074
Add std::encoding::xml — XML parser and serializer#3074ChristianReifberger wants to merge 2 commits intoc3lang:masterfrom
Conversation
|
Hey, I've been testing the
Maybe you would like to add these benchmarks? |
Thank's a lot for the feedback! I fixed 1., 2. and added the benchmarks. Regarding 3. (comments) I actually missed that I had created XMLNodeType.COMMENT and XMLNodeType.PI (Processing Instruction, e.g. Additionally I extended it so top-level things are stored as prologue and epilogue (e.g. comments or Processing Instructions at top level) |
|
C14N verification against other tools is not going well. I would mark this as draft and revisit |
|
Thanks again for the input. I used your steps and created a verification tool After building the tool, you can provide it either a file or a directory of files that undergo the following steps:
Based on that I fixed the implementation and added the missing features, e.g. external entity resolving. For my tests I used the W3C XML Test Suite, which can be found here: https://dev.w3.org/cvsweb/2001/XML-Test-Suite/xmlconf/ / https://www.w3.org/XML/Test/xmlconf-20020606.htm All of these calls yield no errors now:
The c3 implementation is currently more lenient than |
a8660ac to
0b4b2de
Compare
|
Hi, I can't see clearly the changes because you force-pushed. Was this removal intentional? |
|
Yeah I am really sorry about that. I messed up my local git history by pulling in master wrong and then had to reconstruct the correct history with force push so the master commits unrelated to this PR don't show up. Still having my battles on that front. Regarding your question: Yes, the removal was intentional. read_attr_value() gets its chars from read_next(), which already normalizes CRLF in lines 430+. |
|
A test against https://www.w3.org/XML/Test/#releases (https://www.w3.org/XML/Test/xmlts20130923.tar.gz), with a failing example: <!DOCTYPE foo [
<!ELEMENT foo (foo*)>
<!ENTITY space "&#32;">
]>
<foo><foo/>&space;<foo/></foo>
<foo><foo></foo> <foo></foo></foo>c3 output: From the one you provided (the tests from https://www.w3.org/XML/Test/xmlts20020606.zip) I get many errors and an infinite recursion segfaults on I don't know how to proceed, I would define a smaller scope because 2000+ lines of code are very hard to review but that is only my opinion. |
- Recursive-descent parser supporting elements, attributes, text, CDATA, comments, PIs, DOCTYPE, XML declaration, UTF-8 BOM - Full entity expansion: named (& < > ' ") and numeric (decimal &c3lang#65; and hex A) - Serializer (encode/tencode) with text and attribute escaping - Public API: parse/parse_string/tparse/tparse_string, XmlDoc.encode/tencode/free, XmlNode.get_attr/has_attr/child/ children_named/text_content and @operator(len)/[]/&[] - Comprehensive unit tests in test/unit/stdlib/encoding/xml.c3 added std::encoding::xml benchmarks fix custom entities (e.g. &hsize5) breaking parsing; added docbook test fix xml attr whitespace normalization per §3.3.3: literal \n/\r/\t → space, 
 preserved added optional comment parsing xml: add processing instructions, DOCTYPE, xml declaration, epilogue, encode options, bugfixes - Rename PI → PROCESSING_INSTRUCTION in XmlNodeType - Add XmlParseOptions: keep_processing_instructions, max_depth (removes global) - Add XmlEncodeOptions: version field - Add XmlDoctype struct with name, public_id, system_id - Add XmlDoc.prologue, epilogue, xml_declaration, doctype fields - parse_pi: returns XmlNode* instead of void; xml declaration captured separately - parse_doctype: replaces skip_doctype; reads name/PUBLIC/SYSTEM ids - encode: emits DOCTYPE, prologue/epilogue nodes; respects XmlEncodeOptions.version - Fix CDATA ]]> detection: replace 3-char lookahead with bracket-count state machine - Fix max_depth: moved from global to XmlContext, populated from XmlParseOptions - Fix numeric character references: validate codepoint ≤ 0x10FFFF before append_char32 - Fix attribute \r\n normalization: collapses to single space per XML §2.11 - Add 20 new tests (45 total) xml: fix entity overflow, PI leak, DOCTYPE leak; add pretty-print and Unicode names - append_entity: always consume to ';' even when name exceeds 63-char buffer, preventing parser desync on overlong entity names - parse_pi: replace manual EOF-only cleanup with defer catch, fixing a leak when skip_whitespace propagates a non-EOF IO error - parse_document: free DOCTYPE string members before the struct on duplicate DOCTYPE, matching the cleanup pattern used everywhere else - is_name_start/is_name_char: accept bytes >= 0x80 for UTF-8 encoded element and attribute names (e.g. <图书>, <作者>) - XmlAttrMap/XmlNodeList: remove @Private so external code can name the types - XmlEncodeOptions: add indent field; empty string (default) preserves existing compact output; non-empty enables block layout — children indented when any sibling is an element/comment/PI, inline otherwise - Add tests for Unicode names and all pretty-print variants c14n conformity xml: reject NULs and preserve long entity refs
0b4b2de to
6c6842f
Compare
|
I've reviewed this code. It has many parts which are repetitive, like I counted 5 hex string -> integer conversions that were mostly identical. The style of the code and lack of succinctness is problematic. It's clearly strongly LLM assisted at the least. It would take me maybe 8 hours or more to take this code and whip it into a shape which is possible to merge with the stdlib. It should be about 1/3 of it's current size I believe. While correctly using optionals etc it doesn't actually leverage them, leading to unnecessarily long winded code that's also contributing to the expanse of the source. So I will say it's possible to accept this kind of "enterprisey" style of code into the standard library at this point. I recommend you take this and make it into a separate library that you can offer people so that it doesn't come to waste. |
|
For the reasons outline above, I am therefore sad to say that I need to close this PR. |
Types
XmlDoc— the parsed document; owns all allocated nodes via itsallocatorXmlNode— a single node in the tree; owns itsname,content,attrs, andchildrenXmlNodeType— enum:ELEMENT,TEXT,CDATA,COMMENT,PIfaultdef UNEXPECTED_CHARACTER, UNEXPECTED_EOF, INVALID_TAG, MISMATCHED_TAG, INVALID_ATTRIBUTE, MAX_DEPTH_REACHEDint max_depth = 128— configurable nesting limit (matchesstd::encoding::json)API
xml::parse(allocator, stream)InStream, heap-allocated; returnsXmlDoc?xml::parse_string(allocator, s)String; returnsXmlDoc?xml::tparse(stream)InStream, temp-allocatedxml::tparse_string(s)String, temp-allocatedXmlDoc.free()XmlDoc.encode(allocator)XmlDoc.tencode()encodeusing temp allocatorXmlNode.get_attr(name) → String?NOT_FOUNDXmlNode.has_attr(name) → boolXmlNode.child(tag) → XmlNode*?NOT_FOUNDXmlNode.children_named(tag) → XmlNode*[]XmlNode.text_content() → StringTEXT/CDATAchildrenXmlNode.len → usz@operator(len))XmlNode[i] → XmlNode*@operator([]))&XmlNode[i] → XmlNode**@operator(&[]))Implementation notes
InStream. A single sharedDString scratch(stack-allocated viastack_mem) is used for all intermediate string building; only strings that survive past the current parse step are heap-copied.ctx.scratchwithout allocation (read_name_scratch), saving one heap round-trip per element close.HashMap.setcall then immediately freed, sinceHashMap{String, String}copies its keys (COPY_KEYS=true).XmlNode.free()recurses through children; leaf nodes skip the attr/children teardown since they are never initialized.defer catch node.free()inparse_elementensures partially-constructed nodes are cleaned up on any parse error.encode()builds output into a temp-backedByteWriterand copies the result into the target allocator at the end, keeping the hot path allocation-free.Parser features
&<>'") and numeric (decimalAand hexA)<![CDATA[...]]>)<!-- ... -->) — skipped during parsing<?target data?>) — skipped during parsing<?xml ... ?>) — skipped<!DOCTYPE ...>) — skipped, including internal subsets with[...]EF BB BF) — consumed silentlymax_depth, default 128)Tests
test/unit/stdlib/encoding/xml.c3covers: simple elements, nested elements, text content, attributes (single/double quoted), entity expansion (all five named entities, decimal and hex numeric), CDATA sections, comments (inline and prologue), XML declaration, DOCTYPE (simple and with internal subset), mixed content,children_named, whitespace around root, encode/decode roundtrip,tencodeself-closing, temp-allocator variants, and error cases (MISMATCHED_TAG,UNEXPECTED_EOF,INVALID_ATTRIBUTE,MAX_DEPTH_REACHED). Includes a comprehensive integration test exercising all features in a single document with a parse → encode → parse roundtrip.