fix(doc): post-process docs +fetch output to improve round-trip fidelity#214
fix(doc): post-process docs +fetch output to improve round-trip fidelity#214herbertliu wants to merge 6 commits intomainfrom
docs +fetch output to improve round-trip fidelity#214Conversation
Port fixExportedMarkdown() from internal cli to improve round-trip fidelity when fetched Markdown is re-imported via +update / +create. Five fixes applied to MCP fetch-doc raw output: 1. fixBoldSpacing: remove trailing spaces before ** / *, strip redundant ** from ATX headings (CommonMark compliance) 2. fixSetextAmbiguity: insert blank line before "---" following a paragraph to prevent Setext H2 misparse 3. fixBlockquoteHardBreaks: insert bare ">" between consecutive "> " lines so create-doc preserves each line as its own paragraph 4. fixTopLevelSoftbreaks: insert blank line between adjacent top-level content lines (Lark exports blocks with single \n; parsers collapse them into one paragraph without the blank line) 5. Same blank-line insertion inside <lark-td> cells for table content Result: round-trip diff 166 lines → 1 line (image token regeneration, unavoidable by design). Accuracy ≈ 99.8%. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cover all five fix functions with table-driven tests: - fixBoldSpacing: trailing space removal and heading bold stripping - fixSetextAmbiguity: blank line before --- to prevent Setext H2 - fixBlockquoteHardBreaks: bare > inserted between consecutive blockquote lines - fixTopLevelSoftbreaks: blank line between top-level blocks, lark-td cells - fixExportedMarkdown: end-to-end integration across all fixes New code patch coverage for markdown_fix.go: 87-100% per function. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ut emoji aliases Three improvements to fixExportedMarkdown post-processing: 1. fixTopLevelSoftbreaks: promote <callout> and <quote-container> from opaque blocks to content containers, same as <lark-td>. Adjacent content lines inside these containers now get a blank line inserted between them so create-doc preserves paragraph structure instead of collapsing all lines into one. 2. fixCalloutEmoji: new function that replaces named emoji aliases (e.g. emoji="warning") with actual Unicode emoji characters (e.g. emoji="⚠️ "). fetch-doc sometimes emits named aliases that create-doc does not recognize, causing it to fall back to an unrelated emoji. 3. Tests updated to reflect new callout/quote-container behaviour and cover fixCalloutEmoji with table-driven cases. Round-trip diff on test doc (839 lines): 24 lines → 19 lines. Remaining diff: whiteboard token loss (create-doc design limitation, 4 whiteboards × ~2 lines each) + 1 emoji mapped by create-doc server. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… tracking
fetch-doc / create-doc adds a spurious blank line before each closing
code fence, e.g.:
```
last line
<- blank added by server
```
Add fixCodeBlockTrailingBlanks() to remove these blanks after fetch.
Also fix fixTopLevelSoftbreaks() to use the same simple toggle rule:
any ```lang line opens a code block; only a plain ``` closes it.
Previously the function toggled on every ``` line, causing lines inside
a nested fence (e.g. ```go inside ```markdown) to be treated as
top-level content and receive spurious blank-line insertions.
Round-trip diff for the Lark CLI dev spec: 200 lines → 156 lines.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughA markdown post-processing module is added to normalize Lark-exported content through sequential transformations: spacing fixes, setext disambiguation, blockquote handling, softbreak insertion, emoji conversion, and code block cleanup. The docs fetch handler now applies these transformations to markdown results. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
shortcuts/doc/markdown_fix.go (1)
148-167: Duplicate doc comment block should be removed.The
fixCodeBlockTrailingBlankscomment is duplicated verbatim, which adds noise and risks drift.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@shortcuts/doc/markdown_fix.go` around lines 148 - 167, The file contains a duplicated doc comment for fixCodeBlockTrailingBlanks; remove the redundant copy so only one explanatory comment remains immediately above the fixCodeBlockTrailingBlanks declaration, keeping the original wording, formatting and examples intact and ensuring surrounding blank lines/paragraph breaks stay consistent with Go doc comment style for that function.shortcuts/doc/markdown_fix_test.go (1)
90-125: Add fenced-code regression cases for early fixers.Please add explicit tests asserting that
fixBoldSpacing,fixSetextAmbiguity, andfixBlockquoteHardBreaksdo not modify content inside fenced code blocks.Example tests to add
+func TestFixersDoNotModifyInsideFences(t *testing.T) { + input := "```md\n**x **\n> a\n> b\nline\n---\n```" + + if got := fixBoldSpacing(input); got != input { + t.Fatalf("fixBoldSpacing modified fenced code: %q", got) + } + if got := fixSetextAmbiguity(input); got != input { + t.Fatalf("fixSetextAmbiguity modified fenced code: %q", got) + } + if got := fixBlockquoteHardBreaks(input); got != input { + t.Fatalf("fixBlockquoteHardBreaks modified fenced code: %q", got) + } +}Also applies to: 127-173, 175-203
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@shortcuts/doc/markdown_fix_test.go` around lines 90 - 125, Add regression tests in shortcuts/doc/markdown_fix_test.go to ensure the early-fixer functions do not alter fenced code blocks: create inputs containing a fenced code block (triple backticks) with markdown that would otherwise trigger fixes (e.g., bold spacing, setext-like lines, blockquote lines) and call fixBoldSpacing, fixSetextAmbiguity, and fixBlockquoteHardBreaks with those inputs, asserting each returns the original input unchanged; reference the existing test structure and functions fixBoldSpacing, fixSetextAmbiguity, and fixBlockquoteHardBreaks when adding these assertions so they run alongside the other TestFix... cases.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@shortcuts/doc/markdown_fix.go`:
- Around line 30-33: The three transforms (fixBoldSpacing, fixSetextAmbiguity,
fixBlockquoteHardBreaks) currently run over entire markdown and mutate content
inside fenced code blocks; update each function so it scans lines, tracks a
boolean inCode toggled on/off when encountering fenced-code delimiters (e.g. a
trimmed line == "```" or other fence variants), and only apply the transform
logic to chunks outside fenced blocks while appending fenced-block lines
verbatim (use a flush mechanism to process accumulated non-code chunk with
existing transform and then clear it); ensure the fence detection correctly
toggles inCode and preserves fence lines and inner code unchanged to maintain
round-trip fidelity.
---
Nitpick comments:
In `@shortcuts/doc/markdown_fix_test.go`:
- Around line 90-125: Add regression tests in shortcuts/doc/markdown_fix_test.go
to ensure the early-fixer functions do not alter fenced code blocks: create
inputs containing a fenced code block (triple backticks) with markdown that
would otherwise trigger fixes (e.g., bold spacing, setext-like lines, blockquote
lines) and call fixBoldSpacing, fixSetextAmbiguity, and fixBlockquoteHardBreaks
with those inputs, asserting each returns the original input unchanged;
reference the existing test structure and functions fixBoldSpacing,
fixSetextAmbiguity, and fixBlockquoteHardBreaks when adding these assertions so
they run alongside the other TestFix... cases.
In `@shortcuts/doc/markdown_fix.go`:
- Around line 148-167: The file contains a duplicated doc comment for
fixCodeBlockTrailingBlanks; remove the redundant copy so only one explanatory
comment remains immediately above the fixCodeBlockTrailingBlanks declaration,
keeping the original wording, formatting and examples intact and ensuring
surrounding blank lines/paragraph breaks stay consistent with Go doc comment
style for that function.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 90a5d8be-dcb4-4422-9fbc-1e021a9cab75
📒 Files selected for processing (4)
.gitignoreshortcuts/doc/docs_fetch.goshortcuts/doc/markdown_fix.goshortcuts/doc/markdown_fix_test.go
| md = fixBoldSpacing(md) | ||
| md = fixSetextAmbiguity(md) | ||
| md = fixBlockquoteHardBreaks(md) | ||
| md = fixTopLevelSoftbreaks(md) |
There was a problem hiding this comment.
Critical: fenced code content can be corrupted by early transforms.
fixBoldSpacing, fixSetextAmbiguity, and fixBlockquoteHardBreaks run on full markdown and currently modify text inside fenced code blocks. That changes literal code content (e.g. > a\n> b, **x **, line\n---) and breaks round-trip fidelity.
Suggested fix (apply these transforms only outside fenced blocks)
func fixExportedMarkdown(md string) string {
- md = fixBoldSpacing(md)
- md = fixSetextAmbiguity(md)
- md = fixBlockquoteHardBreaks(md)
+ md = applyOutsideCodeFences(md, fixBoldSpacing)
+ md = applyOutsideCodeFences(md, fixSetextAmbiguity)
+ md = applyOutsideCodeFences(md, fixBlockquoteHardBreaks)
md = fixTopLevelSoftbreaks(md)
md = fixCalloutEmoji(md)
md = fixCodeBlockTrailingBlanks(md)
@@
return md
}
+
+func applyOutsideCodeFences(md string, fn func(string) string) string {
+ lines := strings.Split(md, "\n")
+ var out []string
+ var chunk []string
+ inCode := false
+
+ flush := func() {
+ if len(chunk) == 0 {
+ return
+ }
+ out = append(out, strings.Split(fn(strings.Join(chunk, "\n")), "\n")...)
+ chunk = chunk[:0]
+ }
+
+ for _, line := range lines {
+ trimmed := strings.TrimSpace(line)
+ if strings.HasPrefix(trimmed, "```") {
+ if !inCode {
+ flush()
+ inCode = true
+ } else if trimmed == "```" {
+ inCode = false
+ }
+ out = append(out, line)
+ continue
+ }
+ if inCode {
+ out = append(out, line)
+ } else {
+ chunk = append(chunk, line)
+ }
+ }
+ flush()
+ return strings.Join(out, "\n")
+}Also applies to: 50-60, 77-94
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@shortcuts/doc/markdown_fix.go` around lines 30 - 33, The three transforms
(fixBoldSpacing, fixSetextAmbiguity, fixBlockquoteHardBreaks) currently run over
entire markdown and mutate content inside fenced code blocks; update each
function so it scans lines, tracks a boolean inCode toggled on/off when
encountering fenced-code delimiters (e.g. a trimmed line == "```" or other fence
variants), and only apply the transform logic to chunks outside fenced blocks
while appending fenced-block lines verbatim (use a flush mechanism to process
accumulated non-code chunk with existing transform and then clear it); ensure
the fence detection correctly toggles inCode and preserves fence lines and inner
code unchanged to maintain round-trip fidelity.
🚀 PR Preview Install Guide🧰 CLI updatenpm i -g https://pkg.pr.new/larksuite/cli/@larksuite/cli@57b030392205285cb255d8aaa4af7800064508da🧩 Skill updatenpx skills add larksuite/cli#feat/docs-export-import-fix -y -g |
Greptile SummaryThis PR adds a
Confidence Score: 5/5Safe to merge — all findings are P2 style/doc issues; no runtime defects on the changed path. All four comments are P2 (duplicate docstring, missing entry in top-level comment, an edge-case regex behaviour unlikely to be triggered by Lark exports, and a missing blank between adjacent container tags). None affect correctness for the documented use cases, and the round-trip improvement is verified on real documents. No P0/P1 issues found. shortcuts/doc/markdown_fix.go — duplicate docstring and headingBoldRe edge case worth a follow-up test. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[fetch-doc MCP response] --> B{markdown field present?}
B -- yes --> C[fixExportedMarkdown]
B -- no --> G[OutFormat unchanged]
C --> C1[fixBoldSpacing\nremove trailing spaces in ** / *\nstrip redundant ** in ATX headings]
C1 --> C2[fixSetextAmbiguity\ninsert blank before bare --- lines]
C2 --> C3[fixBlockquoteHardBreaks\ninsert blank > between blockquote lines]
C3 --> C4[fixTopLevelSoftbreaks\ninsert blank between top-level blocks\nand inside callout / quote-container / lark-td]
C4 --> C5[fixCalloutEmoji\nreplace named aliases with Unicode emoji]
C5 --> C6[fixCodeBlockTrailingBlanks\nremove spurious blank before closing fence]
C6 --> C7[collapse 3+ newlines to 2\ntrim trailing newlines]
C7 --> G[OutFormat\nwrite title + markdown to output]
Reviews (1): Last reviewed commit: "chore: ignore .tmp/ test directory" | Re-trigger Greptile |
| // fixCodeBlockTrailingBlanks removes blank lines that appear immediately before | ||
| // a closing ``` fence inside a fenced code block. fetch-doc / create-doc | ||
| // sometimes inserts an extra blank line at the end of code block content: | ||
| // | ||
| // ``` | ||
| // last line | ||
| // ← spurious blank added by server | ||
| // ``` | ||
| // | ||
| // Removing it keeps the round-trip diff clean. | ||
| // fixCodeBlockTrailingBlanks removes blank lines that appear immediately before | ||
| // a closing ``` fence inside a fenced code block. fetch-doc / create-doc | ||
| // sometimes inserts an extra blank line at the end of code block content: | ||
| // | ||
| // ``` | ||
| // last line | ||
| // ← spurious blank added by server | ||
| // ``` | ||
| // | ||
| // Removing it keeps the round-trip diff clean. | ||
| // Nested fences (e.g. ```go inside ```markdown) are handled by tracking |
There was a problem hiding this comment.
| // fixExportedMarkdown applies post-processing to Lark-exported Markdown to | ||
| // improve round-trip fidelity on re-import: | ||
| // | ||
| // 1. fixBoldSpacing: removes trailing whitespace before closing ** / *, | ||
| // and strips redundant ** from ATX headings. | ||
| // | ||
| // 2. fixSetextAmbiguity: inserts a blank line before any "---" that immediately | ||
| // follows a non-empty line, preventing it from being parsed as a Setext H2. | ||
| // | ||
| // 3. fixBlockquoteHardBreaks: inserts a blank blockquote line (">") between | ||
| // consecutive blockquote content lines so create-doc preserves line breaks. | ||
| // | ||
| // 4. fixTopLevelSoftbreaks: inserts a blank line between adjacent non-empty | ||
| // lines at the top level and inside content containers (callout, | ||
| // quote-container, lark-td). Code fences are left untouched. | ||
| // | ||
| // 5. fixCalloutEmoji: replaces named emoji aliases (e.g. emoji="warning") with | ||
| // actual Unicode emoji characters that create-doc understands. |
There was a problem hiding this comment.
| var ( | ||
| boldTrailingSpaceRe = regexp.MustCompile(`(\*\*\S[^*]*?)\s+(\*\*)`) | ||
| italicTrailingSpaceRe = regexp.MustCompile(`(\*\S[^*]*?)\s+(\*)`) | ||
| headingBoldRe = regexp.MustCompile(`(?m)^(#{1,6})\s+\*\*(.+?)\*\*\s*$`) |
There was a problem hiding this comment.
headingBoldRe can mangle headings with multiple disjoint bold spans
headingBoldRe uses a non-greedy .+?, but Go's RE2 engine still finds the leftmost-longest overall match. For a heading like # **foo** and **bar**, the engine anchors at # **, then the non-greedy (.+?) expands until \*\*\s*$ can match — which only succeeds when (.+?) captures foo** and **bar (the final ** in the line acting as the closing delimiter). The replacement then emits # foo** and **bar, leaving stray ** markers that render as literal text.
In practice Lark's fetch-doc wraps entire heading content in one bold pair, so this edge case is unlikely to be triggered. It may be worth adding a guard or a test to make this explicit.
// Reproducer (currently handled incorrectly):
// "# **foo** and **bar**" → "# foo** and **bar"
|
|
||
| // Container opening/closing tags are structural — skip them. | ||
| isContainerTag := false | ||
| for _, cc := range contentContainers { | ||
| if strings.HasPrefix(trimmed, cc[0]) || strings.HasPrefix(trimmed, "</"+cc[0][1:]) { | ||
| isContainerTag = true | ||
| break | ||
| } | ||
| } |
There was a problem hiding this comment.
Adjacent top-level container blocks get no separating blank line
isContainerTag is true for both <callout> opening and </callout> closing tags, so the blank-line insertion is skipped for those lines. As a result, two consecutive callout (or <quote-container>) blocks at the top level will have no blank line between the </callout> and the following <callout>:
</callout>
<callout emoji="⚠️"> ← no blank line inserted here
...
This may not matter if the downstream importer tolerates the missing separator, but it is an inconsistency with the stated goal of inserting blank lines between adjacent top-level blocks.
|
|
| /log/ | ||
|
|
||
| # Temporary test files | ||
| .tmp/ |
There was a problem hiding this comment.
Is this ".gitignore" change necessary for this PR? If ".tmp/" is not produced by these changes, I would prefer to leave it out to keep the PR focused.
| // Removing it keeps the round-trip diff clean. | ||
| // Nested fences (e.g. ```go inside ```markdown) are handled by tracking | ||
| // nesting depth so only the outermost closing fence is affected. | ||
| func fixCodeBlockTrailingBlanks(md string) string { |
There was a problem hiding this comment.
This pass looks inherently lossy: it removes any blank line immediately before a closing fence, but an empty trailing line can be real code content. That means some code blocks can no longer round-trip exactly.
| // Structural table tags (<lark-table>, <lark-tr>, <lark-td> and their closing | ||
| // counterparts) never trigger blank-line insertion themselves. Fenced code | ||
| // blocks (``` ... ```) are left completely untouched. | ||
| func fixTopLevelSoftbreaks(md string) string { |
There was a problem hiding this comment.
This logic inserts blank lines between any adjacent non-empty top-level lines, so it also rewrites list syntax. - a\n- b becomes a loose list, and - a\n continuation becomes a second paragraph inside the item. That seems risky for round-trip fidelity. I’d suggest either narrowing the heuristic or adding explicit list/list-continuation tests before merging.
| headingBoldRe = regexp.MustCompile(`(?m)^(#{1,6})\s+\*\*(.+?)\*\*\s*$`) | ||
| ) | ||
|
|
||
| func fixBoldSpacing(md string) string { |
There was a problem hiding this comment.
This still rewrites inline code spans. Example: `**hello **` becomes `**hello**`. Since the PR goal is round-trip fidelity, should these transforms also skip inline code, not just fenced code?
| // | ||
| // 5. fixCalloutEmoji: replaces named emoji aliases (e.g. emoji="warning") with | ||
| // actual Unicode emoji characters that create-doc understands. | ||
| func fixExportedMarkdown(md string) string { |
There was a problem hiding this comment.
Even if we make fixBoldSpacing / fixSetextAmbiguity / fixBlockquoteHardBreaks fence-aware, fenced code is still not safe here: fixCalloutEmoji and the final \n\n\n collapse both run globally and can rewrite literal code samples. For example, <callout emoji="warning"> inside a fenced block will be changed, and intentional multiple blank lines in code will be collapsed.
Summary
fetch-doc(MCP) exports Markdown with several spec violations — missing blank lines between blocks, trailing spaces in bold markers, blockquote lines collapsing, etc. — that cause ~80% of document structure to be lost on re-import viacreate-doc. This PR adds afixExportedMarkdown()post-processing pipeline todocs +fetchthat corrects these issues at the CLI layer.Changes
shortcuts/doc/docs_fetch.go: CallfixExportedMarkdown()on themarkdownfield returned byfetch-docbefore outputting.shortcuts/doc/markdown_fix.go(new): Post-processing pipeline with five fix functions:fixBoldSpacing: removes trailing spaces before**/*closing delimiters; strips redundant**inside ATX headingsfixSetextAmbiguity: inserts a blank line before---that immediately follows a non-blank line, preventing Setext H2 misparsefixBlockquoteHardBreaks: inserts an empty>separator between consecutive blockquote lines so each becomes its own paragraph on re-importfixTopLevelSoftbreaks: inserts blank lines between adjacent top-level content blocks, and between content lines inside<callout>,<quote-container>, and<lark-td>containers; code fences are left untouchedfixCalloutEmoji: replaces named emoji aliases (e.g.emoji="warning") with Unicode emoji thatcreate-docrecognisesfixCodeBlockTrailingBlanks: removes spurious blank lines thatcreate-doc/fetch-docinserts before each closing```fenceshortcuts/doc/markdown_fix_test.go(new): Table-driven unit tests for all fix functions.Test Plan
go test ./shortcuts/doc/...)Related Issues
Summary by CodeRabbit
Bug Fixes
Tests
Chores