Parsing React-like Props in Markdown Comments with oparser

September 19, 2024 javascriptdev-tools

A while back I wrote about syncing remote content into markdown files with markdown-magic. The short version: you drop an HTML comment into a .md file, and the library replaces the block underneath it with generated content.

The comment carries options. That little detail is where all the pain lives.

<!-- doc-start code src="./examples/1_simple.js" lines="1-20" -->
content gets replaced here
<!-- end-docs -->

src and lines are options. Easy enough when it's two strings. But I wanted authors to write options the way they already write React props, including arrays, objects, and nested config:

<!-- doc-start table columns=['name', 'price'] style={{ align: 'left' }} -->

The problem isn't generating the table. The problem is turning that string between the comment markers into a real JavaScript object, written by a human, in a text file, with no editor, no autocomplete, and no linter telling them they forgot a quote.

That parser became its own package: oparser.

Why not just JSON.parse?

The obvious move is to make people write JSON and call it a day.

<!-- doc-start table {"columns": ["name", "price"], "style": {"align": "left"}} -->

JSON.parse is strict on purpose, and that strictness is exactly wrong for hand-authored config. JSON forces double quotes on every key and string, forbids trailing commas, and explodes on a single missing brace. Nobody hand-writes JSON correctly inside an HTML comment on the first try.

What people actually type looks more like this:

columns=[name, price]
style={{ align: left }}
enabled
title=Hello world

No quotes on the keys. No quotes on obvious strings. A bare enabled flag with no value. A value with a space in it. Every one of those is a JSON.parse crash, and every one of those is something a reasonable person would expect to just work.

So the parser has to be forgiving. It has to take loose, human input and do the obvious thing.

What forgiving actually means

oparser exposes a parse() function that turns a loose string into an object:

const { parse } = require('oparser')

parse(`
  width={999}
  enabled=TRUE
  title="Hello world"
  tags=[one, "two, too", "three]still text"]
  style={{ color: 'red', label: "b{c}" }}
`)

// {
//   width: 999,
//   enabled: true,
//   title: 'Hello world',
//   tags: ['one', 'two, too', 'three]still text'],
//   style: { color: 'red', label: 'b{c}' }
// }

Look at what it had to figure out without being told:

width={999} is a number, not the string "999".
enabled=TRUE is a boolean, case-insensitive.
title="Hello world" keeps the space because it's quoted.
tags=[...] is an array, and the comma inside "two, too" is not a delimiter because it's inside quotes.
"three]still text" contains a ] that is not the end of the array.
style={{ ... }} is a nested object, and "b{c}" has a { that is not a new object.

The hard part of parsing loose config is knowing when a special character is structural and when it's just a character inside a string. Quotes are the signal, and most naive parsers split on delimiters before they account for quoting, which is why commas-inside-strings break them.

A few more cases that show the "do the obvious thing" philosophy.

Bare keys become true, so flags work like JSX boolean props:

parse(`disabled isLoading`)
// { disabled: true, isLoading: true }

Unquoted URLs survive intact, brackets, query strings, hashes and all:

parse(`url=https://example.com?ids[]=1&ids[]=2#section`)
// { url: 'https://example.com?ids[]=1&ids[]=2#section' }

Comments outside quotes get stripped, but # and // inside a string are preserved:

parse(`
  width=100       // ignored
  height=200      # ignored
  label="keep # and // inside quotes"
`)
// { width: 100, height: 200, label: 'keep # and // inside quotes' }

And because the original goal was React-like props, JSX and arrow functions inside braces are kept as literal strings instead of being mangled:

parse(`elem={<Component type="text" />}`)
// { elem: '<Component type="text" />' }

parse(`onClick={() => console.log('hi')}`)
// { onClick: "() => console.log('hi')" }

How it works under the hood

The forgiving behavior isn't magic, it's mostly about respecting quotes before doing anything else. The pipeline looks roughly like this:

Trim the input and unwrap any outer quotes.
Protect quoted regions by swapping spaces and special characters inside strings for temporary placeholders, so the next steps can't mistake them for structure.
Scan the string character by character, building up key and value buffers.
Track [, {, and quote boundaries so the scanner knows when an array or object actually closes.
Convert each value: detect booleans, numbers, null, and parse loose object/array syntax.
Restore the protected characters back into the final strings.

Step 2 is the whole trick. By neutralizing the contents of quoted strings before tokenizing, a comma inside "two, too" simply isn't visible as a comma when the array gets split. The structure-detection logic only ever sees real structural characters. Then the placeholders get swapped back at the end so the values come out exactly as written.

That ordering is the difference between a parser that handles tags=["a, b", "c"] and one that quietly returns ['"a', 'b"', '"c"'].

Wiring it into markdown-magic

Inside markdown-magic, the block parser pulls the raw option string out of the comment and hands it straight to oparser:

const { parse } = require('oparser')

const paramString = params.trim()
const parsedOptions = paramString ? parse(paramString) : {}

That's the entire integration. The block parser figures out where the options are (everything after the transform name, before the closing -->), and oparser figures out what they mean.

This split is why markdown-magic could move from its old colon-and-ampersand syntax to React-like props without rewriting the core. The legacy syntax looked like this:

<!-- DOCS:START (CODE:src=./path/to/file.js&lines=22-44) -->

The modern syntax reads like JSX:

<!-- doc-start code src="./path/to/file.js" lines="22-44" -->

Both end up as { src: './path/to/file.js', lines: '22-44' }. The library still detects the old : / ? prefixes and routes them to a legacy parser for backwards compatibility, but everything new flows through oparser. The transform author just receives a clean options object and never thinks about parsing at all.

Where else this is useful

The "loose key-value text to object" problem shows up far more often than you'd expect once you go looking for it:

CLI argument parsing where you want users to be able to "mess up" flag syntax and still get the right result.
CMS or frontmatter-style config authored by non-engineers.
Shortcode / directive systems in any markdown or templating pipeline.
Annotations in comments, like the doc-gen blocks here, where strict syntax would punish authors for typos.

Anywhere a human types config into a text field and a strict parser would reject it for a missing quote, a forgiving parser does the obvious thing instead.

Wrap

The lesson I keep relearning: the format your tool accepts and the format it works with internally don't have to match. Internally markdown-magic wants a clean options object. Externally I want authors to scribble JSX-ish props and have it just work. A small forgiving parser in between is what buys you both.

Parser: oparser on GitHub — npm install oparser
The tool it powers: markdown-magic
Background on the syncing workflow: How to synchronize remote content in markdown files

If you're building anything that takes hand-written config, try giving people the forgiving version first. The strict parser can always run after.

Get the next useful thing I publish

Occasional practical writing, project notes, and tools when they are worth sending.

David Wells

Builder of things

Software architect
& product wrangler

Get the next post

Parsing React-like Props in Markdown Comments with oparser

Why not just JSON.parse?

What forgiving actually means

How it works under the hood

Wiring it into markdown-magic

Where else this is useful

Wrap

Get the next useful thing I publish

David Wells

Builder of things

Software architect& product wrangler

Get the next post

Parsing React-like Props in Markdown Comments with oparser

Why not just JSON.parse?

What forgiving actually means

How it works under the hood

Wiring it into markdown-magic

Where else this is useful

Wrap

Get the next useful thing I publish

Software architect
& product wrangler