Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tabs are not preserved in raw output #3440

Closed
dead-claudia opened this issue Sep 2, 2024 · 3 comments · Fixed by #3438
Closed

Tabs are not preserved in raw output #3440

dead-claudia opened this issue Sep 2, 2024 · 3 comments · Fixed by #3438
Labels
has PR The issue has a Pull Request associated

Comments

@dead-claudia
Copy link

dead-claudia commented Sep 2, 2024

Marked version: v14.1.0

Describe the bug
Tabs are expanded to 4 spaces in the output.

To Reproduce
Steps to reproduce the behavior:

  1. Run npx marked --tokens against the raw version of https://github.com/MithrilJS/mithril.js/blob/7b819f354610d5102dc4c707cee2167eef7fbf47/assets/README.md
  2. Inspect the output at line 389. (It's the raw field starting with 1. Open `assets/logo.svg` in Inkscape.)

Expected behavior
Tabs should be present inside the nested bulleted list's raw output in raw form.


I was attempting to get proper column counting of offsets in a repo that (mostly) uses tabs for indentation (mainly for historical reasons), and this throws that off. The end goal is to try to get line numbers to catch docs issues better. My walker is inspired by discussion in #2134 and looks roughly like this:

const visitCellList = (startOffset, parentOffset, cells, parent) => {
	for (const cell of cells) {
		parentOffset = visitList(startOffset, parentOffset, cell.tokens, parent)
	}
	return parentOffset
}

const visitList = (startOffset, parentOffset, tokens, parent) => {
	for (const child of tokens) {
		// ⬇ This right here is where I'm running into issues with tabs.
		const delta = parent.raw.indexOf(child.raw, parentOffset) - parentOffset
		const length = delta + child.raw.length
		visit(startOffset + delta, child)
		parentOffset += length
		startOffset += length
	}
	return parentOffset
}

const visit = (startOffset, token) => {
	const endOffset = startOffset + token.raw.length

	switch (token.type) {
		// some checks with specific node types omitted - they're irrelevant to the bug

		case "list":
			visitList(startOffset, 0, token.items, token)
			break

		case "table": {
			let parentOffset = visitCellList(startOffset, 0, token.header, token)
			startOffset += parentOffset
			for (const row of token.rows) {
				parentOffset = visitCellList(startOffset, parentOffset, row, token)
				startOffset += parentOffset
			}
			break
		}

		default:
			if (token.tokens !== undefined) {
				visitList(startOffset, 0, token.tokens, token)
			}
	}
}

visitList(0, 0, marked.lexer(contents), {raw: contents.replace(/\t/g, "    ")})

It's that delta calculation in the third line that's tripping me up. I could work around it by rolling a delta adjustment function that considers source tabs as equal to 4 spaces in the output, but it'd be very awkward to write and rather long for what's already a pretty lengthy file to begin with.

Of course, walkTokens won't work directly, as I need to actually work with the structure itself to make deltas work properly.

@dead-claudia
Copy link
Author

Okay, I came up with a pretty nasty workaround. Other functions are the same, except for visitList and the new advanceTabViaSpaceReplacement (the workaround).

// This function exists strictly to eat the spaces as I go through tabs in the raw source
const advanceTabViaSpaceReplacement = (offset, raw, start, end) => {
    while (start < end) {
        const real = contents.charCodeAt(offset++)
        const synthetic = raw.charCodeAt(start++)
        if (
            real === 0x09 && synthetic === 0x20 &&
            raw.charCodeAt(start) === 0x20 &&
            raw.charCodeAt(++start) === 0x20 &&
            raw.charCodeAt(++start) === 0x20
        ) {
            start++
        }
    }

    return offset
}

// This function is modified from the version in the original comment
const visitList = (startOffset, parentOffset, tokens, parent) => {
    for (const child of tokens) {
        const nextIndex = parent.raw.indexOf(child.raw, parentOffset)
        const innerStart = advanceTabViaSpaceReplacement(startOffset, parent.raw, parentOffset, nextIndex)
        const outerStart = advanceTabViaSpaceReplacement(innerStart, child.raw, 0, child.raw.length)
        parentOffset = nextIndex + child.raw.length
        startOffset = outerStart
        visit(innerStart, child)
    }
    return parentOffset
}

@dead-claudia
Copy link
Author

Okay, here's the live version of what I did to resolve my issue, if you want to see.

@UziTech
Copy link
Member

UziTech commented Sep 2, 2024

Should be fixed by #3438

@UziTech UziTech added the has PR The issue has a Pull Request associated label Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
has PR The issue has a Pull Request associated
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants