Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markdown comparison includes headings unnecessarily #452

Closed
adaboese opened this issue Jan 1, 2024 · 7 comments
Closed

Markdown comparison includes headings unnecessarily #452

adaboese opened this issue Jan 1, 2024 · 7 comments

Comments

@adaboese
Copy link

adaboese commented Jan 1, 2024

it('diffs markdown (2)', () => {
  const a =
    '## AIMD generated content\n\nJust for fun, here is the same article scrambled using AIMD anti-AI detection techniques:';
  const b =
    "## AIMD generated content\n\nJust for fun, here is the same article scrambled using AIMD anti-AI detection techniques. If you're curious about building depth on related topics and establishing yourself as an authority, check out our detailed guide on [Understanding Topical Authority](https://aimd.app/blog/2023-12-26-revolutionizing-the-marketing-hierarchy-why-topical-authority-is-the-new-currency).";

  const diff = diffSentences(a, b);
});

Produces:

[
  {
    count: 1,
    added: undefined,
    removed: true,
    value: '## AIMD generated content\n' +
      '\n' +
      'Just for fun, here is the same article scrambled using AIMD anti-AI detection techniques:'
  },
  {
    count: 4,
    added: true,
    removed: undefined,
    value: '## AIMD generated content\n' +
      '\n' +
      "Just for fun, here is the same article scrambled using AIMD anti-AI detection techniques. If you're curious about building depth on related topics and establishing yourself as an authority, check out our detailed guide on [Understanding Topical Authority](https://aimd.app/blog/2023-12-26-revolutionizing-the-marketing-hierarchy-why-topical-authority-is-the-new-currency)."
  }
]

It is not clear why ## AIMD generated content\n is marked as removed.

@adaboese
Copy link
Author

adaboese commented Jan 1, 2024

Interestingly, if I paste the same content into http://incaseofstairs.com/jsdiff/, I cannot replicate the issue.

## AIMD generated content

Just for fun, here is the same article scrambled using AIMD anti-AI detection techniques:
## AIMD generated content

Just for fun, here is the same article scrambled using AIMD anti-AI detection techniques. If you're curious about building depth on related topics and establishing yourself as an authority, check out our detailed guide on [Understanding Topical Authority](https://aimd.app/blog/2023-12-26-revolutionizing-the-marketing-hierarchy-why-topical-authority-is-the-new-currency).

@adaboese
Copy link
Author

adaboese commented Jan 1, 2024

@kpdecker does http://incaseofstairs.com/jsdiff/ do something different than what's in my code?

@adaboese
Copy link
Author

adaboese commented Jan 1, 2024

oh, it is not using sentences. It uses words diff. I will evaluate that.

@adaboese
Copy link
Author

adaboese commented Jan 1, 2024

Looks like switching to words algorithm comes with tradeoffs though.

it.only('recognizes different sentences', () => {
  const a = 'This is a sentence. This is another sentence.';
  const b = 'This is a sentence. And this is a different sen.';
  const abDiff = diff(a, b);

  expect(abDiff).toStrictEqual([
    {
      count: 17,
      value: 'This is a sentence. This is ',
    },
    {
      added: true,
      count: 9,
      removed: undefined,
      value: 'a different',
    },
    {
      count: 9,
      value: ' sentence.',
    },
  ]);
});

This produces a fairly unreadable diff.

[
  { count: 9, value: 'This is a sentence. ' },
  { count: 1, added: undefined, removed: true, value: 'This' },
  { count: 1, added: true, removed: undefined, value: 'And' },
  { count: 1, value: ' ' },
  { count: 2, added: true, removed: undefined, value: 'this ' },
  { count: 2, value: 'is ' },
  { count: 1, added: undefined, removed: true, value: 'another' },
  { count: 1, added: true, removed: undefined, value: 'a' },
  { count: 1, value: ' ' },
  { count: 1, added: undefined, removed: true, value: 'sentence' },
  { count: 3, added: true, removed: undefined, value: 'different sen' },
  { count: 1, value: '.' }
]

@adaboese
Copy link
Author

adaboese commented Jan 1, 2024

Based on my tests https://github.com/google/diff-match-patch is producing the best results when used with diff_cleanupSemantic. Looks I could continue using jsdiff if I just extract logic of diff_cleanupSemantic.

@adaboese
Copy link
Author

adaboese commented Jan 1, 2024

Hm. That won't work as easy as I thought it would. It self-references the constructor and bunch of prototype functions. It would be nice if jsdiff supported equivalent logic as that package does not appear to be actively maintained.

@ExplodingCabbage
Copy link
Collaborator

diffSentences, right now, purports to simply diff two strings of "text" - not of Markdown. But the rules you want only make sense if the text is Markdown and the tokenizer knows that.

We could make newlines be treated as a sentence break, but that wouldn't actually give you what you want here, because Markdown lets you hard-wrap paragraphs with newlines that have no effect on the rendered result. Nor would treating only double-newlines as sentence breaks work, since a heading can be terminated with a single newline. The only solution is a tokenizer that recognises the start of a heading (or perhaps also a bullet point or numbered list item, if you want to treat list items as always being distinct sentences).

I thus don't really see a possible change to jsdiff that would give you what you want without also baking an assumption into diffSentences that its arguments are Markdown, thus I think I probably shouldn't try to "fix" this issue with a change to the library. To get the behaviour you want, you can instead write some custom logic to split the Markdown documents you're diffing into "sentences" (per whatever precise notion of "sentence" you want to use and then pass the result to diffArrays, and my instinct (admittedly without too much research or deep thought) is that that's the only strategy the library should support, for three reasons:

  • the exact rules about what terminates a heading or list item or whatever may well be dependent on the Markdown dialect you're using; I know Stack Overflow's flavour of Markdown and GitHub's flavour of Markdown used to be quite different, and don't know whether things are more standardized today.
  • there are several questions I can think of about how to split Markdown into "sentences" that don't have obvious, objective answer (e.g. is each list item in a bulleted or numbered list a distinct sentence? Do inline images in a paragraph constitute sentence breaks? How should code blocks be handled?)
  • In general, I would like to encourage people diffing text in computer languages to take an approach of running a language-specific parser/tokenizer, combining or splitting language tokens as they see fit for their particular purposes, and passing an array of tokens to diffArrays, rather than bloating jsdiff with more language-specific diff functions.

@ExplodingCabbage ExplodingCabbage closed this as not planned Won't fix, can't repro, duplicate, stale Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants