Auto-detect language per-line is guaranteed to produce poor results #392

joshgoebel · 2021-06-04T12:48:13Z

Hey, current maintainer of Highlight.js here. This just came to my attention via #391.

You're looping over lines and then calling highlightAuto on every line (when you don't have a known language). This is not recommended and guaranteed to produce poor results. Auto-detect is not intended to be useful with such little data and the noise will often (as reported in #391) be much higher than the signal - you're just as likely to get random languages than anything useful. There will be color, but often all wrong.

If you do wish to use auto-detect you should pass us the ENTIRE document (or at the very least all the available lines from the document/diff), then look at the language we determine it to be, then use that language for every single line.

You'll have to take this approach with version 11 anyways since you'll have to do the highlighting in a single pass (rather than per-line). So calling highlightAuto upfront for all available lines and letting it use the greater amount of content available for it's auto-detection... then splitting that result back out into the individual lines you need - already highlighted for you.

You'll have to do it twice of source, once each for the before and after streams.

The text was updated successfully, but these errors were encountered:

rtfpessoa · 2021-06-04T17:07:49Z

Hi @joshgoebel,

Thanks for bringing this up.

I was under the assumption that the code was fine, and I even updated it based on your suggestion highlightjs/highlight.js#2277 (comment) leading to 2416b3d

If I understand correctly you are saying that the only option to correctly highlight unknown languages would be to pass as much contents as possible and then split then line by line?

Does this also apply to code I already know the language?

The HighlightAuto usage is just a best effort try, since I am assuming the rest is doing a good job.

joshgoebel · 2021-06-04T21:06:29Z

If I understand correctly you are saying that the only option to correctly highlight unknown languages would be to pass as much contents as possible and then split then line by line?

Well, "correctly highlight" might be overstating it. The way IMHO to "correctly attempt" it is to give us all the content and let us decide what it is... that will at least have consistent results, even if not 100% correct. And it's not the "only" way... you could do a "pre-highlight" (of all the lines) first to get the language... then do your line by line approach... but again v11 will require a different implementation, so I don't think it's even worth pursuing that avenue.

Does this also apply to code I already know the language?

I opened this issue solely because of the random behavior with auto-detect when it's used line by line. But yes, with v11 how you are doing this will have to be rewritten because the engine does not expose it's internal parsing state to the outside anymore.

The HighlightAuto usage is just a best effort try, since I am assuming the rest is doing a good job.

But if the results are random and haphazard I'd suggest it'd be better to default to nothing.

rtfpessoa · 2021-06-04T22:04:13Z

@joshgoebel I believe I am not understanding exactly what you mean. So maybe if I try to reduce the problem space it will help me understand.
For now, let's ignore the fact that we might not know the language for some files, consider that I am already targeting version 11 and that I can never get all the code in the files since diffs by default have a limited context.

I would like to understand what would be the best way possible to do the highlight.

If I invoke highlight.js for each line individually like this hljs.highlight(codeString, { language, ignoreIllegals: true }) is it going to be a problem in v11?
If yes, what would be the alternative? Would I have to pass all the lines I have for it to work?

iHiD · 2021-06-04T22:25:52Z

(@rtfpessoa This whole thread might be useful reading: https://meta.stackexchange.com/q/355852/188348)

rtfpessoa · 2021-06-04T22:36:40Z

(@rtfpessoa This whole thread might be useful reading: https://meta.stackexchange.com/q/355852/188348)

@iHiD thanks for the reference. Will definitely read it.

joshgoebel · 2021-06-05T05:43:08Z

If I invoke highlight.js for each line individually like this hljs.highlight(codeString, { language, ignoreIllegals: true }) is it going to be a problem in v11?

That's not really what you want to do as it will break on any scopes that persist past the end of a line boundary. What you'd really want to do:

Collect all sequential diff lines
Do this for both the "original" and the "changed" versions
So you'll have say 10 lines of code in two strings now, "before" and "after"
Highlight both blocks of code.
Split the lines (this will require some small amount of parsing and fixing up tags to end at newlines and re-open on the following line)... that can be done in 5-10 lines though as our HTML output is VERY clean easy easy to parse.

You'd really need to do this for each section of a diff (if they are non-sequential). So if a diff included 3 discrete changes, ~10 lines each then you'd be grouping each of those 3 changes into blocks and then highlighting all 3 blocks. Then splitting them apart again to get at the individual highlighted lines.

rtfpessoa mentioned this issue Jan 25, 2022

[Question] About nodeStream and mergeStream #424

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-detect language per-line is guaranteed to produce poor results #392

Auto-detect language per-line is guaranteed to produce poor results #392

joshgoebel commented Jun 4, 2021 •

edited

Loading

rtfpessoa commented Jun 4, 2021

joshgoebel commented Jun 4, 2021

rtfpessoa commented Jun 4, 2021

iHiD commented Jun 4, 2021

rtfpessoa commented Jun 4, 2021

joshgoebel commented Jun 5, 2021 •

edited

Loading

Auto-detect language per-line is guaranteed to produce poor results #392

Auto-detect language per-line is guaranteed to produce poor results #392

Comments

joshgoebel commented Jun 4, 2021 • edited Loading

rtfpessoa commented Jun 4, 2021

joshgoebel commented Jun 4, 2021

rtfpessoa commented Jun 4, 2021

iHiD commented Jun 4, 2021

rtfpessoa commented Jun 4, 2021

joshgoebel commented Jun 5, 2021 • edited Loading

joshgoebel commented Jun 4, 2021 •

edited

Loading

joshgoebel commented Jun 5, 2021 •

edited

Loading