-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto-detect language per-line is guaranteed to produce poor results #392
Comments
Hi @joshgoebel, Thanks for bringing this up. I was under the assumption that the code was fine, and I even updated it based on your suggestion highlightjs/highlight.js#2277 (comment) leading to 2416b3d If I understand correctly you are saying that the only option to correctly highlight unknown languages would be to pass as much contents as possible and then split then line by line? Does this also apply to code I already know the language? The HighlightAuto usage is just a best effort try, since I am assuming the rest is doing a good job. |
Well, "correctly highlight" might be overstating it. The way IMHO to "correctly attempt" it is to give us all the content and let us decide what it is... that will at least have consistent results, even if not 100% correct. And it's not the "only" way... you could do a "pre-highlight" (of all the lines) first to get the language... then do your line by line approach... but again v11 will require a different implementation, so I don't think it's even worth pursuing that avenue.
I opened this issue solely because of the random behavior with auto-detect when it's used line by line. But yes, with v11 how you are doing this will have to be rewritten because the engine does not expose it's internal parsing state to the outside anymore.
But if the results are random and haphazard I'd suggest it'd be better to default to nothing. |
@joshgoebel I believe I am not understanding exactly what you mean. So maybe if I try to reduce the problem space it will help me understand. I would like to understand what would be the best way possible to do the highlight. If I invoke highlight.js for each line individually like this |
(@rtfpessoa This whole thread might be useful reading: https://meta.stackexchange.com/q/355852/188348) |
@iHiD thanks for the reference. Will definitely read it. |
That's not really what you want to do as it will break on any scopes that persist past the end of a line boundary. What you'd really want to do:
You'd really need to do this for each section of a diff (if they are non-sequential). So if a diff included 3 discrete changes, ~10 lines each then you'd be grouping each of those 3 changes into blocks and then highlighting all 3 blocks. Then splitting them apart again to get at the individual highlighted lines. |
Hey, current maintainer of Highlight.js here. This just came to my attention via #391.
You're looping over lines and then calling
highlightAuto
on every line (when you don't have a known language). This is not recommended and guaranteed to produce poor results. Auto-detect is not intended to be useful with such little data and the noise will often (as reported in #391) be much higher than the signal - you're just as likely to get random languages than anything useful. There will be color, but often all wrong.If you do wish to use auto-detect you should pass us the ENTIRE document (or at the very least all the available lines from the document/diff), then look at the language we determine it to be, then use that language for every single line.
You'll have to take this approach with version 11 anyways since you'll have to do the highlighting in a single pass (rather than per-line). So calling
highlightAuto
upfront for all available lines and letting it use the greater amount of content available for it's auto-detection... then splitting that result back out into the individual lines you need - already highlighted for you.You'll have to do it twice of source, once each for the before and after streams.
The text was updated successfully, but these errors were encountered: