-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose guess_lexer() functionality to guess lexers for text. #83
Comments
Mrm. I've mostly been against auto-guessing because of false positives but I guess we can give it a whirl. Would you suggest adding a 'Magic Guess' language to the dropdown and setting it as default? You are correct that it is more consistent with steck. |
Yeah, a dropdown option would be what I'm after. It need not be default until we have a good feeling about the functionality though. |
Mrm, if not default it'd be followed by being the default very shortly afterwards since it's a solution to the problem of people not selecting lexers. The other one would be to either have people select the lexer on view (or repaste with a certain lexer). Do you want to take a look or shall I fix this up over the weekend? |
I can give this a try. |
I did a quick test-drive and the results were quite bad. It often did a fallback to --- a/pinnwand/handler/website.py
+++ b/pinnwand/handler/website.py
@@ -5,6 +5,7 @@ from datetime import datetime
import docutils.core
import tornado.web
+from pygments.lexers import guess_lexer
from pinnwand import database, path, utility, error
@@ -174,6 +175,16 @@ class CreateAction(Base):
raise error.ValidationError()
for (lexer, raw, filename) in zip(lexers, raws, filenames):
+ log.info(f"CreateAction.post: lexer is {lexer}")
+ if lexer == 'AUTO':
+ try:
+ lexer = guess_lexer(raw).aliases[0]
+ log.info(f"CreateAction.post: guessed lexer is {lexer}")
+ except ValueError:
+ # Fall back to plain text
+ log.info(f"CreateAction.post: guess lexer fallback text")
+ lexer = "text"
+
if lexer not in utility.list_languages():
log.info("CreateAction.post: a file had an invalid lexer")
raise error.ValidationError()
diff --git a/pinnwand/template/part/lexer-select.html b/pinnwand/template/part/lexer-select.html
index 5e6ed35..571c4c4 100644
--- a/pinnwand/template/part/lexer-select.html
+++ b/pinnwand/template/part/lexer-select.html
@@ -1,4 +1,5 @@
<select name="lexer">
+ <option value="AUTO">Autodetect</option>
{% if handler.application.configuration.preferred_lexers %}
{% for key in handler.application.configuration.preferred_lexers %}
<option value="{{ key }}"{% if selected == key %} selected="selected"{% end %}>{{ lexers[key] }}</option> |
Does it perform better if a lower limit is added, for example 'only guess if at least n characters of text have been provided' otherwise just use the 'text' lexer? |
Sampled this repository using this script: guess.py#!/usr/bin/env python3
Basically I think we should just fall back to |
The same test on a per line basis:
|
Some interesting results here; it seems like we can do two or three things.
I'd say 1 or 2 have the preference. There's a 4th which would be based on filename but these are rarely supplied. We likely also want to treat Python 2 as Python (3) by default. |
Syntax highlighting is very important to maintain readability, but alas people are lazy.
This is where the
guess_lexer
function comes in. Do you think autodetection of the lexer using this is feasible?Steck already guesses based on mimetype so pastes submitted via web could benefit from a similar feature? https://github.com/supakeen/steck/blob/master/steck.py#L68
The text was updated successfully, but these errors were encountered: