Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: is pure content-based detection possible? #186

Closed
quasilyte opened this issue Jan 8, 2019 · 4 comments
Closed

Question: is pure content-based detection possible? #186

quasilyte opened this issue Jan 8, 2019 · 4 comments
Labels

Comments

@quasilyte
Copy link

If code snippet is extracted from somewhere and you don't have original filename anymore, it seems to be impossible to detect the language with enry.

For example, user might have a source code like:

err := cmd.Run()
if err != nil {
  log.Fatal(err)
}

It could be inferred to be Go.

I guess this falls out of enry features and responsibilities?

@creachadair
Copy link
Contributor

I don't know all the details of the original linguist library that Enry emulates, but I think Enry uses more or less the same tagging strategy and requires a filename to seed the set of possible languages. It uses file contents to narrow the selection, but I don't believe linguist supports using the content alone.

In general one can get reasonable guesses from content alone: Many languages have enough shibboleths to at least narrow the field. That's partly how file(1) works, for example. But that isn't how linguist is constructed, so it'd be a fairly substantial change.

Cc @bzz in case I missed something.

@bzz
Copy link
Contributor

bzz commented Jan 10, 2019

@quasilyte TL;DR there is one option you can try, but overall, main use-case for Enry is file-level language detection so its high-level API always assume the file name is present.

I'm curious if you could you a little bit about your use-case here, for cases the file names are not available?


More specifically, as @creachadair noted above, enry follows the design of github linguist so it consists of a number of Strategies for detecting a language of a given file.

Typically, if there is no single 100% match, all strategies are applied sequentially to narrow down possible options. They are:

  • GetLanguagesByExtension
  • GetLanguagesByFilename
  • GetLanguagesByShebang
  • GetLanguagesByContent
  • GetLanguagesByClassifier

There is one strategy though, GetLanguagesByClassifier that uses Bayesian classifier of the content (trained on these samples) and does not use filename at all. And so it's the closest thing to what you are looking for.

The good news is that it's exposed as low-level public API and you can try using it directly though enry.GetLanguageByClassifier by passing in empty filename.

The bad news is though - as it's a "last resort" strategy, an API is designed to disambiguate between given set of languages (typically, guess by previous strategies). So you have to pass in a slice of potential language name aliases.

If you pass in aliases for all possible languages (LanguagesByAlias keys from the link above) the accuracy of predictions most probably would not be that high. But you should try yourself and see if it works for your case.

On GetLanguagesByContent - although it seems to be applicable for your case, it is not as it uses regexp heuristics only to disambiguate between limited number of ~60 cases when multiple languages share same file extension like .h for C, Objective-C and C++, etc.

Hope this helps! And looking forward learning more about your use case.

@quasilyte
Copy link
Author

I'm curious if you could you a little bit about your use-case here, for cases the file names are not available?

For example, you extracted various snippets from the book or README or any other source.
There can be 2-3 different languages there, for example, bash, go and various config files.
In other words, the source of those snippets might give a clue about how many different languages there are, but it doesn't annotate the snippet with a language itself, so you need to do some classification.

Instead of rolling something on my own, I tried enry.
I could generate N fake filenames, like foo.json, foo.bash, foo.go and try 3 different content-based matching calls, but it seems like, for instance, Go matching does not use any content-based decision making, so if the extension is ".go", file is treated as Go source file. I had not better ideas there.

Hope this helps! And looking forward learning more about your use case.

It does help. Thank you!
I'll try what you described since I usually have a set of the most possible languages that could be there.

@bzz
Copy link
Contributor

bzz commented Jun 1, 2019

Closing as there is no further discussion and the question seems to be answered.

@bzz bzz closed this as completed Jun 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants