Question: is pure content-based detection possible? #186

quasilyte · 2019-01-08T09:57:35Z

If code snippet is extracted from somewhere and you don't have original filename anymore, it seems to be impossible to detect the language with enry.

For example, user might have a source code like:

err := cmd.Run()
if err != nil {
  log.Fatal(err)
}

It could be inferred to be Go.

I guess this falls out of enry features and responsibilities?

creachadair · 2019-01-08T17:37:22Z

I don't know all the details of the original linguist library that Enry emulates, but I think Enry uses more or less the same tagging strategy and requires a filename to seed the set of possible languages. It uses file contents to narrow the selection, but I don't believe linguist supports using the content alone.

In general one can get reasonable guesses from content alone: Many languages have enough shibboleths to at least narrow the field. That's partly how file(1) works, for example. But that isn't how linguist is constructed, so it'd be a fairly substantial change.

Cc @bzz in case I missed something.

bzz · 2019-01-10T17:00:07Z

@quasilyte TL;DR there is one option you can try, but overall, main use-case for Enry is file-level language detection so its high-level API always assume the file name is present.

I'm curious if you could you a little bit about your use-case here, for cases the file names are not available?

More specifically, as @creachadair noted above, enry follows the design of github linguist so it consists of a number of Strategies for detecting a language of a given file.

Typically, if there is no single 100% match, all strategies are applied sequentially to narrow down possible options. They are:

GetLanguagesByExtension
GetLanguagesByFilename
GetLanguagesByShebang
GetLanguagesByContent
GetLanguagesByClassifier

There is one strategy though, GetLanguagesByClassifier that uses Bayesian classifier of the content (trained on these samples) and does not use filename at all. And so it's the closest thing to what you are looking for.

The good news is that it's exposed as low-level public API and you can try using it directly though enry.GetLanguageByClassifier by passing in empty filename.

The bad news is though - as it's a "last resort" strategy, an API is designed to disambiguate between given set of languages (typically, guess by previous strategies). So you have to pass in a slice of potential language name aliases.

If you pass in aliases for all possible languages (LanguagesByAlias keys from the link above) the accuracy of predictions most probably would not be that high. But you should try yourself and see if it works for your case.

On GetLanguagesByContent - although it seems to be applicable for your case, it is not as it uses regexp heuristics only to disambiguate between limited number of ~60 cases when multiple languages share same file extension like .h for C, Objective-C and C++, etc.

Hope this helps! And looking forward learning more about your use case.

quasilyte · 2019-01-10T18:01:07Z

I'm curious if you could you a little bit about your use-case here, for cases the file names are not available?

For example, you extracted various snippets from the book or README or any other source.
There can be 2-3 different languages there, for example, bash, go and various config files.
In other words, the source of those snippets might give a clue about how many different languages there are, but it doesn't annotate the snippet with a language itself, so you need to do some classification.

Instead of rolling something on my own, I tried enry.
I could generate N fake filenames, like foo.json, foo.bash, foo.go and try 3 different content-based matching calls, but it seems like, for instance, Go matching does not use any content-based decision making, so if the extension is ".go", file is treated as Go source file. I had not better ideas there.

Hope this helps! And looking forward learning more about your use case.

It does help. Thank you!
I'll try what you described since I usually have a set of the most possible languages that could be there.

bzz · 2019-06-01T13:37:52Z

Closing as there is no further discussion and the question seems to be answered.

creachadair added the question label Jan 8, 2019

bzz mentioned this issue Jan 29, 2019

Ensure exported names in Go packages are properly documented #195

Open

bzz closed this as completed Jun 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: is pure content-based detection possible? #186

Question: is pure content-based detection possible? #186

quasilyte commented Jan 8, 2019

creachadair commented Jan 8, 2019

bzz commented Jan 10, 2019 •

edited

Loading

quasilyte commented Jan 10, 2019

bzz commented Jun 1, 2019

Question: is pure content-based detection possible? #186

Question: is pure content-based detection possible? #186

Comments

quasilyte commented Jan 8, 2019

creachadair commented Jan 8, 2019

bzz commented Jan 10, 2019 • edited Loading

quasilyte commented Jan 10, 2019

bzz commented Jun 1, 2019

bzz commented Jan 10, 2019 •

edited

Loading