Breakdown of django/django is different from Linguist #204

vmarkovtsev · 2019-03-06T16:22:11Z

As @Guillemdb found out while doing the ML challenge for source{d}, enry reports wrong results for django/django. Ground truth:

enry:

The text was updated successfully, but these errors were encountered:

bzz · 2019-03-06T16:49:56Z

Thank you for the report, indeed numbers look a bit off! It's documented that some of the strategies do not produce the same results in https://github.com/src-d/enry#divergences-from-linguist

So far I'm not aware of the attempts to verify the accuracy of enry that goes beyond the samples dir of Linguist (as part of the tests, we make sure each strategy agrees with Linguist on every file there) and am looking forward to measuring and improving it under this issue!

Here are the results of reproducing this comparison locally with enry v1.7.7 and linguist 7.2.0 CLI

mkdir /tmp/linguist-django; cd /tmp/linguist-django
git clone --dept 1 git@github.com:django/django.git
cd -

enry -prog /tmp/linguist-django/django
67.85%	Python
31.04%	Modelica
1.02%	JavaScript
0.03%	Makefile
0.03%	Shell
0.03%	Smarty

./bin/github-linguist /tmp/linguist-django/django
95.80%  Python
1.90%   JavaScript
1.66%   HTML
0.63%   CSS
0.01%   Shell
0.00%   Smarty
0.00%   Makefile

Good start on this issue would consist of:

shell/shell/go script for CI to compare order of magnitude of the results of this breakdown (high-level)
dig deeper in CLI code to make sure "apples" are compared to "apples"
check if Modelica get's miss-classified somehow
integration test that over the same big codebase file by file and compare results with linguist "gold" results (low-level)

I also checked v1.6.8 and this does not seem to be a regression, so it most probably has been like that for a while.

Hope this helps, and I will be looking into this more next week.

bzz · 2019-03-15T16:19:37Z

Django uses .mo files for localization translation that are known as Modelica to linguist (even if they are not).

So, from a first glance - it seems like in linguist a content classifier was able to differentiate those but on enry it may be the second case of miss-classification from https://github.com/src-d/enry#divergences-from-linguist and most probably due to similar cause as #194

To 100% verify that one would need to run django.mo though linguist and enry with instrumentation from #188 enabled and create an issue Bayesian classifier cann't distinguish "Modelica" vs "XXX".

Wild guess is that fixing #193 would allow to get rid of all such cases, so I'm going to allocate some time next week and fix.

bzz · 2019-03-25T16:16:17Z

Ok, the above explanation is although very probable, is not 100% plausible - it may also be another bug in enry cli

On a dir with 2 files

find /tmp/linguist-django/django/django/conf/locale/be/LC_MESSAGES
 /tmp/linguist-django/django/django/conf/locale/be/LC_MESSAGES/django.mo
 /tmp/linguist-django/django/django/conf/locale/be/LC_MESSAGES/django.po

One is reported as binary with 30 sloc (wc -l is sure it's 29) and another as Gettext Catalog.

enry /tmp/linguist-django/django/django/conf/locale/be/LC_MESSAGES/django.mo
django.mo: 30 lines (30 sloc)
  type:      Binary
  mime_type: text/plain
  language:

enry -prog /tmp/linguist-django/django/django/conf/locale/be/LC_MESSAGES/django.po
django.po: 1301 lines (975 sloc)
  type:      Text
  mime_type: text/plain
  language:  Gettext Catalog

But if the whole dir is analyzed, it becomes

enry -prog /tmp/linguist-django/django/django/conf/locale/be/LC_MESSAGES
50.00%	Gettext Catalog
50.00%	Modelica

Trying to detect language for django.mo with linguist returns nil what is consistent with individual file-level predictions.

bzz · 2019-03-25T16:44:36Z

So, by changing our CLI to:

call GetLanguage on every file, instead of what we do now - would remove wired Modelica
apply result filtering, equivalent to linguist
and not using -prog, so HTML and CSS is not excluded

should allow to get same CLI results as in Linguist.
Will submit a change a bit later.

vmarkovtsev · 2019-03-25T16:50:15Z

@bzz But how does it work right now if it doesn't call GetLanguage on every file?!

bzz · 2019-03-29T14:11:42Z

As one specific application using enry library, enry CLI instead of relying on high-level API of GetLanguages tries to avoid reading every file and use more low-level API in a smart way.

Some API calls do not need to have a content of the file, e.g GetLanguageByExtension and others always require it e.g GetLanguage thus if a first one already returns a plausible candidate - the file does not need to be read.

I would say that CLI should mimic github linguist CLI output/logic, at least in default configuration, and now it does not. Going to submit a change that does this soon.

This includes next main changes: - default: print only Programming and Markup types as Linguist does - `-prog` option replaced with `-all`, to allow for previous behavior - always use GetLanguage as main source of truth that fixes src-d#204 and perf will be restored under src-d#212 Signed-off-by: Alexander Bezzubov <bzz@apache.org>

bzz · 2019-04-04T20:11:07Z

As soon as #214 merged I'm going to make a new release where the CLI outputs will be the same, so it's going to help avoiding any such miss-understanding in the future #214 (comment) (no language detection logic changes)

bzz · 2019-04-09T21:09:38Z

While a release is still blocked by CI, latest master results are

./enry /tmp/linguist-django/django
95.87%	Python
1.85%	JavaScript
1.65%	HTML
0.63%	CSS
0.01%	Shell
0.00%	Smarty
0.00%	Makefile

./bin/github-linguist /tmp/linguist-django/django
95.86%  Python
1.85%   JavaScript
1.65%   HTML
0.63%   CSS
0.01%   Shell
0.00%   Smarty
0.00%   Makefile

bzz added good first issue question labels Mar 6, 2019

bzz added bug enhancement and removed question bug labels Mar 6, 2019

bzz changed the title ~~Wrong results on django/django~~ Breakdown of django/django is different from Linguist Mar 6, 2019

bzz mentioned this issue Mar 15, 2019

Do not return empty lang. #205

Merged

bzz added bug and removed enhancement good first issue labels Mar 15, 2019

bzz added this to the v1.8.0 milestone Mar 15, 2019

bzz mentioned this issue Apr 3, 2019

CLI: sync report logic \w Linguist #214

Merged

2 tasks

bzz closed this as completed in #214 Apr 9, 2019

bzz modified the milestones: v2.1.x, v1.7.3 Apr 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Breakdown of django/django is different from Linguist #204

Breakdown of django/django is different from Linguist #204

vmarkovtsev commented Mar 6, 2019

bzz commented Mar 6, 2019 •

edited

Loading

bzz commented Mar 15, 2019 •

edited

Loading

bzz commented Mar 25, 2019 •

edited

Loading

bzz commented Mar 25, 2019 •

edited

Loading

vmarkovtsev commented Mar 25, 2019

bzz commented Mar 29, 2019 •

edited

Loading

bzz commented Apr 4, 2019 •

edited

Loading

bzz commented Apr 9, 2019

Breakdown of django/django is different from Linguist #204

Breakdown of django/django is different from Linguist #204

Comments

vmarkovtsev commented Mar 6, 2019

bzz commented Mar 6, 2019 • edited Loading

bzz commented Mar 15, 2019 • edited Loading

bzz commented Mar 25, 2019 • edited Loading

bzz commented Mar 25, 2019 • edited Loading

vmarkovtsev commented Mar 25, 2019

bzz commented Mar 29, 2019 • edited Loading

bzz commented Apr 4, 2019 • edited Loading

bzz commented Apr 9, 2019

bzz commented Mar 6, 2019 •

edited

Loading

bzz commented Mar 15, 2019 •

edited

Loading

bzz commented Mar 25, 2019 •

edited

Loading

bzz commented Mar 25, 2019 •

edited

Loading

bzz commented Mar 29, 2019 •

edited

Loading

bzz commented Apr 4, 2019 •

edited

Loading