Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve SourcePawn heuristics #5479

Merged
merged 6 commits into from
Jul 27, 2021
Merged

Improve SourcePawn heuristics #5479

merged 6 commits into from
Jul 27, 2021

Conversation

rtldg
Copy link
Contributor

@rtldg rtldg commented Jul 24, 2021

Some of the SourcePawn include files (.inc) in https://github.com/shavitush/bhoptimer/tree/master/addons/sourcemod/scripting/include are being picked up as C++ for the repository's language percentage.

addons/sourcemod/scripting/include/closestpos.inc (via linguist)
addons/sourcemod/scripting/include/convar_class.inc (on language page)
addons/sourcemod/scripting/include/dhooks.inc (on language page)
addons/sourcemod/scripting/include/shavit.inc (via linguist)

Description

This adds some more patterns to heuristics.yml to match for the following SourcePawn syntax:

  • methodmap ExtendedType < BaseType
  • stock returntype FunctionName()
  • native returntype FunctionName()
  • forward returntype FunctionName()

Checklist:

@rtldg rtldg requested a review from a team as a code owner July 24, 2021 04:30
@rtldg
Copy link
Contributor Author

rtldg commented Jul 24, 2021

Looks like it caused the Pawn tests to fail as it detected it as SourcePawn. Can I have some guidance on what I should do about this?

@lildude
Copy link
Member

lildude commented Jul 24, 2021

You'll need to adjust your heuristic to ensure it only captures SourcePawn-specific syntax. If that's not possible, then your only option is to close this PR and implement an override in your repo.

@rtldg
Copy link
Contributor Author

rtldg commented Jul 24, 2021

I changed the heuristic that was catching Pawn files and the tests pass on my local computer now.
The file shavit.inc is still recognized as C++ though. Would it be fine to add that as a SourcePawn sample? I didn't know that adding a sample affects the classifier at the start but it seems just dropping that file into the samples would've caused all the files to be correctly recognized as SourcePawn and I wouldn't have needed to add heuristics.

@lildude
Copy link
Member

lildude commented Jul 25, 2021

Would it be fine to add that as a SourcePawn sample? I didn't know that adding a sample affects the classifier at the start but it seems just dropping that file into the samples would've caused all the files to be correctly recognized as SourcePawn and I wouldn't have needed to add heuristics.

You could, and adding it would have the effect of helping correctly identify all the files in your repo as it increases the chances the tokens match. I think a question needs to asked first: what specifically about the file (on its own) makes it SourcePawn as opposed to Pawn or C++? Identifying that could help improve the heuristic even more so others get the benefit too.

Also, keep in mind we don't aim for 100% accuracy with Linguist, so an override may need to be used in particularly tricky or vague scenarios.

@rtldg
Copy link
Contributor Author

rtldg commented Jul 25, 2021

I think a question needs to asked first: what specifically about the file (on its own) makes it SourcePawn as opposed to Pawn or C++?

Mainly the SourceMod SharedPlugin and SetNTVOptional stuff. Back in 2014 or so SourcePawn's syntax was expanded though to be more C/C++ -like (while still maintaining compatibility with the old syntax). Things like methodmap, view_as<float>(somerandomint), and enum struct are now good indicators that a file is SourcePawn instead of Pawn if the SharedPlugin or SetNTVOptional stuff isn't in a file.

As for differentiating between SourcePawn and C++, the methodmap, view_as<float>(somerandomint), and the function modifiers forward, stock, and native would be good ways. C++ also has enum struct so that wouldn't work. The forward, stock, and native things already caused problems with making Pawn be recognized as SourcePawn so I'm not sure how that can be used well.

So for why that file is recognized as C++: Making a skeleton .inc file with only the content asdf inside causes the file to be recognized as C++ for me so I think the C++ classifier is just too strong and the heuristics aren't able to override it since modern SourcePawn already looks similar to C++.

@lildude
Copy link
Member

lildude commented Jul 25, 2021

Making a skeleton .inc file with only the content asdf inside causes the file to be recognized as C++ for me so I think the C++ classifier is just too strong and the heuristics aren't able to override it since modern SourcePawn already looks similar to C++.

There are no heuristics so the classifier is the only thing doing the guessing and it is notoriously bad at small samples, especially if there are no language-specific tokens in it. This happens a lit with C-family header files.

asdf is also a really terrible way to try test a language analyser based on content on its own without any other context, you'd never guess the language… its not even English 😁

@lildude
Copy link
Member

lildude commented Jul 25, 2021

Oh yes, C++ came out as the result for your test because it's alphabetically before Pawn and SourcePawn and not because of the content or anything the classifier knows about as none of the samples for these languages have asdf as a token 😁

@lildude
Copy link
Member

lildude commented Jul 26, 2021

There are no heuristics

Whoops, ignore that. I was mixing up my discussions. However, it's still kinda true as the heuristics wouldn't match asdf so wouldn't help in this scenario.

@lildude
Copy link
Member

lildude commented Jul 26, 2021

Mainly the SourceMod SharedPlugin and SetNTVOptional stuff.

🤔 I've taken a closer look at your shavit.inc file and it has both of those as does the heuristic so the current heuristic should be correctly detecting the language of that file. I'm not sure why it's not doing so though.

@lildude
Copy link
Member

lildude commented Jul 26, 2021

Worked it out... those lines happen quite late into the file and the heuristics only analyse the first 50kb of the file:

https://github.com/github/linguist/blob/c492ac0d5bc697033c4261e8bfb597949ea882e2/lib/linguist/heuristics.rb#L6-L24

I'm not keen on increasing this as it's set low to keep things performant so I think it makes sense to add a sample that contains these tokens for the few cases like yours where we fall through to the classifier.

@rtldg
Copy link
Contributor Author

rtldg commented Jul 27, 2021

Worked it out... those lines happen quite late into the file and the heuristics only analyse the first 50kb of the file:

That definitely explains what was happening then.

I'm not keen on increasing this as it's set low to keep things performant so I think it makes sense to add a sample that contains these tokens for the few cases like yours where we fall through to the classifier.

Understandable. A couple of good samples that I think would be good to use:

  • shavit.inc - enum struct, SharedPlugin, SetNTVOptional, char[] buffer, default parameter values
    • Actually, this is still too big for the heuristics to reach the end of. Although I could just move the SharedPlugin and SetNTVOptional stuff to the top and/or strip the non-license comments.
  • dhooks.inc - typeset, methodmap, SetNTVOptional
  • bhopstats.inc - methodmap, methodmap properties, SharedPlugin, SetNTVOptional

And I guess test_heuristics.rb:test_inc_by_heuristics should have SourcePawn moved out of the nil key and swapped to *.inc

Also, I found is that mfile.inc isn't SourcePawn. It's actually Pawn and seems to be for open.mp, a GTA: San Andreas mod based on API used the authors (Minokon & Y-Less) and commit it was added in

I'll wait to see if you have any thoughts before adding more commits for any of it.

@lildude
Copy link
Member

lildude commented Jul 27, 2021

Understandable. A couple of good samples that I think would be good to use:

* [`shavit.inc`](https://github.com/shavitush/bhoptimer/blob/da9144f1b5fbc2a9d7d2153f3a584a75c7218a90/addons/sourcemod/scripting/include/shavit.inc) - `enum struct`, `SharedPlugin`, `SetNTVOptional`, `char[] buffer`, default parameter values

* [`dhooks.inc`](https://github.com/peace-maker/DHooks2/blob/008a2fd0220c7999e3c6337faf0e69c3b04f7bda/sourcemod_files/scripting/include/dhooks.inc) - `typeset`, `methodmap`, `SetNTVOptional`

* [`bhopstats.inc`](https://github.com/shavitush/bhopstats/blob/9ad328ccdf7ac44f61c190d7fcebe7420d3b9ece/scripting/include/bhopstats.inc) - `methodmap`, methodmap properties, `SharedPlugin`, `SetNTVOptional`

I think adding all three will be good for the classifier.

And I guess test_heuristics.rb:test_inc_by_heuristics should have SourcePawn moved out of the nil key and swapped to *.inc

Yup. 👍

Also, I found is that mfile.inc isn't SourcePawn. It's actually Pawn and seems to be for open.mp, a GTA: San Andreas mod based on API used the author (Y-Less) and commit it was added

If you're 💯 on this, then lets move it to Pawn.

@rtldg
Copy link
Contributor Author

rtldg commented Jul 27, 2021

Number of errors (39) is above the acceptable threshold (38)

Looks like the classifier cross-validation error count increased from that I think from these:

  • Pascal/GraphicConfiguration.inc BAD (SourcePawn)
    • dhooks.inc causes this for some reason
  • Pawn/fixed.inc BAD (SourcePawn)
  • Pawn/mfile.inc BAD (SourcePawn)

Is this a problem or a just a situation where the ACCEPTABLE_ERRORS gets bumped? It should only be decreased. 🤔

https://github.com/github/linguist/blob/a3a3b8b330d296589f103d322303826a4c2bcfda/script/cross-validation#L3-L5

@lildude
Copy link
Member

lildude commented Jul 27, 2021

Is this a problem or a just a situation where the ACCEPTABLE_ERRORS gets bumped?

Yes, this is a problem as we're trying to improve the classifier. The increase in errors is an indication that these changes have caused something to now be incorrectly classified.

master has the following known misclassifications around Pawn and SourcePawn:

Pawn/fixed.inc BAD (SourcePawn)
SourcePawn/mfile.inc BAD (C++)

This branch has these:

Pascal/GraphicConfiguration.inc BAD (SourcePawn)  <--- This is new and definitely unwanted as it was correct before
Pawn/fixed.inc BAD (SourcePawn)
Pawn/mfile.inc BAD (SourcePawn)

I suspect this might be happening because of the amount of duplication of syntax in the large files you've added which appears to be swaying the classifier towards SourcePawn.

  • dhooks.inc causes this for some reason

If this is the case, then lets remove it. This should take us back down to the two known incorrect classifications which will improve if and when more diverse samples are added for Pawn in future.

Unrelated, it looks like the boo grammar has gone MIA. I'll see if I can fix that in a separate PR.

@rtldg
Copy link
Contributor Author

rtldg commented Jul 27, 2021

  • dhooks.inc causes this for some reason

If this is the case, then lets remove it. This should take us back down to the two known incorrect classifications which will improve if and when more diverse samples are added for Pawn in future.

Alright, done for that.

Unrelated, it looks like the boo grammar has gone MIA. I'll see if I can fix that in a separate PR.

That got me too when I bootstrapped a fresh repo and had to keep alt-tabbing back to my terminal wondering why it was taking so long.

Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good and looks good to me now. Thanks for being so patient. 🙇

@lildude lildude merged commit ca8b002 into github-linguist:master Jul 27, 2021
@lildude lildude changed the title Add a few SourcePawn heuristics Improve SourcePawn heuristics Jul 27, 2021
kalkin added a commit to kalkin/file-expert that referenced this pull request Sep 19, 2021
…5479)

* Extend SourthePawn regex
* Move mfile.inc to Pawn samples
* Add more `.inc` SourcePawn samples
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants