-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve SourcePawn heuristics #5479
Conversation
Looks like it caused the Pawn tests to fail as it detected it as SourcePawn. Can I have some guidance on what I should do about this? |
You'll need to adjust your heuristic to ensure it only captures SourcePawn-specific syntax. If that's not possible, then your only option is to close this PR and implement an override in your repo. |
I changed the heuristic that was catching Pawn files and the tests pass on my local computer now. |
You could, and adding it would have the effect of helping correctly identify all the files in your repo as it increases the chances the tokens match. I think a question needs to asked first: what specifically about the file (on its own) makes it SourcePawn as opposed to Pawn or C++? Identifying that could help improve the heuristic even more so others get the benefit too. Also, keep in mind we don't aim for 100% accuracy with Linguist, so an override may need to be used in particularly tricky or vague scenarios. |
Mainly the SourceMod As for differentiating between SourcePawn and C++, the So for why that file is recognized as C++: Making a skeleton |
There are no heuristics so the classifier is the only thing doing the guessing and it is notoriously bad at small samples, especially if there are no language-specific tokens in it. This happens a lit with C-family header files.
|
Oh yes, C++ came out as the result for your test because it's alphabetically before Pawn and SourcePawn and not because of the content or anything the classifier knows about as none of the samples for these languages have |
Whoops, ignore that. I was mixing up my discussions. However, it's still kinda true as the heuristics wouldn't match |
🤔 I've taken a closer look at your |
Worked it out... those lines happen quite late into the file and the heuristics only analyse the first 50kb of the file: I'm not keen on increasing this as it's set low to keep things performant so I think it makes sense to add a sample that contains these tokens for the few cases like yours where we fall through to the classifier. |
That definitely explains what was happening then.
Understandable. A couple of good samples that I think would be good to use:
And I guess Also, I found is that I'll wait to see if you have any thoughts before adding more commits for any of it. |
I think adding all three will be good for the classifier.
Yup. 👍
If you're 💯 on this, then lets move it to Pawn. |
Looks like the classifier cross-validation error count increased from that I think from these:
Is this a problem or a just a situation where the |
Yes, this is a problem as we're trying to improve the classifier. The increase in errors is an indication that these changes have caused something to now be incorrectly classified.
This branch has these:
I suspect this might be happening because of the amount of duplication of syntax in the large files you've added which appears to be swaying the classifier towards SourcePawn.
If this is the case, then lets remove it. This should take us back down to the two known incorrect classifications which will improve if and when more diverse samples are added for Pawn in future. Unrelated, it looks like the boo grammar has gone MIA. I'll see if I can fix that in a separate PR. |
Alright, done for that.
That got me too when I bootstrapped a fresh repo and had to keep alt-tabbing back to my terminal wondering why it was taking so long. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good and looks good to me now. Thanks for being so patient. 🙇
…5479) * Extend SourthePawn regex * Move mfile.inc to Pawn samples * Add more `.inc` SourcePawn samples
Some of the SourcePawn include files (.inc) in https://github.com/shavitush/bhoptimer/tree/master/addons/sourcemod/scripting/include are being picked up as C++ for the repository's language percentage.
Description
This adds some more patterns to
heuristics.yml
to match for the following SourcePawn syntax:methodmap ExtendedType < BaseType
stock returntype FunctionName()
native returntype FunctionName()
forward returntype FunctionName()
Checklist: