-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore Scopes in word count [including proof of concept] #99
Comments
This code is a bit nicer. Now keeping line breaks. // ignoring comments, quotes, and fenced code, not ignoring punctuation
let scopes = ['comment.*', 'quote.*', '!punctuation.*', 'fenced.code.*'];
let editor = atom.workspace.getActiveTextEditor();
let scopeselector = require('first-mate').ScopeSelector;
let selector = new scopeselector(scopes.filter(s => !s.startsWith('!')).join(', '));
let text = editor.displayBuffer.tokenizedBuffer.tokenizedLines.map(
line => line.tokens.filter(
token => !selector.matches(token.scopes)
).map(
token => token.value
).join('')
).join('\n');
console.log(text.match(/\S+/g).length); As the scope selector functions like css selectors (connection with commas), there is two ways to ignore something (or deactivate temporarily, see above): either filter it out from the list before (here done via an exclamation mark), or add a random word such as not, so that 'not punctuation.*' can never be matched and is thus ignored. The first option is cleaner, the second one easier. |
@mbroedl Also for testing my particular case (usage of this plugin with Fountain, I write a bit of code which exclude things which should be not be considered in time evaluation of stripText: (text, editor) ->
...
arr = text.split('\n')
out = []
for i of arr
if arr[i].match(/[a-z]/gmu) && !arr[i].match(/^[=|#]/gmu)
out.push arr[i]
out = out.join('\n')
text = out.replace(/(\/\*[^*]*\*\/)|(\/\/[^*]*)/g, '')
text So here my conclusion about that: Having to maintain a different fork of the plugin just to add minimal code snippet is too bad. Filters right in settings can be good, but having the possibility to extend the filtering with custom code snippet in the init.coffee file will be better. It will allows to have complex functions, and to limit them to certain grammar. |
I think this concept of hook is called Service Hub. This is how I extended some Pigments function. Here is the doc: https://github.com/atom/service-hub Or maybe just Services https://flight-manual.atom.io/behind-atom/sections/interacting-with-other-packages-via-services/ (I don't know if there is a diff) |
@X-Raym Although I prefer out-of-the-box packages, I see the point of your suggestions. |
@mbroedl The point would be to be able to let the user write its own text filtering functions in his init.cofee file. I know some package allows such features, like pigments to add custom named colors. Though, I don't know how to implement that. But regarding to your proposition, it is able to get the syntax of the line, it is not just text filtering right ? In this case, we could imagine a filter at this level, and a filter at the text variable level (filtering there doesn't allow to get original syntax). |
I see, yeah. I haven't implemented something like that yet, but it shouldn't be too bad? Here's the logics in pigments, I suppose: Yes, my proposition gets the tokenised syntax (i.e. what your grammar defines, such as fountain, markdown, asciidoc, or whatever) and then filters everything required. So everything you see when syntax highlighting (and whatever is defined but not highlighted). So I suppose if your fountain grammar includes all the regex already and tokenises it properly, there might not even be a need to have a service? Although there might be cases where you would want to ignore stop words in the word count, or prepositions, but not nouns... ok, that's getting well advanced. |
@mbroedl I have to admit having tokenized syntax would make filtering far more easy, as I would have have to reimplement regexes at the filter level. So Yes, this can be a very good solution too ! I would only have to put a list of desired syntax in a text field in the settings right ? That would be very quick to set up :) Maybe one text field for adding syntax, and one to exclude some (depending on how many syntax you need, one would be quicker to set up).
It's getting interested :D Having tokenzied syntax filter would satisfy my immediate needs, a text filter is really for advanced usage. |
@mbroedl oh no I imagine a simple text input fields in word count settings with a csv list of syntax to keep or exclude. But your solution is from far the most user friendly and flexible, I like it ! |
Yeah, I suppose your suggestion would be a first implementation. I wouldn't actually keep two fields, because of the excessive amount of grammars. There'd be too much that's in neither list. So I would suggest only to have a field to exclude grammars? What do you think? Do you have any use cases for when it would be reasonable to do an inclusion list instead? (Apart from the activation-grammar list that already exists?) |
@mbroedl in fountain there is only few fields which are valuable for wordcount : dialog text, description, and transition. The rest have to be exluded. But fortunatly there isnt too much syntax element so it will be possible to do this with an exclusion list. Longer but possible. But I think Inclusion list is thesafest choice. It gives less surprise. "Just put there what you want", no surprise with syntax which should have been excluded but you didnt notice it. The counter simply displays what you want. |
Mhmmm... do all the things you want to exclude have some kind of grammar structure in common that it'd be easy to exclude them? For Markdown (and other grammars I am aware of) there is no such thing as: 'this is actual text without any fuss', but the whole document carries the same label, even if it's links, markup, etc. Is the whole fountain document tokenised in depth? (You can check by opening the drop down console and checking the output of 'log cursor scope' at any given position and see whether the output differs between dialogue text, description, etc.) When you do that in a markdown document, you just have a main category 'text.md' that accounts for every single character in the document. (see the example below) I'll give you an example.
Translates to this hierarchical syntax tree with the language-markdown package:
So to count all words in this case I would need to exclude all punctuation below the document level (text.md). So although apart from your example (and the hypothetical example I made) I don't think there is many cases for requiring a certain grammar to be counted. Looks like we're starting to understand each other! |
I just realised that if you go via inclusion you could only ever count words in your fountain file (assuming other grammars do not have dialogue.* markers, for example)? |
@mbroedl I see the inclusion problem. Not sure how to tackle this. Here is scope log on my Fountain file, with dialog name, dialog text, and description, (things I want in my count):
It seems indeed to have generic text too. |
We could possibly just pass on whatever is in the text field to the scope selector? So I guess in your case you could have (haven't trialled it): Forwarding the text field into the scope selector would certainly be most powerful, but might be harder to debug for the user (cannot find a good scope selector documentation), and also it would remove any kind of possibility to manage that conveniently in a context menu (if users would apply complex patterns, that is). EDIT: When using a comma with a positive (inclusive) match, then in my case for excluding comments AND quotes it'd need to be |
@mbroedl Don't you think the syntax for the could only be a CSV wth atom scope names like that ?
? |
Yes, but: a) then you need two fields one for count-only and one for exclude-from-count I believe (b) is a highly problematic thing and it would require a huge warning which shouldn't be the aim of a seemingly easy word count program. This could be resolved by adding extra scopes, etc. And (a) and (b) could be resolved by using the ScopeSelector. So why not use it? Most people will in fact be able to use it like you suggested, because the Scope Selector would function exactly this way (although I'd like to inverse it, as said). as they want to exclude things (puncutation, language, grammar, etc) from their count, and not only count certain things. What could be a compromise is forward the stuff directly to the scope selector, and then have a checkbox to invert it:
|
A few hours later... (I really shouldn't have time for this, haha.) @X-Raym : Could you try whether mbroedl/atom-wordcount works for you? |
@mbroedl Hmm not sure how to install this, Why do me I miss to cleanly install package from forked repo ? |
In the folder of that package go in the command line and do (Note: none of these ways link the package to the development mode in atom, so make sure you keep a backup!) |
hmmm
Is there another way to install it ? I simply try to copy paste modified file but there is still this first-mate error. |
Try `apm rebuild`. Not an expert in windows, and neither with apm though. The oniguruma package is part of the wordcount package, so not related to my changes. :/
|
@mbroedl I tried How did you made the change on your side ? |
I just reinstalled it to check, and in my case I just go into the directory of the package and type `apm install` and then it worked. :/
Sorry I cannot help more... :/
You could try `npm install` instead, but i don't know if that would make a difference, as apm just forwards all commands...
|
@mbroedl when you say directory of the package, it can be anywhere where the zip has been extracted right ? even simple E:\Desktop\atom-wordcount-master ? |
Ok, so you would execute that in the folder of the package.
`E:\Desktop\atom-wordcount-master> apm install` to install all the modules needed for the module*.
Then you would link it `apm link` if you want it activated always, or `apm link --dev` if you want it to be activated in dev mode which is usally recommended.
If you linked it to the dev mode, you need to start atom as such (e.g. `atom --dev`)
* see also: `apm help install` etc:
… Install the given Atom package to ~/.atom/packages/<package_name>.
If no package name is given then all the dependencies in the package.json
file are installed to the node_modules folder in the current working
directory.
|
@mbroedl still no luck. I think there is something I am missing there. I will not bother you more with that so you can focus on actual feature development (which should be way more stimulating), and try to see if I can find some doc somewhere. I hope other people will join the discussion and be more lucky with the installation ! |
@mbroedl I can't test but I can still provide feedbacks on integration ideas ! 😄 |
@mbroedl Have you tested your mod with other file than fountain ? How well does it perform ? |
Also, as far as I can see, this will not answer all problem. I think having a function to modify the text variable output could still be nice, it case there is things to exclude which are not marked by a synthax (you spoke about link in Markdown ?) |
@mbroedl - are you still working on this? Looks like a solid approach and would add tremendous value. Let us know if you need help. I'd be happy to install and test as well. |
@davidlday I haven't been working on it since my last post here, but the approach as above is still up and running and I'd say fully functional. I just rebased it on the master branch. PS. I don't know if the package will work with the tree-sitter parsers or only with the traditional regex one, as I'm only using the latter. The approach should be the same, but I believe the interface to the tokenised buffer is different. EDIT: Note that I've implemented a settings-only approach, and the fancy graphics I made above never became reality... |
@mbroedl I'll see if I can do some testing later today. |
@mbroedl I'm on macOS Mojave and I'm getting the following when doing
I'll try on one of my Linux boxes in a bit and let you know if I have any luck. EDIT: Installs without problem on my Linux laptop (Ubuntu). Will do a little testing / validation as I have time over the next day or so. EDIT2: And I managed to get it installed on my Mac as well. Problem with xcode-select. Nothing to do with your code. |
@davidlday Thanks for sorting these issues out! |
@davidlday Thanks for reviving this. Are you coordinating the merge of this? Please let me know if you need any more access/publish rights. |
@OleMchls More than happy to help! Yes, I can coordinate merging but might tag you if I'm uncertain on something. Will also let you know if I hit any access issues. @mbroedl Accessing tokenizedBuffer (lines 94 & 99 in wordcount-view.coffee) is an undocumented part of the API. I was going over the API docs this morning and think there might be another way of getting tokens. Grammar provides two methods for tokenizing: Do you think something like this would work instead of editorGrammar = editor.getGrammar()
editorGrammar.tokenizeLines(editor.getText()) There's no equivalent Take a look and let me know what you think. |
@mbroedl I did a little poking around in the console, and it looks like this definitely won't work for tree-sitter grammars. tokenLines = []
for (i = 0; i < editor.getScreenLineCount(); i++) {
tokensLines[i] = editor.tokensForScreenRow(i);
} However, |
Thanks for doing some digging @davidlday ! For later stages, I assume that if The main reason why I implemented it using the undocumented API (which I had found in other packages) is that I was worried about the overhead by tokenizing the editor once for rendering, and then tokenizing it again for counting words. Thought this would cause too much unnecessary overhead? |
I think Regarding the undocumented API vs parsing twice, I don't really feel strongly one way or the other, but using undocumented APIs feels a little more risky since there's no contract around their behavior. I'd defer to @OleMchls on a final call, but really, either implementation is a step forward. I tagged it in an issue thread on the linter-languagetool which is looking to do something similar. You might peek at that thread to see if there's anything of use there that I might have missed. Might even be opportunity for a base atom package that natural language extensions could use to cut out the noise of computer symbols. Thanks for picking this back up, btw! |
I see your concerns about the undocumented grammar. In my own repository I feel alright about it, but the atom-directory is a different thing altogether. There were quite a few modifications necessary to adjust the parser to the Thanks for linking the thread! I don't really have time to write a base atom package (and the knowledge of how to do that), but having the tokens exposed in an API (maybe including selectors) would be a great functionality which I'm sure many people would make use of! — Might look into that once I'm a bit less busy. I've also changed the default settings to work well with generic markup documents (i.e. ignoring comments and punctuation). |
That's a tough call. I would be ok to use the API, there is a way to define private functions. So, for the time being, we can use that, ideally, we could reach out to the atom team (maybe in the linked issue) and gather their opinion. If the API gets removed someday, there will be someone opening an issue here. I want to take the opportunity to second @davidlday on this:
👍 |
Resulting from a discussion with @X-Raym in the recently accepted PR #94 I had a bit of a thought on why it should not be possible to ignore scopes rather than re-inventing regexes that are implemented as grammars already.
Taking tokenised lines is theoretically possible. Together with a scope selector it is possible to filter (positively or negatively) all text elements that match certain scopes.
This would further allow to ignore punctuation (much requested although now dealt with in a new word count regex), different grammars, etc (see e.g. #55 and #65 ).
See the following minimal example to filter out all scopes that are comments, quotes, or punctuation in their respective grammar. It should be copyable to the console of atom 1:1.
At the moment it reduces all lines into a single buffer to make looping easier. this destroys line breaks, and may thus not desirable, but I thought as a proof of concept this would be sufficient for now. This does change the final word count a little bit though. — It should be no difficulty to instead loop through the tokens and filter those that do not match the ignored selectors.
Maybe instead of the many current settings, one could have one text box where all to-be-ignored scopes are listed, and then a little pop-up menu on right-click on the word count (similar to the one for the minimap) would allow to activate/deactivate certain scopes in filtering.
This would reduce bulk in the settings (see discussions elsewhere in this package).
Disclaimer:
(A) The tokenised lines are not documented and thus subject to change without further notice, although the current selector seems to have been stable for at least two years.
(B) I do not know how much the speed of calculation would suffer from this way of filtering (compared to regex, and compared to not filtering).
Looking forward to any discussions!
The text was updated successfully, but these errors were encountered: