-
-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Readability metrics #712
Comments
I already have code for this that I wrote a couple of years ago. It is for English only, as counting syllables is not a trivial matter and has to be tailored for each language. I arrived at an algorithm that counts syllables fairly accurately without having to look up each word, which is the only really accurate method. My code calculates the Flesch–Kincaid Readability Score. The code was once in one of the utility files of nW, but I took it out since it wasn't used. I've considered putting it back in again as well. Thanks for the reminder! |
My level of confidence in that sort of "metrics" is such that I would say as long as it count syllables in English it would probably be OK for French as well. These are statistical-empirical tools, and measuring different parts of an homogeneous text gives (similar but) different results, so a good approximation might be sufficient Any promoter or advocate to speak up ? |
The English function I wrote does not work for Norwegian for instance. I tested it when I wrote it just to see. A decent first approach is to count diphthongs and single vowels. The challenge in English is that the letter The second issue is of course that the grading metric is designed for English specifically, and unless you want to just extract the fairly esoteric score, you need a grading scale for other languages too. They can of course easily be added if they exist. I'm wondering how much work it would be to make a simple machine learning training set for this problem and then train it on a full dictionary using supervised learning. |
Every time I run a readability tool over my writing, for example Hemingway I'm reminded that:
So a readability score displayed next to the word counter would be a welcome addition to novelWriter. It would by nice if novelWriter had a switch that would color sentences that are hard to read, suffer from passive voice, which is what the Hemingway app does. I'm afraid that might be a huge distraction from novelWriter's main reason for existing. |
Those are pretty advanced language analysis features, and probably also too heavy to run with Python. I am not a fan of tools like Hemingway anyway. While they help with grammar, which is useful, they also have a feel of over-teching writing to the point of generating uniformity and conformity. That's not a good thing in my book. The readability score is a more neutral metric because it gives you a number without any judgement on whether it is good or bad. The readability score only needs to match the target audience's reading level. My main concern about adding it is that it adds a tool that is language-specific. I want nW to be less English-oriented. English is not my own first language either. If we can collaborate to make it a bit more language-independent, then I'm all for adding a field next to the word counter that lists the values and the total. The Flesch–Kincaid score is based on average syllables per word and words per sentence, so keeping track of those two values throughout the project isn't too hard. I can add this to the indexer. Syllables is still the trickier one to generalise. |
Just for reference, this is how to programmatically count syllables in English:
It is accurate enough that the mistakes it makes don't affect the final score. One issue is to determine exactly when 'y' makes a vowel sound. We'd need a similar set of rules for each supported language, and I have no idea how well any of this would work for a non-European language. |
there was a tool some years ago that used the OpenOffice hyphenation dictionaries to overcome the multi-lingual problem... Edit: it still exists and even evolved, now using LibreOffice dicts... have a look at PyHyphen on Pypi |
Yeah, this is what I was referring to when I mentioned that the only accurate method is to look up words (although hyphenation dictionaries don't always produce the correct syllable count). I researched this back when I wrote that code. I'm reluctant to depend on such external tools hosted on PyPi as such packages aren't always very dependable. I've had enough problems with pyenchant that I use for spell checking. Perhaps the hyphenation package can be used to train a machine learning implementation though. It's an interesting topic in general. Edit: Another alternative is to write a small module to parse the LibreOffice hyphenation dictionaries directly, like Pyphen does. |
Yes, I just wanted to show it as a mean to bridge that multiple languages issue, and my idea was along the lines of the last alternative you added. |
A benefit of that approach is that the hyphenation dictionaries are available for download, so nW could automatically download the ones the user wants. Looking at the Pyphen package, the parsing isn't that complicated. Since Pyphen is written in pure Python, it's also a good option to just include it in a lib folder in nW instead of depending on an external package. |
As for the suggestion by @johnblommers on integration with grammar tools like Hemingway, there's a feature issue #515 on this where the discussion is to integrate with LanguageTool, which is open source. Since you can run a local language server for free, or optionally connect to a hosted one, this may be a good fit for novelWriter. Especially since it supports multiple languages. As for the metrics part, I will have a look at adding the syllable algorithm for English to the indexer. Perhaps if I add a proper class for syllables calculations we could try out multiple approaches. However, the very idea of counting syllables to determine complexity doesn't translate well outside of European languages, and not necessarily to Germanic languages either where multiple words are often joined into long words where one in English would keep the spaces or insert hyphens. These words aren't necessarily more complex or hard to read just because they contain many syllables. This practice is very common in Norwegian and of course infamous in German. |
I'm planning to redesign the Build Tool for the next release cycle. I've been thinking of adding an extra feature to the build tool for analysing the structure of the manuscript. Checking the complexity of the text could be one option. It requires a fair bit of processing, so it isn't well suited for the text editor. If I add it, I'd like to give a breakdown so that the user has some idea of the various stats that go into these various metrics. It would probably make it more useful than just stating a score based on English alone. It would also allow the user to select a way to estimate syllables. My own algorithm for English provides a good estimate, but it really is tuned to English. Letting the user select the algorithm is probably a good solution. Adding a module of such algorithms would also make it possible for other users to contribute some for other languages. It would be a nice area to accept contributions actually. |
I know the subject is a matter of discussion, and many people (including myself) give little importance to the Flesh, Gunning, SMOG or similar methods of readability assessment.
However, there are publishers, especially in the children and teens literature, who are very attached to this kind of tool and expect to get the figures with the manuscript. (I have seen a target readability index specified in several translation contracts)
The computation can be done manually, counting sentences and words and syllables, but it is very tedious while the computer could do it in seconds, so that might be a good point to envision including something into a future version of nW.
The text was updated successfully, but these errors were encountered: