-
-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LanguageTool integration #515
Comments
There is already a Python wrapper on PyPi that seems to be actively developed: https://pypi.org/project/language-tool-python/ This may be sufficient for the integration. Needs testing. |
Very interesting possibility |
Cool! It may be possible to learn from what TeXstudio ( https://www.texstudio.org/ ) is doing -- they integrate LanguageTool, and they can launch it on demand. Seems to work well. It is a bit heavy to run locally tough -- I'm using it with the full English ngram data, which takes ~10 GB of disk space and 6 GB of RAM (although the last one may just be that it expands if there is space, and there is lots of space here). |
You can also use https://pypi.org/project/language-check/ which works fine. The Python wrapper essentially forwards its calls to the Java application running locally. For huge text files (which are typical for novels) I wouldn't recommend using it remotely because it will add huge latency. Even using it locally takes time depending on the project size. I know this because I have implemented integration for this in Manuskript: olivierkes/manuskript#747 You basically want to preprocess most files of your novel and cache the results from LanguageTool because dynamically requesting files, paragraphs or similar takes too much time. Edit: I have to correct... don't use language-check but use language-tool-python. The one is just the abandoned original with some unfixed issues. |
For the French texts I prefer to use Grammalecte (an evolution from LightProof) rather than Language Tool (they share the same lexical base as the two developers work in parallel) But since they don't find the same errors, one can use both... Grammalecte is written in Python and has been integrated in several apps, but only exists in French flavor. It is integrated in LibreOffice (also Openoffice, but it was late on versions last time I used it) and we also have it into Sigil, so I won't cry to have it integrated into nW Just to be complete : I noticed there exists pygrammalecte, a Grammalecte wrapper in Python https://pypi.org/project/pygrammalecte/ |
How slow would it be if you ran it on a single file instead of the whole novel? Say a scene file of ~1000 words or a chapter of ~5000 words? I wouldn't run it real time I think. Maybe as a command button. Now that the internationalisation of novelWriter is near complete, this is one of the features I'm considering looking into next. |
@vkbo The problem is that LanguageTool uses a server to process each text. Even if you use it locally you have to add in latency from transferring it to the server and receiving its results. So using it remotely will hugely depend on the servers keeping up with potentially multiple people trying to process their texts at the same time, their individual internet connections and bandwidth limitations on both ends. Most of this latency is independent of the actual size of the text because it is an additional overhead. Sure if you have a very low bandwidth, it is worse to transfer huge text files but transferring many small pieces instead adding overhead for each request makes it worse. If possible you should separate the changed passages in the text (changed sentences) and transfer them all together. Then cache or store the responses locally so you don't have to request as much every time. Even if you add a command button, users will tend to spam the button if the the process takes too long (I think I had about 1~3 seconds in some tests locally - so not even remotely). That's the reason why I would recommend to process as few as possible on change (best in a second thread, so it doesn't affect input latency of writing). If you add an explicit button which have to be pressed first, users will notice the time of processing even more. ^^' |
Also because you have asked about specific numbers. I think this count is quite unproblematic locally but it can still be noticed remotely (so it could take about 1 second in situations, I assume...). However if you don't want to restrict users to a specific size in their chapters, I wouldn't recommend taking this approach rather than picking changes out of the text (which is usually much less). |
novelWriter has an upper limit on file size of 5 MB, which is when the Python/Qt interface starts to get into real issues due to the syntax highlighter running on the GUI thread. There's also a user-defined soft cap that defaults to 800 kb that disables some automated features like full document spell checking to reduce load on the syntax highlighter. If this tool is integrated with the highlighter, but runs on a set of rules processed and cached by an off-GUI thread, and also obeys the soft cap, this may work. I'll have a look at your manuskript implementation, but I'm not very familiar with that project. I've tried to keep the editor as lightweight as possible to avoid latency when the user is writing. Python is after all very slow on real time stuff. I am considering placing text analysis in a separate dialog box entirely, with its own highlighter, and have it update the editor's text when completed. A bit like the traditional spell checking dialog tool in office apps. It aligns more with the distraction free philosophy to keep them separate and not clutter the text with all sorts of highlights when you're focused on getting "words to paper". I don't know. I need to think a bit more about this. Thanks a lot for taking the time to provide feedback and insights. |
That's the way I prefer. |
I've been looking into doing some work on this potential feature over the last day or so and it seems entirely feasible especially with the use of language-tool-python. he work done in olivierkes/manuskript#747 There are some important questions to answer in regard to the approach however.
In any case I'd be interested in doing work on this feature if a simi-clear plan could be made for it's implementation |
Thanks for offering to help on this @Ryex. With other and more pressing features to implement, this one has been on hold for quite some time as you can see from the time stamps. I am not very familiar with these toolsets in the first place, so there is a bit of a threshold for me to get started on this. I would be very happy for some assistance on this! I am uncertain how it would be best to implement this in practice. The current spell checker is integrated into the syntax highlighter, and is a major bottleneck on large documents when they are first opened. Since it is not currently possible to run the syntax highlighter off the GUI thread, this becomes even trickier. The highlighter, when only used for actual highlighting though, is very fast. And the updates are done on a line by line basis. That is, every time a line is changed, that line is re-highlighted and re-spellchecked. As I've indicated, there is a performance issue associated with the initial spell check of a large document when it is opened. For any text analysis integration, finding a suitable way to run the tool in the background, and caching between editing sessions, is probably essential for a smooth experience. We could implement a cache for the regular spell checker first to see if it can be done. Thankfully it is fairly easy to trigger an update on a paragraph change using the QTextEditor or QTextDocument signals for this. There is already a thread pool that is currently only used for the word counter that could also be used for queueing up both spell check tasks and analysis tasks. The highlighter would then just read from a buffer of pre-computed highlight regions. As for your points specifically:
GUI Design Ideas Say we want to add these features "on top" of the current editor, so that the text can be dynamically edited in-place. I think perhaps a coloured gutter bar in the text margin could show which paragraph is being analysed. An expandable panel below the editor window, similar to the references panel in the viewer, can hold the needed information for the tool. I already want to add a "Problems" list that can report various errors in your text as an alternative, or addition, to the underlines. Much like the Problems tab in VSCode does. The LanguageTool analysis interface can occupy another tab in this panel, and have a few real-time settings (if needed) and a prev/next button to iterate through paragraphs, and provide its feedback and proposed changes if it produces any (I haven't played with these tools in a while). Implementation
So, in conclusion, there are a bunch of other features that can be tied together to make a more complete toolset for processing text. All of these would benefit from a redesign of the syntax highlighter and the addition of a panel below the editor. It would also create a framework where new features could be added. I do want this to be modular enough that the user can be provided with a selection of options that supports other languages than English. |
I have to admit, when I was looking into it I was a bit put off by the tight integration of the spell check and the highlighter. My first instinct was to separate the two before moving on. I'm glad to see that concern is shared. I think aiming for the modular approach first is the smart move. While slapping on a LibreOffice-esk dialog would be relatively easy it would by no means be user friendly or clean. Proposal
If this is the Path to go this discussion should probably be split off into another issue. Feel free to ping me as I would like to work on this. it would make my writing process much smoother. |
When I first started this project back in 2018, I hadn't written much in Python (I was primarily working in Fortran at the time), and nothing in Qt, so there are a lot of old implementations that are not at all optimal. Some are also related to supporting pre 3.6 Python versions. I am slowly rewriting core parts of the code to be more Pythonic and reflective of newer Python releases. Since 3.6 is also now dropped, it may be a good time to start adding typing info. It makes it easier to collaborate on the code. The spell checker integration was fine when it was only spell checking, and it was assumed users would only have single documents of maximum a few thousand words. After all, a lookup is made on every single word during highlighting, so it is fairly obvious that this doesn't scale well. It is possible to store meta data in the editor on a per-block basis. ID-ing a block by its block position is not really viable since this is merely an index in a list, and they will change all the time when the user inserts text, causing the "Problems" dictionary to have to be updated. I'm wondering if instead of storing a single cache dictionary in memory that there should instead be a meta data object stored with the text block itself in the QTextDocument. The data can be piped to a separate file on save, or dumped at the bottom of the file as a serialised JSON. Then we don't have to consider how to associate a text block with its custom meta data. During save, the block ID is frozen, and can be used for the meta data. It will increase the save time slightly, but I doubt it would be noticeable to the user. Overlapping regions is a non-issue from the highlighter's point of view (although it may be visually messy). The highlighter uses character format merging to set the format. Proceeding:
Since this is a rather large rewrite of core features, I will set up a milestone for it and pull in the relevant issues. I can also set up a project (kanban) for tasks if you prefer to work this way since we may be splitting tasks. I already have release 1.7 and 1.8 planned, so this could be suitable for a 1.9 release. The changes here should not interfere much with 1.8 which focuses only on the "Build Project Tool", and most of 1.7 is already in main. 1.7 is mostly a rewrite of the project data structure to lift a lot of restrictions, and solve a few blockers for 1.8. The timeline here is thus on the order of around half a year, give or take. I do a minor release every few months, depending on how much spare time I have to spend on this. I could make a branch for this already now so it is possible to start without interfering with the 1.8 release. |
I'd like to add another perspective: that of iA writer. iA writer is actually quite a nice program, does markdown (with some creature comfort additions). It's basically a text editor geared to writers of documents. It does its job quite well... but, alas, it doesn't support Linux :( iA writer doesn't do actual grammar/style checking, instead it just highlight the nouns/verbs/adverbs/adjectives/etc. (whichever you select in a nice drop-down menu), and let the grammar and style checking to the user. This allows the application to perform much better, and there's no need for referring some web-service or including a multi-gigabyte program in the download. LanguageTool integration would be overkill, in my humble opinion. |
I am also thinking this is overkill. I don't think I want to move novelWriter down this route at all, considering the current "AI" hype (i.e. rebranded LLM). I do plan to add a text analysis framework which I hope to design in a plug-in like manner. It can include anything from counters, to statistics to text analysis implementations and the user can select and run them on single documents or the entire project. I think it is a better approach, and users with backgrounds from different languages can contribute language-specific ones. |
Add a tool that can integrate with a locally run or an online provided instance of the LanguageTool embedded HTTP Server, see: https://dev.languagetool.org/http-server.
Thanks to @kyrsjo for reminding me of this. I recognise it, and have considered it before.
It should only be loosely integrated with novelWriter, and could be part of a larger toolbox of text analysis options that I'm anyway considering writing. Perhaps it would be useful to wrap the LanguageTool server in an independent, simple GUI launcher that can be run externally. The novelWriter side of it should only access it through the HTTP API, which would also allow the user to connect to hosted instances of the API.
The text was updated successfully, but these errors were encountered: