You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a Paratext projects is uploaded to Serval as a file and a Corpora is created from it, it would be nice to be able to run a gauntlet of analysis tests to see how it is expected to perform for different NLP tasks including (but not limited to) AI drafting in NLLB200. The basic flow would be this:
Architecture Serval level
On the Corpora endpoint, add a [get]analysis endpoint
When a corpora is made, the analysis is automatically queued.
When a corpora (or file that the corpora is made of) is updated, the analysis is automatically updated
You can request the analysis and if it is not complete it will return the status saying "not complete"
Analysis status will be tracked within the corpora IRepository
Analysis data will be a Json file stored on disk
Analysis files will be deleted when then are superceded or the corpora is deleted
Architecture Machine level
A new GRPC endpoint called analysis
The machine-engine server will queue the job
The machine-job server will run the job in python directly in k8s (not ClearML, no S3 bucket interaction) for < 30 second response time.
The results will be a Json file that is either written to file or returned over GRPC
what types of analysis could be done?
Before we implement these, we need to review them for suitability, determine if research is needed and what format we should use to provide to SF.
Check for all instances of unparsable USFM.
Parse each book and see if there are any failures. Don’t let the user select books that fail, but tell them that they are failing and need to be updated.
Check for mal-formed USFM that will likely not be parsed to match intent:
We can also create algorithms to look for "suspicious" USFM, including lines that don't begin with a verse number, missing verses, chapter markers without a chapter number, verses in the wrong order, duplicate verses, etc. We can respond "nicely" to them when they occur, but we should warn the user that something is wrong.
Detecting incorrect versification
We can run the algorithms we already have for this, specifically to detect that the verses present in the books match the expected versification, and that there are not extra (or missing) verses at the end of some chapters (where the rest of the chapter is filled out).
Detecting non-normalized text (mixed scripts, spelling issues etc)?
From it we can get a lot of analysis about scripts and odd characters - how would we present it to the user?
Book metrics
Verse counts, completion status, words per verse, characters per verse
NLLB Tokenization metrics
number of Characters not recognized by NLLB tokenizer
number of characters per token
Detecting versification misalignment
To test for”‘verses being off by one” would need research, but could be deterministic, such as correlating sentence lengths per chapter and making sure that they match to a specified degree or by running an actual alignment and looking for significantly misaligned verse chunk. This could be a comparison against a standard translation, either the greek and hebrew or the KJV, etc.
The text was updated successfully, but these errors were encountered:
When a Paratext projects is uploaded to Serval as a file and a Corpora is created from it, it would be nice to be able to run a gauntlet of analysis tests to see how it is expected to perform for different NLP tasks including (but not limited to) AI drafting in NLLB200. The basic flow would be this:
Architecture Serval level
On the Corpora endpoint, add a [get]analysis endpoint
When a corpora is made, the analysis is automatically queued.
When a corpora (or file that the corpora is made of) is updated, the analysis is automatically updated
You can request the analysis and if it is not complete it will return the status saying "not complete"
Analysis status will be tracked within the corpora IRepository
Analysis data will be a Json file stored on disk
Analysis files will be deleted when then are superceded or the corpora is deleted
Architecture Machine level
A new GRPC endpoint called analysis
The machine-engine server will queue the job
The machine-job server will run the job in python directly in k8s (not ClearML, no S3 bucket interaction) for < 30 second response time.
The results will be a Json file that is either written to file or returned over GRPC
what types of analysis could be done?
Before we implement these, we need to review them for suitability, determine if research is needed and what format we should use to provide to SF.
The text was updated successfully, but these errors were encountered: