Analyze Corpora Job #585

johnml1135 · 2024-12-20T20:40:31Z

When a Paratext projects is uploaded to Serval as a file and a Corpora is created from it, it would be nice to be able to run a gauntlet of analysis tests to see how it is expected to perform for different NLP tasks including (but not limited to) AI drafting in NLLB200. The basic flow would be this:

Architecture Serval level

On the Corpora endpoint, add a [get]analysis endpoint
When a corpora is made, the analysis is automatically queued.
When a corpora (or file that the corpora is made of) is updated, the analysis is automatically updated
You can request the analysis and if it is not complete it will return the status saying "not complete"
Analysis status will be tracked within the corpora IRepository
Analysis data will be a Json file stored on disk
Analysis files will be deleted when then are superceded or the corpora is deleted

Architecture Machine level

A new GRPC endpoint called analysis
The machine-engine server will queue the job
The machine-job server will run the job in python directly in k8s (not ClearML, no S3 bucket interaction) for < 30 second response time.
The results will be a Json file that is either written to file or returned over GRPC

what types of analysis could be done?

Before we implement these, we need to review them for suitability, determine if research is needed and what format we should use to provide to SF.

Check for all instances of unparsable USFM.
- Parse each book and see if there are any failures. Don’t let the user select books that fail, but tell them that they are failing and need to be updated.
Check for mal-formed USFM that will likely not be parsed to match intent:
- We can also create algorithms to look for "suspicious" USFM, including lines that don't begin with a verse number, missing verses, chapter markers without a chapter number, verses in the wrong order, duplicate verses, etc. We can respond "nicely" to them when they occur, but we should warn the user that something is wrong.
Detecting incorrect versification
- We can run the algorithms we already have for this, specifically to detect that the verses present in the books match the expected versification, and that there are not extra (or missing) verses at the end of some chapters (where the rest of the chapter is filled out).
Detecting non-normalized text (mixed scripts, spelling issues etc)?
- Can we run Wildebeest?
- From it we can get a lot of analysis about scripts and odd characters - how would we present it to the user?
Book metrics
- Verse counts, completion status, words per verse, characters per verse
NLLB Tokenization metrics
- number of Characters not recognized by NLLB tokenizer
- number of characters per token
Detecting versification misalignment
- To test for”‘verses being off by one” would need research, but could be deterministic, such as correlating sentence lengths per chapter and making sure that they match to a specified degree or by running an actual alignment and looking for significantly misaligned verse chunk. This could be a comparison against a standard translation, either the greek and hebrew or the KJV, etc.

johnml1135 self-assigned this Dec 20, 2024

johnml1135 added this to Serval Dec 20, 2024

github-project-automation bot moved this to 🆕 New in Serval Dec 20, 2024

This was referenced Dec 20, 2024

Normalize scripts with Wildebeest #587

Open

Better Project Troubleshooting #588

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analyze Corpora Job #585

Analyze Corpora Job #585

johnml1135 commented Dec 20, 2024 •

edited

Loading

Analyze Corpora Job #585

Analyze Corpora Job #585

Comments

johnml1135 commented Dec 20, 2024 • edited Loading

Architecture Serval level

Architecture Machine level

what types of analysis could be done?

johnml1135 commented Dec 20, 2024 •

edited

Loading