Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow for automated translations and removal of duplicate statements using Github Actions #18

Closed
dankim444 opened this issue Jul 19, 2024 · 7 comments · Fixed by #19
Assignees

Comments

@dankim444
Copy link
Contributor

dankim444 commented Jul 19, 2024

When new statements are added, we should have separate jobs within one workflow that can translate all our new statements and remove any duplicate statements that are generated after the translations are complete. These changes should then be committed to the repository.

One job called 'Translate Statements' will translate new files in the raw_statements folder. This workflow will be configured to run on workflow_dispatch, so it must be run manually using the Actions tab on Github. We should also be able to pass in the specific files we want to translate as parameters to the workflow before it is invoked, so we have more manual control over what can be translated in the folder. Steps to run the workflow are found here: https://docs.github.com/en/actions/using-workflows/manually-running-a-workflow.

The second job called 'Remove Any Duplicates' will remove any statements causing duplicates across translations from the original statement file as well as remove the same indices in the corresponding translation files to ensure consistent structure of statement files. This job should then run format_checker.py to check there are no errors at the end.

@JamesPHoughton
Copy link

Probably this wants to be a workflow dispatch action so that we don't make a ton of excess API calls.

Could make it so that if the number of new statements is below some threshold, do it automatically, otherwise post an error or a message requiring manual approval.

@dankim444
Copy link
Contributor Author

Here is a screenshot of running the workflow:
image

The first input field allows you to pass in the names of the .csv files you want to translate. The second takes in the elicitation and the third takes in the committer. So say I add test.csv containing english statements in the raw_statements folder, I can run this workflow with the necessary paramaters, and the workflow will commit/push all translated files to the main branch of the repository. I think this is a much easier way to manually control our input files and push the translations in one centralized pipeline.

@markwhiting
Copy link
Member

Interesting. That seems like a good setup. Perhaps we should also have a good way to check if transactions are complete and estimated costs of completing them?

@dankim444
Copy link
Contributor Author

dankim444 commented Jul 22, 2024

Interesting. That seems like a good setup. Perhaps we should also have a good way to check if transactions are complete and estimated costs of completing them?

Should I use the standard text translation pricing tier of $15 per million characters found here: https://aws.amazon.com/translate/pricing/ ?

Here's what that would look like in the workflow:
image

@dankim444
Copy link
Contributor Author

Here is a screenshot of the Calculate Translation Cost workflow that is triggered after each new push/commit. The price is retrieved via the AWS Price List API to display the most up-to-date pricing information. Note, observable_gpt4o's number of characters and total cost is different than the previous screenshot because I calculated these after removing duplicate translations.

image

@markwhiting
Copy link
Member

Great!

Should this be on the file level or the corpus level, e.g., something like:

There are 200 statements that don't currently have translations in at least one language. It will cost approximately $30 to complete these translations.

@dankim444
Copy link
Contributor Author

Great!

Should this be on the file level or the corpus level, e.g., something like:

There are 200 statements that don't currently have translations in at least one language. It will cost approximately $30 to complete these translations.

It should be on the file level. So say we haven't run any translations on only test2.csv, then the intended output would be:
"test2.csv still needs to be translated in 9 new languages. This would require translating 351 characters.
It will cost approximately $0.01 to complete these translations."

Let me know if this fine!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants