Add data attestation as a requirement #481

roedoejet · 2024-06-20T22:15:29Z

PR Goal?

This PR makes data attestations mandatory for running any model that points to valid data.

Fixes?

Feedback sought?

This is of course able to be circumvented, but do you feel like it's enough of a deterrent for now? I will add more documentation in another PR. Maybe try to run a current model you have that does not have the permissions added and see what happens? It should fail with an intelligible error message.

Priority?

alpha

Tests added?

Tests updated, none added.

How to test?

The wizard should now produce a config with permissions_obtained: True for each dataset which should be runnable. Other configs will not be runnable, nor will existing checkpoints.

Confidence?

medium

Version change?

it would be normally, but just alpha. It's a breaking change so any existing models will have to add permissions_obtained: True for each dataset in the config

Related PRs?

none

fixes #465

codecov · 2024-06-20T22:17:35Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.96%. Comparing base (181ba61) to head (7eb38e0).

❗ Current head 7eb38e0 differs from pull request most recent head 6eb551e

Please upload reports for the commit 6eb551e to get more accurate results.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #481      +/-   ##
==========================================
+ Coverage   73.90%   73.96%   +0.05%     
==========================================
  Files          45       45              
  Lines        2859     2865       +6     
  Branches      443      444       +1     
==========================================
+ Hits         2113     2119       +6     
  Misses        661      661              
  Partials       85       85

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-06-20T22:18:35Z

CLI load time: 0:00.23
Pull Request HEAD: 6eb551eeb1048cbddf3a061361e674f0352610e6
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package

joanise

Bug: I deleted permissions_obtained: true from the config file the wizard wrote for me, and then ran everyvoice preprocess config/everyvoice-text-to-spec.yaml and it ran without error. Instead, it should have failed with the permission message.

everyvoice train text-to-spec config/everyvoice-text-to-spec.yaml also proceeded without telling me abot the missing permissions.

The concept is good and sufficient: there is no way for us to verify they have permission, all we can do is force the user to declare to the software that they do, and this PR (once the bugs above are fixed) does that.

joanise · 2024-06-21T13:41:40Z

everyvoice/config/preprocessing_config.py

+    @field_validator("permissions_obtained")
+    def check_permissions(cls, permissions_obtained: bool) -> bool:
+        if not permissions_obtained:
+            raise ValueError(


Coverage says this raise statement is not exercised by unit testing. It should be easy to add one test case where you create a dataset without specifying permission=True, maybe just duplicating the test case you had to fix with permissions_obtained=True, but without that, it would exercise this.

also add a unittest

roedoejet · 2024-06-21T17:08:00Z

Bug: I deleted permissions_obtained: true from the config file the wizard wrote for me, and then ran everyvoice preprocess config/everyvoice-text-to-spec.yaml and it ran without error. Instead, it should have failed with the permission message.

everyvoice train text-to-spec config/everyvoice-text-to-spec.yaml also proceeded without telling me abot the missing permissions.

The concept is good and sufficient: there is no way for us to verify they have permission, all we can do is force the user to declare to the software that they do, and this PR (once the bugs above are fixed) does that.

Thanks for the quick review! I forgot to add validate_default=True so it wasn't actually validating the default argument (which is permissions_obtained=False) so someone would have actually had to set it to False for it to work. This is now fixed I believe. Thanks!!

joanise

Looks good now.

joanise · 2024-06-21T18:56:11Z

Hum, approved, but something unrelated seems to be failing in CI, it would be best to fix that right away, before merging if you can.

fixes #321

this should be fixed at a later date by fixing #359

roedoejet added 2 commits June 20, 2024 15:11

feat: require data permissions for all models

476ea7d

fixes #465

chore: update schemas

4555605

roedoejet requested review from joanise and SamuelLarkin and removed request for joanise June 20, 2024 22:17

joanise requested changes Jun 21, 2024

View reviewed changes

fix: always validate permissions_obtained

919ab4a

also add a unittest

roedoejet requested a review from joanise June 21, 2024 17:32

joanise approved these changes Jun 21, 2024

View reviewed changes

roedoejet mentioned this pull request Jun 21, 2024

logs_and_checkpoints folder is erroneously created #483

Closed

roedoejet and others added 5 commits June 21, 2024 14:06

fix: point test to temp folder

396dd55

style: apply black

7eb38e0

fix: remove lower and nfc_normalization from default cleaners

6746b4c

fixes #321

style: apply black

a92742f

fix: set global cleaners as the union of dataset cleaners in the wizard

6eb551e

this should be fixed at a later date by fixing #359

roedoejet merged commit c928f9b into main Jun 21, 2024
2 checks passed

roedoejet deleted the dev.ap/data-attestation branch June 21, 2024 21:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add data attestation as a requirement #481

Add data attestation as a requirement #481

roedoejet commented Jun 20, 2024

codecov bot commented Jun 20, 2024 •

edited

Loading

github-actions bot commented Jun 20, 2024 •

edited

Loading

joanise left a comment

joanise Jun 21, 2024

roedoejet commented Jun 21, 2024

joanise left a comment

joanise commented Jun 21, 2024

Add data attestation as a requirement #481

Add data attestation as a requirement #481

Conversation

roedoejet commented Jun 20, 2024

PR Goal?

Fixes?

Feedback sought?

Priority?

Tests added?

How to test?

Confidence?

Version change?

Related PRs?

codecov bot commented Jun 20, 2024 • edited Loading

Codecov Report

github-actions bot commented Jun 20, 2024 • edited Loading

joanise left a comment

Choose a reason for hiding this comment

joanise Jun 21, 2024

Choose a reason for hiding this comment

roedoejet commented Jun 21, 2024

joanise left a comment

Choose a reason for hiding this comment

joanise commented Jun 21, 2024

codecov bot commented Jun 20, 2024 •

edited

Loading

github-actions bot commented Jun 20, 2024 •

edited

Loading