-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve handling of inputs #348
Conversation
2ce7fbc
to
15e3226
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #348 +/- ##
==========================================
+ Coverage 71.49% 73.15% +1.66%
==========================================
Files 39 43 +4
Lines 2543 2786 +243
Branches 390 459 +69
==========================================
+ Hits 1818 2038 +220
- Misses 653 664 +11
- Partials 72 84 +12 ☔ View full report in Codecov by Sentry. |
271778a
to
7325149
Compare
|
def5f34
to
469b515
Compare
469b515
to
3bff09b
Compare
I ran a few test with success but getting lots of warnings. ( I did not really look at the results , just ran the full pipeline all the way to synth with success using CRK data. BUT Noticing this bug when you use a filelist that includes a headerline during the wizard. EX: I give a file that included a "header line" like below:
VS using the same file but ignoring the header lines, no crash on proprocess.
|
I turned this into an issue: #363 |
03df507
to
4966efd
Compare
Thanks Adian for your great effort! It seems that we need to include |
Yeah, I had the same reaction when I started testing, but it's because I had not reinstalled. It's already declared, but you need to rerun |
Some disorganized feedback...
Some new mypy warnings that should be fixable:
|
Thanks for the feedback @joanise !
Yes, I also worry about this. I think we either:
Related, but orthogonal to that, perhaps we should remove comma |
Now we may have a symbol defined twice in
Not sure if this is a desired operation. Perhaps we can modify the Again thanks Aidan for the great effort in making the input so much better! |
Thanks Tim! I just added 4a0dc56 and 3fdc561 which I think fixes this. It removes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow this is a big one, may be too big. It was hard to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humm starting yet another review.
I forgot a suggestion.
1483d56
to
8178cfc
Compare
the only thing that is required is that the filelist has the appropriate column headers
Co-authored-by: Samuel Larkin <samuel.larkin@nrc-cnrc.gc.ca>
fixes #370 according to @marctessier
refactor to process cleaners on filelist-lists before parsing into a list of dicts
8e69328
to
f656bf2
Compare
Thanks so much for the super helpful review! I think this is good to go now, modulo the bug that you reported when training. I'm happy to merge it now and sort that bug out later though, if all looks good otherwise? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this now, and do further improvements as new PRs.
It's in good enough shape to merge, and we know this is where we want to work from.
Fixes #216
Fixes: #361
This PR implements the following:
characters
,phones
, orarpabet
. Thetext
column header is no longer used by the preprocessor. The Wizard converts thetext
column in a filelist to eithercharacters
,phones
, orarpabet
.phones
.characters
and indicates that the language of their dataset is one that is supported by a supported g2p engine, thenphones
will be automatically calculated, and phone tokens will be guessed using IPA tokenization libraryipatok
.h/e/l/l/o h/o/w a/r/e y/o/u
). The DataLoader then encodes them on-the-fly. Multi-hot phonological features are still saved to disk.characters
,phones
, orphonological features
.everyvoice synthesize from-text model.ckpt --text 'hello' --text-representation 'characters'
)Minor changes:
features.py
to use numpy everywhere instead of just adding lists of ints together.Related PRs:
Remaining issues:
Remove lowercase_ascii default from symbol set. #361Sort headers in processed filelist #364text
an alias ofcharacters
to support backwards compatibility #371To illustrate the major changes related to preprocessing, previously for LJ text we would provide a filelist with an ambiguous
text
column and receive the following:Whereas now, still providing same information, we get the following: