-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alignment fails if the transcript file has a header row #65
Comments
In the csv, there is a Sniffer class with the function has_header implemented. https://docs.python.org/3/library/csv.html I would like to contribute to this issue, although I dont know where to start :D I never worked with such a big project. Could you give me some guidance? |
First, thanks, that documentation link alone is a great help! I'd be happy to orient you to the code. The program is divided into two main parts: FAVE-align and FAVE-extract. These correspond to two subfolders of the We can trace back from the error to the transcript which should give you a sense of what functions will need modified.
The most immediate problem you will face is that we do not use the
I'd suggest implementing the |
Wow! Thank you for this detailled description. I will try to get the structure and see, how I can implement an improvement. Is there a chance to test this before? Or should I just make some csv files and test it with the dummy files? I played a bit around with has_header(). To identify the header, the datatypes should be different in the header and the following rows. Is this also in your csv? If so, this should not be a problem :) |
Maybe we can fix this error with one of the following:
The thing I don't know is, how to start in such a large project :D I can add some line of code here and there, but how could I test this? Could you help me with this? :-) |
We use pytest (tutorial) to test our code and you can find them in the Since this issue involves file read-write, you will want to look at test_cmudictionary.py in the dev branch which also deals with file read-write functions. You'll notice that it's a little sloppy and somewhat complex. I will write some tests for this issue which you can use, so don't feel like you need to figure out the testing suite alongside the project itself. Looking at
Yes, this is the case. Headers will be strings, the data will be a mix of floating point numbers and strings.
Possible, but risky. We get a ValueError which is a really common error type in python. If an error occurs for some other reason, we risk more bugs. For example, we're focused on headers right now, but what if a user accidentally puts data in the wrong column? A try-except block will skip over it, and the program will go on like nothing is wrong when, in fact, there's an error that the user needs to fix. So we can wind up skipping more than just headers.
I think this is on the right track, but instead of passing that information along, could TranscriptProcessor.read_transcription_file() just remove the header row? I don't think any other functions expect a header row. It's just there to make the document more human readable. If I remember correctly, I think using |
Adds test for JoFrhwld#65 also adds more code debt for turning the CMU dictionary excerpt into a fixture.
Hi Chris, my idea to fix the header alignment right now is:
I will try to read a bit more into pytest, to understand better what is going on :-D |
Yeah, fixing how we test code which relies on the cmudictionary is on my to-do list. I wrote a similar test over on my branch for this issue which will show you how to incorporate the cmu dictionary into the test. My current solution is to copy and paste some code from the cmudictionary tests, but that's not the best long term solution.
This will work, but remember that file read-write can be very slow. Our users work with data sets that can be very large, and this would effectively double the run time. I'm just thinking of this now, but maybe we don't need to implement the csv library. If all it does is look at the data types of the 2nd and later rows to see if they match the first, maybe we can just implement our own test? The 3rd element of a line will always be a float, so in read_transcription_file() we could just test to see if def read_transcription_file(self):
...
try:
float(lines[0].split('\t')[2])
except ValueError:
# Log a warning about having detected a header row
next(lines) # skips header row, see https://stackoverflow.com/questions/4796764
self.lines = lines
Generally, we should not overwrite files without a user's explicit permission. If they don't have a backup, we could wind up causing data loss, and we should avoid creating that situation. |
okay this is great :) The override would be neccessary to make the new file (without the header) the old one. But you are right, I also did not think about the extra time it will take. |
Resolves #65 by checking the data type of the first time field. If it's not a float, we assume it's a header row and remove it from the returned list. Otherwise the function returns as previously. Squashed commit of DerMoehre's PR #73 Co-authored-by: JoFrhwld <JoFrhwld@gmail.com> Co-authored-by: Christian Brickhouse <chrisbrickhouse@users.noreply.github.com>
Resolves JoFrhwld#65 by checking the data type of the first time field. If it's not a float, we assume it's a header row and remove it from the returned list. Otherwise the function returns as previously. Squashed commit of DerMoehre's PR JoFrhwld#73 Co-authored-by: JoFrhwld <JoFrhwld@gmail.com> Co-authored-by: Christian Brickhouse <chrisbrickhouse@users.noreply.github.com>
Resolves #65 by checking the data type of the first time field. If it's not a float, we assume it's a header row and remove it from the returned list. Otherwise the function returns as previously. Squashed commit of DerMoehre's PR #73 Co-authored-by: JoFrhwld <JoFrhwld@gmail.com> Co-authored-by: Christian Brickhouse <chrisbrickhouse@users.noreply.github.com>
At line 175 in aligner.align() the program loops through the lines of the transcript TSV. If the file has a header row, the alignment fails at line 192 because the headers (strings) cannot be coerced into time stamps (floats).
The text was updated successfully, but these errors were encountered: