Alignment fails if the transcript file has a header row #65

chrisbrickhouse · 2022-08-29T23:26:54Z

At line 175 in aligner.align() the program loops through the lines of the transcript TSV. If the file has a header row, the alignment fails at line 192 because the headers (strings) cannot be coerced into time stamps (floats).

This failure mode should be avoided if possible. We should detect whether a header row is present, and skip it if so. This check-and-skip functionality should probably be added in TranscriptProcessor, and might already be built-in to the csv reader.
Logging and error reporting could be improved here. Currently the error message says that something couldn't be converted to a float, but where and what it is are not reported. FAVE should report the line and contents which would make identification of transcript problems easier.

JoFrhwld · 2022-09-15T21:29:36Z

issue

DerMoehre · 2022-09-30T07:51:05Z

In the csv, there is a Sniffer class with the function has_header implemented.
It returns TRUE, if the first column seems to be a column

https://docs.python.org/3/library/csv.html

I would like to contribute to this issue, although I dont know where to start :D I never worked with such a big project. Could you give me some guidance?

chrisbrickhouse · 2022-09-30T20:45:15Z

First, thanks, that documentation link alone is a great help! I'd be happy to orient you to the code.

The program is divided into two main parts: FAVE-align and FAVE-extract. These correspond to two subfolders of the fave director: align and extract. This issue corresponds to align, which has a number of python files inside. Each file contains one class (or, at least, should contain only one class, I think there are a couple that don't follow that convention) which has the same name as the file. FAVE-align is mainly implemented by the Aligner class specified in aligner.py. The aligner takes a transcript file, technically a TSV (tab seperated value) file which is processed by the TranscriptProcessor class in fave/align/transcriptprocessor.py. Together, these two files are the source of the problem in this issue.

We can trace back from the error to the transcript which should give you a sense of what functions will need modified.

A TranscriptProcessor class instance is stored as Aligner.transcript
Aligner.read_transcription_file() calls TranscriptProcessor.read_transcription_file() which reads the file into memory and stores a list of strings in TranscriptProcessor.lines which can be accessed through Aligner.transcript.lines (see 1)
TranscriptProcessor.check_transcription_file() populates TranscriptProcessor.trans_lines attribute as a list of strings.
Aligner.align() at line 175 defines a loop over these lines (and some other stuff). The first item of this loop is sometimes a header row.
When the item is a header row, Aligner.align() fails at line 192 because it tries to cast a string (the header) to a floating point number which is impossible. The program crashes.

The most immediate problem you will face is that we do not use the csv library anywhere in this process (I thought we did). You can either:

Modify TranscriptProcessor.read_transcription_file() to use csv instead of our current f.readlines() approach. Best long-term plan, but might require some changes to functions outside the traceback I gave.
Keep our non-csv approach and modify Aligner.align() so that it does its own header check. Easier short-term fix, but will be harder to maintain in the future.

I'd suggest implementing the csv library, but either would work as a bug fix.

DerMoehre · 2022-10-01T11:18:25Z

Wow! Thank you for this detailled description. I will try to get the structure and see, how I can implement an improvement.

Is there a chance to test this before? Or should I just make some csv files and test it with the dummy files?

I played a bit around with has_header(). To identify the header, the datatypes should be different in the header and the following rows. Is this also in your csv? If so, this should not be a problem :)

DerMoehre · 2022-10-01T16:53:52Z

Maybe we can fix this error with one of the following:

We add a try:except block at the position, where the division is made. If the error occurs, the line will be skipped
We start with adding the csv. The places I identified would be in the transscriptprocessor at line 246. Maybe it would be enough to check there for header, and give this information somehow to align

The thing I don't know is, how to start in such a large project :D I can add some line of code here and there, but how could I test this? Could you help me with this? :-)

chrisbrickhouse · 2022-10-01T21:36:54Z

Is there a chance to test this before?

We use pytest (tutorial) to test our code and you can find them in the tests directory. Inside the project directory, if you run python -m pytest tests/ then you should be able to test all of the code on your own computer; it's the same command we run in our workflow, so if the tests pass on your machine they will probably pass on github.

Since this issue involves file read-write, you will want to look at test_cmudictionary.py in the dev branch which also deals with file read-write functions. You'll notice that it's a little sloppy and somewhat complex. I will write some tests for this issue which you can use, so don't feel like you need to figure out the testing suite alongside the project itself.

Looking at test_cmudictionary.py you'll see that the function definitions take an argument: tmp_path. This lets pytest know that we will be using a temporary folder to read and write files in this test. Using that tmp_path object, we create a temporary file and then write the testing text. Then we call the function with the temporary file and see if everything works as intended.

To identify the header, the datatypes should be different in the header and the following rows

Yes, this is the case. Headers will be strings, the data will be a mix of floating point numbers and strings.

We add a try:except block at the position, where the division is made. If the error occurs, the line will be skipped

Possible, but risky. We get a ValueError which is a really common error type in python. If an error occurs for some other reason, we risk more bugs. For example, we're focused on headers right now, but what if a user accidentally puts data in the wrong column? A try-except block will skip over it, and the program will go on like nothing is wrong when, in fact, there's an error that the user needs to fix. So we can wind up skipping more than just headers.

Maybe it would be enough to check there for header, and give this information somehow to align

I think this is on the right track, but instead of passing that information along, could TranscriptProcessor.read_transcription_file() just remove the header row? I don't think any other functions expect a header row. It's just there to make the document more human readable. If I remember correctly, I think using next() on a csv.reader object skips the first line which might be easier than passing along a variable or checking an attribute.

Adds test for JoFrhwld#65 also adds more code debt for turning the CMU dictionary excerpt into a fixture.

DerMoehre · 2022-10-02T14:10:24Z

Hi Chris,
thank you again, for your detailed information :)
I played a bit around with the pytest and tried to write a small test for the read_transcription_file(). I could pass the example csv file, but it stopped because of the missing cmudictionary.
I think, I understand the code a bit better now, so still some progress ;-)

my idea to fix the header alignment right now is:

import csv and use it only in the transcriptprocessor.py in the function read_transcription_file()
with the opened file, it will check for header
if it has a header, another csv will be opened and all but the first line will be written in this file and saved
after that, it will override the old csv with the new
with the new csv, the original code will run through
(it worked for a local csv file on my computer)

I will try to read a bit more into pytest, to understand better what is going on :-D

chrisbrickhouse · 2022-10-02T21:25:45Z

I could pass the example csv file, but it stopped because of the missing cmudictionary.

Yeah, fixing how we test code which relies on the cmudictionary is on my to-do list. I wrote a similar test over on my branch for this issue which will show you how to incorporate the cmu dictionary into the test. My current solution is to copy and paste some code from the cmudictionary tests, but that's not the best long term solution.

if it has a header, another csv will be opened and all but the first line will be written in this file and saved

This will work, but remember that file read-write can be very slow. Our users work with data sets that can be very large, and this would effectively double the run time.

I'm just thinking of this now, but maybe we don't need to implement the csv library. If all it does is look at the data types of the 2nd and later rows to see if they match the first, maybe we can just implement our own test? The 3rd element of a line will always be a float, so in read_transcription_file() we could just test to see if float(lines[0].split('\t')[2]) raises an error. This is similar to your idea of using a try-except block, but instead of putting it in Aligner.align(), we move it earlier in the chain. Something like:

def read_transcription_file(self):
    ...
    try:
        float(lines[0].split('\t')[2])
    except ValueError:
        # Log a warning about having detected a header row
        next(lines) # skips header row, see https://stackoverflow.com/questions/4796764
    self.lines = lines

after that, it will override the old csv with the new

Generally, we should not overwrite files without a user's explicit permission. If they don't have a backup, we could wind up causing data loss, and we should avoid creating that situation.

DerMoehre · 2022-10-03T07:41:59Z

okay this is great :)
I will try to implement the code of yours into the file and see, how it works :)

The override would be neccessary to make the new file (without the header) the old one. But you are right, I also did not think about the extra time it will take.

Resolves #65 by checking the data type of the first time field. If it's not a float, we assume it's a header row and remove it from the returned list. Otherwise the function returns as previously. Squashed commit of DerMoehre's PR #73 Co-authored-by: JoFrhwld <JoFrhwld@gmail.com> Co-authored-by: Christian Brickhouse <chrisbrickhouse@users.noreply.github.com>

Resolves JoFrhwld#65 by checking the data type of the first time field. If it's not a float, we assume it's a header row and remove it from the returned list. Otherwise the function returns as previously. Squashed commit of DerMoehre's PR JoFrhwld#73 Co-authored-by: JoFrhwld <JoFrhwld@gmail.com> Co-authored-by: Christian Brickhouse <chrisbrickhouse@users.noreply.github.com>

Resolves #65 by checking the data type of the first time field. If it's not a float, we assume it's a header row and remove it from the returned list. Otherwise the function returns as previously. Squashed commit of DerMoehre's PR #73 Co-authored-by: JoFrhwld <JoFrhwld@gmail.com> Co-authored-by: Christian Brickhouse <chrisbrickhouse@users.noreply.github.com>

chrisbrickhouse mentioned this issue Aug 29, 2022

[Milestone] Release 2.1 #62

Closed

13 tasks

chrisbrickhouse added this to the Version 2.1 milestone Sep 23, 2022

chrisbrickhouse added bug Something broke good first issue Easy to do and suitable for new developers labels Sep 23, 2022

chrisbrickhouse added a commit to chrisbrickhouse/FAVE that referenced this issue Oct 1, 2022

Test for TP.read_transcription_file

dbdc50e

Adds test for JoFrhwld#65 also adds more code debt for turning the CMU dictionary excerpt into a fixture.

chrisbrickhouse closed this as completed Oct 6, 2022

chrisbrickhouse mentioned this issue Oct 18, 2022

Feature/parselmouth fix #85

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alignment fails if the transcript file has a header row #65

Alignment fails if the transcript file has a header row #65

chrisbrickhouse commented Aug 29, 2022

JoFrhwld commented Sep 15, 2022

DerMoehre commented Sep 30, 2022 •

edited

Loading

chrisbrickhouse commented Sep 30, 2022

DerMoehre commented Oct 1, 2022 •

edited

Loading

DerMoehre commented Oct 1, 2022

chrisbrickhouse commented Oct 1, 2022 •

edited

Loading

DerMoehre commented Oct 2, 2022

chrisbrickhouse commented Oct 2, 2022

DerMoehre commented Oct 3, 2022

Alignment fails if the transcript file has a header row #65

Alignment fails if the transcript file has a header row #65

Comments

chrisbrickhouse commented Aug 29, 2022

JoFrhwld commented Sep 15, 2022

DerMoehre commented Sep 30, 2022 • edited Loading

chrisbrickhouse commented Sep 30, 2022

DerMoehre commented Oct 1, 2022 • edited Loading

DerMoehre commented Oct 1, 2022

chrisbrickhouse commented Oct 1, 2022 • edited Loading

DerMoehre commented Oct 2, 2022

chrisbrickhouse commented Oct 2, 2022

DerMoehre commented Oct 3, 2022

DerMoehre commented Sep 30, 2022 •

edited

Loading

DerMoehre commented Oct 1, 2022 •

edited

Loading

chrisbrickhouse commented Oct 1, 2022 •

edited

Loading