Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_tsv: remove line breaks and tab characters from cell values #50

Open
knh11545 opened this issue Mar 10, 2017 · 2 comments
Open

write_tsv: remove line breaks and tab characters from cell values #50

knh11545 opened this issue Mar 10, 2017 · 2 comments

Comments

@knh11545
Copy link

Problem: write_tsv will convert report to a TSV formatted file. Line break, new line, carriage return and tab characters in cell values (stemming from the original SUSHI report) will frequently break subsequent processing of TSV files by other software. Formal data quality of reports in the wild is often bad.

Suggestion: Optionally remove any such characters from cell values in the generated TSV. Maybe this can be done via a hook/callback to allow for future extension should the need for any other processing arise.

@Wooble
Copy link
Member

Wooble commented Mar 14, 2017

Are you seeing examples where other tools that are aware of quoting & escaping are having problems with the files generated? (I.e., is pycounter currently incorrectly quoting or escaping, which would make this a fairly serious bug?) I've been fortunate to not have to deal with any SUSHI vendors who are bad in this specific way. Yet.

(I'm definitely in favor of producing reports that can be processed by anything that needs to consume them. Additionally it's a violation of the standard to output any tabs in the spreadsheet reports' cells at all, so cleaning those up should probably be the default behavior; even properly quoting them sounds like it produces non-compliant reports.)

@knh11545
Copy link
Author

As a librarian I have many years of experience in processing COUNTER reports from many vendors, in tabular format as well as in SUSHI report format. Automatic processing of raw report data has been failing in most cases due to poor technical data quality. One of the errors we encountered most often is line break characters and the like in cell values. These characters will frequently break processing reports in an automated tool chain using various standard software. Unfortunately, we need to employ a huge load of painstaking manual editing and correcting of reports in order to have them processed for integration in our home grown ERM system. While line breaks in SUSHI XML reports do not pose a problem per se reports fetched via SUSHI often need to be converted to tabular format and that is when automatic processing breaks.

I use the Biblio::COUNTER perl module for importing and forked it to support recent COUNTER versions. The original module author states:

Because the COUNTER Codes of Practice are so poorly written and documented, with incomplete specifications and inconsistent terminology, it has been necessary to make certain assumptions and normalizations in the code and documentation of this module.

Besides Biblio::COUNTER I was unable to import a pycounter-generated TSV with R's read.table(). I couldn't figure out were it breaks as yet.

I think it is fair to say that it is generally acknowledged among programmers that processing CSV-like file formats holds many difficulties due to loose format definitions. See also Text::CSV. So COUNTER reports are just another example.

Project COUNTER's validator service does not show errors nor warnings for files failing for me due to whitespace issues. I found nothing in the COUNTER documents about newline or tab characters in cell values. Therefore, I am not sure it is justified to say that it is a "violation of the standard" as it currently is. However, I asked Project COUNTER to not allow newlines or tabs in field/cell values in the upcoming Release 5 (I have no answer as yet). They have a public consultation running.

So some software may fail were other software succeeds. I don't see that pycounter is behaving incorrectly as it is not clear what is correct.

However, to be on the save side an in order to reduce possible problems it would be nice if there was an option in pycounter's write_tsv to sanitize newlines and tabs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants