Trouble decoding and encoding twix VB format during anonymization #52

josephmje · 2017-05-30T17:04:27Z

Hi,

I'm having some trouble with the anonymize_twix() function. My file is read as the VB format.

Decoding to windows-1252 produces the error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 418277: character maps to <undefined>

Decoding using latin-1 works (as shown in the load_twix_vb() function). However, I can't write the anonymized header to a new file. Using windows-1252, there is again an issue with the 0x8f character.

The text was updated successfully, but these errors were encountered:

twix files have a header encoded in the latin-1 encoding, previous versions of suspect used an incorrect windows-1252 in the anonymize_twix() function.

bennyrowland · 2017-06-13T11:55:43Z

Hi Joseph,
Sorry for the delay, I have been away for a couple of weeks, but back working on Suspect now. I don't typically make much use of anonymize_twix() as I can usually get away with using the original files. When I first wrote the twix handling code, I didn't know what the correct encoding was, and found that windows-1252 worked for the files I was working with at the time (it shares a lot of characters with latin-1). Later when I found a file that didn't work with windows-1252 and changed the load_twix functions to use latin-1, I forgot to make the change in the anonymize_twix() function.
I have fixed this in the master so if you are using that feel free to test it out and make sure it works for you. I am also going to address some of your other issues with anonymize_twix() and will then create a v0.3.1 hot fix release on PyPI, hopefully some time in the next day or two, so if you have any other requests, now is the time to make them.
Ben
P.S. don't be afraid to nag a bit on issues like this, it is a useful way for devs to figure out what is important to people, and incentivise them to work on something if they get a gentle reminder every few days or so.

josephmje · 2017-06-13T13:02:35Z

Hi Ben,

Thank you for your reply. This is my first time working with twix files so your code was very informative. I was working on adapting your anonymize_twix() function for our purposes.
Yes, encoding and decoding with latin-1 works.
As for the twix headers, below are the tags I've been modifying. The instances 'tPatientName', 'PatientsName' and 'lPatientSex' were missing from suspect.

patient_name = "(<ParamString.\"t?Patients?Name\">\s*\{\s*)(\".+\")(\s*\}\n)"
patient_id = "(<ParamString.\"PatientID\">\s*\{\s*)(\".+\")(\s*\}\n)"
patient_birthday = "(<ParamString.\"PatientBirthDay\">\s*\{\s*)(\".+\")(\s*\}\n)"
patient_gender = "(<ParamLong.\"l?PatientSex\">\s*\{\s*)(\d+)(\s*\}\n)"
patient_age = "(<ParamDouble.\"flPatientAge\">\s*\{\s*)(<Precision> 6 \s*)(\d+\.\d*)(\s*\}\n)"
patient_weight = "(<ParamDouble.\"flUsedPatientWeight\">\s*\{\s*)(<Precision> 6 \s*)(\d+\.\d*)(\s*\}\n)"
patient_height = "(<ParamDouble.\"flPatientHeight\">\s*\{\s*<Unit> \"\[mm\]\"\s*)(<Precision> 6 \s*)(\d+\.\d*)(\s*\}\n)"
study_description = "(<ParamString.\"tStudyDescription\">\s*\{\s*)(\s*\}\n)"

Also, when anonymizing MR scans, our site typically replaces the PatientName and PatientID with unique codes. I tried doing this on the twix files and modified the initial 4 byte integer to point to the new starting point of the ADC data. However, the twix files could not be processed after this. Does the ADC data also contain starting locations that would also need to be modified?
I noticed that suspect instead replaces those fields with an equivalent number of "x" characters, keeping the twix header length the same. Was this done to keep the twix files viable?

Thanks,
Michael

bennyrowland · 2017-06-13T14:13:25Z

Hi Michael,

First, sorry for calling you Joseph above. That looks like a nice collection of regex you have assembled there. Mine has definitely been an evolving collection and ripe for some reorganisation. I knew that the patient name strings were all significantly different, which is why I extracted the actual name string and replaced that, to be sure that I had not missed any of the different tags. This is ok for names, but in principle numerical strings such as weight/age/height could be a perfect match for some other scan parameters (particularly if set to 0) and so cannot be blindly replaced that way, we have to rely on having all the tags considered. I will incorporate these into the anonymise function.

Regarding length, you are correct that I took the easy option of replacing characters one by one to maintain the length of the overall file, even though this does slightly weaken the anonymisation. As far as I can tell, this is not actually necessary to preserve the file structure, as everything is done by offsets subsequently. I have tested it using a fixed string of characters instead and suspect is able to correctly re-load the anonymised file, even though the header string is a different length. What exactly do you mean by "could not be processed"? In suspect or with some other software? Although suspect is fine with the modified file, part of the reason I kept a fixed length was in case it was important to some other software, probably not necessary though.

I will work on modifying the anonymize_twix() function to accept replacement patient_id and patient_name strings instead of only using x characters. Do you call the function from within python or do you use the script that comes as part of the install?

josephmje · 2017-06-13T14:31:07Z

I created a pull request with some of my changes. I didn't try processing the anonymized file with suspect but I should (d'oh!). I was using Gannet and that failed.

I call the function from within Python.

bennyrowland · 2017-06-13T15:54:09Z

Ok, I will have a look. Might take me a little while to dig out a VB MEGAPRESS file though. Note that the current version of suspect assumes the header string stays of fixed length when writing out the answer, so is not correct when changes are made. Once your PR is in I will look at successfully changing the length of the files and using custom anonymous names/ids.

bennyrowland · 2017-06-14T14:22:52Z

A little bit more info - the headers in the twix file are actually a collection of separate files on the scanner which get lumped together to make up the overall header. Each of these sections begins with:
uint32 length of the name of the file
string of length bytes containing the name of the file
a null byte
uint32 length of the contents of the file
string of length bytes with the contents

Thus changing the length of the text has to be corrected in two places, at the very beginning and also at the start of the subfile. Suspect doesn't treat the components separately and so is not affected. The way that Gannet reads the files is to parse each file component separately, and to extract whatever parameters it comes across, rather than looking for specific ones. If the length of one subfile changes then it gets confused about where to look for the name of the next one and this is what causes the error.

In order to fix this, the lengths of the individual subfiles need to be monitored and correctly modified. However, I don't have any knowledge of the structure of these header subfiles, whether the set of files is always constant or changes etc. This would require considerable work restructuring the way the twix files are read in to separate out the subfiles and extract the different parameters from each one, in order to change specific values. In addition because the name appears with different tags in different areas it would have to be changed everywhere separately. Do you want to take a look at it?

josephmje · 2017-06-14T16:11:02Z

Thanks for the info Ben. Yes, I tried digging around for more on how the twix files are organized but info is pretty scarce. I came across this thesis: http://repositorio.ucp.pt/bitstream/10400.14/14819/1/Master%20Thesis%20Jos%C3%A9%20Santos.pdf

Perhaps, looking more at the how Gannet reads the file will help.
I'd be interested in looking more into it when I get some time. But as of now, I think masking the PHI instead of inserting strings is good enough for most purposes.

bennyrowland added a commit that referenced this issue Jun 13, 2017

Fixes #52 by changing the encoding for twix headers

a1bbae6

twix files have a header encoded in the latin-1 encoding, previous versions of suspect used an incorrect windows-1252 in the anonymize_twix() function.

bennyrowland self-assigned this Jun 13, 2017

bennyrowland added the in progress label Jun 13, 2017

bennyrowland closed this as completed in c2cea1c Jun 13, 2017

bennyrowland removed the in progress label Jun 13, 2017

bennyrowland reopened this Jun 13, 2017

bennyrowland added the in progress label Jun 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trouble decoding and encoding twix VB format during anonymization #52

Trouble decoding and encoding twix VB format during anonymization #52

josephmje commented May 30, 2017

bennyrowland commented Jun 13, 2017

josephmje commented Jun 13, 2017

bennyrowland commented Jun 13, 2017

josephmje commented Jun 13, 2017

bennyrowland commented Jun 13, 2017

bennyrowland commented Jun 14, 2017

josephmje commented Jun 14, 2017

Trouble decoding and encoding twix VB format during anonymization #52

Trouble decoding and encoding twix VB format during anonymization #52

Comments

josephmje commented May 30, 2017

bennyrowland commented Jun 13, 2017

josephmje commented Jun 13, 2017

bennyrowland commented Jun 13, 2017

josephmje commented Jun 13, 2017

bennyrowland commented Jun 13, 2017

bennyrowland commented Jun 14, 2017

josephmje commented Jun 14, 2017