Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble decoding and encoding twix VB format during anonymization #52

Open
josephmje opened this issue May 30, 2017 · 7 comments
Open
Assignees

Comments

@josephmje
Copy link
Contributor

Hi,

I'm having some trouble with the anonymize_twix() function. My file is read as the VB format.

Decoding to windows-1252 produces the error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 418277: character maps to <undefined>

Decoding using latin-1 works (as shown in the load_twix_vb() function). However, I can't write the anonymized header to a new file. Using windows-1252, there is again an issue with the 0x8f character.

bennyrowland added a commit that referenced this issue Jun 13, 2017
twix files have a header encoded in the latin-1 encoding, previous
versions of suspect used an incorrect windows-1252 in the
anonymize_twix() function.
@bennyrowland bennyrowland self-assigned this Jun 13, 2017
@bennyrowland
Copy link
Member

Hi Joseph,
Sorry for the delay, I have been away for a couple of weeks, but back working on Suspect now. I don't typically make much use of anonymize_twix() as I can usually get away with using the original files. When I first wrote the twix handling code, I didn't know what the correct encoding was, and found that windows-1252 worked for the files I was working with at the time (it shares a lot of characters with latin-1). Later when I found a file that didn't work with windows-1252 and changed the load_twix functions to use latin-1, I forgot to make the change in the anonymize_twix() function.
I have fixed this in the master so if you are using that feel free to test it out and make sure it works for you. I am also going to address some of your other issues with anonymize_twix() and will then create a v0.3.1 hot fix release on PyPI, hopefully some time in the next day or two, so if you have any other requests, now is the time to make them.
Ben
P.S. don't be afraid to nag a bit on issues like this, it is a useful way for devs to figure out what is important to people, and incentivise them to work on something if they get a gentle reminder every few days or so.

@josephmje
Copy link
Contributor Author

Hi Ben,

Thank you for your reply. This is my first time working with twix files so your code was very informative. I was working on adapting your anonymize_twix() function for our purposes.
Yes, encoding and decoding with latin-1 works.
As for the twix headers, below are the tags I've been modifying. The instances 'tPatientName', 'PatientsName' and 'lPatientSex' were missing from suspect.

patient_name = "(<ParamString.\"t?Patients?Name\">\s*\{\s*)(\".+\")(\s*\}\n)"
patient_id = "(<ParamString.\"PatientID\">\s*\{\s*)(\".+\")(\s*\}\n)"
patient_birthday = "(<ParamString.\"PatientBirthDay\">\s*\{\s*)(\".+\")(\s*\}\n)"
patient_gender = "(<ParamLong.\"l?PatientSex\">\s*\{\s*)(\d+)(\s*\}\n)"
patient_age = "(<ParamDouble.\"flPatientAge\">\s*\{\s*)(<Precision> 6 \s*)(\d+\.\d*)(\s*\}\n)"
patient_weight = "(<ParamDouble.\"flUsedPatientWeight\">\s*\{\s*)(<Precision> 6 \s*)(\d+\.\d*)(\s*\}\n)"
patient_height = "(<ParamDouble.\"flPatientHeight\">\s*\{\s*<Unit> \"\[mm\]\"\s*)(<Precision> 6 \s*)(\d+\.\d*)(\s*\}\n)"
study_description = "(<ParamString.\"tStudyDescription\">\s*\{\s*)(\s*\}\n)"

Also, when anonymizing MR scans, our site typically replaces the PatientName and PatientID with unique codes. I tried doing this on the twix files and modified the initial 4 byte integer to point to the new starting point of the ADC data. However, the twix files could not be processed after this. Does the ADC data also contain starting locations that would also need to be modified?
I noticed that suspect instead replaces those fields with an equivalent number of "x" characters, keeping the twix header length the same. Was this done to keep the twix files viable?

Thanks,
Michael

@bennyrowland
Copy link
Member

Hi Michael,

First, sorry for calling you Joseph above. That looks like a nice collection of regex you have assembled there. Mine has definitely been an evolving collection and ripe for some reorganisation. I knew that the patient name strings were all significantly different, which is why I extracted the actual name string and replaced that, to be sure that I had not missed any of the different tags. This is ok for names, but in principle numerical strings such as weight/age/height could be a perfect match for some other scan parameters (particularly if set to 0) and so cannot be blindly replaced that way, we have to rely on having all the tags considered. I will incorporate these into the anonymise function.

Regarding length, you are correct that I took the easy option of replacing characters one by one to maintain the length of the overall file, even though this does slightly weaken the anonymisation. As far as I can tell, this is not actually necessary to preserve the file structure, as everything is done by offsets subsequently. I have tested it using a fixed string of characters instead and suspect is able to correctly re-load the anonymised file, even though the header string is a different length. What exactly do you mean by "could not be processed"? In suspect or with some other software? Although suspect is fine with the modified file, part of the reason I kept a fixed length was in case it was important to some other software, probably not necessary though.

I will work on modifying the anonymize_twix() function to accept replacement patient_id and patient_name strings instead of only using x characters. Do you call the function from within python or do you use the script that comes as part of the install?

@josephmje
Copy link
Contributor Author

I created a pull request with some of my changes. I didn't try processing the anonymized file with suspect but I should (d'oh!). I was using Gannet and that failed.

I call the function from within Python.

@bennyrowland
Copy link
Member

Ok, I will have a look. Might take me a little while to dig out a VB MEGAPRESS file though. Note that the current version of suspect assumes the header string stays of fixed length when writing out the answer, so is not correct when changes are made. Once your PR is in I will look at successfully changing the length of the files and using custom anonymous names/ids.

@bennyrowland
Copy link
Member

A little bit more info - the headers in the twix file are actually a collection of separate files on the scanner which get lumped together to make up the overall header. Each of these sections begins with:
uint32 length of the name of the file
string of length bytes containing the name of the file
a null byte
uint32 length of the contents of the file
string of length bytes with the contents

Thus changing the length of the text has to be corrected in two places, at the very beginning and also at the start of the subfile. Suspect doesn't treat the components separately and so is not affected. The way that Gannet reads the files is to parse each file component separately, and to extract whatever parameters it comes across, rather than looking for specific ones. If the length of one subfile changes then it gets confused about where to look for the name of the next one and this is what causes the error.

In order to fix this, the lengths of the individual subfiles need to be monitored and correctly modified. However, I don't have any knowledge of the structure of these header subfiles, whether the set of files is always constant or changes etc. This would require considerable work restructuring the way the twix files are read in to separate out the subfiles and extract the different parameters from each one, in order to change specific values. In addition because the name appears with different tags in different areas it would have to be changed everywhere separately. Do you want to take a look at it?

@josephmje
Copy link
Contributor Author

Thanks for the info Ben. Yes, I tried digging around for more on how the twix files are organized but info is pretty scarce. I came across this thesis: http://repositorio.ucp.pt/bitstream/10400.14/14819/1/Master%20Thesis%20Jos%C3%A9%20Santos.pdf

Perhaps, looking more at the how Gannet reads the file will help.
I'd be interested in looking more into it when I get some time. But as of now, I think masking the PHI instead of inserting strings is good enough for most purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants