-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trouble decoding and encoding twix VB format during anonymization #52
Comments
twix files have a header encoded in the latin-1 encoding, previous versions of suspect used an incorrect windows-1252 in the anonymize_twix() function.
Hi Joseph, |
Hi Ben, Thank you for your reply. This is my first time working with twix files so your code was very informative. I was working on adapting your anonymize_twix() function for our purposes.
Also, when anonymizing MR scans, our site typically replaces the PatientName and PatientID with unique codes. I tried doing this on the twix files and modified the initial 4 byte integer to point to the new starting point of the ADC data. However, the twix files could not be processed after this. Does the ADC data also contain starting locations that would also need to be modified? Thanks, |
Hi Michael, First, sorry for calling you Joseph above. That looks like a nice collection of regex you have assembled there. Mine has definitely been an evolving collection and ripe for some reorganisation. I knew that the patient name strings were all significantly different, which is why I extracted the actual name string and replaced that, to be sure that I had not missed any of the different tags. This is ok for names, but in principle numerical strings such as weight/age/height could be a perfect match for some other scan parameters (particularly if set to 0) and so cannot be blindly replaced that way, we have to rely on having all the tags considered. I will incorporate these into the anonymise function. Regarding length, you are correct that I took the easy option of replacing characters one by one to maintain the length of the overall file, even though this does slightly weaken the anonymisation. As far as I can tell, this is not actually necessary to preserve the file structure, as everything is done by offsets subsequently. I have tested it using a fixed string of characters instead and suspect is able to correctly re-load the anonymised file, even though the header string is a different length. What exactly do you mean by "could not be processed"? In suspect or with some other software? Although suspect is fine with the modified file, part of the reason I kept a fixed length was in case it was important to some other software, probably not necessary though. I will work on modifying the |
I created a pull request with some of my changes. I didn't try processing the anonymized file with suspect but I should (d'oh!). I was using Gannet and that failed. I call the function from within Python. |
Ok, I will have a look. Might take me a little while to dig out a VB MEGAPRESS file though. Note that the current version of suspect assumes the header string stays of fixed length when writing out the answer, so is not correct when changes are made. Once your PR is in I will look at successfully changing the length of the files and using custom anonymous names/ids. |
A little bit more info - the headers in the twix file are actually a collection of separate files on the scanner which get lumped together to make up the overall header. Each of these sections begins with: Thus changing the length of the text has to be corrected in two places, at the very beginning and also at the start of the subfile. Suspect doesn't treat the components separately and so is not affected. The way that Gannet reads the files is to parse each file component separately, and to extract whatever parameters it comes across, rather than looking for specific ones. If the length of one subfile changes then it gets confused about where to look for the name of the next one and this is what causes the error. In order to fix this, the lengths of the individual subfiles need to be monitored and correctly modified. However, I don't have any knowledge of the structure of these header subfiles, whether the set of files is always constant or changes etc. This would require considerable work restructuring the way the twix files are read in to separate out the subfiles and extract the different parameters from each one, in order to change specific values. In addition because the name appears with different tags in different areas it would have to be changed everywhere separately. Do you want to take a look at it? |
Thanks for the info Ben. Yes, I tried digging around for more on how the twix files are organized but info is pretty scarce. I came across this thesis: http://repositorio.ucp.pt/bitstream/10400.14/14819/1/Master%20Thesis%20Jos%C3%A9%20Santos.pdf Perhaps, looking more at the how Gannet reads the file will help. |
Hi,
I'm having some trouble with the anonymize_twix() function. My file is read as the VB format.
Decoding to windows-1252 produces the error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 418277: character maps to <undefined>
Decoding using latin-1 works (as shown in the load_twix_vb() function). However, I can't write the anonymized header to a new file. Using windows-1252, there is again an issue with the 0x8f character.
The text was updated successfully, but these errors were encountered: