Questions regarding date shifting and the removal of personal datetime data #1500
Replies: 2 comments 1 reply
-
Perhaps this will help clarify:
The deidentified This is kind of a weird process, but we had the following constraint: tell researchers the year that a patient was admitted and preserve the chronology of the data. Imagine if a patient has an admission in December 2014, and another admission in January 2015. If we told you the real year for every hospitalization, then you'd know: (1) hospitalization A occurred in 2014, (2) hospitalization B occurred in 2015, and (3) hospitalization B occurred 20 days after hospitalization A. With this information, you can pretty trivially narrow down the possible dates of admission to a very small window. To avoid this issue, we only tell you the real year once. For which date then? Well, we arbitrarily chose the So, back to your questions:
This query will get you a random number, because we randomly shifted all dates between 2110 - 2208. It only makes sense to analyze
I am not sure if there are public proposals, but we discussed this at length internally. Most deidentification approaches are based on what we do, so not sure it will help to look at other datasets either! Very loosely there are three approaches: (1) give a single year and offsets (MIMIC-IV, eICU-CRD), (2) give the day of the week / season, do not give any year information (MIMIC-III), (3) don't give any real-world date information at all.
Not sure I follow this exactly, but what would make sense is: (1) join patients table to admissions, (2) calculate the real admission year with by taking roughly |
Beta Was this translation helpful? Give feedback.
-
Thank you very much for your detailed response and for taking the time!
Is the real
If I understand correctly, this approach would match the real admission year in roughly 1/3 of the admissions with a general "fuzziness" in the dates (-1/0/+1 year) for all following admissions (by design, due to the deidentification)? I tried with the following query (Postgres):
Which yields:
i.e. a disproportional number of admissions for the "middle years". For example, if I would want to approximate the number of patients with an appendicitis diagnosis per year; I wonder if there's an approach that could make sense and comes close to what the real numbers might are? |
Beta Was this translation helpful? Give feedback.
-
anchor_year
. I don't understand whether theanchor_year
is actually based on a real value that gets shifted or if it's a somewhat "random number" between2110
and2208
(select min(anchor_year), max(anchor_year) from patients;
). What's the underlying heuristic? Did I miss something in the docs?anchor_year_group
(e.g.2011 - 2013
). Is this something that has been explored?Beta Was this translation helpful? Give feedback.
All reactions