Questions regarding date shifting and the removal of personal datetime data #1500

schu · 2023-03-15T11:38:29Z

schu
Mar 15, 2023

Every patient gets assigned an anchor_year. I don't understand whether the anchor_year is actually based on a real value that gets shifted or if it's a somewhat "random number" between 2110 and 2208 (select min(anchor_year), max(anchor_year) from patients;). What's the underlying heuristic? Did I miss something in the docs?
Did you consider other approaches to de-indentify patient/datetime info for the MIMIC project and is there any public info (for example design proposals) on the discussed options and pros & cons?
I wonder if it would make sense to approximate statistics for a single year (for example 2011) by choosing a random third of the patients within the correspondig anchor_year_group (e.g. 2011 - 2013). Is this something that has been explored?

alistairewj · 2023-03-28T17:50:57Z

alistairewj
Mar 28, 2023
Maintainer

Perhaps this will help clarify:

First, for a given patient, we pick an anchor_year based on a real date. Let's say it's 2015.
Then, we deidentify this anchor_year. So now it's something deidentified, let's say it's 2110.
Finally, we provide the anchor_year_group. Because the original anchor_year was 2015, we'd assign it to the 2014 - 2016 group. (Note: if the anchor_year was 2016, we would also assign it to the 2014 - 2016 anchor_year_group).

The deidentified anchor_year from step 2 does still have meaning, because the anchor_year has been shifted in the same way as the other dates for the patient. So if a patient has a hosptialization in 2111, we can calculate this as 2111 - 2110 = 1 year after the anchor_year, which means that in real dates, that hospitalization occurred 1 year after [2014 - 2016], aka sometime between 2015 - 2017.

This is kind of a weird process, but we had the following constraint: tell researchers the year that a patient was admitted and preserve the chronology of the data. Imagine if a patient has an admission in December 2014, and another admission in January 2015. If we told you the real year for every hospitalization, then you'd know: (1) hospitalization A occurred in 2014, (2) hospitalization B occurred in 2015, and (3) hospitalization B occurred 20 days after hospitalization A. With this information, you can pretty trivially narrow down the possible dates of admission to a very small window. To avoid this issue, we only tell you the real year once. For which date then? Well, we arbitrarily chose the anchor_year... but it could well be anything. We just have to only tell you it once.

So, back to your questions:

Every patient gets assigned an anchor_year. I don't understand whether the anchor_year is actually based on a real value that gets shifted or if it's a somewhat "random number" between 2110 and 2208 (select min(anchor_year), max(anchor_year) from patients;). What's the underlying heuristic? Did I miss something in the docs?

This query will get you a random number, because we randomly shifted all dates between 2110 - 2208. It only makes sense to analyze anchor_year with anchor_year_group.

2. Did you consider other approaches to de-indentify patient/datetime info for the MIMIC project and is there any public info (for example design proposals) on the discussed options and pros & cons?

I am not sure if there are public proposals, but we discussed this at length internally. Most deidentification approaches are based on what we do, so not sure it will help to look at other datasets either! Very loosely there are three approaches: (1) give a single year and offsets (MIMIC-IV, eICU-CRD), (2) give the day of the week / season, do not give any year information (MIMIC-III), (3) don't give any real-world date information at all.

3. I wonder if it would make sense to approximate statistics for a single year (for example 2011) by choosing a random third of the patients within the correspondig anchor_year_group (e.g. 2011 - 2013). Is this something that has been explored?

Not sure I follow this exactly, but what would make sense is: (1) join patients table to admissions, (2) calculate the real admission year with by taking roughly admittime - YEAR(anchor_year) + MIDDLE_YEAR(anchor_year_group) (not real syntax, just pseudocode), and (3) do a histogram of real years of hospital admission. That would give you roughly how many admissions occur each year in MIMIC-IV.

0 replies

schu · 2023-04-15T11:14:43Z

schu
Apr 15, 2023
Author

Thank you very much for your detailed response and for taking the time!

First, for a given patient, we pick an anchor_year based on a real date. Let's say it's 2015.

Then, we deidentify this anchor_year. So now it's something deidentified, let's say it's 2110.

The deidentified anchor_year from step 2 does still have meaning, because the anchor_year has been shifted in the same way as the other dates for the patient. So if a patient has a hosptialization in 2111, we can calculate this as 2111 - 2110 = 1 year after the anchor_year, which means that in real dates, that hospitalization occurred 1 year after [2014 - 2016], aka sometime between 2015 - 2017.

Is the real anchor_year the first year they appear in the EHR system? How is the deidentification ("shifting") done? (If this is something that you can share publicly.)

I wonder if it would make sense to approximate statistics for a single year (for example 2011) by choosing a random third of the patients within the correspondig anchor_year_group (e.g. 2011 - 2013). Is this something that has been explored?

Not sure I follow this exactly, but what would make sense is: (1) join patients table to admissions, (2) calculate the real admission year with by taking roughly admittime - YEAR(anchor_year) + MIDDLE_YEAR(anchor_year_group) (not real syntax, just pseudocode), and (3) do a histogram of real years of hospital admission. That would give you roughly how many admissions occur each year in MIMIC-IV.

If I understand correctly, this approach would match the real admission year in roughly 1/3 of the admissions with a general "fuzziness" in the dates (-1/0/+1 year) for all following admissions (by design, due to the deidentification)?

I tried with the following query (Postgres):

SELECT
  (DATE_PART('year', a.admittime) - p.anchor_year + SUBSTRING(p.anchor_year_group, '([0-9]{4}).*')::int + 1) AS year,
  COUNT(a.hadm_id)
FROM admissions AS a
INNER JOIN patients AS p ON p.subject_id = a.subject_id
GROUP BY year
ORDER by year;

Which yields:

 year | count
------+-------
 2006 |    10
 2007 |    19
 2008 |    49
 2009 | 73501
 2010 | 24853
 2011 | 18163
 2012 | 67511
 2013 | 25310
 2014 | 19394
 2015 | 66040
 2016 | 26051
 2017 | 20557
 2018 | 61557
 2019 | 18477
 2020 |  8253
 2021 |  1486

i.e. a disproportional number of admissions for the "middle years".

For example, if I would want to approximate the number of patients with an appendicitis diagnosis per year; I wonder if there's an approach that could make sense and comes close to what the real numbers might are?

1 reply

alistairewj Apr 15, 2023
Maintainer

Is the real anchor_year the first year they appear in the EHR system? How is the deidentification ("shifting") done? (If this is something that you can share publicly.)

Yes, basically. The shift is a random number of days per patient.

If I understand correctly, this approach would match the real admission year in roughly 1/3 of the admissions with a general "fuzziness" in the dates (-1/0/+1 year) for all following admissions (by design, due to the deidentification)?

Yep, basically.

I tried with the following query (Postgres):
SELECT
  (DATE_PART('year', a.admittime) - p.anchor_year + SUBSTRING(p.anchor_year_group, '([0-9]{4}).*')::int + 1) AS year,
  COUNT(a.hadm_id)
FROM admissions AS a
INNER JOIN patients AS p ON p.subject_id = a.subject_id
GROUP BY year
ORDER by year;
Which yields:
 year | count
------+-------
 2006 |    10
 2007 |    19
 2008 |    49
 2009 | 73501
 2010 | 24853
 2011 | 18163
 2012 | 67511
 2013 | 25310
 2014 | 19394
 2015 | 66040
 2016 | 26051
 2017 | 20557
 2018 | 61557
 2019 | 18477
 2020 |  8253
 2021 |  1486
i.e. a disproportional number of admissions for the "middle years".

For example, if I would want to approximate the number of patients with an appendicitis diagnosis per year; I wonder if there's an approach that could make sense and comes close to what the real numbers might are?

That looks about right. There will be edge effects, and MIMIC isn't really a good population sample, but if you average the rate of appendicitis in those groups then you'll get the rate. It probably makes sense to apply a 3-year rolling average since we've deidentified in 3 year blocks as well, and chop off the years before 2009 and after 2020.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions regarding date shifting and the removal of personal datetime data #1500

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Questions regarding date shifting and the removal of personal datetime data #1500

schu Mar 15, 2023

Replies: 2 comments · 1 reply

alistairewj Mar 28, 2023 Maintainer

schu Apr 15, 2023 Author

alistairewj Apr 15, 2023 Maintainer

schu
Mar 15, 2023

Replies: 2 comments 1 reply

alistairewj
Mar 28, 2023
Maintainer

schu
Apr 15, 2023
Author

alistairewj Apr 15, 2023
Maintainer