From fad041140b2203894ed3589c86621841228c4ccd Mon Sep 17 00:00:00 2001 From: seabbs Date: Wed, 16 Oct 2019 15:35:31 +0000 Subject: [PATCH] complete paper+analysis scaffold --- README.Rmd | 21 +++++- README.md | 30 +++++++- docs/articles/paper.html | 66 +++++++++--------- docs/index.html | 22 +++++- .../reference/account_for_nested_missing.html | 2 +- docs/reference/make_results_folder.html | 2 +- .../reference/model_variable_missingness.html | 2 +- docs/reference/nested_missing_table.html | 2 +- docs/reference/plot_nested_missing.html | 2 +- docs/reference/pull_results.html | 2 +- docs/reference/read_data.html | 2 +- docs/reference/save_data.html | 2 +- docs/reference/save_figure.html | 2 +- docs/reference/show_figure.html | 2 +- docs/reference/summarise_missingness.html | 2 +- vignettes/drafts/paper/paper.docx | Bin 230754 -> 230626 bytes vignettes/drafts/paper/paper.html | 64 +++++++++-------- vignettes/paper.Rmd | 50 ++++++------- 18 files changed, 171 insertions(+), 104 deletions(-) diff --git a/README.Rmd b/README.Rmd index caaa940..467b8b8 100644 --- a/README.Rmd +++ b/README.Rmd @@ -20,11 +20,28 @@ knitr::opts_chunk$set( ## Background +The Enhanced Tuberculosis Surveillance (ETS) system is a routine surveillance system - with a similar structure to other such systems - that collects data on all notified tuberculosis (TB) cases in England. It is routinely used to study the epidemiology of TB. Routine data often has a large amount of missing data which may not be fully accounted for when used in analyses. This study explores the evidence for associations between missingness in several key outcomes and demographic variables. Any such associations may introduce bias if not accounted for. + ## Methods -## Results -## Conclusions +* Introduce ETS +* Data extraction and management +* Structure of the ETS +* Data completeness +* Drivers of variable completeness (regression) + +## Results *Copy from bottom* + +* Missing structure +* Drivers of variable completeness + +## Conclusions + +* Surveillance data is likely to have a high degree of misising data. In the ETS missing for key outcomes is associated with demographic factors such as.... +* To avoid biasing analysis studies should make use of imputed data - rather than complete case analysis - and extend their imputation models to other demographic variables that may not be included in the analysis model. +* This analysis should be repeated in other datasets - for this reason the code is available as an R package. + ## Reproducibility diff --git a/README.md b/README.md index bd1a4ca..40d32bc 100644 --- a/README.md +++ b/README.md @@ -10,12 +10,40 @@ Brooks-Pollock ## Background +The Enhanced Tuberculosis Surveillance (ETS) system is a routine +surveillance system - with a similar structure to other such systems - +that collects data on all notified tuberculosis (TB) cases in England. +It is routinely used to study the epidemiology of TB. Routine data often +has a large amount of missing data which may not be fully accounted for +when used in analyses. This study explores the evidence for associations +between missingness in several key outcomes and demographic variables. +Any such associations may introduce bias if not accounted for. + ## Methods -## Results + - Introduce ETS + - Data extraction and management + - Structure of the ETS + - Data completeness + - Drivers of variable completeness (regression) + +## Results *Copy from bottom* + + - Missing structure + - Drivers of variable completeness ## Conclusions + - Surveillance data is likely to have a high degree of misising data. + In the ETS missing for key outcomes is associated with demographic + factors such as…. + - To avoid biasing analysis studies should make use of imputed data - + rather than complete case analysis - and extend their imputation + models to other demographic variables that may not be included in + the analysis model. + - This analysis should be repeated in other datasets - for this reason + the code is available as an R package. + ## Reproducibility ### Repository structure diff --git a/docs/articles/paper.html b/docs/articles/paper.html index 44e7258..8b5aebb 100644 --- a/docs/articles/paper.html +++ b/docs/articles/paper.html @@ -100,16 +100,12 @@

Methods

-

We obtained all TB notifications for 2000-2015 in England from the ETS. We gave an overview of the structure of the ETS and the steps taken to clean the data for analysis.

-

We considered five outcomes: All-cause mortality, death due to TB (in those who died), recurrent TB, pulmonary disease, and sputum smear status. We used logistic regression, with complete case analysis, to investigate each outcome with BCG vaccination, years since vaccination and age at vaccination, adjusting for potential confounders. All analyses were repeated using multiply imputed data.

  • Introduce ETS
  • Data extraction and management
  • -
  • Structure of the ETS (results)
  • -
  • Data completeness (motivation)
  • -
  • Data completeness method
  • -
  • Structure of missingness in the ETS
  • -
  • Variables not completed pre and post 2008
  • +
  • Structure of the ETS
  • +
  • Data completeness
  • +
  • Drivers of variable completeness (regression)
@@ -117,24 +113,29 @@

Results Copy from bottom

    -
  • Table 1
  • Missing structure
  • -
  • Associations of missingness
  • +
  • Drivers of variable completeness

Conclusions

+
    +
  • Surveillance data is likely to have a high degree of misising data. In the ETS missing for key outcomes is associated with demographic factors such as….
  • +
  • To avoid biasing analysis studies should make use of imputed data - rather than complete case analysis - and extend their imputation models to other demographic variables that may not be included in the analysis model.
  • +
  • This analysis should be repeated in other datasets - for this reason the code is available as an R package.
  • +

Introduction

Background

-

The Enhanced Tuberculosis Surveillance (ETS) system is a routine surveillance system - with a similar structure to other such systems - that collects data on all notified tuberculosis (TB) cases in England. It is routinely used to study the epidemiology of TB. Routine data often has a large amount of missing data which may not be fully accounted for when used in analyses. This study explores the evidence for associations between missingness in several key outcomes and demographic variables. Any such associations may introduce bias if not accounted for.

+

The Enhanced Tuberculosis Surveillance (ETS) system is a routine surveillance system - with a similar structure to other such systems - that collects data on all notified tuberculosis (TB) cases in England. It is routinely used to study the epidemiology of TB. Routine data often has a large amount of missing data which may not be fully accounted for when used in analyses.

Detail

-

Missing data can take several forms, data that are missing completely at random (MCAR), data that are missing at random (MAR) and data that are missing not at random (MNAR).[1] Data that are MAR are missing with a mechanism that is conditional on observed variables, whilst MNAR are missing with a mechanism that is conditional on variables that are not observed. Data that is MAR, and MNAR may lead to biases when analysing the data, however it is not possible to deduce from the observed data what the mechanism driving missing data is. Therefore, it is necessary to account for these potential biases during the analysis stage. This is possible using a variety of methods such as scenario analysis accounting for the ‘best’ and ‘worst’ case scenarios, and multiple imputation of missing data using additional variables in the dataset to inform the imputation model.[1]

+

Missing data can take several forms, data that are missing completely at random (MCAR), data that are missing at random (MAR) and data that are missing not at random (MNAR).[1] Data that are MAR are missing with a mechanism that is conditional on observed variables, whilst MNAR are missing with a mechanism that is conditional on variables that are not observed. Data that is MAR, and MNAR may lead to biases when analysing the data, however it is not possible to deduce from the observed data what the mechanism driving missing data is. Therefore, it is necessary to account for these potential biases during the analysis stage. This is possible using a variety of methods such as scenario analysis accounting for the ‘best’ and ‘worst’ case scenarios, and multiple imputation of missing data using additional variables in the dataset to inform the imputation model.[1] Common practise is to include all variables included in the analyses in the imputation model, these variables may or may not be those at most risk of introducing bias due to an MAR mechanism.

Aim

+

This study aims to explore the evidence for associations between missingness in several key outcomes and demographic variables. Any such associations may introduce bias if not accounted for.

@@ -156,17 +157,17 @@

Data completeness

-

As the ETS is aggregated across England, from a variety of sources, some level of missing data are inevitable. This takes two forms: under-reporting of notified cases, of which there is some evidence in the literature,[4] and data missing for a notified case. The former is particularly problematic as apart from using comparative studies the characteristics of those that are not notified is unknown. For variables that are missing data within the dataset it is possible to calculate the proportion of missing data but care must be taken to account for nested variables such as date of death and year of BCG vaccination. This can be done by assuming that the nested variables takes the value of the top level variable when it is known that the variable is not truly missing. An example of this is using overall outcome for date of death when notifications are known to have not died.

+

As the ETS is aggregated across England, from a variety of sources, missing data are inevitable. This takes two forms: under-reporting of notified cases, of which there is some evidence in the literature,[4] and data missing for a notified case. The former is particularly problematic as apart from using comparative studies the characteristics of those that are not notified is unknown. For variables that are missing data within the dataset it is possible to calculate the proportion of missing data but care must be taken to account for nested variables such as date of death and year of BCG vaccination. To do this we have assumed that nested variables takes the value of the top level variable when it is known that the variable is not truly missing. An example of this is using overall outcome for date of death when notifications are known to have not died. This allows then allows us to estimate the proportion of these variables that are truly missing.

+

For nested variables with rare outcomes assuming the top level variable value can mask the underlying amount of missing data. We implemented an alternative approach which filtered the data for the top level variable required for the nested variable to be defined and then computed the proportion of these notifications that were missing data for the outcome of interest.

Drivers of Variable completeness

-

Missing data may be MAR or MNAR, which may introduce biases into any analyses based on these data. Unfortunately MNAR data cannot be detected, so bias from this source cannot be discounted. However, it is possible to detect potential MAR mechanisms from observed variables that would not necessarily be included in a model used for analysis. Although these associations may themselves be caused by an external factor. In the following section we explore variables associated with data being missing for several key variables including: BCG status, year of BCG vaccination, date of death, cause of death, date of symptom onset, date of diagnosis, date of starting treatment and date of ending treatment.

-

In order to explore the drivers of missing data we reformulated the problem as a logistic regression for each variable of interest, with the outcome being data completeness (complete/missing). This allows variables that are hypothesised to be related to missing data to be adjusted for and their independent impact on data completeness to be estimated. Unlike classic approaches to missing data, such as multiple imputation by chained regression (MICE),[5] this is not an imputation.

-
-
-

-Statistical analysis

+

Missing data may be MAR or MNAR, which may introduce biases into any analyses based on these data. Unfortunately MNAR data cannot be detected, so bias from this source cannot be discounted. However, it is possible to detect potential MAR mechanisms from observed variables that would not necessarily be included in a model used for analysis. Here we develop a method for this and apply it to several key outcomes including: BCG status, year of BCG vaccination, date of death, cause of death, date of symptom onset, date of diagnosis, date of starting treatment and date of ending treatment.

+

We reformulated the problem as a logistic regression for each variable of interest, with the outcome being data completeness (complete/missing). This allows variables that are hypothesised to be related to missing data to be adjusted for and their independent impact on data completeness to be estimated. This approach does not account for missingness within exploratory variables.

+
+

+Method

In order to reformulate missing data as a logistic regression we took the following steps:

  1. For the variable of interest create a new temporary binary variable, called data status, that is “Missing” when the variable of interest is missing and “Complete” when it is not. Specify “Complete” as the baseline.

  2. @@ -175,11 +176,12 @@

  3. Fit a logistic regression model with the temporary data status variable as the outcome, adjusting for the hypothesised drivers of missingness.

  4. Exponentiate the returned coefficients, and confidence intervals so that they represent Odds Ratios (ORs).

  5. Refit the model, dropping each variable in turn and then comparing the updated model with the full model using a likelihood ratio test.

  6. -
  7. Interpret the results, using the estimated size of the effect, the width of the confidence intervals and the size of the Wald and likelihood ratio test p values to determine which variables are related to missingness for the variable of interest. Evidence should be interpreted on a spectrum, rather than using arbitrary significance cut-offs.[6] To avoid issues of multiple testing the level of evidence should be weighted based on the number of variables adjusted for and the number of outcomes explored.

  8. +
  9. Interpret the results, using the estimated size of the effect, the width of the confidence intervals and the size of the Wald and likelihood ratio test p values to determine which variables are related to missingness for the variable of interest. Evidence should be interpreted on a spectrum, rather than using arbitrary significance cut-offs.[5] To avoid issues of multiple testing the level of evidence should be weighted based on the number of variables adjusted for and the number of outcomes explored.

For all outcomes considered we adjusted for the same set of demographic variables that were both highly complete, plausibly linked to missingness for all outcomes considered, and likely to be present in other comparable surveillance datasets. These were: year, sex, age (grouped as 0-14 year olds, 15-65 year olds and 65+), ethnic group, UK birth status and socio-economic status (national quintiles). For socio-economic group 1 indicates the most deprived quintile. Complete case analysis has been used, with the dataset limited to notifications from 2010 and on-wards as socio-economic status was not collected prior to this.

+

Patient and public involvement

@@ -192,7 +194,7 @@

Data completeness

-

Doing this shows high completeness for common demographic variables such as sex, age, ethnic group and UK birth status (Figure 1, Table 1). More problematically, BCG status and year of BCG status have a high percentage missing, even after accounting for the introduction of national collection of these variables in 2008. Socio-economic status (as national quintiles) was not collected until 2010 but after this point is highly complete. Comparing pre 2009 and post 2008 in Table 1 (and by inspecting Figure 1) there are also issues of changing completeness over time,[2,7] if this is not accounted for than it may lead to spurious trends. Figure 1 also indicates that there are multiple groups of variables that share a common pattern of missing data.

+

We found high completeness for common demographic variables such as sex, age, ethnic group and UK birth status (Figure 1, Table 1). More problematically, BCG status and year of BCG status had a high percentage missing, even after accounting for the introduction of national collection of these variables in 2008.[2] Socio-economic status (as national quintiles) was not collected until 2010 but after this point is highly complete.[2] Comparing pre 2009 and post 2008 in Table 1 (Figure 1) we see completeness changes over time,[2,6] this may lead to spurious trends if not adjusted for. Figure 1 also indicates that there are multiple groups of variables that share a correlated pattern of missing data.

Figure 1: Summary plot of missing data in the extract of the ETS data used in this thesis. Due to the large size of the dataset, the data has been sub-sampled with only 20\% of the data shown in this figure. Notifications have been ordered by date of notification from left to right. The following subset of variables are shown: year (year), sex (sex), age (age), PHE Centre (phec), Occupation (occat), Ethnic group (ethgrp), UK birth status (ukborn), Time since entry (timesinceent), date of symptom onset (symptonset), date of diagnosis (datediag), started treatment (startedtreat), date of starting treatment (starttreatdate), treatment end date (txenddate), pulmonary or extra-pulmonary TB (pulmextrapulm), culture (culture), sputum smear status (sputsmear), drug resistance (anyres), previous diagnosis (prevdiag), BCG status(bcgvacc), Year of BCG vaccination (bcgvaccyr), overall outcome (overalloutcome), cause of death (tomdeathrelate), socio-economic status quintiles (natquintile), and date of death (dateofdeath). Nested variables have been accounted for (i.e date of death has had an entry added for cases that are known to have not died), so that true missingness for all variables is estimated.

Figure 1: Summary plot of missing data in the extract of the ETS data used in this thesis. Due to the large size of the dataset, the data has been sub-sampled with only 20% of the data shown in this figure. Notifications have been ordered by date of notification from left to right. The following subset of variables are shown: year (year), sex (sex), age (age), PHE Centre (phec), Occupation (occat), Ethnic group (ethgrp), UK birth status (ukborn), Time since entry (timesinceent), date of symptom onset (symptonset), date of diagnosis (datediag), started treatment (startedtreat), date of starting treatment (starttreatdate), treatment end date (txenddate), pulmonary or extra-pulmonary TB (pulmextrapulm), culture (culture), sputum smear status (sputsmear), drug resistance (anyres), previous diagnosis (prevdiag), BCG status(bcgvacc), Year of BCG vaccination (bcgvaccyr), overall outcome (overalloutcome), cause of death (tomdeathrelate), socio-economic status quintiles (natquintile), and date of death (dateofdeath). Nested variables have been accounted for (i.e date of death has had an entry added for cases that are known to have not died), so that true missingness for all variables is estimated. @@ -385,7 +387,7 @@

-

For nested variables with rare outcomes assuming the top level variable value can mask the underlying amount of missing data. An alternative approach is to filter the data for the top level variable required for the nested variable to be defined and to then compute the proportion of these notifications that are missing data for the outcome of interest. For the date of starting treatment this approach leads to an estimate of 5.9% (6434/108410) being missing, which is more complete than previously estimated. For cases that are known to have completed treatment 16.5% (13804/83891) are missing a date for the end of treatment. In notifications that are known to have died, 26.6% (1592/5976) were missing the date of death and 44.9% (2686/5976) were missing the cause of death. In any analysis where these variables are used the missing data for these variables will need to be carefully adjusted for. In particular, if cause of death is used it must be clearly stated that it is highly missing and results based on this variable should be properly caveated.

+

By filtering nested variables - rather than by using replacement - we found the date of starting treatment was 5.9% (6434/108410) missing, which is more complete than previously estimated. For cases that were known to have completed treatment 16.5% (13804/83891) were missing a date for the end of treatment. In notifications that were known to have died, 26.6% (1592/5976) were missing the date of death and 44.9% (2686/5976) were missing the cause of death.

@@ -2567,15 +2569,15 @@

Statement of primary findings

In the ETS system we found a high degree of missing data for several important variables. We also found that there is likely to be strong missing at random (MAR) mechanism underlying this missing data for multiple variables. Several factors are strongly associated with data being missing for many variables, including UK birth status, ethnic group, socio-economic status and year. These MAR mechanisms must be adjusted for in studies using this data to avoid introducing bias. We found that date variables in particular suffered from changing data completeness over time, which may introduce spurious temporal trends if not fully understood.

-
    -
  • The following analysis is not currently in the paper but it was in the chapter - is there a case for including?*
  • -
+

The following analysis is not currently in the paper but it was in the chapter - is there a case for including?

We also found that for several variables, including the date of symptom onset, there was a large degree of recall bias when aggregating by day or month. Several variables, including date of notification and date of starting treatment, showed a seasonal trend with a maximum in the summer months. The date of ending treatment showed less evidence of a seasonal trend.

Strengths and limitations of the study

-

Routine observational datasets are subject to numerous potential biases, such as selection bias, recall bias, measurement bias, and unmeasured confounding.[8] Additionally, as the data has not been collected with a specific analysis in mind there maybe issues with the specificity of variables. The ETS system is likely to suffer from all of the above biases to some extent, which must be accounted for as far as possbile, and explicitly stated at every level of analysis. The most important consideration is that the ETS system is unlikely to be representative of the general population as it contains only notified TB cases that occurred in England during the study period, research questions must therefore be either limited to active TB patients, or when extended to the general population the differing population demographics must be accounted for. If this is not done then any results may be due to selection bias. Additionally, multiple variables may suffer from misclassification bias, including BCG status which can be assessed via vaccination record, the presence of a scar, or case recall: this may lead to spurious associations.[9] Validation studies would be required to account for this.

+

Work in progress - copied from chapter text

+

Routine observational datasets are subject to numerous potential biases, such as selection bias, recall bias, measurement bias, and unmeasured confounding.[7] Additionally, as the data has not been collected with a specific analysis in mind there maybe issues with the specificity of variables. The ETS system is likely to suffer from all of the above biases to some extent, which must be accounted for as far as possbile, and explicitly stated at every level of analysis. The most important consideration is that the ETS system is unlikely to be representative of the general population as it contains only notified TB cases that occurred in England during the study period, research questions must therefore be either limited to active TB patients, or when extended to the general population the differing population demographics must be accounted for. If this is not done then any results may be due to selection bias. Additionally, multiple variables may suffer from misclassification bias, including BCG status which can be assessed via vaccination record, the presence of a scar, or case recall: this may lead to spurious associations.[8] Validation studies would be required to account for this.

+

Unlike classic approaches to missing data, such as multiple imputation by chained regression (MICE),[9] this is not an imputation

@@ -2616,20 +2618,20 @@

4 Pillaye J, Clarke A. An evaluation of completeness of tuberculosis notification in the United Kingdom. BMC Public Health 2003;3:31.

-
-

5 Groothuis-oudshoorn K. Journal of Statistical Software MICE : Multivariate Imputation by Chained.;VV.

-
-

6 Sterne JA, Davey Smith G. Sifting the evidence-what’s wrong with significance tests? Bmj 2001;322:226–31.

+

5 Sterne JA, Davey Smith G. Sifting the evidence-what’s wrong with significance tests? Bmj 2001;322:226–31.

-

7 PHE. Tuberculosis in England 2016 Report (presenting data to end of 2015). 2016.

+

6 PHE. Tuberculosis in England 2016 Report (presenting data to end of 2015). 2016.

-

8 Benchimol EI, Smeeth L, Guttmann A et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) Statement. The American Statistician 2016;115-116:1–22.

+

7 Benchimol EI, Smeeth L, Guttmann A et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) Statement. The American Statistician 2016;115-116:1–22.

-

9 Fewell Z, Davey Smith G, Sterne JAC. The impact of residual and unmeasured confounding in epidemiologic studies: A simulation study. American Journal of Epidemiology 2007;166:646–55.

+

8 Fewell Z, Davey Smith G, Sterne JAC. The impact of residual and unmeasured confounding in epidemiologic studies: A simulation study. American Journal of Epidemiology 2007;166:646–55.

+
+
+

9 Groothuis-oudshoorn K. Journal of Statistical Software MICE : Multivariate Imputation by Chained.;VV.

diff --git a/docs/index.html b/docs/index.html index 7b3f6d9..98e5ff4 100644 --- a/docs/index.html +++ b/docs/index.html @@ -79,18 +79,36 @@

Background

+

The Enhanced Tuberculosis Surveillance (ETS) system is a routine surveillance system - with a similar structure to other such systems - that collects data on all notified tuberculosis (TB) cases in England. It is routinely used to study the epidemiology of TB. Routine data often has a large amount of missing data which may not be fully accounted for when used in analyses. This study explores the evidence for associations between missingness in several key outcomes and demographic variables. Any such associations may introduce bias if not accounted for.

Methods

+
    +
  • Introduce ETS
  • +
  • Data extraction and management
  • +
  • Structure of the ETS
  • +
  • Data completeness
  • +
  • Drivers of variable completeness (regression)
  • +
-
+

-Results

+Results Copy from bottom +

+
    +
  • Missing structure
  • +
  • Drivers of variable completeness
  • +

Conclusions

+
    +
  • Surveillance data is likely to have a high degree of misising data. In the ETS missing for key outcomes is associated with demographic factors such as….
  • +
  • To avoid biasing analysis studies should make use of imputed data - rather than complete case analysis - and extend their imputation models to other demographic variables that may not be included in the analysis model.
  • +
  • This analysis should be repeated in other datasets - for this reason the code is available as an R package.
  • +

diff --git a/docs/reference/account_for_nested_missing.html b/docs/reference/account_for_nested_missing.html index eb9d33d..b431f4b 100644 --- a/docs/reference/account_for_nested_missing.html +++ b/docs/reference/account_for_nested_missing.html @@ -187,7 +187,7 @@

Examp #> "Died", dateofdeath, "N/A") %>% as.factor) %>% mutate(timesinceent = ifelse(ukborn %in% #> "UK born", "N/A", timesinceent)) #> } -#> <bytecode: 0x561f5bad3d18> +#> <bytecode: 0x55982c2bca98> #> <environment: namespace:ETSMissing>

Methods

-

We obtained all TB notifications for 2000-2015 in England from the ETS. We gave an overview of the structure of the ETS and the steps taken to clean the data for analysis.

-

We considered five outcomes: All-cause mortality, death due to TB (in those who died), recurrent TB, pulmonary disease, and sputum smear status. We used logistic regression, with complete case analysis, to investigate each outcome with BCG vaccination, years since vaccination and age at vaccination, adjusting for potential confounders. All analyses were repeated using multiply imputed data.

  • Introduce ETS
  • Data extraction and management
  • -
  • Structure of the ETS (results)
  • -
  • Data completeness (motivation)
  • -
  • Data completeness method
  • -
  • Structure of missingness in the ETS
  • -
  • Variables not completed pre and post 2008
  • +
  • Structure of the ETS
  • +
  • Data completeness
  • +
  • Drivers of variable completeness (regression)

Results Copy from bottom

    -
  • Table 1
  • Missing structure
  • -
  • Associations of missingness
  • +
  • Drivers of variable completeness

Conclusions

+
    +
  • Surveillance data is likely to have a high degree of misising data. In the ETS missing for key outcomes is associated with demographic factors such as….
  • +
  • To avoid biasing analysis studies should make use of imputed data - rather than complete case analysis - and extend their imputation models to other demographic variables that may not be included in the analysis model.
  • +
  • This analysis should be repeated in other datasets - for this reason the code is available as an R package.
  • +

Introduction

Background

-

The Enhanced Tuberculosis Surveillance (ETS) system is a routine surveillance system - with a similar structure to other such systems - that collects data on all notified tuberculosis (TB) cases in England. It is routinely used to study the epidemiology of TB. Routine data often has a large amount of missing data which may not be fully accounted for when used in analyses. This study explores the evidence for associations between missingness in several key outcomes and demographic variables. Any such associations may introduce bias if not accounted for.

+

The Enhanced Tuberculosis Surveillance (ETS) system is a routine surveillance system - with a similar structure to other such systems - that collects data on all notified tuberculosis (TB) cases in England. It is routinely used to study the epidemiology of TB. Routine data often has a large amount of missing data which may not be fully accounted for when used in analyses.

Detail

-

Missing data can take several forms, data that are missing completely at random (MCAR), data that are missing at random (MAR) and data that are missing not at random (MNAR).[1] Data that are MAR are missing with a mechanism that is conditional on observed variables, whilst MNAR are missing with a mechanism that is conditional on variables that are not observed. Data that is MAR, and MNAR may lead to biases when analysing the data, however it is not possible to deduce from the observed data what the mechanism driving missing data is. Therefore, it is necessary to account for these potential biases during the analysis stage. This is possible using a variety of methods such as scenario analysis accounting for the ‘best’ and ‘worst’ case scenarios, and multiple imputation of missing data using additional variables in the dataset to inform the imputation model.[1]

+

Missing data can take several forms, data that are missing completely at random (MCAR), data that are missing at random (MAR) and data that are missing not at random (MNAR).[1] Data that are MAR are missing with a mechanism that is conditional on observed variables, whilst MNAR are missing with a mechanism that is conditional on variables that are not observed. Data that is MAR, and MNAR may lead to biases when analysing the data, however it is not possible to deduce from the observed data what the mechanism driving missing data is. Therefore, it is necessary to account for these potential biases during the analysis stage. This is possible using a variety of methods such as scenario analysis accounting for the ‘best’ and ‘worst’ case scenarios, and multiple imputation of missing data using additional variables in the dataset to inform the imputation model.[1] Common practise is to include all variables included in the analyses in the imputation model, these variables may or may not be those at most risk of introducing bias due to an MAR mechanism.

Aim

+

This study aims to explore the evidence for associations between missingness in several key outcomes and demographic variables. Any such associations may introduce bias if not accounted for.

Methods

@@ -429,15 +430,15 @@

Structure of the ETS

Data completeness

-

As the ETS is aggregated across England, from a variety of sources, some level of missing data are inevitable. This takes two forms: under-reporting of notified cases, of which there is some evidence in the literature,[4] and data missing for a notified case. The former is particularly problematic as apart from using comparative studies the characteristics of those that are not notified is unknown. For variables that are missing data within the dataset it is possible to calculate the proportion of missing data but care must be taken to account for nested variables such as date of death and year of BCG vaccination. This can be done by assuming that the nested variables takes the value of the top level variable when it is known that the variable is not truly missing. An example of this is using overall outcome for date of death when notifications are known to have not died.

+

As the ETS is aggregated across England, from a variety of sources, missing data are inevitable. This takes two forms: under-reporting of notified cases, of which there is some evidence in the literature,[4] and data missing for a notified case. The former is particularly problematic as apart from using comparative studies the characteristics of those that are not notified is unknown. For variables that are missing data within the dataset it is possible to calculate the proportion of missing data but care must be taken to account for nested variables such as date of death and year of BCG vaccination. To do this we have assumed that nested variables takes the value of the top level variable when it is known that the variable is not truly missing. An example of this is using overall outcome for date of death when notifications are known to have not died. This allows then allows us to estimate the proportion of these variables that are truly missing.

+

For nested variables with rare outcomes assuming the top level variable value can mask the underlying amount of missing data. We implemented an alternative approach which filtered the data for the top level variable required for the nested variable to be defined and then computed the proportion of these notifications that were missing data for the outcome of interest.

Drivers of Variable completeness

-

Missing data may be MAR or MNAR, which may introduce biases into any analyses based on these data. Unfortunately MNAR data cannot be detected, so bias from this source cannot be discounted. However, it is possible to detect potential MAR mechanisms from observed variables that would not necessarily be included in a model used for analysis. Although these associations may themselves be caused by an external factor. In the following section we explore variables associated with data being missing for several key variables including: BCG status, year of BCG vaccination, date of death, cause of death, date of symptom onset, date of diagnosis, date of starting treatment and date of ending treatment.

-

In order to explore the drivers of missing data we reformulated the problem as a logistic regression for each variable of interest, with the outcome being data completeness (complete/missing). This allows variables that are hypothesised to be related to missing data to be adjusted for and their independent impact on data completeness to be estimated. Unlike classic approaches to missing data, such as multiple imputation by chained regression (MICE),[5] this is not an imputation.

-
-
-

Statistical analysis

+

Missing data may be MAR or MNAR, which may introduce biases into any analyses based on these data. Unfortunately MNAR data cannot be detected, so bias from this source cannot be discounted. However, it is possible to detect potential MAR mechanisms from observed variables that would not necessarily be included in a model used for analysis. Here we develop a method for this and apply it to several key outcomes including: BCG status, year of BCG vaccination, date of death, cause of death, date of symptom onset, date of diagnosis, date of starting treatment and date of ending treatment.

+

We reformulated the problem as a logistic regression for each variable of interest, with the outcome being data completeness (complete/missing). This allows variables that are hypothesised to be related to missing data to be adjusted for and their independent impact on data completeness to be estimated. This approach does not account for missingness within exploratory variables.

+
+

Method

In order to reformulate missing data as a logistic regression we took the following steps:

  1. For the variable of interest create a new temporary binary variable, called data status, that is “Missing” when the variable of interest is missing and “Complete” when it is not. Specify “Complete” as the baseline.

  2. @@ -446,11 +447,12 @@

    Statistical analysis

  3. Fit a logistic regression model with the temporary data status variable as the outcome, adjusting for the hypothesised drivers of missingness.

  4. Exponentiate the returned coefficients, and confidence intervals so that they represent Odds Ratios (ORs).

  5. Refit the model, dropping each variable in turn and then comparing the updated model with the full model using a likelihood ratio test.

  6. -
  7. Interpret the results, using the estimated size of the effect, the width of the confidence intervals and the size of the Wald and likelihood ratio test p values to determine which variables are related to missingness for the variable of interest. Evidence should be interpreted on a spectrum, rather than using arbitrary significance cut-offs.[6] To avoid issues of multiple testing the level of evidence should be weighted based on the number of variables adjusted for and the number of outcomes explored.

  8. +
  9. Interpret the results, using the estimated size of the effect, the width of the confidence intervals and the size of the Wald and likelihood ratio test p values to determine which variables are related to missingness for the variable of interest. Evidence should be interpreted on a spectrum, rather than using arbitrary significance cut-offs.[5] To avoid issues of multiple testing the level of evidence should be weighted based on the number of variables adjusted for and the number of outcomes explored.

For all outcomes considered we adjusted for the same set of demographic variables that were both highly complete, plausibly linked to missingness for all outcomes considered, and likely to be present in other comparable surveillance datasets. These were: year, sex, age (grouped as 0-14 year olds, 15-65 year olds and 65+), ethnic group, UK birth status and socio-economic status (national quintiles). For socio-economic group 1 indicates the most deprived quintile. Complete case analysis has been used, with the dataset limited to notifications from 2010 and on-wards as socio-economic status was not collected prior to this.

+

Patient and public involvement

We did not involve patients or the public in the design or planning of this study.

@@ -460,7 +462,7 @@

Patient and public involvement

Results

Data completeness

-

Doing this shows high completeness for common demographic variables such as sex, age, ethnic group and UK birth status (Figure 1, Table 1). More problematically, BCG status and year of BCG status have a high percentage missing, even after accounting for the introduction of national collection of these variables in 2008. Socio-economic status (as national quintiles) was not collected until 2010 but after this point is highly complete. Comparing pre 2009 and post 2008 in Table 1 (and by inspecting Figure 1) there are also issues of changing completeness over time,[2,7] if this is not accounted for than it may lead to spurious trends. Figure 1 also indicates that there are multiple groups of variables that share a common pattern of missing data.

+

We found high completeness for common demographic variables such as sex, age, ethnic group and UK birth status (Figure 1, Table 1). More problematically, BCG status and year of BCG status had a high percentage missing, even after accounting for the introduction of national collection of these variables in 2008.[2] Socio-economic status (as national quintiles) was not collected until 2010 but after this point is highly complete.[2] Comparing pre 2009 and post 2008 in Table 1 (Figure 1) we see completeness changes over time,[2,6] this may lead to spurious trends if not adjusted for. Figure 1 also indicates that there are multiple groups of variables that share a correlated pattern of missing data.

Figure 1: Summary plot of missing data in the extract of the ETS data used in this thesis. Due to the large size of the dataset, the data has been sub-sampled with only 20\% of the data shown in this figure. Notifications have been ordered by date of notification from left to right. The following subset of variables are shown: year (year), sex (sex), age (age), PHE Centre (phec), Occupation (occat), Ethnic group (ethgrp), UK birth status (ukborn), Time since entry (timesinceent), date of symptom onset (symptonset), date of diagnosis (datediag), started treatment (startedtreat), date of starting treatment (starttreatdate), treatment end date (txenddate), pulmonary or extra-pulmonary TB (pulmextrapulm), culture (culture), sputum smear status (sputsmear), drug resistance (anyres), previous diagnosis (prevdiag), BCG status(bcgvacc), Year of BCG vaccination (bcgvaccyr), overall outcome (overalloutcome), cause of death (tomdeathrelate), socio-economic status quintiles (natquintile), and date of death (dateofdeath). Nested variables have been accounted for (i.e date of death has had an entry added for cases that are known to have not died), so that true missingness for all variables is estimated.

@@ -656,7 +658,7 @@

Data completeness

-

For nested variables with rare outcomes assuming the top level variable value can mask the underlying amount of missing data. An alternative approach is to filter the data for the top level variable required for the nested variable to be defined and to then compute the proportion of these notifications that are missing data for the outcome of interest. For the date of starting treatment this approach leads to an estimate of 5.9% (6434/108410) being missing, which is more complete than previously estimated. For cases that are known to have completed treatment 16.5% (13804/83891) are missing a date for the end of treatment. In notifications that are known to have died, 26.6% (1592/5976) were missing the date of death and 44.9% (2686/5976) were missing the cause of death. In any analysis where these variables are used the missing data for these variables will need to be carefully adjusted for. In particular, if cause of death is used it must be clearly stated that it is highly missing and results based on this variable should be properly caveated.

+

By filtering nested variables - rather than by using replacement - we found the date of starting treatment was 5.9% (6434/108410) missing, which is more complete than previously estimated. For cases that were known to have completed treatment 16.5% (13804/83891) were missing a date for the end of treatment. In notifications that were known to have died, 26.6% (1592/5976) were missing the date of death and 44.9% (2686/5976) were missing the cause of death.

Drivers of Variable completeness

@@ -2843,14 +2845,14 @@

Discussion

Statement of primary findings

In the ETS system we found a high degree of missing data for several important variables. We also found that there is likely to be strong missing at random (MAR) mechanism underlying this missing data for multiple variables. Several factors are strongly associated with data being missing for many variables, including UK birth status, ethnic group, socio-economic status and year. These MAR mechanisms must be adjusted for in studies using this data to avoid introducing bias. We found that date variables in particular suffered from changing data completeness over time, which may introduce spurious temporal trends if not fully understood.

-
    -
  • The following analysis is not currently in the paper but it was in the chapter - is there a case for including?*
  • -
+

The following analysis is not currently in the paper but it was in the chapter - is there a case for including?

We also found that for several variables, including the date of symptom onset, there was a large degree of recall bias when aggregating by day or month. Several variables, including date of notification and date of starting treatment, showed a seasonal trend with a maximum in the summer months. The date of ending treatment showed less evidence of a seasonal trend.

Strengths and limitations of the study

-

Routine observational datasets are subject to numerous potential biases, such as selection bias, recall bias, measurement bias, and unmeasured confounding.[8] Additionally, as the data has not been collected with a specific analysis in mind there maybe issues with the specificity of variables. The ETS system is likely to suffer from all of the above biases to some extent, which must be accounted for as far as possbile, and explicitly stated at every level of analysis. The most important consideration is that the ETS system is unlikely to be representative of the general population as it contains only notified TB cases that occurred in England during the study period, research questions must therefore be either limited to active TB patients, or when extended to the general population the differing population demographics must be accounted for. If this is not done then any results may be due to selection bias. Additionally, multiple variables may suffer from misclassification bias, including BCG status which can be assessed via vaccination record, the presence of a scar, or case recall: this may lead to spurious associations.[9] Validation studies would be required to account for this.

+

Work in progress - copied from chapter text

+

Routine observational datasets are subject to numerous potential biases, such as selection bias, recall bias, measurement bias, and unmeasured confounding.[7] Additionally, as the data has not been collected with a specific analysis in mind there maybe issues with the specificity of variables. The ETS system is likely to suffer from all of the above biases to some extent, which must be accounted for as far as possbile, and explicitly stated at every level of analysis. The most important consideration is that the ETS system is unlikely to be representative of the general population as it contains only notified TB cases that occurred in England during the study period, research questions must therefore be either limited to active TB patients, or when extended to the general population the differing population demographics must be accounted for. If this is not done then any results may be due to selection bias. Additionally, multiple variables may suffer from misclassification bias, including BCG status which can be assessed via vaccination record, the presence of a scar, or case recall: this may lead to spurious associations.[8] Validation studies would be required to account for this.

+

Unlike classic approaches to missing data, such as multiple imputation by chained regression (MICE),[9] this is not an imputation

Strengths and limitations in comparison to the literature

@@ -2887,20 +2889,20 @@

References

4 Pillaye J, Clarke A. An evaluation of completeness of tuberculosis notification in the United Kingdom. BMC Public Health 2003;3:31.

-
-

5 Groothuis-oudshoorn K. Journal of Statistical Software MICE : Multivariate Imputation by Chained.;VV.

-
-

6 Sterne JA, Davey Smith G. Sifting the evidence-what’s wrong with significance tests? Bmj 2001;322:226–31.

+

5 Sterne JA, Davey Smith G. Sifting the evidence-what’s wrong with significance tests? Bmj 2001;322:226–31.

-

7 PHE. Tuberculosis in England 2016 Report (presenting data to end of 2015). 2016.

+

6 PHE. Tuberculosis in England 2016 Report (presenting data to end of 2015). 2016.

-

8 Benchimol EI, Smeeth L, Guttmann A et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) Statement. The American Statistician 2016;115-116:1–22.

+

7 Benchimol EI, Smeeth L, Guttmann A et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) Statement. The American Statistician 2016;115-116:1–22.

-

9 Fewell Z, Davey Smith G, Sterne JAC. The impact of residual and unmeasured confounding in epidemiologic studies: A simulation study. American Journal of Epidemiology 2007;166:646–55.

+

8 Fewell Z, Davey Smith G, Sterne JAC. The impact of residual and unmeasured confounding in epidemiologic studies: A simulation study. American Journal of Epidemiology 2007;166:646–55.

+
+
+

9 Groothuis-oudshoorn K. Journal of Statistical Software MICE : Multivariate Imputation by Chained.;VV.

diff --git a/vignettes/paper.Rmd b/vignettes/paper.Rmd index 379753f..4a65a54 100644 --- a/vignettes/paper.Rmd +++ b/vignettes/paper.Rmd @@ -75,42 +75,37 @@ The Enhanced Tuberculosis Surveillance (ETS) system is a routine surveillance sy ## Methods -We obtained all TB notifications for 2000-2015 in England from the ETS. We gave an overview of the structure of the ETS and the steps taken to clean the data for analysis. - -We considered five outcomes: All-cause mortality, death due to TB (in those who died), recurrent TB, pulmonary disease, and sputum smear status. We used logistic regression, with complete case analysis, to investigate each outcome with BCG vaccination, years since vaccination and age at vaccination, adjusting for potential confounders. All analyses were repeated using multiply imputed data. - * Introduce ETS * Data extraction and management -* Structure of the ETS (results) -* Data completeness (motivation) -* Data completeness method -* Structure of missingness in the ETS -* Variables not completed pre and post 2008 +* Structure of the ETS +* Data completeness +* Drivers of variable completeness (regression) ## Results *Copy from bottom* - -* Table 1 * Missing structure -* Associations of missingness +* Drivers of variable completeness ## Conclusions - +* Surveillance data is likely to have a high degree of misising data. In the ETS missing for key outcomes is associated with demographic factors such as.... +* To avoid biasing analysis studies should make use of imputed data - rather than complete case analysis - and extend their imputation models to other demographic variables that may not be included in the analysis model. +* This analysis should be repeated in other datasets - for this reason the code is available as an R package. # Introduction *Background* -The Enhanced Tuberculosis Surveillance (ETS) system is a routine surveillance system - with a similar structure to other such systems - that collects data on all notified tuberculosis (TB) cases in England. It is routinely used to study the epidemiology of TB. Routine data often has a large amount of missing data which may not be fully accounted for when used in analyses. This study explores the evidence for associations between missingness in several key outcomes and demographic variables. Any such associations may introduce bias if not accounted for. +The Enhanced Tuberculosis Surveillance (ETS) system is a routine surveillance system - with a similar structure to other such systems - that collects data on all notified tuberculosis (TB) cases in England. It is routinely used to study the epidemiology of TB. Routine data often has a large amount of missing data which may not be fully accounted for when used in analyses. *Detail* -Missing data can take several forms, data that are missing completely at random (MCAR), data that are missing at random (MAR) and data that are missing not at random (MNAR).[@Sterne2009a] Data that are MAR are missing with a mechanism that is conditional on observed variables, whilst MNAR are missing with a mechanism that is conditional on variables that are not observed. Data that is MAR, and MNAR may lead to biases when analysing the data, however it is not possible to deduce from the observed data what the mechanism driving missing data is. Therefore, it is necessary to account for these potential biases during the analysis stage. This is possible using a variety of methods such as scenario analysis accounting for the 'best' and 'worst' case scenarios, and multiple imputation of missing data using additional variables in the dataset to inform the imputation model.[@Sterne2009a] +Missing data can take several forms, data that are missing completely at random (MCAR), data that are missing at random (MAR) and data that are missing not at random (MNAR).[@Sterne2009a] Data that are MAR are missing with a mechanism that is conditional on observed variables, whilst MNAR are missing with a mechanism that is conditional on variables that are not observed. Data that is MAR, and MNAR may lead to biases when analysing the data, however it is not possible to deduce from the observed data what the mechanism driving missing data is. Therefore, it is necessary to account for these potential biases during the analysis stage. This is possible using a variety of methods such as scenario analysis accounting for the 'best' and 'worst' case scenarios, and multiple imputation of missing data using additional variables in the dataset to inform the imputation model.[@Sterne2009a] Common practise is to include all variables included in the analyses in the imputation model, these variables may or may not be those at most risk of introducing bias due to an MAR mechanism. *Aim* +This study aims to explore the evidence for associations between missingness in several key outcomes and demographic variables. Any such associations may introduce bias if not accounted for. # Methods @@ -138,6 +133,10 @@ The ETS is in a wide format with each notification having a single row, and with ### Data completeness +As the ETS is aggregated across England, from a variety of sources, missing data are inevitable. This takes two forms: under-reporting of notified cases, of which there is some evidence in the literature,[@Pillaye2003] and data missing for a notified case. The former is particularly problematic as apart from using comparative studies the characteristics of those that are not notified is unknown. For variables that are missing data within the dataset it is possible to calculate the proportion of missing data but care must be taken to account for nested variables such as date of death and year of BCG vaccination. To do this we have assumed that nested variables takes the value of the top level variable when it is known that the variable is not truly missing. An example of this is using overall outcome for date of death when notifications are known to have not died. This allows then allows us to estimate the proportion of these variables that are truly missing. + +For nested variables with rare outcomes assuming the top level variable value can mask the underlying amount of missing data. We implemented an alternative approach which filtered the data for the top level variable required for the nested variable to be defined and then computed the proportion of these notifications that were missing data for the outcome of interest. + ```{r miss-dat-munge, include = FALSE, eval = regen_results} if (regen_results) { ## See package docs (?ETSMissing::function) for further implementation details @@ -173,15 +172,13 @@ if (regen_results) { ``` -As the ETS is aggregated across England, from a variety of sources, some level of missing data are inevitable. This takes two forms: under-reporting of notified cases, of which there is some evidence in the literature,[@Pillaye2003] and data missing for a notified case. The former is particularly problematic as apart from using comparative studies the characteristics of those that are not notified is unknown. For variables that are missing data within the dataset it is possible to calculate the proportion of missing data but care must be taken to account for nested variables such as date of death and year of BCG vaccination. This can be done by assuming that the nested variables takes the value of the top level variable when it is known that the variable is not truly missing. An example of this is using overall outcome for date of death when notifications are known to have not died. - ### Drivers of Variable completeness -Missing data may be MAR or MNAR, which may introduce biases into any analyses based on these data. Unfortunately MNAR data cannot be detected, so bias from this source cannot be discounted. However, it is possible to detect potential MAR mechanisms from observed variables that would not necessarily be included in a model used for analysis. Although these associations may themselves be caused by an external factor. In the following section we explore variables associated with data being missing for several key variables including: BCG status, year of BCG vaccination, date of death, cause of death, date of symptom onset, date of diagnosis, date of starting treatment and date of ending treatment. +Missing data may be MAR or MNAR, which may introduce biases into any analyses based on these data. Unfortunately MNAR data cannot be detected, so bias from this source cannot be discounted. However, it is possible to detect potential MAR mechanisms from observed variables that would not necessarily be included in a model used for analysis. Here we develop a method for this and apply it to several key outcomes including: BCG status, year of BCG vaccination, date of death, cause of death, date of symptom onset, date of diagnosis, date of starting treatment and date of ending treatment. -In order to explore the drivers of missing data we reformulated the problem as a logistic regression for each variable of interest, with the outcome being data completeness (complete/missing). This allows variables that are hypothesised to be related to missing data to be adjusted for and their independent impact on data completeness to be estimated. Unlike classic approaches to missing data, such as multiple imputation by chained regression (MICE),[@Groothuis-oudshoorn] this is not an imputation. +We reformulated the problem as a logistic regression for each variable of interest, with the outcome being data completeness (complete/missing). This allows variables that are hypothesised to be related to missing data to be adjusted for and their independent impact on data completeness to be estimated. This approach does not account for missingness within exploratory variables. -### Statistical analysis +#### Method In order to reformulate missing data as a logistic regression we took the following steps: @@ -271,7 +268,7 @@ We did not involve patients or the public in the design or planning of this stud ## Data completeness -Doing this shows high completeness for common demographic variables such as sex, age, ethnic group and UK birth status (`r pretty_figref("plot-missing-struct")`, `r pretty_tabref("missing-var-tabs")`). More problematically, BCG status and year of BCG status have a high percentage missing, even after accounting for the introduction of national collection of these variables in 2008. Socio-economic status (as national quintiles) was not collected until 2010 but after this point is highly complete. Comparing pre 2009 and post 2008 in `r pretty_tabref("missing-var-tabs")` (and by inspecting `r pretty_figref("plot-missing-struct")`) there are also issues of changing completeness over time,[@PHE2016; @PHE2017] if this is not accounted for than it may lead to spurious trends. `r pretty_figref("plot-missing-struct")` also indicates that there are multiple groups of variables that share a common pattern of missing data. +We found high completeness for common demographic variables such as sex, age, ethnic group and UK birth status (`r pretty_figref("plot-missing-struct")`, `r pretty_tabref("missing-var-tabs")`). More problematically, BCG status and year of BCG status had a high percentage missing, even after accounting for the introduction of national collection of these variables in 2008.[@PHE2017] Socio-economic status (as national quintiles) was not collected until 2010 but after this point is highly complete.[@PHE2017] Comparing pre 2009 and post 2008 in `r pretty_tabref("missing-var-tabs")` (`r pretty_figref("plot-missing-struct")`) we see completeness changes over time,[@PHE2016; @PHE2017] this may lead to spurious trends if not adjusted for. `r pretty_figref("plot-missing-struct")` also indicates that there are multiple groups of variables that share a correlated pattern of missing data. ```{r plot-missing-struct, fig.cap = pretty_figref("plot-missing-struct", "Summary plot of missing data in the extract of the ETS data used in this thesis. Due to the large size of the dataset, the data has been sub-sampled with only 20\\% of the data shown in this figure. Notifications have been ordered by date of notification from left to right. The following subset of variables are shown: year (year), sex (sex), age (age), PHE Centre (phec), Occupation (occat), Ethnic group (ethgrp), UK birth status (ukborn), Time since entry (timesinceent), date of symptom onset (symptonset), date of diagnosis (datediag), started treatment (startedtreat), date of starting treatment (starttreatdate), treatment end date (txenddate), pulmonary or extra-pulmonary TB (pulmextrapulm), culture (culture), sputum smear status (sputsmear), drug resistance (anyres), previous diagnosis (prevdiag), BCG status(bcgvacc), Year of BCG vaccination (bcgvaccyr), overall outcome (overalloutcome), cause of death (tomdeathrelate), socio-economic status quintiles (natquintile), and date of death (dateofdeath). Nested variables have been accounted for (i.e date of death has had an entry added for cases that are known to have not died), so that true missingness for all variables is estimated."), out.width = "60%", out.extra = ""} show_figure("plot-missing-struct") @@ -295,8 +292,7 @@ knitr::kable(miss_ets, caption = pretty_tabref("missing-var-tabs", "Breakdown of missing data from the ETS prior to the web based system (pre 2009) and post (post 2008) by variable, ordered by the percentage missing for a subset of variables. The following subset of variables are shown year (year), sex (sex), age (age), PHE Centre (phec), Occupation (occat), Ethnic group (ethgrp), UK birth status (ukborn), Time since entry (timesinceent), date of symptom onset (symptonset), date of diagnosis (datediag), started treatment (startedtreat), date of starting treatment (starttreatdate), treatment end date (txenddate), pulmonary or extra-pulmonary TB (pulmextrapulm), culture (culture), sputum smear status (sputsmear), drug resistance (anyres), previous diagnosis (prevdiag), BCG status(bcgvacc), Year of BCG vaccination (bcgvaccyr), overall outcome (overalloutcome), cause of death (tomdeathrelate), socio-economic status quintiles (natquintile), and date of death (dateofdeath). Nested variables have been accounted for (i.e data of death has had an entry added for cases that are known to have not died), so that true missingness for all variables is estimated.")) ``` -For nested variables with rare outcomes assuming the top level variable value can mask the underlying amount of missing data. An alternative approach is to filter the data for the top level variable required for the nested variable to be defined and to then compute the proportion of these notifications that are missing data for the outcome of interest. For the date of starting treatment this approach leads to an estimate of `r missing_stats[["date_treat"]]` being missing, which is more complete than previously estimated. For cases that are known to have completed treatment `r missing_stats[["date_treat_end"]]` are missing a date for the end of treatment. In notifications that are known to have died, `r missing_stats[["date_death"]]` were missing the date of death and `r missing_stats[["cause_death"]]` were missing the cause of death. In any analysis where these variables are used the missing data for these variables will need to be carefully adjusted for. In particular, if cause of death is used it must be clearly stated that it is highly missing and results based on this variable should be properly caveated. - +By filtering nested variables - rather than by using replacement - we found the date of starting treatment was `r missing_stats[["date_treat"]]` missing, which is more complete than previously estimated. For cases that were known to have completed treatment `r missing_stats[["date_treat_end"]]` were missing a date for the end of treatment. In notifications that were known to have died, `r missing_stats[["date_death"]]` were missing the date of death and `r missing_stats[["cause_death"]]` were missing the cause of death. ## Drivers of Variable completeness @@ -419,14 +415,19 @@ ETSMissing::pull_results(results, "txenddate") %>% In the ETS system we found a high degree of missing data for several important variables. We also found that there is likely to be strong missing at random (MAR) mechanism underlying this missing data for multiple variables. Several factors are strongly associated with data being missing for many variables, including UK birth status, ethnic group, socio-economic status and year. These MAR mechanisms must be adjusted for in studies using this data to avoid introducing bias. We found that date variables in particular suffered from changing data completeness over time, which may introduce spurious temporal trends if not fully understood. -* The following analysis is not currently in the paper but it was in the chapter - is there a case for including?* +*The following analysis is not currently in the paper but it was in the chapter - is there a case for including?* We also found that for several variables, including the date of symptom onset, there was a large degree of recall bias when aggregating by day or month. Several variables, including date of notification and date of starting treatment, showed a seasonal trend with a maximum in the summer months. The date of ending treatment showed less evidence of a seasonal trend. ### Strengths and limitations of the study +*Work in progress - copied from chapter text* + Routine observational datasets are subject to numerous potential biases, such as selection bias, recall bias, measurement bias, and unmeasured confounding.[@Benchimol2016a] Additionally, as the data has not been collected with a specific analysis in mind there maybe issues with the specificity of variables. The ETS system is likely to suffer from all of the above biases to some extent, which must be accounted for as far as possbile, and explicitly stated at every level of analysis. The most important consideration is that the ETS system is unlikely to be representative of the general population as it contains only notified TB cases that occurred in England during the study period, research questions must therefore be either limited to active TB patients, or when extended to the general population the differing population demographics must be accounted for. If this is not done then any results may be due to selection bias. Additionally, multiple variables may suffer from misclassification bias, including BCG status which can be assessed via vaccination record, the presence of a scar, or case recall: this may lead to spurious associations.[@Fewell2007] Validation studies would be required to account for this. + +Unlike classic approaches to missing data, such as multiple imputation by chained regression (MICE),[@Groothuis-oudshoorn] this is not an imputation + ### Strengths and limitations in comparison to the literature ### Meaning of the study @@ -434,7 +435,6 @@ Routine observational datasets are subject to numerous potential biases, such as ### Unanswered questions and future research - **Acknowledgements** The authors thank the TB section at Public Health England (PHE) for maintaining the Enhanced Tuberculosis Surveillance (ETS) system; all the healthcare workers involved in data collection for the ETS.