Skip to content

COVID-19 integrated surveillance data provided by the Italian Institute of Health and processed via UnrollingAverages.jl to deconvolve the weekly moving averages.

License

Notifications You must be signed in to change notification settings

InPhyT/COVID19-Italy-Integrated-Surveillance-Data

Repository files navigation

COVID-19 Integrated Surveillance Data in Italy

language-italian language-english License: CC BY-SA 4.0 DOI

Every week the National Institute for Nuclear Physics (INFN) imports an anonymous individual-level dataset from the Italian National Institute of Health (ISS) and converts it into an incidence time series data organized by date of event and disaggregated by sex, age and administrative level with a consolidation period of approximately two weeks. The information available to the INFN is summarised in the following meta-table:

Variable Name Description Code / Format Missing
REGIONEDIAGNOSI Region of diagnosis ISTAT regional code No
ETA Age of the patient in years at the date of symptoms onset or diagnosis Yes
SESSO Sex F= female; M = male; U = unknown No
NAZIONALITA Nationality ISO3166-1 Yes
PROVINCIADOMICILIORESIDENZA Province of domicile or residence if missing ISTAT provincial code Yes
OPERATORESANITARIO Healthcare worker Y = yes; N = no; U = unknown No
DATAPRELIEVO Date of sample collection dd/mm/yyyy Yes
DATADIAGNOSI Date of diagnosis dd/mm/yyyy Yes
SINTOMATICO Presence of symptoms Y = yes; N = no; U = unknown No
DATAINIZIOSINTOMI Date of symptoms onset dd/mm/yyyy Yes
RICOVERO Hospitalization Y = yes; N = no; U = unknown No
DATARICOVERO Date of admission to hospital dd/mm/yyyy Yes
TERAPIAINTENSIVA Intensive care unit Y = yes; N = no; U = unknown No
DATATERAPIAINTENSIVA Date of admission to intensive care unit dd/mm/yyyy Yes
DECEDUTO Deceased with COVID-19 Y = yes; N = no; U = unknown No
DATADECESSO Date of death dd/mm/yyyy Yes
CASOIMPORTATO Imported case from abroad Y = yes; N = no; U = unknown No

Data

Archive

The original data has been stored here, reorganised here and its contents are summarised in the following table:

Collection Symptomatic cases Confirmed cases Ordinary hospital admissions Intensive hospital admission Deceased cases National level Regional level Provincial level Age stratification Sex stratification Raw time series Averaged time series
Daily incidences at national and regional level
Daily incidences at provincial level
Daily incidences of healthcare workers
Daily incidences of over-80
Daily incidences by sex and age
Daily incidences ratios
Daily Rₜ
Absolute overall prevalences
Relative overall prevalences
Daily age distribution
Daily incidences percentages by age
Daily incidences percentages by outcome
Distribution of time delay from hospitalization to death
Temporal distribution of time delay from hospitalization to death

Input

The input data has been stored here and contain the following information:

  • Aggregated data in the daily_incidences_by_region folder:
    • Weekly moving average and daily time series of confirmed cases by date of diagnosis at the regional level;
    • Weekly moving average and daily time series of ordinary hospital admissions by date of admission at the regional level;
    • Weekly moving average and daily time series of intensive hospital admissions by date of admission at regional level;
    • Weekly moving average and daily time series of deceased cases by date of death at the regional level.
  • Disaggregated data in the daily_incidences_by_region_sex_age folder:
    • Weekly moving average time series of symptomatic cases by date of symptoms onset stratified by sex and age at the regional level;
    • Weekly moving average time series of confirmed cases by date of diagnosis stratified by sex and age at the regional level;
    • Weekly moving average time series of ordinary hospital admissions by date of admission stratified by sex and age at the regional level;
    • Weekly moving average time series of intensive hospital admissions by date of admission stratified by sex and age at the regional level;
    • Weekly moving average time series of deceased cases by date of death stratified by sex and age at the regional level.

Output

The output data has been stored here and contain the following information:

  • Reconstructed daily time series of confirmed cases by date of diagnosis stratified by sex and age at the regional level;

lombardy-confirmed

  • Reconstructed daily time series of symptomatic cases by date of symptoms onset stratified by sex and age at the regional level;

lombardy-symptomatic

  • Reconstructed daily time series of ordinary hospital admissions by date of admission stratified by sex and age at the regional level;

lombardy-hospitalized

  • Reconstructed daily time series of intensive hospital admissions by date of admission stratified by sex and age at the regional level;

lombardy-icu

  • Reconstructed daily time series of deceased cases by date of death stratified by sex and age at the regional level.

lombardy-deceased

Methodology

Data Organization

Raw data are downloaded from the INFN (direct download here), decompressed, stored in the 0_archive folder and then organized into the 1_structured_archive folder via the execution of the data_organization.jl script.

Data Processing

In general, given the moving average (or rolling mean) of a time series, it's not possible to recover the original series unless n original points are known where n is the width of the window adopted in the moving average, but since epidemiological surveillance incidence series are strictly composed of natural numbers, we can leverage this property to come up with a finite number of candidate original series, and then prune these down to as little as possible, hopefully only one, final recovered series.

The whole procedure is performed via the execution of the main.jl script and the related technical details can be found the documentation of UnrollingAverages.jl package.

The averaged time series to be unrolled (i.e. recovered, reconstructed or de-averaged) are those stored in the 2_input/daily_incidences_by_region_sex_age folder: they are organized in .csv files, each of which reporting the 10 age-specific time series of a particular incidence in a particular region. Each dataset has two counterparts that are further stratified by sex.

Since the smaller the numbers involved the better UnrollingAverages.jl seems to perform, we opted for unrolling the sex-stratified series first and then aggregate them later. Since not all the age and sex stratified averaged series allows UnrollingAverages.jl to find an unique original series and no further sex-stratified information is provided by INFN, we attempted to directly unroll the sex-aggregated time series for which CovidStat provides additional information in the form of age-aggregated original time series, that we employed to select that combination of age-disaggregated series proposed by UnrollingAverages.jl which summed to the age-aggregated original time series provided by INFN. The utilized age and sex-aggregated may be found in the 2_input/daily_incidences_by_region folder. We'll refer to the last selection algorithm as the cross-sectional consistency constraint.

The successfully reconstructed time series are then saved in the 3_output/data folder (both aggregated and disaggregated by sex), while the visualisations of those that are age-stratified and sex-aggregated may be found in 3_output/figures.

Future Developments

We may improve the cross-sectional consistency constraint in one of the following ways:

How to Contribute

If you wish to change or add some functionality, please file an issue. Some suggestions may be found in the Future Developments section.

How to Cite

If you use this data in your work, please cite this repository using the following metadata:

@dataset{Monticone_Moroni_COVID-19_Integrated_Surveillance_Data_Italy_2021,
         abstract     = {COVID-19 integrated surveillance data provided by the Italian Institute of Health and processed via UnrollingAverages.jl to remove the weekly moving averages.},
         author       = {Monticone, Pietro and Moroni, Claudio},
         doi          = {10.5281/zenodo.5748142},
         keywords     = {Data, Data Analysis, Statistics, Time Series, Time Series Analysis, Epidemiological Data, Surveillance, Surveillance Data, Incidence Data, Open Data, Epidemiology, Mathematical Epidemiology, Computational Epidemiology, COVID-19, SARS-CoV-2, Italy, COVID-19 Data, SARS-CoV-2 Data},
         license      = {CC BY-SA 4.0},
         organization = {Interdisciplinary Physics Team (InPhyT)},
         title        = {COVID-19 Integrated Surveillance Data in Italy},
         url          = {https://doi.org/10.5281/zenodo.5748142},
         year         = {2021}
         }

References

Data

Istituto Superiore di Sanità. COVID-19 Integrated Surveillance Data in Italy.

Software

  1. Pietro Monticone, Claudio Moroni, UnrollingAverages.jl (2021) https://doi.org/10.5281/zenodo.5725301.
  2. Tom Breloff, Plots.jl (2021) https://doi.org/10.5281/zenodo.5747251.

Scientific Literature