-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Italian COVID-19 Integrated Surveillance Data #463
Comments
Thanks for opening an issue! We'll try and get back to you shortly. If you've identified an issue and would like to fix it please see our contribution guidelines. |
I'll hop in quickly with a question about how the data you've put together (which is impressive) compares with the aggregated data which the package currently draws from the Department of Civil Protection (https://github.com/pcm-dpc/COVID-19/blob/master/README_EN.md). The level of disaggregation (gender, age cohort) that you have is more fine-grained than most of the data coming out of We recently moved from one Swiss data source to another. We have not [yet] put in a standard way to let users choose between two different datasets (though I think this is sort of possible within the UK data). |
Hi @RichardMN,
The main difference between the integrated surveillance data from the Italian National Institute of Health that we update on a weekly basis here and the surveillance data from the Italian Department of Civil Protection that they update on a daily basis here is that the former contains incidences organised by date of key event while the latter by date of notification (affected by the typical problem of time-varying reporting delays). For more details you might want to take a look at Del Manso et al. (2020) where the two data streams are described and compared. CC: @ClaudMor, @pitmonticone |
Hello, Would you have any update on this? |
Things have been a bit more hectic for the past couple of weeks and I haven't decided to spend an evening writing this code yet. It's going to be a bit picky sorting out how to switch between two data sources (I suppose I'll probably look at what is done for the UK example) and this is probably why I've not written a drop-in replacement yet. I think that other contributors have also been focussed on other projects related to now- and forecasting. |
Hi @RichardMN, thanks for your reply. We're certainly willing to help you with the logistics if needed: if you tell us the proper format we could make an additional folder in our repository with the data in the requested format. |
So here am I with a suggestion, having had a bit of a look at the data. It would be a lot simpler if the data were in 'tidy' format. Roughly, this might look like:
[fictional data - I haven't checked what the real numbers would be] If you prefer to have column names (and region names) in Italian, or all lower case, or not, can all be worked around. This will make for one very long (as opposed to wide) CSV, but much easier to filter and much easier for our code to aggregate. (And it means not writing code to download 20 x (4 or 5) different separate CSV files, then glue them together, then flatten them, ... which I can do but I'm not looking forward to.)
Edits:
|
Hi @RichardMN, here is the tidy version of our dataset following your suggestion. Could you tell us if you believe it might be fine? If so, we will notify you here when we'll merge in the main branch. |
Looks good. Below is a quick reprex for pulling it into R, aggregating it (as we will inside the package) and plotting it. You have saved me at least an hour of painful url-hackery. I've not started doing logical tests against it, but in terms of making something which is going to be straightforward to pull into library(vroom)
library(ggplot2)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
it_inphyt_data <- vroom::vroom("https://github.com/InPhyT/COVID19-Italy-Integrated-Surveillance-Data/raw/use_initial_conditions/epiforecasts_covidregionaldata/COVID19-Italy-Integrated-Surveillance-Data.csv")
#> Rows: 674503 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (4): region, gender, age_cohort, indicator
#> dbl (1): count
#> date (1): date
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
it_agg_data <- it_inphyt_data %>%group_by(date,region,indicator) %>% summarise(across(where(is.double), sum), .groups = "drop")
it_agg_data %>% filter(indicator=="confirmed") %>% ggplot(aes(x=date,y=count, colour=region)) +geom_line() +theme_minimal() Created on 2022-03-14 by the reprex package (v2.0.1) |
Back with more questions, some of which may take a bit of digging. What is the What is your preferred count between Any ideas why we have this in our existing code? (This squashes the two together, so that Trento and Bolzano are both listed as Trentino-Alto Adige. I don't know enough Italian geography or regional/local government to know why this makes sense or doesn't.) It'll have to be amended to match what your region identifiers are but I'm not that familiar with our Italy code so don't quite know why we do this. I wonder if it may be that the two regions share an ISO-3166 code and so they get merged together because in many of our other usages we depend on the ISO-3166 being a unique identifier for regions. mutate(level_1_region = recode(.data$level_1_region,
"P.A. Trento" = "Trentino-Alto Adige",
"P.A. Bolzano" = "Trentino-Alto Adige"
)) %>% For now, #464 is a first write-through of an alternate implementation of the Italy code which uses the InPhyT data. I'll make a PR here and would welcome someone else poking it a bit. Later this week I may try putting in:
|
Fix indicators as requested by epiforecasts/covidregionaldata#463 (comment) Co-Authored-By: Pietro Monticone <38562595+pitmonticone@users.noreply.github.com> Co-Authored-By: Claudio Moroni <43729990+ClaudMor@users.noreply.github.com>
Hi @RichardMN, thanks for your feedback and your questions.
We've renamed
We have no preferred count between
mutate(level_1_region = recode(.data$level_1_region,
"P.A. Trento" = "Trentino-Alto Adige",
"P.A. Bolzano" = "Trentino-Alto Adige"
)) %>% Yes, this aggregation makes perfect sense since Trentino-Alto Adige is the Italian region made up of the two self-governing of Trento and Bolzano. Please tell us if any further changes are needed. |
Hi @RichardMN, Today we've successfully updated our repository merging the new folder epiforecasts_covidregionaldata. Please don't hesitate to let us know if any further changes are needed. |
I've adjusted the download url (twice - I got it wrong the first time). Checks appear to be failing in the github workflow but I think that may be because there's a problem with the French data right now. |
Hello @RichardMN, Is there anything else we can do on our side to facilitate the transition? Thanks. |
Hi @RichardMN, We've recently solved a few issues and added one age class so that now we provide the following age classification:
Here is the updated data. Thanks. |
Hi @InterdisciplinaryPhysicsTeam - thank you for the various updates. There are two slightly interrelated issues. I am not a maintainer of this package and so I cannot apply changes. The package appears to be moving towards senescence - many of the upstream sources have stopped updating or moved to frequencies which are no longer useful for the epidemiological work which people want to do with data from On a slightly related point, France has changed their data format (three weeks ago) #469 which means that just to get to the point where the patches I made will pass checks and could be applied, I need to go and look at the France code (or someone else does) and get those fixed and applied. Returning to point 1, I need a sense from @seabbs or @kathsherratt or others whether we're going to try to modularize the package better (so that single country failures don't bork everything else) or just accept that it was very useful for a time but no longer appears to have utility or a market. This is a bit of a bigger question than belongs in this issue but this appears to be where the conversation might take place. |
Hi all, and thanks @RichardMN for bringing up this topic. As mentioned in #459, we are unsure if this package is still used by / useful to anyone. Because of this, most of the contributors have moved on (excepted @RichardMN, whose heroic efforts to keep this package running need to be highlighted!). I can help in getting outstanding PR merged though if someone feels that something needs updating / fixing. Two comments:
If necessary, feel free to ping me. I cannot promise I'll always be responsive but I'll try. |
Hi @RichardMN @Bisaloo @pitmonticone @ClaudMor, Thank you @Bisaloo for your reply.
It very much depends on which variables you're interested in and would like to make use of. The main differences between the integrated surveillance data from the Italian National Institute of Health that we update on a weekly basis here and the surveillance data from the Italian Department of Civil Protection that they update on a daily basis here are the following:
For more details you might want to take a look at Del Manso et al. (2020) where the two data streams are described and compared. |
Okay, I'm quite convinced we need to keep both data sources, with the ability for the user to switch from one to the other. @RichardMN, are you interested in implementing this or would you like me to do it? No pressure either way. |
Hi all,
First of all thank you very much for the development and maintenance of this very useful global national and sub-national level COVID-19 incidence data package.
We have read the Development section where you write:
then, exploring the Wiki, we have read the following recommendation:
Therefore we have opened this preliminary issue to ask if you believe it could be helpful to include the Italian COVID-19 integrated surveillance data we've recently obtained the authorisation to publish containing:
Contacts
The text was updated successfully, but these errors were encountered: