-
-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open questions around multidimensional indicators #3635
Comments
Thanks for writing this up, Pablo! General thoughtsI personally think that there is some fundamental distinction between what "indicator" actually means in ETL and in DB. I think this brings some confusion when working with metadata in ETL, aligning "what we see" in DB vs in ETL. In ETL, an indicator is represented by a In DB, however, indicators are slightly different. They don't retain dimensions, and we show one per dimension. For instance, in this dataset, we have several "DB indicators." But in ETL, these are just one single "ETL indicator," with few dimensions. My preferred way forwardI think that something more like your solution 2 would be beneficial in the long term and would align ETL & DB worlds. When it comes to data, I'm in favor of "bringing DB closer to ETL", even if more work, than "bringing ETL closer to DB". ETL was designed later on, and I think that the DB has some legacy decisions that would be nice not to bring to ETL. Opinion on 1On Solution 1, I think they are fairly easier to implement than in 2. For 1.1, just to comment on "We can manually edit metadata of a specific flattened indicator (e.g. population_sex_male_age_0_to_4).": I think this is great, but we should be able to do this without the need to flattening, with Jinja. I think that Jinja is not ideal, and we have work to do there to make it easier. But I wouldn't force everyone to flatten so they can edit the metadata. Not that you are implying this, but wanted to clarify this. If going for something more like 1, I wouldn't flatten tables in Garden, to have two versions of the same table. Reasons are memory/redundacy, less clarity on tables, and because sometimes we may not need to flatten at all? Instead, I'd flatten things after When showing metadata in ETL, I agree that it's sub-optimal that we get the Jinja stuff without it being rendered. Maybe there could be something we could do so we get it rendered by dimension. E.g., |
Thank you @lucasrodes. To address some of your points:
(I haven't thought much about this, not sure if my proposal makes sense). |
Mmmh, I'd try to clarify this a bit more. I know it might seem unnecessary, but I think it'd be nice if we made it clearer what it means. I think it is the source of confusion of various things when trying to map ETL<->DB. I would try to choose one of the following options:
or
|
I'm not sure I follow your proposal with Say we have the following table
If I wanted to get the metadata for |
I suppose that, for any slice of data, we could have a corresponding slice of metadata, e.g., if you did
But again, I haven't really thought much about it, this was just a half-baked idea. |
I agree with some points, but I will comment only on mdims. Currently, the only thing mdim steps do is create a YAML config, which is then uploaded to the DB. They operate solely on the DB and don’t use anything else from ETL. What do we gain from closer integration with ETL steps? I’ve only created two mdim configs, but I don’t see how that would make my life easier. Could you be more specific about your pain points and how this would help? Regarding flattening, I’m happy to "isolate" it from the |
Hi @Marigold, thanks for your inputs. Let me address your points:
I think that's already a problem. If we could keep all that information on the ETL, we would not need any extra dependency on DB. Currently, the dependencies added to the DAG for mdim steps are a bit arbitrary (it's really up to you what you write there in the DAG, but no data is actually loaded). If the data changes after an update, you won't have an easy way to know that. The mdim explorer will simply show wrong, or no data, I suppose. This is an issue we used to have with explorers, and we fixed it recently by letting ETL create explorers out of actual data (see e.g. this example, where the explorer config is drawn from the content of the data; but also note that this is done in a very suboptimal way, based on column names, ideally, we would explode dimensions with dedicated functions).
My proposal is to keep both the original (mdim) and the flattened data in ETL. In other words: All data comes from ETL and can be found in the catalog (which has always been the ideal goal, moving away from scattered data between ETL and DB). Then you can work with whatever data structure you feel more comfortable with. But we definitely need flattened data for grapher, so it makes more sense to create it explicitly in ETL, rather than on the flight (which is then not accessible via ETL or catalog). |
I agree on this. I generally prefer to work with long tables over wide tables. I don't see any benefit from using flat tables tbh, since one can edit metadata already for all dimensions using Jinja. It was actually a huge improvement when long->wide was automated. I understand that it kinda aligns ETL with DB a bit more. But I think that's not the ideal route that we want to take. If anything, I'd bring 'db closer to etl', whatever this means (can think more about this). In addition, IMHO having two versions of the same table can be a have some downsides:
I do agree that having MDIMs depend on DB is a bit confusing, in terms of the DAG. And probably would be nice to depend on If so, I'd use unflattened tables though, so that MDIMs can exploit the fact that the tables already have the dimension information. |
Long vs wide is an interesting (and nuanced) discussion. You seem to have now a strong preference for long formats. But let's remember that there are many operations where wide formats are clearly more convenient. One of them is propagating metadata. We have no idea how to do that properly with long format tables. We need metadata propagation every time we combine datasets (e.g. to properly track origins). jinja templates is already quite a messy business, I'm not sure we would be able to improve our tooling to a point where everything can be done with long tables only. On top of all that, it seems reasonable to expect that grapher is not going to change drastically fro the time being. We'll (always?) need to have a dataset with an entity and a time, and no other dimensions. This constraint limits our solution space significantly. So, ideally, in ETL we should be able to easily move from one format to the other without losing information (dimensions or other metadata). @lucasrodes proposed something interesting above, which is to have, e.g. |
We are already propagating metadata for long formats as of today. The thing is that this works fine within the same dataset, where we can use Jinja. It is true that, once we start concatenating different datasets that have different metadata fields, there is the need to differentiate metadata at dimension level. And can't really do it with our current tools. I'd invest time in trying to have metadata at dimension level, to solve these edge cases. Personally, when I have to append timeseries that have substantially different metadata, I'd then go with two columns instead, and assume they are sort of "two indicators".
I think this is worth exploring. We currently use Jinja, and while I think it's messy sometimes, it's also convenient and helpful. I think we can also re-define a bit how the metadata YAML file is structured so that we can allocate for dimensions somehow, maybe without the need to have Jinja text. |
We already have a prototype data page to display multidimensional indicators (although with some open issues, but probably fixable). However, from an architectural perspective, after several conversations, it seems clear to me that there are some very important open questions that need to be answered before we move on to using them frequently (and eventually replace explorers).
So far, given our current technical limitations, we have been forced to flatten mdim indicators (either in ETL or in grapher). My initial understanding of an mdim project would be to tackle those limitations and have tools that let us work with true mdim indicators. However, looking at the current implementation of mdims, it seems that we are still flattening mdim indicators (and entity-date are still the only true dimensions of all indicators).
Solution 1: Flattening mdim indicators
We currently have two ways to flatten mdim indicators:
Solution 1.1: Flattening in ETL
We do that explicitly in the
data://garden
ordata://grapher
steps. This means doing apivot
operation, and creating metadata for each of the flattened indicators. An example implementation of this is to have, in the garden step, two output tables:population
, with indexescountry,year,sex,age
, and columnpopulation
(orvalue
).population_flat
, with indexescountry,year
and columnspopulation_sex_male_age_0_to_4
,population_sex_female_age_0_to_4
, etc.Pros
population_sex_male_age_0_to_4
).data://grapher
steps, instead of needing to read from DB.Cons
Solution 1.2: Flattening in DB
We do the flattening in the
grapher://grapher
step. The output is also a flat table (actually, a grapher dataset) with indexescountry,year
and columnspopulation_sex_male_age_0_to_4
,population_sex_female_age_0_to_4
, etc. And we store an additional columndimensions
in thevariables
DB table, e.g.{"filters": [{"name": "sex", "value": "male"}, {"name": "age", "value": "0-4"}], "originalName": "Population", "originalShortName": "population"}
.Pros
Cons
Solution 2: Working with truly mdim indicators
This solution implies two big changes:
Mdims in ETl/owid-catalog
Currently, an
owid-catalog
Table
carries just one indicator metadata for each column. For mdims with jinja-templated metadata, the result is only machine-readable, full of<< something >>
content.Ideally, each cell in a table would have its own metadata.
We could achieve that by having a table for data, and another, with the same structure, for metadata. We could still use jinja templates, but the metadata would be materialized when running the garden step.
But this would imply massive changes to
owid-catalog
, e.g. regarding metadata propagation and other operations.Mdims in owid-grapher
I'm sure this would be another huge can of worms.
Other solutions
I suppose we could have other hybrid solutions, e.g. mdims in ETL, flattening in the grapher step (almost automatically) and keeping owid-grapher in the same way. But I suppose those would all be intermediate solutions.
The text was updated successfully, but these errors were encountered: