Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open questions around multidimensional indicators #3635

Open
pabloarosado opened this issue Nov 27, 2024 · 10 comments
Open

Open questions around multidimensional indicators #3635

pabloarosado opened this issue Nov 27, 2024 · 10 comments

Comments

@pabloarosado
Copy link
Contributor

We already have a prototype data page to display multidimensional indicators (although with some open issues, but probably fixable). However, from an architectural perspective, after several conversations, it seems clear to me that there are some very important open questions that need to be answered before we move on to using them frequently (and eventually replace explorers).

So far, given our current technical limitations, we have been forced to flatten mdim indicators (either in ETL or in grapher). My initial understanding of an mdim project would be to tackle those limitations and have tools that let us work with true mdim indicators. However, looking at the current implementation of mdims, it seems that we are still flattening mdim indicators (and entity-date are still the only true dimensions of all indicators).

Solution 1: Flattening mdim indicators

We currently have two ways to flatten mdim indicators:

Solution 1.1: Flattening in ETL

We do that explicitly in the data://garden or data://grapher steps. This means doing a pivot operation, and creating metadata for each of the flattened indicators. An example implementation of this is to have, in the garden step, two output tables:

  • A long-format table, e.g. population, with indexes country,year,sex,age, and column population (or value).
  • A wide-format table, e.g. population_flat, with indexes country,year and columns population_sex_male_age_0_to_4, population_sex_female_age_0_to_4, etc.

Pros

  • ETL is still our source of data. All data used in charts can be found in the catalog.
  • We can manually edit metadata of a specific flattened indicator (e.g. population_sex_male_age_0_to_4).
  • Mdim steps rely on data://grapher steps, instead of needing to read from DB.

Cons

  • We need to deal (in ETL and DB) with indicators with long, ugly names, instead of accepting the truly multidimensional nature of those indicators.
  • Some ETL steps take more time and storage, since we would be flattening and storing locally, before pushing to grapher.

Solution 1.2: Flattening in DB

We do the flattening in the grapher://grapher step. The output is also a flat table (actually, a grapher dataset) with indexes country,year and columns population_sex_male_age_0_to_4, population_sex_female_age_0_to_4, etc. And we store an additional column dimensions in the variables DB table, e.g. {"filters": [{"name": "sex", "value": "male"}, {"name": "age", "value": "0-4"}], "originalName": "Population", "originalShortName": "population"}.

Pros

  • Some ETL steps take less time and storage.

Cons

  • We need to deal (in DB only) with indicators with long, ugly names, instead of accepting the truly multidimensional nature of those indicators.
  • The ETL is no longer our source of data. We can't find the data for flattened mdim indicators from the owid-catalog.
  • We can edit the garden jinja template, but we can't manually edit the metadata of specific flattened indicators (I think).
    • For example: What would happen if you edit, e.g. this indicator from the admin? If it tries to create an override yaml file in the grapher step, I suppose this yaml file would serve no purpose or break the ETL, since the output table of the grapher step is not yet flattened. But I haven't checked, so I'm not sure if there are already mechanisms in place for this case.
  • When creating mdim steps, we now rely on content that only exists in DB (dimensions). So we come back to the original issue of creating circular dependencies between ETL and DB.

Solution 2: Working with truly mdim indicators

This solution implies two big changes:

Mdims in ETl/owid-catalog

Currently, an owid-catalog Table carries just one indicator metadata for each column. For mdims with jinja-templated metadata, the result is only machine-readable, full of << something >> content.

Ideally, each cell in a table would have its own metadata.

We could achieve that by having a table for data, and another, with the same structure, for metadata. We could still use jinja templates, but the metadata would be materialized when running the garden step.

But this would imply massive changes to owid-catalog, e.g. regarding metadata propagation and other operations.

Mdims in owid-grapher

I'm sure this would be another huge can of worms.

Other solutions

I suppose we could have other hybrid solutions, e.g. mdims in ETL, flattening in the grapher step (almost automatically) and keeping owid-grapher in the same way. But I suppose those would all be intermediate solutions.

@lucasrodes
Copy link
Member

Thanks for writing this up, Pablo!

General thoughts

I personally think that there is some fundamental distinction between what "indicator" actually means in ETL and in DB. I think this brings some confusion when working with metadata in ETL, aligning "what we see" in DB vs in ETL.

In ETL, an indicator is represented by a catalog.Variable object, which is similar to a pandas Series and usually has (country, year) as the index and the value column. Additionally, the index can have more dimensions so that we end up having (country, year, age), for instance.

In DB, however, indicators are slightly different. They don't retain dimensions, and we show one per dimension.

For instance, in this dataset, we have several "DB indicators." But in ETL, these are just one single "ETL indicator," with few dimensions.

My preferred way forward

I think that something more like your solution 2 would be beneficial in the long term and would align ETL & DB worlds. When it comes to data, I'm in favor of "bringing DB closer to ETL", even if more work, than "bringing ETL closer to DB". ETL was designed later on, and I think that the DB has some legacy decisions that would be nice not to bring to ETL.

Opinion on 1

On Solution 1, I think they are fairly easier to implement than in 2. For 1.1, just to comment on "We can manually edit metadata of a specific flattened indicator (e.g. population_sex_male_age_0_to_4).": I think this is great, but we should be able to do this without the need to flattening, with Jinja. I think that Jinja is not ideal, and we have work to do there to make it easier. But I wouldn't force everyone to flatten so they can edit the metadata. Not that you are implying this, but wanted to clarify this.

If going for something more like 1, I wouldn't flatten tables in Garden, to have two versions of the same table. Reasons are memory/redundacy, less clarity on tables, and because sometimes we may not need to flatten at all? Instead, I'd flatten things after data://grapher, so we are aligned with DB before pushing things. If we really wanted to see tables flattened in ETL, we could have an option like ds.read(..., flat=True) or something like that when reading tables too.

When showing metadata in ETL, I agree that it's sub-optimal that we get the Jinja stuff without it being rendered. Maybe there could be something we could do so we get it rendered by dimension. E.g., tb[col] is something else than Variable and is a VariableDim that works better / renders better dimensions? Not very explicit here, just saying we should do something to owid.catalog to improve rendering.

@pabloarosado
Copy link
Contributor Author

Thank you @lucasrodes. To address some of your points:

  • I think an indicator can be both things. If we want to distinguish them, maybe we can speak of an "mdim indicator" or a "flattened indicator".
  • Currently, in ETL we have Variable which, in terms of data, can be an mdim indicator. However, in terms of metadata, Variable is limited to having only one set of metadata, regardless of the values of the dimensions. This could indeed be improved: Variable metadata could be accessed in an analogous way that we access data. So, instead of doing, e.g. tb.loc[row1, col1] to access the data of row1-col1, we could do, e.g. tb.mloc[row1, col1] to access the metadata of that specific cell. To achieve this:
    • In the most general case, we would need to have a table of data, and a sidetable of metadata (where each cell is the already rendered metadata for that specific set of dimensions and column).
    • In the simplest case, we could keep Table pretty much the same (with jinja nomenclature) and create helper methods to properly render mdim metadata when accessing specific rows of a table.

(I haven't thought much about this, not sure if my proposal makes sense).

@lucasrodes
Copy link
Member

I think an indicator can be both things. If we want to distinguish them, maybe we can speak of an "mdim indicator" or a "flattened indicator".

Mmmh, I'd try to clarify this a bit more. I know it might seem unnecessary, but I think it'd be nice if we made it clearer what it means. I think it is the source of confusion of various things when trying to map ETL<->DB.

I would try to choose one of the following options:

  • (1) An indicator is a time series (or a collection of time series) for a given (country, year)-pair. It can have dimensions, so that's why it could be a collection. E.g., Total Population is an indicator, and can have dimensions age=10, age=20, etc. Total Population - age: 10 would be something different. Or it could be a dimensionless indicator, if there are no dimensions.
    • This is what we are already using, often, in ETL, where we define metadata for Total Population, and clarify things for dimensions using Jinja. But Total Population - age: 10 does not exist unless you explicitly flatten the table, which I don't think we should.

or

  • (2) An indicator is a time series for a given (country, year)-pair or (country, year, dimension)-pair if there are dimensions available. In this case, Total Population would not be an indicator, but Total Population - age: 10.
    • This is what we currently have in the database.

@lucasrodes
Copy link
Member

lucasrodes commented Nov 27, 2024

I'm not sure I follow your proposal with tb.mloc[row1, col1]

Say we have the following table tb (country, year, age as index):

country year age some_indicator
United States 2023 20 331,000,000
United States 2023 30 1,400,000,000
United States 2023 40 125,000,000
Nigeria 2023 20 216,000,000
Nigeria 2023 30 214,000,000
Nigeria 2023 40 83,000,000

If I wanted to get the metadata for some_indicator when age=20, how would we use .mloc? In any case, I agree there is something to do in that space. We can discuss more in detail when the time comes.

@pabloarosado
Copy link
Contributor Author

pabloarosado commented Nov 27, 2024

I suppose that, for any slice of data, we could have a corresponding slice of metadata, e.g., if you did tb.mloc[row_filter, some_indicator].description_short, you would get:

country year age some_indicator
United States 2023 20 In the US in 2023, blah
United States 2023 30 In the US in 2023, blah
United States 2023 40 In the US in 2023, blah
Nigeria 2023 20 In Nigeria in 2023, blah
Nigeria 2023 30 In Nigeria in 2023, blah
Nigeria 2023 40 In Nigeria in 2023, blah

But again, I haven't really thought much about it, this was just a half-baked idea.

@Marigold
Copy link
Collaborator

I agree with some points, but I will comment only on mdims. Currently, the only thing mdim steps do is create a YAML config, which is then uploaded to the DB. They operate solely on the DB and don’t use anything else from ETL. What do we gain from closer integration with ETL steps? I’ve only created two mdim configs, but I don’t see how that would make my life easier. Could you be more specific about your pain points and how this would help?

Regarding flattening, I’m happy to "isolate" it from the grapher://grapher step to make it possible to explicitly flatten them in the data://grapher step for anyone who wants to. However, personally, when I was prototyping mdim configs, I found it much easier to work with causes of death, which have dimensions, than with energy, which has a kind of flattened structure.

@pabloarosado
Copy link
Contributor Author

Hi @Marigold, thanks for your inputs. Let me address your points:

Currently, the only thing mdim steps do is create a YAML config, which is then uploaded to the DB. They operate solely on the DB and don’t use anything else from ETL. What do we gain from closer integration with ETL steps?

I think that's already a problem. If we could keep all that information on the ETL, we would not need any extra dependency on DB. Currently, the dependencies added to the DAG for mdim steps are a bit arbitrary (it's really up to you what you write there in the DAG, but no data is actually loaded). If the data changes after an update, you won't have an easy way to know that. The mdim explorer will simply show wrong, or no data, I suppose. This is an issue we used to have with explorers, and we fixed it recently by letting ETL create explorers out of actual data (see e.g. this example, where the explorer config is drawn from the content of the data; but also note that this is done in a very suboptimal way, based on column names, ideally, we would explode dimensions with dedicated functions).
Another issue is having hardcoded paths in multiple places, but I suppose we could figure out a way to avoid this with the current implementation.
So, a more robust implementation (update-safe) would be that the mdim step loads the dependencies directly within ETL, and it creates (and sanity-checks) the config automatically. If anything has changed in the data, the step will fail, and you'll know straight away.

Regarding flattening, I’m happy to "isolate" it from the grapher://grapher step to make it possible to explicitly flatten them in the data://grapher step for anyone who wants to. However, personally, when I was prototyping mdim configs, I found it much easier to work with causes of death, which have dimensions, than with energy, which has a kind of flattened structure.

My proposal is to keep both the original (mdim) and the flattened data in ETL. In other words: All data comes from ETL and can be found in the catalog (which has always been the ideal goal, moving away from scattered data between ETL and DB). Then you can work with whatever data structure you feel more comfortable with. But we definitely need flattened data for grapher, so it makes more sense to create it explicitly in ETL, rather than on the flight (which is then not accessible via ETL or catalog).
We may need to add a new field in the Table metadata to keep track of dimensions, and then you won't need to fetch it from DB.

@lucasrodes
Copy link
Member

lucasrodes commented Nov 28, 2024

Regarding flattening, I’m happy to "isolate" it from the grapher://grapher step to make it possible to explicitly flatten them in the data://grapher step for anyone who wants to. However, personally, when I was prototyping mdim configs, I found it much easier to work with causes of death, which have dimensions, than with energy, which has a kind of flattened structure.

I agree on this. I generally prefer to work with long tables over wide tables. I don't see any benefit from using flat tables tbh, since one can edit metadata already for all dimensions using Jinja. It was actually a huge improvement when long->wide was automated.

I understand that it kinda aligns ETL with DB a bit more. But I think that's not the ideal route that we want to take. If anything, I'd bring 'db closer to etl', whatever this means (can think more about this).

In addition, IMHO having two versions of the same table can be a have some downsides:

  • It uses more memory/redundancy: I know we tend to neglect this, but I think this is meaningful here.
  • Less clarity on tables: it brings an extra layer for someone (even us) to consume our data, where there would be more tables and maybe it wouldn't be clear which one to work with.
  • Flattening is not always required
  • More maintenance: We have to develop tools to accommodate for more types of tables.
  • We loose the information on dimensions: Flattened tables lack that information naturally unless we add it somehow. Unflattened ones have it naturally.
  • Metadata is already editable in unflattened tables (via Jinja): I think this is the way, though we could try to improve this overall.

So, a more robust implementation (update-safe) would be that the mdim step loads the dependencies directly within ETL, and it creates (and sanity-checks) the config automatically. If anything has changed in the data, the step will fail, and you'll know straight away.

I do agree that having MDIMs depend on DB is a bit confusing, in terms of the DAG. And probably would be nice to depend on data://grapher. We could definitely benefit from this in updates.

If so, I'd use unflattened tables though, so that MDIMs can exploit the fact that the tables already have the dimension information.

@pabloarosado
Copy link
Contributor Author

Long vs wide is an interesting (and nuanced) discussion. You seem to have now a strong preference for long formats. But let's remember that there are many operations where wide formats are clearly more convenient. One of them is propagating metadata. We have no idea how to do that properly with long format tables. We need metadata propagation every time we combine datasets (e.g. to properly track origins). jinja templates is already quite a messy business, I'm not sure we would be able to improve our tooling to a point where everything can be done with long tables only.

On top of all that, it seems reasonable to expect that grapher is not going to change drastically fro the time being. We'll (always?) need to have a dataset with an entity and a time, and no other dimensions. This constraint limits our solution space significantly.

So, ideally, in ETL we should be able to easily move from one format to the other without losing information (dimensions or other metadata). @lucasrodes proposed something interesting above, which is to have, e.g. ds.read(table_name, flat=True). Maybe we could invest more time in having such methods to quickly switch from one to the other (e.g. smarter melt and pivot operations). I'm not sure if achieving such things would be feasible in general. My intuition is that, sooner or later, we'll need to have two separate objects.

@lucasrodes
Copy link
Member

lucasrodes commented Nov 28, 2024

One of them is propagating metadata. We have no idea how to do that properly with long format tables.

We are already propagating metadata for long formats as of today. The thing is that this works fine within the same dataset, where we can use Jinja. It is true that, once we start concatenating different datasets that have different metadata fields, there is the need to differentiate metadata at dimension level. And can't really do it with our current tools.

I'd invest time in trying to have metadata at dimension level, to solve these edge cases.

Personally, when I have to append timeseries that have substantially different metadata, I'd then go with two columns instead, and assume they are sort of "two indicators".

jinja templates is already quite a messy business, I'm not sure we would be able to improve our tooling to a point where everything can be done with long tables only.

I think this is worth exploring. We currently use Jinja, and while I think it's messy sometimes, it's also convenient and helpful. I think we can also re-define a bit how the metadata YAML file is structured so that we can allocate for dimensions somehow, maybe without the need to have Jinja text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants