Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide dimension lookup in a star/snowflake schema style #464

Open
1 of 3 tasks
monfera opened this issue Dec 2, 2019 · 4 comments
Open
1 of 3 tasks

Provide dimension lookup in a star/snowflake schema style #464

monfera opened this issue Dec 2, 2019 · 4 comments
Labels
:data Data/series/scales related issue discuss To be discussed enhancement New feature or request

Comments

@monfera
Copy link
Contributor

monfera commented Dec 2, 2019

On this ticket, we could discuss the need and feasibility for dimensional lookup, to aid i18n, responsive annotation text lengths and assisting efficient, ad-hoc data exploration.

image (image: Wikipedia)

Currently, textual data for display as legends, categorical axis tick labels etc. are expected to be in the data array, ie. currently we rely on denormalization. This is a good approach with Cartesian charts, especially when elastic-charts can expect a newly built denormalized array on every operation. But these items might be motivation for allowing a dimensional lookup:

  1. the user, or their clients or audience needs internationalization, and texts for dimension codes (eg. 'United Kingdom' for 'UK') are represented in several languates, uploaded into an index
  2. the report/visualization generally has constraints for text sizes; many systems solve it by using progressively shorter text for table headers, pie/treemap labels, categorical tick labels (also depending on horizontal vs vertical orientation) etc., for example, 'United Kingdom of Great Britain and Northern Ireland' -> 'United Kingdom' -> UK - storing just 5 languages and 3 text lengths would require 15 new fields on the document, and the fields would need to have naming conventions to encode the language/text length etc. so it'd get tedious very fast, and any change (eg. to 'Swaziland' -> 'eSwatini') would require reindexing
  3. There may be other types of lookup, eg. deriving the country from the city code, or deriving the region from the country code; if there's a maximum of one hop, it's a star schema, if it's more, it's a snowflake schema
  4. While denormalization works in theory, it fails once we consider ad-hoc reporting, eg. the user has a sudden interest in slicing/dicing, aggregating, annotating etc. on something that was not designed to be in the document

The discussion is needed, because

  • such master data joins may eventually be needed or implemented for underlying ES query languages such as ESSQL, maybe lessening the need to do so
  • even if a SQL JOIN were provided, there'd be technical merit to avoid re-querying some large dataset just because the user wants to now switch to another language, or responsive resizing leads to cramped space and there's less room for the table headers, tick labels or other annotations

Describe a solution
A star or snowflake schema for elastic-charts: besides accepting a facts table (documents/aggregates as now) it could accept an arbitrary, typically low number of dimension tables, while also providing the means for reaching into those dimension tables via (most importantly) equality checking of dimension values against document/aggregate values (star schema) or even allow multi-hop, eg. hierarchical lookup as in the case of eg. a product nomenclature or any other breakdown often needed for treemaps, sunburst charts, table etc. visualizations

While technically, denormalization works (with potentially large to infeasible runtime cost), it's at odds with our evolving visualization algebra, because it hides and obscures how facts (dimensions or granular aggregates) relate to dimensional data. But this very information is key for the user to navigate the chart design space.

Alternatives considered

  1. Continuing as now: denormalized everything
  2. ES schema is not denormalized, but ESSQL obtains a JOIN: elastic-charts wouldn't need to change, but basic things eg. responsive resizing would need that ES is re-queried
  3. Handling dimensional lookup inside elastic-charts (this ticket is biased toward this option)
  4. A comprehensive approach in which Kibana expressions and relational joins (including client-side joins where they're preferable, and server-side joins, once ES or ESSQL supports them) are considered at the Kibana level (ie. so it's equally available to Lens, Vega, 3rd party charts etc.
  5. Combining either or both of the latter two with crossfiltering, which also benefits from the efficiencies of not having to denormalize data

There's no rush on it as option 1 works right now, but it's good to foresee this as it often comes up in tools but it's often hard to refactor for such a general approach later, because by that time a solution could've evolved or conserved, maybe overly specialized, one-off solutions (eg. "let's add a PR for text in multiple languages", then "let's add a PR for differing text lengths" etc.). Even if we don't do anything now, we can keep the approach in mind or refer to it in other discussions.

Checklist

  • this request is checked against already exist requests
  • every related Kibana issue is listed under Kibana Cross Issues list
  • kibana cross issue tag is associated to the issue if any kibana cross issue is present
@monfera monfera added enhancement New feature or request discuss To be discussed labels Dec 2, 2019
@markov00
Copy link
Member

markov00 commented Dec 2, 2019

I totally agree on that Robert, adding a star schema for lookups is a good starting point in my opinion. I don't know how this will be handled within Kibana and ES because I don't actually see a way to query your aggregated data and have some dimensions reported into a lookup table.
But I can see some use cases like: formatting/editing in place a specific text field in a visualization.

@monfera
Copy link
Contributor Author

monfera commented Dec 3, 2019

Yes, in-place editing is a good example too, I didn't think of that. For elastic-charts the fact that Kibana can't (yet?) serve a set of arrays in a snowflake relation needs not be a constraint. In fact, we're already supporting snowflake-like inputs, eg. an explicit set of colors is joined inside elastic-charts with the also user-supported data for visualization. It's not a big jump (conceptually) to go from coloring—a visual channel dimension—to eg. the name of a dimension value in language X.

Also, specific parts of Kibana, eg. Canvas, may well supply multiple inputs, Canvas, via its expression language, has capable means to do so.

@monfera
Copy link
Contributor Author

monfera commented Dec 3, 2019

I think @dsmith001's comments here are relevant too as there are examples for reusable, shareable, themeable, modifiable dimensional mappings (colors, text elements etc). These exceed what's possible with a denormalized data structure where the original data and the related data are blended together.

@monfera
Copy link
Contributor Author

monfera commented Dec 10, 2019

Star schema, snowflake schema, facts, dimensions right from here in Andy Pavlo's lecture

@markov00 markov00 added the :data Data/series/scales related issue label Mar 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:data Data/series/scales related issue discuss To be discussed enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants