Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec for model/cube #854

Closed
pwalsh opened this issue Dec 30, 2015 · 11 comments
Closed

Spec for model/cube #854

pwalsh opened this issue Dec 30, 2015 · 11 comments
Assignees

Comments

@pwalsh
Copy link
Member

pwalsh commented Dec 30, 2015

Fiscal Data Package has a mapping object. This is very very handy for building a logical model out of the physical data sources when appropriate. This logical model can in turn be used to automate visualisations and data loaders, for example.

Actually, there is nothing particularly "Fiscal" about this mapping: it is simply an OLAP cube implementation with measures and dimensions. I think we could extract out the generic pattern and expose it as a spec for declaring a model/cube mapping for any tabular data package.

@danfowler
Copy link
Contributor

+1 I was thinking in a similar direction (that the mapping/model of FDP should be made generic) when I posted this comment.

@s-celles
Copy link

@pwalsh
Copy link
Member Author

pwalsh commented Jul 12, 2016

@rgrp and all

I'm into this idea of course, but I'd rather see how this plays out with a views spec. Happy to leave this open for a while though while views gets worked on.

@rufuspollock
Copy link
Contributor

WONTFIX. I'm going to close as wontfix for now and we can re-open if there is interest / need.

@danfowler
Copy link
Contributor

I know this is closed, but I just came across this which seems relevant:

https://json-stat.org/

The JSON-stat format is a simple lightweight JSON format for data dissemination. It is based in a cube model that arises from the evidence that the most common form of data dissemination is the tabular form. In this cube model, datasets are organized in dimensions. Dimensions are organized in categories.

@rufuspollock
Copy link
Contributor

@danfowler thanks - and I am aware of them (I think this started out as a simple version of SDMX).

@rufuspollock
Copy link
Contributor

Re-opening. @pwalsh and I have discussed this recently and clear interest here and we'd like to start something in the nearish future.

/cc @ericbusboom

@ericbusboom
Copy link

I've been going through the JSON-stat website, and so far, I'm pretty sure that I don't understand it at all, and that none of my analysts would be able to create a JSON-stat file by hand. I can tell that the format depends on having several array properties all have the same length, basically breaking a conceptual object into separate fields, which seems like a maintenance nightmare. There is plenty to learn from here, but I don't think it is a good model for a design.

For my users, my top requirement is that it is easy to create and read the specifications. I want data creators to be able to annotate measures and dimensions from memory, with very little training. Data users must be able to understand the annotations with no training.

I have a strong preference for embedding the measure and dimension classifications into the schema, because it's easier to create and read. This can be as simple as:

  1. Defining names in a taxonomy for the types of measures and dimensions
  2. Attaching the names to columns in the existing schema

I imagine the names being mostly common terms like "dollars" or "yen" or "weight" or "sex".

I'd further propose that the names have a hierarchical structure to them, to allow for specification and extension. For instance 'weight/lbs' vs 'weight/kg' to distinguish units, or 'race/omb' vs race/census' to distinguish between different systems of standards for race.

But, it should also be possible for the user to annotate a column with just "weight." That's not ideal, but I've learned that getting 20% is better than getting 0%.

I'd further propose that the names be linked to JSON definitions that can be inlined or well-known. So "race/omb" may have an associated JSON file, possibly similar to the existing JSON-state or Financial data package forms. Then, perhaps, users could also define their own term 'race/orgname' and include a their own definitions in the package.

I don't (currently) have strong opinions about the structure of the definitions for the names -- the Fiscal Data Package definitions seem suitably extensible and generalizable -- since the definitions would mostly be created by experts.

However, am strongly opinionated that the typical user should be able to annotate the dataset with nothing more than applying a measure/dimension name to a column in the existing schema, and those names should be familiar and easy to memorize.

For reference, here are the inputs and outputs of the annotation system I'd produced before. This one has a rich datatype field ( rather than a separate field for the measure/dimension annotation), and a parent connection to link columns. The measure/dimension classification is inherent in the rich datatype; "count" is always a measure, "raceeth" is always a dimension. Here is a schema file:

http://test.docker1.civicknowledge.com/bundles/d04p006/file/schema.csv

And here is what the file looks like when rendered for the web:

http://test.docker1.civicknowledge.com/partitions/p04p00f006

As with Tableau, dimensions are green and measures are blue. Errors and uncertainties are grey. indentation represents parent/child relationships.

@pwalsh
Copy link
Member Author

pwalsh commented Feb 13, 2017

so, now actually reopening, and also ref. frictionlessdata/datapackage#343

@pwalsh pwalsh reopened this Feb 13, 2017
@pwalsh pwalsh self-assigned this Feb 13, 2017
@pwalsh
Copy link
Member Author

pwalsh commented Feb 15, 2017

@ericbusboom

I am quite sure I completely get what you want here, and it is very inline with where I think we need to go to generalise this out of our previous work on FDP.

One question: you say

But, it should also be possible for the user to annotate a column with just "weight." That's not ideal, but I've learned that getting 20% is better than getting 0%.

Which user? Someone who edits a descriptor file directly, so, someone comfortable with text editing a JSON file?

I ask because I want to distinguish between a canonical representation of something on the descriptor, and an ideal user experience for "end users" who might generate a descriptor via a series of actions.

OpenSpending currently supports a customisation to FDP (unspecified as yet) which does such annotations per field.

@ericbusboom
Copy link

Ah, Good question. I half-thought "user" was the wrong word when I wrote that .... I should have said Creator and Wrangler, as described in this analysis model. So, it's the people who are creating the dataset and the data dictionary, not the people who are defining what "weight" means.

I ask because I want to distinguish between a canonical representation of something on the descriptor, and an ideal user experience for "end users" who might generate a descriptor via a series of actions.

Yes, Absolutely. The definition of what "weight" is could be ( probably should be ) JSON.

I've updated one of my older specifications into a proposal for a semantic datatype category taxonomy. This is basically the system I've linked to previously, used in Ambry.

@roll roll transferred this issue from frictionlessdata/datapackage Jan 3, 2024
@frictionlessdata frictionlessdata locked and limited conversation to collaborators Jan 3, 2024
@roll roll converted this issue into discussion #855 Jan 3, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

5 participants