Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional layer metadata at instantiation #952

Merged
merged 37 commits into from
May 21, 2018
Merged

Add optional layer metadata at instantiation #952

merged 37 commits into from
May 21, 2018

Conversation

jgoizueta
Copy link
Contributor

This is experimental to be used by CartoVL. It shouldn't be public/documented at the moment.

Fixes #948

@jgoizueta jgoizueta self-assigned this May 8, 2018
@jgoizueta
Copy link
Contributor Author

Notes:

This allows obtaining additional metadata in map instantiation. When aggregation occurs, the metadata is about the original, unaggregated data source.
This is so to fullfill current Carto VL needs, but we can discuss about having aggregation metadata too in the future.

The metadata can be requested adding a metadata entry to the layer options, which can have this properties:

  • featureCount: (true/false) actual row count (as opposed to the default estimated row count)
  • geometryType (true/false) type of the geometry column
  • columns: (true/false) name and types of functions
  • columnStats: (true/false/object) same as columns and also some basic stats: max, min, avg, sum for numbers, max, min for dates, cand top categories for text columns. The number of top categories is 1024 by default and can be controlled by passing an object like { topCategories: 100 } here.
  • sample (integer) returns a sample of the table of the approximate size (number of rows).

The requested metadata is returned in the response for each layer (metadata.layers) in meta.stats, in addition to estimatedFeatureCount which we already had.

I've preserved the existing behaviour for estimatedFeatureCount, which takes the value -1 if any SQL error occurs when computing it. I think this is questionable, as may be hiding unexpected problems. For the new metadata I'm not filtering SQL errors.

@jgoizueta jgoizueta requested review from dgaubert and rochoa May 9, 2018 13:53
@jgoizueta
Copy link
Contributor Author

jgoizueta commented May 10, 2018

I'm reviewing this because of a major problem: at the point where metadata is being computed now only the aggregated query (in case of aggregation) is available.
We could extract the original query since it appears as ( ... ) _cdb_query but a less hacky approach would be desirable.

Also the previously existing stat estimatedFeatureCount refers to the aggregated results in case of aggregation. So, to be consistent we should be explicit in the metadata parameters about pre/post aggregation metadata and support both in some cases (like count).

Tests for metadata in the presence of aggregation must be added.

@jgoizueta
Copy link
Contributor Author

Well, regarding the existing stat estimatedFeatureCount it never really worked with aggregation because it executes the aggregation query without performing Mapnik token substitution (scale_denominator, etc.) This wasn't noticed because as stated in my previous comment we were silently returning -1 in case of error.

So, now we should be free to change the behaviour of this for aggregation and return the estimate count for the pre-aggregation query in this default stat.

All stats are computed now pre-aggregation
Code to help compute post-aggregation stats remains for testing
jgoizueta added 4 commits May 18, 2018 15:29
Remove usage of PhasedExecution
This achives better query execution granularity and
removes questionable usage of shared results object.
It introduces a couple of behavior changes:
* estimatedFeatureCount desn't ignore errors now
* sample always uses estimatedFeatureCount,even if the actual count is also computed.
@jgoizueta
Copy link
Contributor Author

Hey @dgaubert can you review this again?

There was an interesting problem with the tests (well, "interesting" wasn't the word I had in mind while I was pulling my hair out trying to figure it out).

The test "layergroup can hold substitution tokens" was failing but only on travis. (not even using the docker tests locally). I finally was able to reproduce it locally by setting global.environment.enabledFeatures.layerStats = true; in the test.
So we have some test-order issue with some test leaving an indeterminate state in layerStats.

Now, the problem, which has existed for a long time, and has been reveled now, is this: we always compute the row count estimate stat. But this has been failing if the sql query contains Mapnik tokens (because we make no substitution before executing it).
But since we were ignoring any SQL errors for that query (setting row count to -1) we didn't notice.

I've fixed the substitution problem, but I haven't look at the layerStats configuration problem

Keep current production behavior of ignoreing errors when computing this stat and returning -1.
This is done as to no introduce any instability in production at the moment.
@jgoizueta
Copy link
Contributor Author

I've reverted the behaviour in case of error when computing estimatedFeatureCount (computed in each instantiation) to ignore errors and return -1, so we can safely deploy this in production without introducing new risks.

@oleurud do you think worth to make that conditional on the environment (so that in development, staging, etc the errors aren't ignored)? (if (process.env.NODE_ENV === 'production') ...)

@simon-contreras-deel
Copy link
Contributor

Maybe an environment configuration parameter will be the best option (easy and fast to enable/disable)

The sampling probability is now being computed using an estimate of the table row count
This could led to too high probabilities (to large samples) if the estimate is not accurate.
To avoid potential problems with large samples we've added a LIMIT to the sampling queries.
Copy link
Contributor

@simon-contreras-deel simon-contreras-deel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but let me check again this afternoon (i need to read some things)

if (field.type === 'number') {
return ['min', 'max', 'avg', 'sum'];
}
if (field.type === 'date') { // TODO other types too?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be a else if?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I think I've omitted lately quite a few elses because of returns inside conditions.
🤔 do you think the explicit else is preferable for clarity?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forget if, you are right.


// columns are returned as an object { columnName1: { type1: ...}, ..}
// for consistency with SQL API
function formatResultFields(dbConnection, flds) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be prettier (no required, but it will be easy to understand):

function formatResultFields(dbConnection, fields = []) {
    let nfields = {};
    for (let field in fields) {
      const cname = dbConnection.typeName(field.dataTypeID);
      let tname;
      if ( ! cname ) {
        tname = 'unknown(' + field.dataTypeID + ')';
      } else {
        tname = fieldType(cname);
      }
      nfields[field.name] = { type: tname };
    }
    return nfields;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You catched me copy-pasting from SQL API!! 😊


// TODO: compute _sample with _featureCount when available

Promise.all([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good :)

({ estimatedFeatureCount }) => _sample(ctx, estimatedFeatureCount)
.then(s => mergeResults([s, { estimatedFeatureCount }]))
),
_featureCount(ctx),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question to understand it: If some of _featureCount, _aggrFeatureCount, _geometryType or _columns fails, we will return an error as a response. I know that the user must request for it expressly, but I am not sure if metadata should do to fail a request. What is your point of view?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a reason to stop the PR and also, I am not saying to change it. Only it raises doubts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, If the user requests for any specific metadata and an error happens, we should return that error because we weren't able to process the request completely.

Copy link
Contributor

@simon-contreras-deel simon-contreras-deel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have my blessings ;)

Thanks for a very nice issue (and the comments on the PR) allowing everyone to understand very well the use case, the solution, and the caveats

@jgoizueta jgoizueta merged commit 4711b28 into master May 21, 2018
@jgoizueta jgoizueta deleted the metadata branch May 21, 2018 14:05
threshold: 1
},
metadata: {
aggrFeatureCount: 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 is the zoom value??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly

}

testClient.getLayergroup(function(err, layergroup) {
assert.ok(!err);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use assert.ifError(). It gives a proper feedback of the actual error. However assert.ok() wraps the error behind the scenes.

});

testClient.getLayergroup(function(err, layergroup) {
assert.ok(!err);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use assert.ifError()

});

testClient.getLayergroup(function(err, layergroup) {
assert.ok(!err);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use assert.ifError()

});

testClient.getLayergroup(function(err, layergroup) {
assert.ok(!err);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use assert.ifError()

});

testClient.getLayergroup(function(err, layergroup) {
assert.ok(!err);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use assert.ifError()

});

testClient.getLayergroup(function(err, layergroup) {
assert.ok(!err);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use assert.ifError()

Promise.all([
_estimatedFeatureCount(ctx).then(
({ estimatedFeatureCount }) => _sample(ctx, estimatedFeatureCount)
.then(s => mergeResults([s, { estimatedFeatureCount }]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏 rename s by sample

@@ -11,17 +34,278 @@ MapnikLayerStats.prototype.is = function (type) {
return this._types[type] ? this._types[type] : false;
};

function queryPromise(dbConnection, query, adaptResults, errorHandler) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it a little bit of complication here?

We can simply return a promise:

function queryPromise(dbConnection, query) {
  return new Promise((resolve, reject) => { 
    dbConnection.query(query, (err, res) => err ? reject(err) : resolve(res))   
  })
}

And then, for each specific metadata function:

function _estimatedFeatureCount (ctx) {
  return queryPromise(ctx.dbConnection, _getSQL(ctx, queryUtils.getQueryRowEstimation))
    .then(res => ({ estimatedFeatureCount: res.rows[0].rows }))
    .catch(() => ({ estimatedFeatureCount: -1 }))
}

Note: Remember, .then and .catch return a promise as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, we can avoid the ctx object and pass only the arguments that each metadata function needs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed the other issues you mentioned (except the queryPromise refactor), as they were very simple.

Regarding queryPromise, I think you're absolutely right, but I don't want to delay the deploy today, so I'll open a separate PR to do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough!

Please, don't miss that refactor; you'll end up loving promises and their composability!

💋

Copy link
Contributor

@dgaubert dgaubert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I leave some comments that I would like you to take into account.

@jgoizueta jgoizueta mentioned this pull request May 21, 2018
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants