Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

standard bigquery v2, support nested schema in query result #1593

Closed
c0b opened this issue Sep 14, 2016 · 18 comments
Closed

standard bigquery v2, support nested schema in query result #1593

c0b opened this issue Sep 14, 2016 · 18 comments
Assignees
Labels
api: bigquery Issues related to the BigQuery API.

Comments

@c0b
Copy link
Contributor

c0b commented Sep 14, 2016

Please support nested schema field names in complex queries like SELECT ARRAY[STRUCT(1 AS a, 2 AS b, 3 AS c)] AS d

const BigQuery = require('@google-cloud/bigquery');
> BigQuery().query({ query: 'SELECT 1 AS cnt', useLegacySql: false }, function (err, data) { if (err) return console.error(err); console.dir(data, { depth: null, colors: true }); });
> [ { cnt: 1 } ]

This is right, the field name cnt is correctly used

> BigQuery().query({ query: 'SELECT ARRAY[STRUCT(1 AS a, 2 AS b, 3 AS c)] AS d', useLegacySql: false }, function (err, data) { if (err) return console.error(err); console.dir(data, { depth: null, colors: true }); });
> [ { d: [ { v: { f: [ { v: '1' }, { v: '2' }, { v: '3' } ] } } ] } ] 
  • only the outer level field name d is in use, not others a, b, c,
  • all get default f and v, and all their values type, the 1,2,3 here should be integers

I have even more complex queries than this, all are like this only the outer one level schema field names / value types are got populated

A raw query over HTTP, the http tool is from httpie, just an advanced version of curl; after I get access token with same httpie tool from token allocation, I can do queries like this on command line, here the server response included correct schema with nested fields, so technically it should be possible to populate all inner level field names and value types correct

➸ time http --verbose --pretty=colors POST https://www.googleapis.com/bigquery/v2/projects/XXXXXXX/queries 'Authorization: Bearer XXXXXXXXXXXX' useLegacySql=false useQueryCache=false query='SELECT ARRAY[STRUCT(1 AS a, 2 AS b, 3 AS c)] AS d'
POST /bigquery/v2/projects/XXXXXXXXXXXXXX/queries HTTP/1.1
Content-Length: 113
Accept-Encoding: gzip, deflate
Host: www.googleapis.com
Accept: application/json, */*
User-Agent: HTTPie/0.9.6
Connection: keep-alive
Content-Type: application/json
Authorization: Bearer XXXXXXXXXXXXXXXXXXXXXX

{"useLegacySql": "false", "useQueryCache": "false", "query": "SELECT ARRAY[STRUCT(1 AS a, 2 AS b, 3 AS c)] AS d"}

HTTP/1.1 200 OK
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: Mon, 01 Jan 1990 00:00:00 GMT
Date: Wed, 14 Sep 2016 17:37:03 GMT
ETag: "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
Vary: Origin
Vary: X-Origin
Content-Type: application/json; charset=UTF-8
Content-Encoding: gzip
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Server: GSE
Alt-Svc: quic=":443"; ma=2592000; v="36,35,34,33,32"
Transfer-Encoding: chunked

{
 "kind": "bigquery#queryResponse",
 "schema": {
  "fields": [
   {
    "name": "d",
    "type": "RECORD",
    "mode": "REPEATED",
    "fields": [
     {
      "name": "a",
      "type": "INTEGER",
      "mode": "NULLABLE"
     },
     {
      "name": "b",
      "type": "INTEGER",
      "mode": "NULLABLE"
     },
     {
      "name": "c",
      "type": "INTEGER",
      "mode": "NULLABLE"
     }
    ]
   }
  ]
 },
 "jobReference": {
  "projectId": "XXXXXXXXXXXXX",
  "jobId": "job_XXXXXXXXXXXXXXXXXXXXXXXXXXX"
 },
 "totalRows": "1",
 "rows": [
  {
   "f": [
    {
     "v": [
      {
       "v": {
        "f": [
         {
          "v": "1"
         },
         {
          "v": "2"
         },
         {
          "v": "3"
         }
        ]
       }
      }
     ]
    }
   ]
  }
 ],
 "totalBytesProcessed": "0",
 "jobComplete": true,
 "cacheHit": false
}

real    0m3.040s
user    0m0.364s
sys 0m0.020s

Environment details

  • OS: Linux
  • Node.js version: v6

I have search other issues here #1564 #1135 about nested schema, but for different purpose, I believe this ticket is different.

@stephenplusplus stephenplusplus added the api: bigquery Issues related to the BigQuery API. label Sep 14, 2016
@stephenplusplus
Copy link
Contributor

Thanks for the report! We'll definitely dig into this. In the meantime, you should be able to access the same data returned from httpie with the 3rd param to the query callback:

BigQuery().query('...', {...}, function(err, data, apiResponse) {
  // apiResponse === http response body
})

@c0b
Copy link
Contributor Author

c0b commented Sep 14, 2016

Another thing confused me is there seems multiple bigquery nodejs binding, above code uses const BigQuery = require('@google-cloud/bigquery'); var bigquery = BigQuery(); bigquery.query(...), there are other sample code uses require('google-cloud').bigquery('v2') and get a readable stream API that needs to listen on 'data' event and process each one, there is even a 3rd one is googleapis npm package, that uses var google = require('googleapis'); var bigquery = google.bigquery('v2'); and bigquery.jobs.query(...) all these I have tried so far have a same problem of inner fields not populated;

But could you clarify which library is officially supported by Google? or which one is latest? and could you mark others as deprecated?

  1. https://googlecloudplatform.github.io/google-cloud-node/#/docs/google-cloud/0.40.0/bigquery?method=query
  2. https://www.npmjs.com/package/googleapis#request-level-options
> const gcloud = require('google-cloud');
> var bigquery = gcloud.bigquery('v2');
undefined
> bigquery.query({ query: 'SELECT STRUCT(1 AS a, 2 AS b, 3 AS c) AS d', useLegacySql: false }).on('data', rows => { console.dir(rows, { depth: null, colors: true }); } );
DestroyableTransform {
  _readableState: 
   ReadableState {
     objectMode: true,
     highWaterMark: 16,
     buffer: [],
   ...
> { d: { f: [ { v: '1' }, { v: '2' }, { v: '3' } ] } }

@c0b
Copy link
Contributor Author

c0b commented Sep 14, 2016

@stephenplusplus
Copy link
Contributor

https://googlecloudplatform.github.io/google-cloud-node/#/:

What is the relationship between google-cloud and the Google APIs Node.js Client?

The Google APIs Node.js Client is a client library for using the broad set of Google APIs. google-cloud is built specifically for the Google Cloud Platform and is the recommended way to integrate Google Cloud APIs into your Node.js applications. If your application requires both Google Cloud Platform and other Google APIs, the 2 libraries may be used by your application.

So, this is the handmade API. googleapis is auto-generated. We have two different ways you can use BigQuery: with the all-in-one module, google-cloud. This ships with every API we support in this library. If you don't want the extra APIs you're not using, you can just install the BigQuery API: @google-cloud/bigquery.

require('google-cloud').bigquery('v2')

We don't allow specifying a version to use in this library, we always use the latest. The v2 parameter is being ignored.

@c0b
Copy link
Contributor Author

c0b commented Sep 14, 2016

// apiResponse === http response body

that would be a workaround before this issue is resolved, that means to parse the schema by myself; but I prefer wait this issue to be resolved

@stephenplusplus
Copy link
Contributor

Would you mind getting the JSON output from a complicated query and running it through here:

BigQuery().query('...', {}, function(err, data, apiResponse) {
  console.log(JSON.stringify(test(apiResponse.schema, apiResponse.rows), null, 2))
});

function test(schema, rows) {
  return rows.map(mergeSchema).map(flattenRows);

  function mergeSchema(row) {
    return row.f.map(function(field, index) {
      var schemaField = schema.fields[index];
      var value = field.v;

      switch (schemaField.type) {
        case 'BOOLEAN': {
          value = value === 'true';
          break;
        }
        case 'FLOAT': {
          if (!is.nil(value)) {
            value = parseFloat(value);
          }
          break;
        }
        case 'INTEGER': {
          if (!is.nil(value)) {
            value = parseInt(value, 10);
          }
          break;
        }
        case 'RECORD': {
          value = Table.mergeSchemaWithRows_(schemaField, [value[0].v]);
          break;
        }
        case 'TIMESTAMP': {
          value = new Date(value * 1000);
          break;
        }
      }

      var fieldObject = {};
      fieldObject[schemaField.name] = value;
      return fieldObject;
    });
  }

  function flattenRows(rows) {
    return rows.reduce(function(acc, row) {
      var key = Object.keys(row)[0];
      acc[key] = row[key];
      return acc;
    }, {});
  }
}

If that works as expected, I'll send a PR.

@stephenplusplus
Copy link
Contributor

I'm pretty sure it's going to need tweaking, so if you could provide the raw JSON response body from what you think is the most complicated query, I'll be in a better place to test. Thanks!

@c0b
Copy link
Contributor Author

c0b commented Sep 16, 2016

I am testing python api as well, that is able to parse some inner level structure, but the value type is still not right:

When I put a wrong SQL, it is able to infer the inner struct is STRUCT<STRING, INT64, STRING>

$ glcoud-sdk/.../path/to/bq.py --format=prettyjson query --nouse_legacy_sql 'SELECT ARRAY[STRUCT("zx83" AS id, 1 AS a, 2 AS b), ("f8f7", 4, "7a")] d'
Error in query string: Error processing job 'XXXXXXXX': Array elements of types
{STRUCT<STRING, INT64, STRING>, STRUCT<id STRING, a INT64, b INT64>} do not have a common supertype at [1:8]

When I put a correct SQL, it gives me this output:

bq.py --format=prettyjson query --nouse_legacy_sql 'SELECT ARRAY[STRUCT("zx83" AS id, 1 AS a, 2 AS b), ("f8f7", 4, 7)] d'
[
  {
    "d": [
      {
        "a": "1", 
        "b": "2", 
        "id": "zx83"
      }, 
      {
        "a": "4", 
        "b": "7", 
        "id": "f8f7"
      }
    ]
  }
]

the fields a, b is already known as INT, should be a number in JSON and please keep the order, I know this is a bad problem of Python's default JSON parser, I guess it internally called json.loads to parse an object to python's dict and output by json.dumps that dict is completely unordered;

Javascript's JSON.parse or JSON.stringify doesn't have the problem, that comes in right order; while I am expecting it returns the right value type

// expected output
[
  {
    "d": [
      {
        "id": "zx83"
        "a": 1,
        "b": 2,
      },
      {
        "id": "f8f7"
        "a": 4,
        "b": 7,
      }
    ]
  }
]

@stephenplusplus
Copy link
Contributor

Interesting. If something unexpected is happening with that library, I'll have to ask someone where we can report the bug. Off-hand, I'm not sure.

As far as ordering of the results, we will keep it in the order the HTTP JSON API returns it to us. The only extra step we do is combining the data and transforming it back to the native type of the result.

To make sure my attempted solution above can handle even the most complex queries, would you mind running one that you think might matches, then send over the JSON response body from the API? I appreciate your help debugging so far, sorry for asking for more!

@c0b
Copy link
Contributor Author

c0b commented Sep 23, 2016

I've just got some cycles to test on this, are you sure the 3rd argument apiResponse exist?

I'm following examples in this doc: which says nextQuery is 3rd, and apiResponse is the 4th,

https://googlecloudplatform.github.io/google-cloud-node/#/docs/google-cloud/0.40.0/bigquery?method=query

const bigquery = BigQuery();

bigquery.query({
  query: 'SELECT ARRAY[STRUCT("fa23" AS id, 1 AS a, 2 AS b, 3.4 AS c), ("bd23", 2, 3, 3.14)] AS d',
  useLegacySql: false
}, (err, rows, nextQuery, apiResponse) => {
  if (err) {
    return console.error(err);
  }

  console.log(arguments.length);
  console.dir(rows, { depth: null, colors: true });
  console.dir({ err, rows, nextQuery, apiResponse },
              { depth: null, colors: true });
});

but the code get this output: I don't see either nextQuery or apiResponse; does it require any option to be set?

5
[ { d: 
     [ { v: { f: [ { v: 'fa23' }, { v: '1' }, { v: '2' }, { v: '3.4' } ] } },
       { v: { f: [ { v: 'bd23' }, { v: '2' }, { v: '3' }, { v: '3.14' } ] } } ] } ]
{ err: null,
  rows: 
   [ { d: 
        [ { v: { f: [ { v: 'fa23' }, { v: '1' }, { v: '2' }, { v: '3.4' } ] } },
          { v: { f: [ { v: 'bd23' }, { v: '2' }, { v: '3' }, { v: '3.14' } ] } } ] } ],
  nextQuery: undefined,
  apiResponse: undefined }

I've tried your way to use query string as 1st, options as 2nd parameter, but it seems ignoring useLegacySql, always reporting invalid SQL (it parsed as legacy SQL)

BigQuery().query('SELECT STRUCT(...)', { useLegacySql: false }, function(err, data, apiResponse) {
  console.log(JSON.stringify(test(apiResponse.schema, apiResponse.rows), null, 2))
});

does the library support a NODE_DEBUG=... variable to dump api server interaction details?

@stephenplusplus
Copy link
Contributor

Sorry, my earlier example was wrong in regards to apiResponse existing by default. The JSON APIs return responses to us in a paginated format. By default, our library will make as many API requests as necessary to get all of the results before executing your callback. In order to receive an apiResponse argument, you must shut off this feature with autoPaginate: false.

Here's another version that will run all of the queries and return the results to your callback once, but also collect all of the apiResponses. Please give this a shot:

BigQuery().query('SELECT STRUCT(...)', {
  useLegacySql: false,
  autoPaginate: false
}, function(err, rows, nextQuery, apiResponse) {
  if (err) throw err
  console.log(JSON.stringify(test(apiResponse.schema, apiResponse.rows), null, 2))
})

function test(schema, rows) {
  return rows.map(mergeSchema).map(flattenRows);

  function mergeSchema(row) {
    return row.f.map(function(field, index) {
      var schemaField = schema.fields[index];
      var value = field.v;

      switch (schemaField.type) {
        case 'BOOLEAN': {
          value = value === 'true';
          break;
        }
        case 'FLOAT': {
          if (!is.nil(value)) {
            value = parseFloat(value);
          }
          break;
        }
        case 'INTEGER': {
          if (!is.nil(value)) {
            value = parseInt(value, 10);
          }
          break;
        }
        case 'RECORD': {
          value = Table.mergeSchemaWithRows_(schemaField, [value[0].v]);
          break;
        }
        case 'TIMESTAMP': {
          value = new Date(value * 1000);
          break;
        }
      }

      var fieldObject = {};
      fieldObject[schemaField.name] = value;
      return fieldObject;
    });
  }

  function flattenRows(rows) {
    return rows.reduce(function(acc, row) {
      var key = Object.keys(row)[0];
      acc[key] = row[key];
      return acc;
    }, {});
  }
}

@c0b
Copy link
Contributor Author

c0b commented Sep 24, 2016

this code works for a simple struct, but I believe a recursive call is needed somewhere, for deeply nested structures, like

  apiResponse: 
   { kind: 'bigquery#queryResponse',
     schema: 
      { fields: 
         [ { name: 'd',
             type: 'RECORD',
             mode: 'REPEATED',
             fields: 
              [ { name: 'id', type: 'STRING', mode: 'NULLABLE' },
                { name: 'a', type: 'INTEGER', mode: 'NULLABLE' },
                { name: 'b', type: 'INTEGER', mode: 'NULLABLE' },
                { name: 'c', type: 'FLOAT', mode: 'NULLABLE' },
                { name: 'f',
                  type: 'RECORD',
                  mode: 'NULLABLE',
                  fields: [ { name: 'a', type: 'INTEGER', mode: 'NULLABLE' } ] } ] } ] },
     rows: 
      [ { f: 
           [ { v: 
                [ { v: 
                     { f: 
                        [ { v: 'fa23' },
                          { v: '1' },
                          { v: '2' },
                          { v: '3.4' },
                          { v: { f: [ { v: '1' } ] } } ] } },
                  { v: 
                     { f: 
                        [ { v: 'bd23' },
                          { v: '2' },
                          { v: '3' },
                          { v: '3.14' },
                          { v: { f: [ { v: '3' } ] } } ] } } ] } ] } ],
     totalBytesProcessed: '0',
     jobComplete: true,
     cacheHit: false } }

[ { d: [ { id: 'fa23', a: 1, b: 2, c: 3.4, f: { f: [ { v: '1' } ] } } ] } ]          // this is currently printed, it seems only first element is shown for an array

This is from python bq tool:

[
  {
    "d": [
      {
        "a": "1", 
        "b": "2", 
        "c": "3.4", 
        "f": {
          "a": "1"
        }, 
        "id": "fa23"
      }, 
      {
        "a": "2", 
        "b": "3", 
        "c": "3.14", 
        "f": {
          "a": "3"
        }, 
        "id": "bd23"
      }
    ]
  }
]

Expected is:

[
  {
    "d": [
      {
        "id": "fa23",
        "a": 1,
        "b": 2,
        "c": 3.4,
        "f": {
          "a": 1
        }
      },
      {
        "id": "bd23",
        "a": 2,
        "b": 3,
        "c": 3.14,
        "f": {
          "a": 3
        },
      }
    ]
  }
]

@stephenplusplus
Copy link
Contributor

Thanks for the snippet, I'll get to work on a recursive solution.

@c0b
Copy link
Contributor Author

c0b commented Sep 24, 2016

the query is bq --format=prettyjson query --use_legacy_sql=false 'SELECT ARRAY[STRUCT("fa23" AS id, 1 AS a, 2 AS b, 3.4 AS c, STRUCT(1 AS a) AS f), ("bd23", 2, 3, 3.14, STRUCT(3))] AS d'

@stephenplusplus
Copy link
Contributor

Okay, please try:

console.log(JSON.stringify(test(apiResponse.schema, apiResponse.rows), null, 2))

function test(schema, rows) {
  return (Array.isArray(rows) ? rows : [rows]).map(mergeSchema).map(flattenRows);

  function mergeSchema(row) {
    return row.f.map(function(field, index) {
      var schemaField = schema.fields[index];
      var value = field.v;

      switch (schemaField.type) {
        case 'BOOLEAN': {
          value = value === 'true';
          break;
        }
        case 'FLOAT': {
          value = parseFloat(value);
          break;
        }
        case 'INTEGER': {
          value = parseInt(value, 10);
          break;
        }
        case 'RECORD': {
          if (schemaField.mode === 'REPEATED') {
            value = value.map(function(val) {
              return test(schemaField, val.v).pop();
            });
          } else {
            value = test(schemaField, value).pop();
          }

          break;
        }
        case 'TIMESTAMP': {
          value = new Date(value * 1000);
          break;
        }
      }

      var fieldObject = {};
      fieldObject[schemaField.name] = value;
      return fieldObject;
    });
  }

  function flattenRows(rows) {
    return rows.reduce(function(acc, row) {
      var key = Object.keys(row)[0];
      acc[key] = row[key];
      return acc;
    }, {});
  }
}

Feel free to run other responses against it as well.

@c0b
Copy link
Contributor Author

c0b commented Sep 28, 2016

I'm re-using this code on a table.getRows; if a table is defined with nested fields, this function returns all objects with nested f and v so this recursive decode is also needed,

https://googlecloudplatform.github.io/google-cloud-node/#/docs/bigquery/0.3.0/bigquery%2Ftable?method=getRows

found 2 problems:

  1. table.getRows's apiResponse doesn't have schema, so I have to call a table.getMetadata first, can this be called by table.getRows automatically? or have a parameter to let it return schema in apiResponse?
  2. fields are by default NULLABLE, including RECORD can also be a NULL; I made a little change for it to be working:
function test(schema, rows) {
  return (Array.isArray(rows) ? rows : [rows]).map(mergeSchema).map(flattenRows);

  function mergeSchema(row) {
    return row.f.map(function(field, index) {
      var schemaField = schema.fields[index];
      var value = field.v;

      var fieldObject = {};

      // skip any schema decoding if it's null
      if (value === null) {
        fieldObject[schemaField.name] = null;
        return fieldObject;
      }

      switch (schemaField.type) {
        case 'BOOLEAN': {
          value = value === 'true';
          break;
        }
        case 'FLOAT': {
          value = parseFloat(value);
          break;
        }
        case 'INTEGER': {
          value = parseInt(value, 10);
          break;
        }
        case 'RECORD': {
          if (schemaField.mode === 'REPEATED') {
            value = value.map(function(val) {
              return test(schemaField, val.v).pop();
            });
          } else {
            value = test(schemaField, value).pop();
          }

          break;
        }
        case 'TIMESTAMP': {
          value = new Date(value * 1000);
          break;
        }
      }

      fieldObject[schemaField.name] = value;
      return fieldObject;
    });
  }

  function flattenRows(rows) {
    return rows.reduce(function(acc, row) {
      var key = Object.keys(row)[0];
      acc[key] = row[key];
      return acc;
    }, {});
  }
}

@stephenplusplus
Copy link
Contributor

Thanks! I'll put that into a PR and ping you for another look when it's ready.

@stephenplusplus
Copy link
Contributor

@c0b - please take a look at #1648.

table.getRows's apiResponse doesn't have schema, so I have to call a table.getMetadata first, can this be called by table.getRows automatically? or have a parameter to let it return schema in apiResponse?

We actually do that automatically within getRows(), and the schema goes onto the table's instance property, table.metadata.schema. To test from your callback to getRows, you can try:

table.getRows(function(err, rows, apiResponse) {
  console.log(JSON.stringify(test(table.metadata.schema, apiResponse.rows), null, 2))
});

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API.
Projects
None yet
Development

No branches or pull requests

3 participants