Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

google_bigquery_job: Add support for importing GeoJSON #12423

Labels
Milestone

Comments

@linus
Copy link

linus commented Aug 30, 2022

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment. If the issue is assigned to the "modular-magician" user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If the issue is assigned to a user, that user is claiming responsibility for the issue. If the issue is assigned to "hashibot", a community member has claimed the issue already.

Description

When loading GeoJSON with the bq command line tool, it is possible to autodetect a schema and create a table from newline delimited GeoJSON as described here, like so:

bq load \
 --source_format=NEWLINE_DELIMITED_JSON \
 --json_extension=GEOJSON \
 --autodetect \
 DATASET.TABLE \
 FILE_PATH_OR_URI

I am currently trying to accomplish the same with Terraform, with something like this:

resource "google_bigquery_job" "load_regso_data" {
  project = google_project.project.project_id
  job_id = "JOB_ID"

  load {
    source_uris = [
      "gs://BUCKET/FILENAME.json"
    ]
    source_format = "NEWLINE_DELIMITED_JSON"
    autodetect = true

    destination_table {
      project_id = google_bigquery_table.PROJECT.project
      dataset_id = google_bigquery_table.PROJECT.dataset_id
      table_id = google_bigquery_table.PROJECT.table_id
    }

    write_disposition = "WRITE_TRUNCATE"
  }
}

Resulting in an error:

Error while reading data, error message: JSON parsing error in row starting at position 0: Nested arrays not allowed.

Trying to specify a schema for the table, and remove autodetect, the error is instead:

Error while reading data, error message: JSON parsing error in row starting at position 0: No such field: type.

Either way, importing the GeoJSON with a google_bigquery_job seems either impossible or at least not as simple as bq does it. If it would be possible to support something like bq:s json_extension=GEOJSON it would be helpful.

New or Affected Resource(s)

  • google_bigquery_job

Potential Terraform Configuration

resource "google_bigquery_job" "load_regso_data" {
  project = google_project.project.project_id
  job_id = "JOB_ID"

  load {
    source_uris = [
      "gs://BUCKET/FILENAME.json"
    ]
    source_format = "NEWLINE_DELIMITED_JSON"
    json_extension = "GEOJSON"
    autodetect = true

    destination_table {
      project_id = google_bigquery_table.PROJECT.project
      dataset_id = google_bigquery_table.PROJECT.dataset_id
      table_id = google_bigquery_table.PROJECT.table_id
    }

    write_disposition = "WRITE_TRUNCATE"
  }
}

References

@linus
Copy link
Author

linus commented Aug 30, 2022

After running a test with both bq and terraform, and fetching the jobs with bq show --job=true, these are the job configurations.

For bq:

{
  "configuration": {
    "jobType": "LOAD",
    "load": {
      "autodetect": true,
      "destinationTable": {
        "datasetId": "DATASET",
        "projectId": "PROJECT",
        "tableId": "TABLE"
      },
      "jsonExtension": "GEOJSON",
      "maxBadRecords": 0,
      "sourceFormat": "NEWLINE_DELIMITED_JSON",
      "sourceUris": [
        "gs://BUCKET/FILE.json"
      ]
    }
  }
}

For terraform with autodetect:

{
  "configuration": {
    "jobType": "LOAD",
    "load": {
      "autodetect": true,
      "createDisposition": "CREATE_IF_NEEDED",
      "destinationTable": {
        "datasetId": "DATASET",
        "projectId": "PROJECT",
        "tableId": "TABLE"
      },
      "encoding": "UTF-8",
      "sourceFormat": "NEWLINE_DELIMITED_JSON",
      "sourceUris": [
        "gs://BUCKET/FILE.json"
      ],
      "writeDisposition": "WRITE_TRUNCATE"
    }
  }
}

It seems as if a json_extension attribute which gets forwarded to the job configuration would be enough to mimick the bq load feature?

@linus
Copy link
Author

linus commented Aug 30, 2022

Looking at the API reference documentation for bigquery JobConfigurationLoad, it seems like the jsonExtension attribute is not documented. I'm not sure what the take is on this but since the bq tool uses it, it seems safe enough.

Edit: Looking at the REST API schema, that contains jsonExtension, which makes me believe that the documentation omission may be a bug:

  "jsonExtension": {
    "description": "[Optional] If sourceFormat is set to newline-delimited JSON, indicates whether it should be processed as a JSON variant such as GeoJSON. For a sourceFormat other than JSON, omit this field. If the sourceFormat is newline-delimited JSON: - for newline-delimited GeoJSON: set to GEOJSON.",
    "type": "string"
  },

@linus
Copy link
Author

linus commented Aug 30, 2022

I'm barely superficially familiar with Go so take it with a pinch of salt, but I think something like this might do the trick:

index 3aaca78a2..bd358dbd3 100644
--- a/google/resource_bigquery_job.go
+++ b/google/resource_bigquery_job.go
@@ -451,6 +451,12 @@ CSV: Trailing columns
 JSON: Named values that don't match any column names`,
                                                        Default: false,
                                                },
+                                               "json_extension": {
+                                                       Type:     schema.TypeString,
+                                                       Optional: true,
+                                                       ForceNew: true,
+                                                       Description: `If sourceFormat is set to newline-delimited JSON, indicates whether it should be processed as a JSON variant such as GeoJSON. For a sourceFormat other than JSON, omit this field. If the sourceFormat is newline-delimited JSON: - for newline-delimited GeoJSON: set to GEOJSON.`,
+                                               },
                                                "max_bad_records": {
                                                        Type:     schema.TypeInt,
                                                        Optional: true,
@@ -1446,6 +1452,8 @@ func flattenBigQueryJobConfigurationLoad(v interface{}, d *schema.ResourceData,
                flattenBigQueryJobConfigurationLoadAllowQuotedNewlines(original["allowQuotedNewlines"], d, config)
        transformed["source_format"] =
                flattenBigQueryJobConfigurationLoadSourceFormat(original["sourceFormat"], d, config)
+       transformed["json_extension"] =
+               flattenBigQueryJobConfigurationLoadJsonExtension(original["jsonExtension"], d, config)
        transformed["allow_jagged_rows"] =
                flattenBigQueryJobConfigurationLoadAllowJaggedRows(original["allowJaggedRows"], d, config)
        transformed["ignore_unknown_values"] =
@@ -1552,6 +1560,10 @@ func flattenBigQueryJobConfigurationLoadSourceFormat(v interface{}, d *schema.Re
        return v
 }
 
+func flattenBigQueryJobConfigurationLoadJsonExtension(v interface{}, d *schema.ResourceData, config *Config) interface{} {
+       return v
+}
+
 func flattenBigQueryJobConfigurationLoadAllowJaggedRows(v interface{}, d *schema.ResourceData, config *Config) interface{} {
        return v
 }
@@ -2454,6 +2466,13 @@ func expandBigQueryJobConfigurationLoad(v interface{}, d TerraformResourceData,
                transformed["sourceFormat"] = transformedSourceFormat
        }
 
+       transformedJsonExtension, err := expandBigQueryJobConfigurationLoadJsonExtension(original["json_extension"], d, config)
+       if err != nil {
+               return nil, err
+       } else if val := reflect.ValueOf(transformedSourceFormat); val.IsValid() && !isEmptyValue(val) {
+               transformed["jsonExtension"] = transformedJsonExtension
+       }
+
        transformedAllowJaggedRows, err := expandBigQueryJobConfigurationLoadAllowJaggedRows(original["allow_jagged_rows"], d, config)
        if err != nil {
                return nil, err
@@ -2583,6 +2602,10 @@ func expandBigQueryJobConfigurationLoadSourceFormat(v interface{}, d TerraformRe
        return v, nil
 }
 
+func expandBigQueryJobConfigurationLoadJsonExtension(v interface{}, d TerraformResourceData, config *Config) (interface{}, error) {
+       return v, nil
+}
+
 func expandBigQueryJobConfigurationLoadAllowJaggedRows(v interface{}, d TerraformResourceData, config *Config) (interface{}, error) {
        return v, nil
 }

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 20, 2022
@github-actions github-actions bot added service/bigquery forward/review In review; remove label to forward labels Jan 14, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.