Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery Struct destination nulls out fields in repeated structs #4487

Closed
zestyping opened this issue Jul 2, 2021 · 6 comments · Fixed by #5261
Closed

BigQuery Struct destination nulls out fields in repeated structs #4487

zestyping opened this issue Jul 2, 2021 · 6 comments · Fixed by #5261

Comments

@zestyping
Copy link
Contributor

zestyping commented Jul 2, 2021

Environment

  • Airbyte version: 0.27.0-alpha
  • OS Version / Instance: AWS EC2
  • Deployment: Docker
  • Source Connector and version: Facebook Marketing 0.2.13
  • Destination Connector and version: BigQuery Struct 0.1.0
  • Severity: High (data loss)
  • Step where error happened: Sync job

Current Behavior

In the Facebook Marketing data, there's a record that contains a field named targeting, whose value is a structure containing a field named exclusions, whose value is a structure containing a field named interests, whose value is an array of {"id": ..., "name": ...} structures.

I can confirm the data is present because it shows up when I use the original, "flattened" BigQuery destination. When I transfer the data using the flattened BigQuery destination, and then query for the record in BigQuery like this:

SELECT * FROM `facebook_raw.ads` WHERE id = '23847599448580278' ORDER BY _airbyte_emitted_at DESC LIMIT 1;

...BigQuery gives me this JSON result:

[
  {
    "id": "23847599448580278",
    "name": "Image4",
    "status": "PAUSED",
    "adlabels": [],
    "adset_id": "23847599448570278",
    "bid_info": null,
    "bid_type": "ABSOLUTE_OCPM",
    "creative": "{\"id\":\"23847605777030278\"}",
    "targeting": "{\"age_max\":35,\"age_min\":18,\"custom_audiences\":[{\"id\":\"23847589512650278\",\"name\":\"Lookalike (10%) - IG Profile Engagers\"},{\"id\":\"23847589512660278\",\"name\":\"Lookalike (10%) - FB Page Visitors\"}],\"exclusions\":{\"interests\":[{\"id\":\"6003012399881\",\"name\":\"Mark Levin\"},{\"id\":\"6003060009815\",\"name\":\"NRA's American Rifleman (Official)\"},{\"id\":\"6003068592093\",\"name\":\"Fox News Radio\"},{\"id\":\"6003089254597\",\"name\":\"The Mark Levin Show\"},{\"id\":\"6003108316584\",\"name\":\"Fox News Channel\"},{\"id\":\"6003265249086\",\"name\":\"Bill O'Reilly (political commentator)\"},{\"id\":\"6003270662648\",\"name\":\"Eric Trump\"},{\"id\":\"6003284135676\",\"name\":\"Rush Limbaugh\"},{\"id\":\"6003327648848\",\"name\":\"The Sean Hannity Show\"},{\"id\":\"6003331802697\",\"name\":\"Fox News Opinion\"},{\"id\":\"6003332337177\",\"name\":\"Fox News Sunday\"},{\"id\":\"6003354144727\",\"name\":\"Judicial Watch\"},{\"id\":\"6003393663563\",\"name\":\"Ivanka Trump\"},{\"id\":\"6003421411431\",\"name\":\"Tea party\"}]},\"geo_locations\":{\"countries\":[\"US\"],\"location_types\":[\"home\",\"recent\"]},\"locales\":[6],\"targeting_optimization\":\"none\",\"publisher_platforms\":[\"facebook\",\"instagram\"],\"facebook_positions\":[\"feed\",\"story\"],\"instagram_positions\":[\"stream\",\"story\"],\"device_platforms\":[\"mobile\",\"desktop\"]}",
    "account_id": "1211960785929066",
    "bid_amount": null,
    "campaign_id": "23847599448600278",
    "created_time": "2021-05-12T11:15:24-0400",
    "source_ad_id": "23847599387070278",
    "updated_time": "2021-06-16T17:49:15-0400",
    "tracking_specs": [
      "{\"action.type\":[\"post_engagement\"],\"page\":[\"107364421446709\"],\"post\":[\"131862565663561\"]}",
      "{\"action.type\":[\"offsite_conversion\"],\"fb_pixel\":[\"380363403227494\"]}"
    ],
    "recommendations": [
      "{\"title\":\"Ad isn't optimised for conversions\",\"message\":\"Your tracking pixel is active, but your ad isn't optimised for off-site conversions. You may get better results if you choose off-site conversions as your optimisation goal.\",\"code\":1942006,\"importance\":\"HIGH\",\"confidence\":\"HIGH\",\"blame_field\":\"tracking_specs\"}"
    ],
    "conversion_specs": [
      "{\"action.type\":[\"link_click\"],\"post\":[\"131862565663561\"],\"post.wall\":[\"107364421446709\"]}"
    ],
    "effective_status": "PAUSED",
    "last_updated_by_app_id": "119211728144504",
    "_airbyte_emitted_at": "2021-06-24 21:54:52 UTC",
    "_airbyte_ads_hashid": "e10832663f3bda00d7486722a78b3e01"
  }
]

Above you can see the contents of the targeting field represented as a JSON string, as you would expect from the normalized BigQuery destination.

I've repeated the transfer using the new BigQuery Struct destination, putting the structured data in a dataset named facebook_struct_raw. When I look at the results, the targeting.exclusion.interests array still contains many elements with the {"id": ..., "name": ...} shape, but they are full of nulls. If I do a similar query on the structured dataset:

SELECT * FROM `facebook_struct_raw.ads` WHERE id = '23847599448580278' ORDER BY _airbyte_emitted_at DESC LIMIT 1;

...BigQuery gives me this:

[
  {
    "id": "23847599448580278",
    "name": "Image4",
    "status": "PAUSED",
    "adlabels": null,
    "adset_id": "23847599448570278",
    "bid_info": null,
    "bid_type": "ABSOLUTE_OCPM",
    "creative": {
      "id": "23847605777030278",
      "creative_id": null
    },
    "targeting": {
      "moms": null,
      "income": null,
      "age_max": "35",
      "age_min": "18",
      "genders": null,
      "locales": {
        "value": [
          "6"
        ]
      },
      "user_os": null,
      "politics": null,
      "behaviors": null,
      "home_type": null,
      "interests": null,
      "net_worth": null,
      "exclusions": {
        "moms": null,
        "income": null,
        "politics": null,
        "behaviors": null,
        "home_type": null,
        "interests": {
          "value": [
            {
              "id": null,
              "name": null
            },
            {
              "id": null,
              "name": null
            },
            {
              "id": null,
              "name": null
            },
            {
              "id": null,
              "name": null
            },
            {
              "id": null,
              "name": null
            },
            {
              "id": null,
              "name": null
            },
            {
              "id": null,
              "name": null
            },
            {
              "id": null,
              "name": null
            },
            {
              "id": null,
              "name": null
            },
            {
              "id": null,
              "name": null
            },
            {
              "id": null,
              "name": null
            },
            {
              "id": null,
              "name": null
            },
            {
              "id": null,
              "name": null
            },
            {
              "id": null,
              "name": null
            }
          ]
        },
        "net_worth": null,
        "generation": null,
        "industries": null,
        "connections": null,
        "life_events": null,
        "home_ownership": null,
        "work_employers": null,
        "work_positions": null,
        "family_statuses": null,
        "user_adclusters": null,
        "custom_audiences": null,
        "education_majors": null,
        "excluded_connections": null,
        "household_composition": null,
        "relationship_statuses": null,
        "friends_of_connections": null,
        "excluded_custom_audiences": null
      },
      "generation": null,
      "industries": null,
      "connections": null,
      "office_type": null,
      "user_device": null,
      "flexible_spec": null,
      "geo_locations": {
        "zips": null,
        "cities": null,
        "regions": null,
        "countries": {
          "value": [
            "US"
          ]
        },
        "geo_markets": null,
        "country_groups": null,
        "location_types": {
          "value": [
            "home",
            "recent"
          ]
        },
        "custom_locations": null
      },
      "interested_in": null,
      "home_ownership": null,
      "work_positions": null,
      "family_statuses": null,
      "user_adclusters": null,
      "custom_audiences": {
        "value": [
          {
            "id": null,
            "name": null
          },
          {
            "id": null,
            "name": null
          }
        ]
      },
      "device_platforms": {
        "value": [
          "mobile",
          "desktop"
        ]
      },
      "app_install_state": null,
      "education_statuses": null,
      "facebook_positions": {
        "value": [
          "feed",
          "story"
        ]
      },
      "instagram_positions": {
        "value": [
          "stream",
          "story"
        ]
      },
      "messenger_positions": null,
      "publisher_platforms": {
        "value": [
          "facebook",
          "instagram"
        ]
      },
      "excluded_connections": null,
      "excluded_user_device": null,
      "relationship_statuses": null,
      "excluded_geo_locations": null,
      "friends_of_connections": null,
      "targeting_optimization": "none",
      "excluded_custom_audiences": null,
      "audience_network_positions": null,
      "excluded_publisher_categories": null
    },
    "account_id": "1211960785929066",
    "bid_amount": null,
    "campaign_id": "23847599448600278",
    "created_time": "2021-05-12T11:15:24-0400",
    "source_ad_id": "23847599387070278",
    "updated_time": "2021-06-16T17:49:15-0400",
    "tracking_specs": {
      "value": [
        {
          "page": null,
          "post": null,
          "event": null,
          "offer": null,
          "object": null,
          "dataset": null,
          "leadgen": null,
          "subtype": null,
          "creative": null,
          "fb_pixel": null,
          "question": null,
          "response": null,
          "post_wall": null,
          "event_type": null,
          "action_type": null,
          "application": null,
          "page_parent": null,
          "post_object": null,
          "conversion_id": null,
          "event_creator": null,
          "object_domain": null,
          "offer_creator": null,
          "offsite_pixel": null,
          "fb_pixel_event": null,
          "post_object_wall": null,
          "question_creator": null
        },
        {
          "page": null,
          "post": null,
          "event": null,
          "offer": null,
          "object": null,
          "dataset": null,
          "leadgen": null,
          "subtype": null,
          "creative": null,
          "fb_pixel": null,
          "question": null,
          "response": null,
          "post_wall": null,
          "event_type": null,
          "action_type": null,
          "application": null,
          "page_parent": null,
          "post_object": null,
          "conversion_id": null,
          "event_creator": null,
          "object_domain": null,
          "offer_creator": null,
          "offsite_pixel": null,
          "fb_pixel_event": null,
          "post_object_wall": null,
          "question_creator": null
        }
      ]
    },
    "recommendations": {
      "value": [
        {
          "code": null,
          "title": null,
          "message": null,
          "confidence": null,
          "importance": null,
          "blame_field": null
        }
      ]
    },
    "conversion_specs": {
      "value": [
        {
          "page": null,
          "post": null,
          "event": null,
          "offer": null,
          "object": null,
          "dataset": null,
          "leadgen": null,
          "subtype": null,
          "creative": null,
          "fb_pixel": null,
          "question": null,
          "response": null,
          "post_wall": null,
          "event_type": null,
          "action_type": null,
          "application": null,
          "page_parent": null,
          "post_object": null,
          "conversion_id": null,
          "event_creator": null,
          "object_domain": null,
          "offer_creator": null,
          "offsite_pixel": null,
          "fb_pixel_event": null,
          "post_object_wall": null,
          "question_creator": null
        }
      ]
    },
    "effective_status": "PAUSED",
    "last_updated_by_app_id": "119211728144504",
    "_airbyte_ab_id": "14fc0b01-9e47-4d96-b147-e4b9e15e8ea7",
    "_airbyte_emitted_at": "2021-07-01 18:57:20 UTC"
  }
]

So it looks like the structure is accurately preserved, but only the top-level fields contain values, and many (but not all!) of the rest contain nulls.

Expected Behavior

The data values should be preserved in the destination along with the original structure. In other words, the output from the above two queries should represent the same information.

If it helps to specify the correctness criterion in a precise way, I'd say:

  • Given a source connected to both:
    • A flattened BigQuery destination that writes to table FT
    • A structured BigQuery destination that writes to table ST
  • For every flat record FR in FT, there exists a corresponding structured record SR in ST such that:
    • For every column C in FT, FR[C] is either equal to SR[C] or equal to the JSON serialization of SR[C]

Logs

logs-486-0.txt

Steps to Reproduce

  1. Set up a Facebook Marketing source.
  2. Create a connection from the source to a BigQuery flattened destination, and run a sync.
  3. Create a connection from the same source to a BigQuery structured destination, and run a sync.
  4. Compare the results by querying an individual record from each table in the BigQuery console (see example above) and selecting the JSON version of the output for easier comparison.

Are you willing to submit a PR?

I'm willing to help look into it, but I don't know where to start yet.

@zestyping zestyping added the type/bug Something isn't working label Jul 2, 2021
@ChristopheDuong ChristopheDuong added the area/connectors Connector related issues label Jul 6, 2021
@sherifnada sherifnada added this to the Core - 2021-07-14 milestone Jul 7, 2021
@sherifnada
Copy link
Contributor

@ChristopheDuong to add any necessary descriptions/acceptance criteria then triage

@ChristopheDuong
Copy link
Contributor

the correctness criterion as described by @zestyping:

  • Given a source connected to both:
    • destination-bigquery that writes to table FT
    • destination-bigquery-denormalized that writes to table ST
  • For every flat record FR in FT, there exists a corresponding structured record SR in ST such that:
    • For every column C in FT, FR[C] is either equal to SR[C] or equal to the JSON serialization of SR[C]

From @zestyping's observations, it seems the conversion of the catalog is properly done and tables are well created by:

However, populating data into the table may not be done properly, so it should be verified if the formatting of the record data is properly done in:

@DoNotPanicUA
Copy link
Contributor

Update.
The current implementation of the BigQuery De-normalized destination doesn't cover complex objects properly. The main problem is that table creation can't properly identify Record structure.
You can find warnings in the log like that:
2021-07-02 00:57:21 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-07-02 00:57:21 �[33mWARN�[m i.a.i.d.b.BigQueryDenormalizedDestination(getTypes):157 - {} - Field home_type has no type defined, defaulting to STRING
After wrongly created table, processing ignores all incoming values out of the created structure. Errors like that:
2021-07-02 00:57:31 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-07-02 00:57:31 �[33mWARN�[m i.a.i.d.b.BigQueryDenormalizedRecordConsumer(lambda$formatData$0):92 - {} - Ignoring field action_type as it is not defined in catalog
As summary, the destination requires significant rework. I will keep you posted about this process.

@DoNotPanicUA
Copy link
Contributor

Update.
There is a general issue with a destination that stores nested types. There are only a few sources that provide enough metadata for storing that data right. This problem will be fixed out of this issue scope. (Discussed with @sherifnada)

In the case of processing data from Facebook to BigQuery, we have a gap when we process Array of Objects.

@sherifnada
Copy link
Contributor

@zestyping give the new version a shot and let us know how it goes!

@zestyping
Copy link
Contributor Author

@sherifnada Awesome news. Thanks, will do!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants