Skip to content

[Bug]: ReadFromSpanner fails with Dataflow when using apache_beam.io.gcp.experimental.spannerio #35850

@Andrei-Strenkovskii

Description

@Andrei-Strenkovskii

What happened?

I am building an application in Apache Beam and Python that runs in Google DataFlow. I am using the ReadFromSpanner method in apache_beam.io.gcp.experimental.spannerio.
On operation ReadFromSpanner, I get an error:

Python sdk harness failed: 
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/apache_beam/metrics/monitoring_infos.py", line 366, in create_monitoring_info
    return metrics_pb2.MonitoringInfo(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen _collections_abc>", line 949, in update
TypeError: bad argument type for built-in operation

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/sdk_worker_main.py", line 212, in main
    sdk_harness.run()
  File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/sdk_worker.py", line 283, in run
    getattr(self, SdkHarness.REQUEST_METHOD_PREFIX + request_type)(
  File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/sdk_worker.py", line 361, in _request_harness_monitoring_infos
    ).to_runner_api_monitoring_infos(None).values()
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "apache_beam/metrics/execution.py", line 334, in apache_beam.metrics.execution.MetricsContainer.to_runner_api_monitoring_infos
  File "apache_beam/metrics/cells.py", line 76, in apache_beam.metrics.cells.MetricCell.to_runner_api_monitoring_info
  File "apache_beam/metrics/cells.py", line 158, in apache_beam.metrics.cells.CounterCell.to_runner_api_monitoring_info_impl
  File "/usr/local/lib/python3.11/site-packages/apache_beam/metrics/monitoring_infos.py", line 233, in int64_counter
    return create_monitoring_info(urn, SUM_INT64_TYPE, metric, labels)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/apache_beam/metrics/monitoring_infos.py", line 369, in create_monitoring_info
    raise RuntimeError(
RuntimeError: Failed to create MonitoringInfo for urn beam:metric:io:api_request_count:v1 type <class 'type'> labels {labels} and payload {payload}

return metrics_pb2.MonitoringInfo(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen _collections_abc>", line 949, in update
TypeError: bad argument type for built-in operation

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/sdk_worker_main.py", line 367, in <module>
    main(sys.argv)
  File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/sdk_worker_main.py", line 212, in main
    sdk_harness.run()
  File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/sdk_worker.py", line 283, in run
    getattr(self, SdkHarness.REQUEST_METHOD_PREFIX + request_type)(
  File "/usr/local/lib/python3.11/site-packages/apache_beam/runners/worker/sdk_worker.py", line 361, in _request_harness_monitoring_infos

).to_runner_api_monitoring_infos(None).values()
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "apache_beam/metrics/execution.py", line 334, in apache_beam.metrics.execution.MetricsContainer.to_runner_api_monitoring_infos
  File "apache_beam/metrics/cells.py", line 76, in apache_beam.metrics.cells.MetricCell.to_runner_api_monitoring_info
  File "apache_beam/metrics/cells.py", line 158, in apache_beam.metrics.cells.CounterCell.to_runner_api_monitoring_info_impl
  File "/usr/local/lib/python3.11/site-packages/apache_beam/metrics/monitoring_infos.py", line 233, in int64_counter

return create_monitoring_info(urn, SUM_INT64_TYPE, metric, labels)

I get this error ONLY with tables that have more than several hundred rows; on small tables, the error is not presented
I tried to increase the number of workers and change the machine type, but it didn't help

Example of my pipeline:

 options = PipelineOptions([
        '--runner=DataflowRunner',
        '--project=project',
        '--region=us-east4',
        '--temp_location=gs-path',
        '--staging_location=gs-path',
        '--experiments=shuffle_mode=service',
        f'--job_name=beam-test'
    ])

read_operations = [
        ReadOperation.table(table=table, columns=[column]),
    ]

    with beam.Pipeline(options=options) as pipeline:
        all_users = pipeline | ReadFromSpanner(
            'project', 'instance', 'database',
            read_operations=read_operations,
        )

Versions:
apache_beam[gcp]==2.66.0
Python 3.11.8

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions