Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(bigquerystorage): resume reader connection on EOS internal error #9994

Merged
merged 4 commits into from
Jan 15, 2020

Conversation

tswast
Copy link
Contributor

@tswast tswast commented Dec 18, 2019

It's infeasible for the backend to change the status of EOS on DATA
internal errors, so instead we check the error message to see if it's
an error that is resumable. We don't want to try to resume on all
internal errors, so inspecting the message is the best we can do.

This fixes the same error as https://issuetracker.google.com/143292803 but for Python.

@tswast tswast requested a review from a team December 18, 2019 15:48
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Dec 18, 2019
@tswast tswast requested a review from shollyman December 18, 2019 15:48
@tswast tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Dec 18, 2019
@tswast
Copy link
Contributor Author

tswast commented Jan 3, 2020

Closing this PR per "Custom retry logic in GAPIC-generated clients" email thread.

@tswast tswast closed this Jan 3, 2020
@tswast tswast deleted the b143292803-eos-bqstorage branch January 3, 2020 15:59
@tswast tswast restored the b143292803-eos-bqstorage branch January 8, 2020 19:21
@tswast
Copy link
Contributor Author

tswast commented Jan 8, 2020

We were able to reproduce this error in Python with the long-running integration tests. It surfaces as:

E                   grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
E                   	status = StatusCode.INTERNAL
E                   	details = "Received RST_STREAM with error code 2"
E                   	debug_error_string = "{"created":"@1578389183.706753048","description":"Error received from peer ipv4:172.217.219.95:443","file":"src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Received RST_STREAM with error code 2","grpc_status":13}"
E                   >

It appears to match the related Java code:

https://github.com/grpc/grpc/blob/9dd05c4abc1741223ded8566bc143c927d9cba79/src/core/ext/transport/chttp2/transport/frame_rst_stream.cc#L110

@tswast tswast reopened this Jan 8, 2020
tswast added 2 commits January 8, 2020 14:45
It's infeasible for the backend to change the status of `EOS on DATA`
internal errors, so instead we check the error message to see if it's
an error that is resumable. We don't want to try to resume on *all*
internal errors, so inspecting the message is the best we can do.
@tswast tswast force-pushed the b143292803-eos-bqstorage branch from d893035 to 50ecc98 Compare January 8, 2020 20:53
@tswast tswast removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Jan 8, 2020
@tswast tswast requested a review from plamut January 13, 2020 16:21
Copy link
Contributor

@plamut plamut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have one minor refactoring suggestion, otherwise it generally LGTM. We still need to resolve a merge conflict, though.

BTW, what is the best way of reproducing this locally? Is it even feasible?

@tswast tswast requested a review from plamut January 15, 2020 21:59
@tswast
Copy link
Contributor Author

tswast commented Jan 15, 2020

I haven't figured out a way to reproduce it locally, at least not without a lot of time and network bandwidth. We get the error pretty consistently with the following sample code in our internal nightly CI:

# Copyright 2019 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import concurrent.futures

from google.cloud import bigquery_storage_v1beta1
import pytest


@pytest.mark.parametrize(
    "format_",
    (
        bigquery_storage_v1beta1.enums.DataFormat.AVRO,
        bigquery_storage_v1beta1.enums.DataFormat.ARROW,
    ),
)
def test_long_running_read(bqstorage_client, project_id, format_):
    table_ref = bigquery_storage_v1beta1.types.TableReference()
    table_ref.project_id = "bigquery-public-data"
    table_ref.dataset_id = "samples"
    table_ref.table_id = "wikipedia"

    session = bqstorage_client.create_read_session(
        table_ref,
        "projects/{}".format(project_id),
        requested_streams=5,
        format_=format_
    )
    assert len(session.streams) == 5

    def count_rows(stream):
        read_position = bigquery_storage_v1beta1.types.StreamPosition(
            stream=stream
        )
        reader = bqstorage_client.read_rows(read_position)
        row_count = 0
        for page in reader.rows(session).pages:
            row_count += page.num_items
        return row_count

    with concurrent.futures.ThreadPoolExecutor() as pool:
        row_count = sum(pool.map(count_rows, session.streams))

    assert row_count == 313797035

@plamut
Copy link
Contributor

plamut commented Jan 15, 2020

OK good, I'll try to reproduce it locally tomorrow, but if it doesn't work, I'll approve and rely on the next nightly run (the code itself LGTM now).

Edit: Since the test takes 90 minute on a GCE instance, and the error is only reproduced roughly half of the time, it would be inefficient to testing the fix in action locally (would require multiple runs). Since the fix is not going to production immediately, we can still revert it should the nightly runs reveal that the issue still persists.

Copy link
Contributor

@plamut plamut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving under the assumption that the fix is not being shipped to production immediately, and that we can revert it in case the error occurs again in one of the nightly runs.

This was referenced Jan 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants