Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bigtable python raises InvalidChunk: possible loss of microsecond timestamp precision? #2397

Closed
destijl opened this issue Sep 23, 2016 · 10 comments
Assignees
Labels
api: bigtable Issues related to the Bigtable API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@destijl
Copy link

destijl commented Sep 23, 2016

The smallest reproducible test case I could get to is here (small diff off hello world):
destijl/python-docs-samples@cc074c1

I write a new value to the same row with an older timestamp that is 200us after epoch, and now I can't read the row anymore. My current best guess is that we end up storing a 0 timestamp because microseconds are lost somewhere and there is code that treats not chunk.timestamp_micros as an error as you can see in the backtrace.

The only reason I came across this is I'm implementing a bigtable datastore and our unit tests do weird stuff like this to test timestamp handling of the various databases.

backtrace:

$ python main.py grr-test-demo bigtabletesting
Creating the Hello-Bigtable table.
Writing some greetings to the table.
Getting a single greeting by row key.
Traceback (most recent call last):
  File "main.py", line 136, in <module>
    main(args.project_id, args.instance_id, args.table)
  File "main.py", line 98, in main
    row = table.read_row(key.encode('utf-8'), filter_=row_filter)
  File "/usr/local/google/home/gcastle/VE/release/lib/python2.7/site-packages/gcloud/bigtable/table.py", line 236, in read_row
    rows_data.consume_all()
  File "/usr/local/google/home/gcastle/VE/release/lib/python2.7/site-packages/gcloud/bigtable/row_data.py", line 324, in consume_all
    self.consume_next()
  File "/usr/local/google/home/gcastle/VE/release/lib/python2.7/site-packages/gcloud/bigtable/row_data.py", line 276, in consume_next
    self._validate_chunk(chunk)
  File "/usr/local/google/home/gcastle/VE/release/lib/python2.7/site-packages/gcloud/bigtable/row_data.py", line 391, in _validate_chunk
    self._validate_chunk_row_in_progress(chunk)
  File "/usr/local/google/home/gcastle/VE/release/lib/python2.7/site-packages/gcloud/bigtable/row_data.py", line 371, in _validate_chunk_row_in_progress
    _raise_if(not chunk.timestamp_micros or not chunk.value)
  File "/usr/local/google/home/gcastle/VE/release/lib/python2.7/site-packages/gcloud/bigtable/row_data.py", line 442, in _raise_if
    raise InvalidChunk(*args)
@dhermes dhermes added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. api: bigtable Issues related to the Bigtable API. labels Sep 23, 2016
@dhermes dhermes self-assigned this Sep 23, 2016
@dhermes
Copy link
Contributor

dhermes commented Sep 23, 2016

I tried to make this a little more minimal, but can't get this to break:

import datetime

from gcloud import bigtable
from gcloud.bigtable import row_filters


instance_id = 'bigtabletesting'
table_id = 'Hello-Bigtable'

client = bigtable.Client(admin=True)
instance = client.instance(instance_id, 'us-central1-c')
table = instance.table(table_id)

column_family_id = 'cf1'
cf1 = table.column_family(column_family_id)
table.create(column_families=[cf1])

timestamp = datetime.datetime(1970, 1, 1, microsecond=200)
row = table.row(b'test')
row.set_cell(
    column_family_id,
    b'greeting', b'Hello World!',
    timestamp=timestamp)
row.commit()

col_filter = row_filters.ColumnQualifierRegexFilter(column_id)
family_filter = row_filters.FamilyNameRegexFilter(column_family_id)
row_filter = row_filters.RowFilterUnion(
    filters=[col_filter, family_filter])
# BEGIN: Thing that breaks
partial_data = table.read_row(row._row_key, filter_=row_filter)
#   END: Thing that breaks

Now will run the entire script you linked.

@dhermes
Copy link
Contributor

dhermes commented Sep 23, 2016

OK, I was able to reproduce. Digging in now to see why this occurs.

@dhermes
Copy link
Contributor

dhermes commented Sep 23, 2016

The culprit is a chunk with no timestamp. This is the offending response:

from google.cloud.bigtable._generated import bigtable_pb2

chunk0 = bigtable_pb2.ReadRowsResponse.CellChunk(
  row_key='greeting0',
  timestamp_micros=1474593939415000,
  value='Hello World!',
)
chunk0.family_name.value = 'cf1'
chunk0.qualifier.value = 'greeting'
chunk1 = bigtable_pb2.ReadRowsResponse.CellChunk(
  timestamp_micros=1474593939415000,
  value='Hello World!',
)
chunk2 = bigtable_pb2.ReadRowsResponse.CellChunk(
  value='Hello World!',
)
chunk3 = bigtable_pb2.ReadRowsResponse.CellChunk(
  value='Hello World!',
  commit_row=True,
)
response = bigtable_pb2.ReadRowsResponse(
    chunks=[chunk0, chunk1, chunk2, chunk3])

@destijl
Copy link
Author

destijl commented Oct 19, 2016

There's definitely a loss of precision problem here. The timestamp I set is not the timestamp I get back on read.

$ python ~/temp.py
Read: 1970-01-01T00:00:00+00:00
Set: 1970-01-01T00:00:00.000200+00:00
import datetime
import pytz

from gcloud import bigtable
instance_id = 'bigtabletesting'
table_id = 'Hello-Bigtable'

client = bigtable.Client(admin=True)
instance = client.instance(instance_id, 'us-central1-c')
table = instance.table(table_id)
client.start()

column_family_id = 'cf1'
cf1 = table.column_family(column_family_id)
table.create(column_families=[cf1])

timestamp = datetime.datetime(1970, 1, 1, microsecond=200, tzinfo=pytz.utc)
row = table.row(b'test')
row.set_cell(
        column_family_id,
            b'greeting', b'Hello World!',
                timestamp=timestamp)
row.commit()

partial_data = table.read_row(row._row_key)
assert(len(partial_data.cells["cf1"]["greeting"]) == 1)
# These should be equal, but they are not because microseconds are lost.
print "Read: %s" % partial_data.cells["cf1"]["greeting"][0].timestamp.isoformat()
print "Set: %s" % timestamp.isoformat()

@destijl
Copy link
Author

destijl commented Oct 19, 2016

I wondered if this was just a problem around epoch, but it's not. 1 Dec is the same as 1 Jan.

$ python ~/temp.py
Read: 1970-12-01T00:00:00+00:00
Set: 1970-12-01T00:00:00.000200+00:00

Bigtable does support microsecond precision, right?

@sduskis
Copy link
Contributor

sduskis commented Oct 26, 2016

The cloud bigtable API does not support microseconds at this point.

@destijl
Copy link
Author

destijl commented Oct 26, 2016

Yep, I eventually realized, see #2569 There's still a problem with timestamps here that's independent.

@destijl
Copy link
Author

destijl commented Oct 26, 2016

Also filed feature request for microsecond granularity here:
#2626

@destijl
Copy link
Author

destijl commented Feb 13, 2017

Just FYI I upgraded to google-cloud-bigtable==0.22.0 and verified the error is still there. I have a few integration tests I have to skip at the moment due to this bug.

@lukesneeringer lukesneeringer added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Apr 19, 2017
@lukesneeringer
Copy link
Contributor

Hello,
One of the challenges of maintaining a large open source project is that sometimes, you can bite off more than you can chew. As the lead maintainer of google-cloud-python, I can definitely say that I have let the issues here pile up.

As part of trying to get things under control (as well as to empower us to provide better customer service in the future), I am declaring a "bankruptcy" of sorts on many of the old issues, especially those likely to have been addressed or made obsolete by more recent updates.

My goal is to close stale issues whose relevance or solution is no longer immediately evident, and which appear to be of lower importance. I believe in good faith that this is one of those issues, but I am scanning quickly and may occasionally be wrong. If this is an issue of high importance, please comment here and we will reconsider. If this is an issue whose solution is trivial, please consider providing a pull request.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigtable Issues related to the Bigtable API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

4 participants