Skip to content

Conversation

@deven-amd
Copy link

No description provided.

Revan Sopher and others added 19 commits July 7, 2018 21:34
PiperOrigin-RevId: 203557079
When storing images in Cloud Bigtable, the resulting gRPC messages are often larger than the default receive message max size value. This change makes the maximum receive message sizes configurable, and sets a more reasonable default for general TensorFlow use.

PiperOrigin-RevId: 203569796
Restructured sharding passes to propagate sharding on pass-through instructions which now the placer does not assign anymore (GTEs, tuples, bitcast, parameters, ...).

PiperOrigin-RevId: 203591020
configure.py respects CUDA_TOOLKIT_PATH instead of CUDA_INSTALL_PATH

PiperOrigin-RevId: 203591214
…ite Object Detection app.

PiperOrigin-RevId: 203689941
Instead of having one stream for compute, host-to-device and device-to-host transfers, switch to having separate streams, just like the GPU does.
Add a se::Event field to XlaTensor to allow accurate inter-stream dependencies to be created.

As part of this:
 - Fix TransferManager::TransferLiteralFrom/ToDevice to correctly make generated substreams wait on their master stream.
 - Fix Stream::BlockHostUntilDone() to not block on or return substreams. This behavior is completely broken and not only nondeterministically returns substreams to the pool but causes indefinite hangs with the HostStream.

PiperOrigin-RevId: 203726543
When doing multi-output fusion and using sibling fusion, it can happen that we
don't need to clone the 'instruction_to_fuse' argument. Right now, we clone,
and then delete the clone again, and at the end of the function try to print
the debug string for the clone (which then crashes).
Instead, we can simply not generate the clone if it is not needed, and catch
this case before printing the debug string.

PiperOrigin-RevId: 203733796
If a domain become empty because the various optimizations removed all
instruction from it then we have to re-add some instruction to make sure
the user supplied sharding is still respected.

This is especially important for the root instruction as the user will
expect the data to be available on the device they requested it. Before
this CL we failed to insert the tuple->gte sequence into the empty
domain due to a bug where we only considered cases where we have an exit
domain what is not the case for the root instruction.

PiperOrigin-RevId: 203744534
Benchmark should emit info even if extras is None.

PiperOrigin-RevId: 203762356
PiperOrigin-RevId: 203769116
@whchung
Copy link
Collaborator

whchung commented Jul 10, 2018

@deven-amd please help review those failures and update whitelist if necessary.

@deven-amd
Copy link
Author

@whchung

only two tests are failing.

  1. //tensorflow/python/estimator:dnn_linear_combined_test
  2. //tensorflow/python/ops/parallel_for:control_ops_test

I cannot reproduce the failure for #1 locally (3 consecutuive passes for me)

Failure #2 is reproducible (crash with stack dump while running) but currently do not know the cause of of the failure/crash.

@whchung
Copy link
Collaborator

whchung commented Jul 10, 2018

a) let's put them into the whitelist, and update the PR
b) please also help update the spreadsheet.
c) are they newly introduced tests? or regressions? please help put such information in the spreadsheet.

@deven-amd
Copy link
Author

let me put the two tests on the whitelist and update the PR
will update the spreadsheet as well

The dnn_linear_combined_test is a regression (though I cannot reproduce the failure locally) ... it started passing recently (either due to disabling sharding or the "FloorDiv on GPU" fix)

The parallel_for/control_ops_test is a new test!

//tensorflow/python/estimator:dnn_linear_combined_test
  is a regression. However I cannot reproduce the failure locally

//tensorflow/python/ops/parallel_for:control_flow_ops_test
  is a new test. There is also another "control_flow_ops" test on the whitelist...perhaps a common cause of failure.
@deven-amd deven-amd merged commit 9fc24be into develop-upstream Jul 10, 2018
@deven-amd deven-amd deleted the develop-upstream-sync-180709 branch July 23, 2018 12:42
deven-amd pushed a commit that referenced this pull request Aug 2, 2019
Closes #62

COPYBARA_INTEGRATE_REVIEW=tensorflow/mlir#62 from schweitzpgi:register-fir d122eae9c2cdf21581f48412551a93b8b4e640a6
PiperOrigin-RevId: 261187850
jerryyin pushed a commit that referenced this pull request Oct 3, 2019
ekuznetsov139 pushed a commit that referenced this pull request May 31, 2022
PiperOrigin-RevId: 449102807
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.