forked from tensorflow/tensorflow
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Callback #1
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
PiperOrigin-RevId: 261242655
…ariable mismatch. PiperOrigin-RevId: 261245922
PiperOrigin-RevId: 261246913
PiperOrigin-RevId: 261246974
RELNOTES=In TensorFlow 2, layers now default to float32, and automatically cast their inputs to the layer's dtype. If you had a model that used float64, it will probably silently use float32 in TensorFlow 2, and a warning will be issued that starts with "Layer <layer-name> is casting an input tensor from dtype float64 to the layer's dtype of float32". To fix, either set the default dtype to float64 with `tf.keras.backend.set_floatx('float64')`, or pass `dtype='float64'` to each of the Layer constructors. See `tf.keras.layers.Layer` for more information. PiperOrigin-RevId: 261249415
…response, get the status from the response and clean up all remaining calls in the queue. PiperOrigin-RevId: 261252523
For the fully_connected layer, we have seen FakeQuant* with 16bits are used in the training models, so we should add this to the op type constraint to quantize these models. PiperOrigin-RevId: 261253563
PiperOrigin-RevId: 261267068
…rite call In a recent change, this rewrite pattern was moved to the second greedy pattern rewrite call, but the TF FakeQuant ops are constant folded in the first greedy pattern rewrite call, so the narrow_range and bit_width attribute from the TF FakeQuant ops are missing. We failed to test this case since the default values in the following passes matches the TOCO requirement. This patch restores the original behavior and uses the TFL quantize and dequantize ops to preserve these information. This patch fixed another related bug that the generated patterns are not applied in the second greedy pattern rewrite call, so the TF transpose/reshape are not lifting to enable the constant folding. This patch added the generated patterns to the second greedy pattern rewrite call. At the same time, two lifting rules were added so the TFL quantize and dequantize ops are handled. This patch also improved the implementation of the TFL QDQs inserting pattern. Some related tests are simplified to only check necessary invariant. PiperOrigin-RevId: 261267567
PiperOrigin-RevId: 261268533
This CL sets narrow_range to true to avoid the value -128 in in8 quantization, thus the weight values range only in [-127, 127]. This enables faster runtime arithmetic kernels on ARM NEON. For uint8 quantization, 128 is substracted from the quantized values and zero-points and then the int8 kernels can be used, thus the its narrow_range for weights are set to true as well. Note that the FakeQuant* for "weights" inserted in all the existing models have narrow_range set to true, so this CL just make it consistent for all the weights in the model. TOCO implemented the same logic in the ensure_uint8_weights_safe_for_fast_int8_kernels pass. This optimizaion is very specific to ARM architecture, so an TODO is added to make it configurable. Activations shouldn't use narrow_range, instead, it should use the full range. PiperOrigin-RevId: 261269551
PiperOrigin-RevId: 261269625
…rands with types matching the output type This will allow instances not following these requirements to printed so that they can be read again correctly. Also, updated the parser to parse long as well as the short form. Similar pattern to mark all but the first two operands as control inputs is there for other ops like NextIterationSinkOp, SwitchOp and SwitchNOp but these ops expects only two data operands so no changes are required for them. PiperOrigin-RevId: 261272248
…MergeOp verifier Variant types may have opaque subtypes info that need to match. Also, added constraint that all data operands and the output of the MergeOp are of tensor type. PiperOrigin-RevId: 261277322
PiperOrigin-RevId: 261284396
PiperOrigin-RevId: 261284400
PiperOrigin-RevId: 261289882
Note that cache key contains PyObject* and is therefore not easily reusable from other languages. CPU | Benchmark | Before (calls/sec) | After (calls/sec) | |---------------------------------+--------------------+-------------------| | benchmark_add_float_scalars | 96697.1650772 | 122549.093512 | | benchmark_add_int_scalars | 100551.000642 | 124905.320251 | | benchmark_create_float_constant | 269135.927106 | 368643.600035 | | benchmark_create_int32_constant | 250023.088998 | 347383.13732 | GPU | Benchmark | Before (calls/sec) | After (calls/sec) | |---------------------------------+--------------------+-------------------| | benchmark_add_float_scalars | 9478.74450315 | 17181.8063021 | | benchmark_add_int_scalars | 99584.0439651 | 117965.869066 | | benchmark_create_float_constant | 275277.007219 | 381577.874818 | Notes: * The timings between CPU and GPU are incomparable because they were measured on different hardware; * I suspect that benchmark_add_int_scalars on GPU does addition on CPU and copies to GPU after, therefore the gap between *_add_float_* and *_add_int_*. PiperOrigin-RevId: 261293772
PiperOrigin-RevId: 261294904
PiperOrigin-RevId: 261296066
PiperOrigin-RevId: 261306157
and wrong opt set (RUY_OPT_INTRINSICS, not RUY_OPT_ASM, there is no asm here). PiperOrigin-RevId: 261314107
PiperOrigin-RevId: 261318771
Don't use nullness of local_packed or packing_status array pointers to determine whether a side is pre-packed: use params->is_prepacked for that. Make local_packed a member of TrMulTask so we don't need to pass it around explicitly. PiperOrigin-RevId: 261319667
in the single-thread case. PiperOrigin-RevId: 261320407
As signed index is verified to be >= 0 at the point of compare with the unsigned size, we can make the compare explicitly an unsigned compare by casting index. Also avoids -Wsign-compare warning where enabled. PiperOrigin-RevId: 261321178
…ounters. Saves a store-release and a load-acquire (total ~100 cycles) per matmul. PiperOrigin-RevId: 261321407
AffineDataCopyGeneration pass relied on command line flags for internal logic in several places, which makes it unusable in a library context (i.e. outside a standalone mlir-opt binary that does the command line parsing). Define configuration flags in the constructor instead, and set them up to command line-based defaults to maintain the original behavior. PiperOrigin-RevId: 261322364
PiperOrigin-RevId: 261806721
…tly pass it through in convolution kernel. PiperOrigin-RevId: 261808345
PiperOrigin-RevId: 261809474
PiperOrigin-RevId: 261816030
… directive. This allows for proper forward declaration, as opposed to leaking the internal implementation via a using directive. This also allows for all pattern building to go through 'insert' methods on the OwningRewritePatternList, replacing uses of 'push_back' and 'RewriteListBuilder'. PiperOrigin-RevId: 261816316
PiperOrigin-RevId: 261816763
PiperOrigin-RevId: 261816972
PiperOrigin-RevId: 261820730
The input_length arg is passed as the maximum_iterations arg to tf.while_loop which adds a LogicalAnd to the loop condition which is slow on GPU. PiperOrigin-RevId: 261822039
PiperOrigin-RevId: 261823955
PiperOrigin-RevId: 261828878
PiperOrigin-RevId: 261840270
Many LLVM transformations benefits from knowing the targets. This enables optimizations, especially in a JIT context when the target is (generally) well-known. Closes tensorflow#49 COPYBARA_INTEGRATE_REVIEW=tensorflow/mlir#49 from dcaballe:dcaballe/tti ab02f72eb326f660945696e5dadeeb983cf263b3 PiperOrigin-RevId: 261840617
It is now tensorflow/core/platform:platform PiperOrigin-RevId: 261843350
No functionality changes. Lessons learned: 1. Some protobuf messages are large, e.g. FunctionDef. - Solution: Allocate them on heap instead. 2. Sometimes compiler inlines functions, so inlined function's stack frame will be merged into caller function stack frame, and we get a caller function with large stack frame. This is caught by inspecting assembly code. - Solution: Add TF_ATTRIBUTE_NOINLINE to those inlined functions. PiperOrigin-RevId: 261851076
PiperOrigin-RevId: 2618565
PiperOrigin-RevId: 261857381
…rnal to Google. Most tests were already being run with XLA. This primarily ensures that any new tests will also be run with XLA in the future. Some contrib/ tests are disabled for XLA because there are no guarantees on contrib being supported. PiperOrigin-RevId: 261861931
PiperOrigin-RevId: 261867752
PiperOrigin-RevId: 261867753
PiperOrigin-RevId: 261887312
This CL modifies the LowerLinalgToLoopsPass to use RewritePattern. This will make it easier to inline Linalg generic functions and regions when emitting to loops in a subsequent CL. PiperOrigin-RevId: 261894120
This CL extends the Linalg GenericOp with an alternative way of specifying the body of the computation based on a single block region. The "fun" attribute becomes optional. Either a SymbolRef "fun" attribute or a single block region must be specified to describe the side-effect-free computation. Upon lowering to loops, the new region body is inlined in the innermost loop. The parser, verifier and pretty printer are extended. Appropriate roundtrip, negative and lowering to loop tests are added. PiperOrigin-RevId: 261895568
fsx950223
pushed a commit
that referenced
this pull request
May 8, 2021
Prototype showed significant dispatch performance improvements from the new backend. This is the first of a series of commits to add a new PJRT backend. The intention is to eventually replace the existing StreamExecutor-based CPU backend. PiperOrigin-RevId: 367514967 Change-Id: I16c9523b604445015125ad2e42fd8822ec0c38c5
fsx950223
pushed a commit
that referenced
this pull request
Nov 28, 2023
On some CI nodes (typically those with higher CPU core counts 128/256), the `//tensorflow/c/eager:c_api_distributed_test_gpu` test fails on an intermitent basis. When it does fail, the failures manifests as segfault at the end of the test, with the stack dump shown at the end of this commit message. The stack dump points the finger to a routine within the MKLDNN implementation. This is further confirmed by the observation that disabling the MKLDNN based Eigen contraction kernels (for ROCm) seems to make the crash go away. related JIRA ticket - https://ontrack-internal.amd.com/browse/SWDEV-313684 A previous commit disabled the `//tensorflow/c/eager:c_api_distributed_test` unit-test only in the CPU unit-tests CI job (for the same reason). That comit cannot be reverted, because this commit disables MKLDNN based Eigen contraction kernels *only* for the ROCm build. ``` Thread 191 "c_api_distribut" received signal SIGSEGV, Segmentation fault. [Switching to thread 191 (Thread 0x7ffc777fe700 (LWP 159004))] 0x00007fff54530000 in ?? () (gdb) where #0 0x00007fff54530000 in ?? () #1 0x00007fffd5d15ae4 in dnnl::impl::cpu::x64::avx_gemm_f32::sgemm_nocopy_driver(char const*, char const*, long, long, long, float const*, float const*, long, float const*, long, float const*, float*, long, float const*, float*) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so tensorflow#2 0x00007fffd5d166e1 in dnnl::impl::cpu::x64::jit_avx_gemm_f32(int, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so tensorflow#3 0x00007fffd5e277ed in dnnl_status_t dnnl::impl::cpu::x64::gemm_driver<float, float, float>(char const*, char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, float const*, long const*, float const*, float const*, float*, long const*, float const*, bool, dnnl::impl::cpu::x64::pack_type, dnnl::impl::cpu::x64::gemm_pack_storage_t*, bool) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so tensorflow#4 0x00007fffd5665056 in dnnl::impl::cpu::extended_sgemm(char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*, bool) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so tensorflow#5 0x00007fffd52fe983 in dnnl_sgemm () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/libexternal_Smkl_Udnn_Uv1_Slibmkl_Udnn.so tensorflow#6 0x0000555557187b0b in Eigen::internal::TensorContractionKernel<float, float, float, long, Eigen::internal::blas_data_mapper<float, long, 0, 0, 1>, Eigen::internal::TensorContractionInputMapper<float, long, 1, Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, Eigen::array<long, 1ul>, 4, true, false, 0, Eigen::MakePointer>, Eigen::internal::TensorContractionInputMapper<float, long, 0, Eigen::TensorEvaluator<Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::ThreadPoolDevice>, Eigen::array<long, 1ul>, Eigen::array<long, 1ul>, 4, true, false, 0, Eigen::MakePointer> >::invoke(Eigen::internal::blas_data_mapper<float, long, 0, 0, 1> const&, Eigen::internal::ColMajorBlock<float, long> const&, Eigen::internal::ColMajorBlock<float, long> const&, long, long, long, float, float) () tensorflow#7 0x000055555718dc76 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::kernel(long, long, long, bool) () tensorflow#8 0x000055555718f327 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::signal_kernel(long, long, long, bool, bool) () tensorflow#9 0x00005555571904cb in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::pack_rhs(long, long) () tensorflow#10 0x000055555718fd69 in Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::EvalParallelContext<Eigen::TensorEvaluator<Eigen::TensorContractionOp<Eigen::array<Eigen::IndexPair<long>, 1ul> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::TensorMap<Eigen::Tensor<float const, 2, 1, long>, 16, Eigen::MakePointer> const, Eigen::NoOpOutputKernel const> const, Eigen::ThreadPoolDevice>::NoCallback, true, true, false, 0>::enqueue_packing_helper(long, long, long, bool) () tensorflow#11 0x00007ffff6b607a1 in Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2 tensorflow#12 0x00007ffff6b5de93 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2 tensorflow#13 0x00007ffff6b40107 in tensorflow::(anonymous namespace)::PThread::ThreadFn(void*) () from /root/.cache/bazel/_bazel_root/efb88f6336d9c4a18216fb94287b8d97/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/c/eager/../../../_solib_local/_U_S_Stensorflow_Sc_Seager_Cc_Uapi_Udistributed_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.2 tensorflow#14 0x00007fffd1ca86db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 tensorflow#15 0x00007fffd00b471f in clone () from /lib/x86_64-linux-gnu/libc.so.6 ```
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.