[DataType] Add bfloat16 #5601

Menooker · 2020-05-15T07:42:20Z

We add bfloat16 as a new type named "bf16" in the frontend. Completed LLVM backend for generating bf16.

Use int16 as the storage type in LLVM
Add legalization to enable computations on bf16
Add runtime frontend support (e.g. allow converting numpy's uint16 array to bf16 NDArray)

Details on legalization

Since most of the HW has no native support for computation on bf16, we added a pass BF16Legalization to use fp32 computing bf16 data. It adds cast_to_fp32() before each Op involing bf16 operands, and use Ops of fp32 to compute. Finally, it adds a 'cast_to_bf16()' after each Op that is altered. e.g.

add(a,b) => cast16(add(cast32(a), cast32(b)))

We call this phase as "BF16Promotion". It is a sub-pass of BF16Legalization pass.

We note that this will add redundant casting. e.g.

add(a, neg(b)) => cast16(add(cast32(a), cast32(cast16(neg(cast32(b)))))

The pattern cast32(cast16(some_fp32_value)) can be simplified to some_fp32_value.

Thus, we add an optimization pass after "BF16Promotion" in BF16Legalization pass, which eliminates redundant casts.

After BF16Legalization pass, there will be no bf16 related computation in the AST, except casting between fp32 and bf16, bf16 value comparasion and assignment.

Casting between fp32 and bf16

We follow PyTorch's bf16 casting implementation.

Menooker · 2020-05-15T08:47:10Z

@zhiics @liangfu Please help review this PR. Thanks!

tqchen · 2020-05-15T21:41:35Z

cc @gussmith23 might be related to BYOD

liangfu

Thanks @Menooker for the great work! The proposed changes mostly looks good. I left a few comments.

liangfu · 2020-05-19T09:10:59Z

src/target/llvm/codegen_llvm.cc

@@ -309,6 +309,9 @@ llvm::Type* CodeGenLLVM::DTypeToLLVMType(const DataType& dtype) const {
      default:
        LOG(FATAL) << "do not support " << dtype;
    }
+  } else if (dtype.is_bfloat()) {
+    CHECK_EQ(dtype.bits(), 16);


Since bfloat is assumed to be 16bit, can we keep the terminology more consistent? Since the data type is termed as bf, bf16, bfloat16, bfloat in the proposed change. Or are we going to support more data types like bfloat18 and bfloat20 in the future?

Sorry for the inclarity. I think in bfloat[X], only X=16 makes sense. But TVM's type system allows specifying the bits of a type. So here is the checking to make sure it is bf16.

A good question. Will we treat TensorFloat-32 as bfloat20? If so, then bits is useful to distinguish those.

liangfu · 2020-05-19T09:11:41Z

tests/python/unittest/test_tir_transform_bf16_legalize.py

+if __name__ == "__main__":
+    test_promote()
+    test_eliminate()
+    test_legalize()


Please leave a new line at EOF, even this is test script :)

liangfu · 2020-05-19T09:15:56Z

tests/python/unittest/test_target_codegen_llvm.py

+def np_bf162np_float(arr):
+    ''' Convert a numpy array of bf16 (uint16) to a numpy array 
+    of float'''
+    u32 = np.left_shift(arr.astype('uint32'), 16)


Are we going to produce a potential endianness problem here?

In my understanding, fp32=>bf16 casting preserves the higher-ordered bits (bits 31-16). We don't need to know whether the higher-ordered bits are stored in a larger address or a smaller address (which is the endianness), we just need to get the bits by shifting, which is well-defined - just using shifting is enough.

Reference: wiki for fp32 bit order

PyTorch's bf16 casting

I am not 100% sure about this. I have tested the code on x86, not (yet) on other arch.

Can we reused the following code snippet, which preserves endianness checks?

https://github.com/apache/incubator-tvm/blob/6cbda80227fc18a859c4b01f57f75abbd7a16181/3rdparty/bfloat16/bfloat16.cc#L27-L35

And it has wrapper functions below.

If my understanding is correct, we don't need to care about endianness. BF16 conversions only involves getting higher-ordered bits. And the operation to get higher-ordered bits in C++/Numpy is well-defined.

liangfu · 2020-05-19T09:24:17Z

src/target/llvm/codegen_llvm.cc

@@ -906,7 +954,7 @@ DEFINE_CODEGEN_BINARY_OP(Mul);
  llvm::Value* CodeGenLLVM::Create##Op(DataType t, llvm::Value* a, llvm::Value* b) { \
    if (t.is_int()) {                                                                \
      return builder_->CreateICmpS##Op(a, b);                                        \
-    } else if (t.is_uint()) {                                                        \
+    } else if (t.is_uint() || t.is_bfloat()) {                                       \


Isn't comparing bfloat16 this way risky?

FP32/FP64 comparasion are also bit-wise in my understanding.

liangfu · 2020-05-19T09:26:49Z

src/target/llvm/codegen_llvm.cc

+                                    ? static_cast<llvm::Type*>(builder_->getInt32Ty())
+                                    : llvm::VectorType::get(builder_->getInt32Ty(), from.lanes());
+    auto v = builder_->CreateZExt(value, extended_type);
+    v = builder_->CreateShl(v, 16);


Potential endianness problem here?

tqchen · 2020-05-20T02:26:34Z

include/tvm/runtime/c_runtime_api.h

@@ -114,6 +114,7 @@ typedef enum {
  kTVMNNVMLast = 20U,
  // The following section of code is used for non-reserved types.
  kTVMExtReserveEnd = 64U,
+  kTVMBFloat = 65U,


We do not want BFloat to be passed as PackedFunc argument, most packedfunc argument should always be passed as double

I suppose TVM should support kernel generation, e.g. generating a fused "conv+bn+relu", rather than generating end-to-end model, which is the usual case. In this case, we might select some intermediate layers of the model and let TVM generate the selected layers. The layers may require bf16 as the dtype, as they are in the middle of the model.

What I want to say is that we sometimes need bf16 as the input dtype. In our usecase in Intel, we need to generate a bf16 kernel (e.g. conv+bn+relu).

Such dtype is covered by allocating a DLTensor with type_code equals kBFloat, and does not need patch to the code here(needed for parameter argument passing PackedFunc).

The particular code is used when we directly pass a constant into PackedFunc, e.g. f(1.0, some_float_value). in these cases double can be used.

If we remove this type from TVM runtime, we cannot pass a bf16 array to TVM via Python and users can only pass bf16 buffers via C runtime (or in some awkward way to construct a bf16 DLTensor via Python). Currently, with kTVMBFloat defined, we can:

A = te.placeholder((32, ), dtype='bfloat16') B = te.placeholder((32, ), dtype='bfloat16') d = te.compute((32, ), lambda x: A[x] + B[x]) sch = te.create_schedule(d.op) module = tvm.build(sch, [A, B, d]) npa = np.random.rand(32).astype('float32') npb = np.random.rand(32).astype('float32') a_ = np_float2tvm_bf16(npa) b_ = np_float2tvm_bf16(npb) c_ = tvm.nd.empty((32,), 'bfloat16') module(a_, b_, c_)

Which is useful for testing and prototyping.

I don't think you will kTVMBFloat to support this feature. The DataType::kDLBFloat flag in the runtime::DataType should be sufficient for NDArray contents(because the runtime::DataType's type code in the NDArray contents diverges from the TVM type code above the OpaqueHandle).

ok I understand. will change that

tqchen · 2020-05-20T02:27:13Z

include/tvm/runtime/data_type.h

@@ -81,6 +82,10 @@ class DataType {
  bool is_float() const { return code() == DataType::kFloat; }
  /*! \return whether type is a float16 type. */
  bool is_float16() const { return is_float() && bits() == 16; }
+  /*! \return whether type is a bfloat type. */
+  bool is_bfloat() const { return code() == DataType::kBFloat; }


given that only bfloat16 is defined, is_bf16 is a good enough function

ok, changed

tqchen · 2020-05-20T02:27:21Z

include/tvm/runtime/data_type.h

@@ -297,6 +302,8 @@ inline const char* TypeCode2Str(int type_code) {
      return "Object";
    case kTVMObjectRValueRefArg:
      return "ObjectRValueRefArg";
+    case kTVMBFloat:
+      return "bf";


ok, changed

tqchen · 2020-05-20T02:28:10Z

src/target/llvm/codegen_llvm.cc

 // cast operatpr
 llvm::Value* CodeGenLLVM::CreateCast(DataType from, DataType to, llvm::Value* value) {
  llvm::Type* target = DTypeToLLVMType(to);
  if (value->getType() == target) return value;
  if (to.is_handle()) {
    return builder_->CreateBitCast(value, target);
+  } else if (to.is_float() && from.is_bfloat()) {
+    CHECK_EQ(from.bits(), 16);


If LLVM does not support bfloat, then perhaps we should do the legalization as a TIR=>TIR pass as opposed to do it in LLVM

We are actually doing TIR=>TIR legalization pass in TVM. See src/tir/transforms/bf16_legalize.cc

Then we should directly change the type to be i16 during legalization and remove special handling code for bfloat16

Then we cannot tell whether it is a float32 => i16 or float32 => bfloat16 casting

There're 2 kinds of legalization:

TIR->TIR. TIR has full ability to describe any bfloat16 operation after this PR. This legalization is introduced just because of hardware limitation that current hardware only provide few bfloat16 operations. One day when hardware has full instructions support with bfloat16, ideally this legalization can be skipped. So this legalization is a target dependent pass.

TIR->LLVM IR. I guess this is the legalization that @tqchen mentions. Because LLVM IR doesn't natively support bfloat16 , i16 will be used to replace bfloat16. In this PR, I guess this is done within codegen_llvm, not by a particular pass.

Then we should legalize the cast as well in the TIR to introduce the actual impl of the cast funtions in TIR, please also refer to https://tvm.apache.org/2020/05/20/bring-your-own-datatypes for releated implemenetation

Just 2 small questions.

Did you mean totally eliminating bf16 dtype in legalization pass? This will bring much more complexity in the BF16Legalize pass, because we need to check every TIR node to replace bf16 with int16. In contrast, current impl only changes computation TIR nodes. And in the codegen, the bf16 generation is quite simple, just adding another ‘else if’ in casting node and tvm dtype to llvm type converter

and I think the way processing “custom data type” that you mentioned does not fit this pr well. Actually I have already notice this feature before I wrote this bf16 feature. But it needs function calls to do lowering, which is not friendly to the codegen backend to do auto vectorization and so on. Of course you can say we can implement this cast function as an intrinsic. Yes, but more complexity is brought.

I think letting bf16 dtype live until codegen is a good idea, it makes legalization, impl of casting easier

tqchen · 2020-05-21T15:40:30Z

Given that this is a new feature that will affect quite some people, please open a new RFC thread in the discuss forum to describe the motivation and the high level design. Thank you!

… bf16

Menooker · 2020-06-05T08:43:10Z

@tqchen Thanks for the clarification. I have changed kTVMNullPtr back to 4.

tqchen · 2020-06-09T20:33:23Z

@vinx13 @ZihengJiang @liangfu it would be great if you cam take another look. Thanks @Menooker for keep improving the PR

tqchen · 2020-06-10T21:56:09Z

src/tir/transforms/bf16_legalize.cc

+
+// implementation from
+// https://github.com/pytorch/pytorch/blob/master/c10/util/BFloat16.h
+inline uint16_t round_to_nearest_even(float src) {


Google C Style, we cannot directly copy code from another codebase into the mainline, we would need to either put it in 3rdparty, or implement it independently

… bf16

tqchen · 2020-06-12T15:32:54Z

@junrushao1994 can you also take a quick look at this PR. thank you!

junrushao · 2020-06-12T17:27:03Z

tests/python/unittest/test_tir_transform_bf16_legalize.py

+    def orig1(a,b):
+        return lambda i: a[i]+b[i]+a[99-i]+b[99-i]
+    def after1(a,b):
+        return lambda i: to16(to32(a[i])+to32(b[i])+to32(a[99-i])+to32(b[99-i]))
+    def orig2(a,b):
+        return lambda i: a[i]*b[i]+a[99-i]*b[99-i]+a[i]
+    def after2(a,b):
+        return lambda i: to16(to32(a[i])*to32(b[i])+to32(a[99-i])*to32(b[99-i])+to32(a[i]))


i am not so sure why the coding style here can pass pylint...Mind sending a simple fix?

Sorry for that. Now I have formatted and manually run pylint on this test file. BTW, the test python files are never checked in TVM CI's pylint :)

Ooops I see. That makes sense then :-)

… bf16

tqchen · 2020-06-16T21:19:08Z

@vinx13 @ZihengJiang @liangfu it would be great if you cam take another look and https://tvm.apache.org/docs/contribute/code_review.html#approve-and-request-changes-explicitly

junrushao · 2020-06-16T23:09:30Z

include/tvm/runtime/data_type.h

@@ -72,6 +73,9 @@ class DataType {
    data_.code = static_cast<uint8_t>(code);
    data_.bits = static_cast<uint8_t>(bits);
    data_.lanes = static_cast<uint16_t>(lanes);
+    if (code == kBFloat) {
+      CHECK_EQ(bits, 16);


It is understandable that right now we only support bf16, but my concern is that "should we put the check here"?

I understand your concern. Any suggestions for the location where we put this check? Thanks.

@tqchen This is just a nitpick. What do you think?

let us leave it as it is for now, we can come back to it later

junrushao · 2020-06-16T23:11:26Z

include/tvm/runtime/data_type.h

@@ -372,7 +372,7 @@ inline DLDataType String2DLDataType(std::string s) {
    t.lanes = 1;
    return t;
  } else if (s.substr(0, 6) == "bfloat") {
-    t.code = kTVMBFloat;
+    t.code = kDLBfloat;


I agree with tq

junrushao · 2020-06-16T23:11:47Z

python/tvm/_ffi/_cython/base.pxi

@@ -27,7 +27,7 @@ cdef enum TVMTypeCode:
    kUInt = 1
    kFloat = 2
    kTVMOpaqueHandle = 3
-    kTVMNullptr = 4
+    kBFloat = 4


shall we remove this?

junrushao · 2020-06-16T23:12:17Z

python/tvm/_ffi/runtime_ctypes.py

@@ -96,6 +98,9 @@ def __init__(self, type_str):
            self.type_code = DataTypeCode.HANDLE
            bits = 64
            head = ""
+        elif head.startswith("bfloat"):
+            self.type_code = 4


not sure if it is good to hard code here

not sure if it is good to hard code here

Change to DataTypeCode. TVM refactors a lot (which is good). And when this PR was raised, all the type code here used hard codes.

The other two issues you raised were also changed as required.

tqchen · 2020-06-18T23:30:34Z

@junrushao1994 @ZhennanQin please followup and https://tvm.apache.org/docs/contribute/code_review.html#approve-and-request-changes-explicitly

junrushao

LGTM :-)

tqchen · 2020-06-19T14:40:04Z

include/tvm/runtime/data_type.h

@@ -72,6 +73,9 @@ class DataType {
    data_.code = static_cast<uint8_t>(code);
    data_.bits = static_cast<uint8_t>(bits);
    data_.lanes = static_cast<uint16_t>(lanes);
+    if (code == kBFloat) {
+      CHECK_EQ(bits, 16);


let us leave it as it is for now, we can come back to it later

tqchen · 2020-06-19T14:41:39Z

Thanks @Menooker for being patient and keep improving the PR to maintain a high quality standard! Thanks @ZhennanQin @junrushao1994 @liangfu for helpful reviews!

Menooker added 11 commits May 15, 2020 15:20

add bf16

6ca0e30

add bf16 in DataType (py)

3fba684

ndarray of bf16

48e7e94

do not cast back for compare op

17ef57b

const gen

96f3019

more precise

4aeff41

update test

c551a3d

enable vectorization

0510af9

correct vectorize

3c5c0f4

linter changes

c978b9e

linter

ef6f410

Menooker and others added 7 commits May 15, 2020 16:52

linter

92d014a

linter

e23f33c

Update bf16_legalize.cc

a245d41

Update bf16_legalize.cc

17c7084

Update bf16_legalize.cc

d51bf9b

Update transform.py

cbb1e5b

fix

680ecce

tqchen added the status: need review label May 15, 2020

Menooker added 3 commits May 16, 2020 10:55

Update test_target_codegen_llvm.py

b4b9d42

Update test_target_codegen_llvm.py

a899ef7

Update transform.py

3523f00

liangfu reviewed May 19, 2020

View reviewed changes

tqchen requested changes May 20, 2020

View reviewed changes

Menooker added 2 commits May 20, 2020 14:22

bf16 => bfloat16

bac7247

fix linter problem

7224acf

tqchen added the status: need RFC need RFC discussion label May 21, 2020

Menooker added 3 commits June 5, 2020 12:42

change back nullptr typecode

7612f9d

Merge branch 'master' of https://github.com/apache/incubator-tvm into…

02990b1

… bf16

remove python runtime type for bf16

1b85a00

tqchen requested changes Jun 10, 2020

View reviewed changes

Menooker added 4 commits June 12, 2020 09:24

fix code style of RoundToNearestEven

c0cb1ef

Merge branch 'master' of https://github.com/apache/incubator-tvm into…

b8b5b4a

… bf16

merge newest master

a6341cb

format

656e3e4

tqchen self-assigned this Jun 12, 2020

junrushao reviewed Jun 12, 2020

View reviewed changes

Menooker added 3 commits June 13, 2020 11:23

pylint on test

25c811c

Merge branch 'master' of https://github.com/apache/incubator-tvm into…

af5438a

… bf16

make it run on newest master

7a72ea9

junrushao reviewed Jun 16, 2020

View reviewed changes

type code changes etc.

318ddc9

junrushao approved these changes Jun 19, 2020

View reviewed changes

tqchen approved these changes Jun 19, 2020

View reviewed changes

tqchen merged commit 9b7c078 into apache:master Jun 19, 2020

tqchen added status: accepted and removed status: need review labels Jun 19, 2020

tqchen mentioned this pull request Jun 19, 2020

Fix map assign issue in CI test #5854

Merged

trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Jun 30, 2020

[DataType] Add bfloat16 (apache#5601)

3d10f79

zhiics pushed a commit to neo-ai/tvm that referenced this pull request Jul 2, 2020

[DataType] Add bfloat16 (apache#5601)

b409918

ZihengJiang mentioned this pull request Sep 25, 2020

TVM v0.7 Release Note Candidate #6486

Closed

[DataType] Add bfloat16 #5601

[DataType] Add bfloat16 #5601

Conversation

Menooker commented May 15, 2020 • edited Loading

Details on legalization

Casting between fp32 and bf16

Menooker commented May 15, 2020

tqchen commented May 15, 2020

liangfu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Menooker May 19, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liangfu May 20, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tqchen May 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tqchen May 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tqchen commented May 21, 2020

Menooker commented Jun 5, 2020

tqchen commented Jun 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tqchen commented Jun 12, 2020

junrushao Jun 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tqchen commented Jun 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Menooker Jun 17, 2020 • edited Loading

Choose a reason for hiding this comment

tqchen commented Jun 18, 2020

junrushao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tqchen commented Jun 19, 2020

Menooker commented May 15, 2020 •

edited

Loading

Menooker May 19, 2020 •

edited

Loading

liangfu May 20, 2020 •

edited

Loading

tqchen May 21, 2020 •

edited

Loading

tqchen May 21, 2020 •

edited

Loading

junrushao Jun 12, 2020 •

edited

Loading

Menooker Jun 17, 2020 •

edited

Loading