JSON import/export: added base64 support for [ubyte] array and nested FBS #4585

vglavnyy · 2018-01-05T18:41:03Z

Issue #4393
Supported strict RFC4648 Base64 and Base64 with URL and Filename Safe Alphabet.
Added two new attributes (base64) and (base64url).
Added new option IDLOptions.output_base64_cancel_padding.
The Monster schema extended and regenerated for testing purposes.

Problem:
I don't know what to do with Base64SelftTest function.
https://github.com/vglavnyy/flatbuffers/blob/base64-PR/src/util_base64.cpp#L337-L448

This is the test of internal implementations of base64 encoder and decoder.
I can remove it from PR.

@mikkelfj can you review this PR for compatibility with flatcc?

mikkelfj · 2018-01-05T19:21:31Z

I can't confirm by reading the code, but I added the test case for input and wrote the expected output in padded strict JSON. This passes the flatcc json test suite including verifying the nested buffer (which is kind of lucky since the monster_test.fbs are not exactly identical).

dvidelabs/flatcc@9ed2733

Note that the test fails if flatcc compiles with UQ off (unquoted), meaning the test case fails when it expects strict json, but that is to be expected.

I have released my implementation in v0.5.0 earlier today so the above is not in the release.

mikkelfj · 2018-01-05T19:23:06Z

Note that I did not add runtime flags to control printing with or without padding - mostly because it was messy in the input, as you said. But it could be added later. I think what we have is fine for interop.

mikkelfj · 2018-01-05T19:25:49Z

BTW: the TEST macro in my test takes two string arguments, the first is the input of a monster_test.fbs json file and the second argument is the expected json printed output without spaces if the parse is successful.

mikkelfj · 2018-01-05T19:45:26Z

I tried to test the schema, but some semantics have diverged a bit on namespaces and language specific attributes. Removing these from the schema and the schema parses cleanly with base64, base64url attributes.

aardappel · 2018-01-08T16:59:59Z

include/flatbuffers/util_base64.h

+  return static_cast<Base64Mode>(static_cast<int>(a) | static_cast<int>(b));
+}
+
+inline bool Base64ModeIsStrict(Base64Mode m) { return (m & kBase64Strict); }


remove ()

aardappel · 2018-01-08T17:00:35Z

include/flatbuffers/util_base64.h

+
+// upper bound estimate of a size of the base64 decoded sequence
+static inline size_t Base64EstimateDecodedSize(size_t src_size) {
+  return (src_size / 4) * 3 + (src_size % 4);


some comments as to where these constants come from, and/or make them into named constants.

aardappel · 2018-01-08T17:01:31Z

samples/monster_generated.h

@@ -603,120 +603,6 @@ inline void EquipmentUnion::Reset() {
  type = Equipment_NONE;
 }



please run sh generate_code.sh, this will ensure the generated code in this commit will have the same settings as before.

aardappel · 2018-01-08T17:02:43Z

src/idl_gen_text.cpp

@@ -46,9 +48,12 @@ void OutputIdentifier(const std::string &name, const IDLOptions &opts,
 // Print (and its template specialization below for pointers) generate text
 // for a single FlatBuffer value into JSON format.
 // The general case for scalars:
-template<typename T>


where does this formatting come from? unless done by clang-format, please don't reformat things that are unrelated to your PR.

aardappel · 2018-01-08T17:03:32Z

src/idl_gen_text.cpp

@@ -69,10 +74,61 @@ bool Print(T val, Type type, int /*indent*/, Type * /*union_type*/,
  return true;
 }

+// Print a vector as bas64-encoded string


missing e

aardappel · 2018-01-08T17:09:05Z

src/idl_gen_text.cpp

+  // reserve text-space before temporary allocation
+  text.reserve(text.size() + req_size + 64);
+  std::vector<char> buf(req_size);  // temporary buffer
+  size_t b64_len = Base64Encode(b64mode, src, src_size, flatbuffers::vector_data(buf), buf.size());


why does this function encode into a vector<char> and not a string? If it took a string, we could pass it text, and it could append straight where it is needed. See similarly how EscapeString is called.

aardappel · 2018-01-08T17:33:13Z

src/util_base64.cpp

+                                const size_t dst_size,
+                                size_t *const error_position = nullptr) {
+  // strict base64 table (RFC 4648)
+  B64_DATA_ALIGNAS(64) static const uint8_t strict_table[256] = {


why is this alignment needed? if this is to align it with cache lines for performance, please remove. prefer simple code over optimizations that have not been shown to be a clear bottleneck.

aardappel · 2018-01-08T17:51:41Z

src/util_base64.cpp

+  size_t N4 = src_size / 4;
+  size_t R4 = src_size % 4;
+  // deferr last char[4] block to remained block
+  if ((R4 == 0) && (N4 > 0)) {


more comments / better identifier names please. This code is pretty cryptic.

aardappel · 2018-01-08T17:54:11Z

src/util_base64.cpp

+  }
+}
+
+#define TEST_EQ(exp, val) TestEq(exp, val, #exp, __FILE__, __LINE__)


move testing code to test.cpp

aardappel · 2018-01-09T00:27:30Z

tests/monster_test.fbs

@@ -44,6 +44,13 @@ table Stat {
  count:ushort;
 }

+table TestBase64
+{


can you in-line this table? reduces the amount of new files generated.

vglavnyy · 2018-01-17T08:29:36Z

At commit 79b80f8 the method ReserveElements was removed from FlatBufferBuilder class.
ReserveElements
My code uses this public method for memory preallocation.
How to preallocate a big Vector without using temporary objects like std::vector<>?

mikkelfj · 2018-01-17T09:57:46Z

@aardappel reservation in some form is necessary, otherwise the base64 decoder becomes very complex. The output size can be estimated conservatively within a small margin allowing the decoder to run without having to check output boundaries and manage flushing.

vglavnyy · 2018-01-17T18:14:31Z

Resolved.
Problem with ReserveElements can be solved with method:
uoffset_t CreateUninitializedVector(size_t len, size_t elemsize, uint8_t **buf)

mikkelfj · 2018-01-19T23:03:52Z

I just realized that this also needs to be tested with union strings and string vectors, and union vectors with strings.

vglavnyy · 2018-01-20T01:29:40Z

Can you present examples of these cases?
I'm finishing requested changes within a few days.
At the first stage (this PR), only [ubyte] arrays are supported.
At the second, nested buffers.

mikkelfj · 2018-01-20T06:57:49Z

I got confused - I forgot it was only ubyte and [ubyte] cannot be part of unions.

I was thinking of a use case like loading multiple PEM certificates where each cert type would have different union type so the following could be handled

http://how2ssl.com/articles/working_with_pem_files/

This would currently require each cert ot be loaded into a separate table, but may that is not a bad thing.

mikkelfj · 2018-01-23T19:21:23Z

I\m not sure exactly what the latest changes do, but it appears to enforcing padding.

flatcc isn't very configurable right now, but I believe it accepts unpadded input and in the future will be allowed to generate unpadded.

There are some JSON base64 formats floating around that define base64 to be unpadded, so I don't think it is wise to reject it, given that the string is fully delimited.

mikkelfj · 2018-01-23T19:32:31Z

This is protocol buffers base64: "JSON value will be the data encoded as a string using standard base64 encoding with paddings. Either standard or URL-safe base64 encoding with/without paddings are accepted."
https://developers.google.com/protocol-buffers/docs/proto3#json

vglavnyy · 2018-01-24T06:33:00Z

@mikkelfj,
Thank you for review and you are right.
This commit with a default base64 decode/encode behaviour was not good. This was experiment with code. As we discussed earlier encoder/decoder behavior must satisfy:
base64/base64url decoder: padding is optional,
base64 output: with padding,
base64url output: with padding and have IDL-option JsonBase64CancelPadding.

mikkelfj · 2018-01-24T10:21:44Z

@vglavnyy note there is one new suggestion: that base64 decodes also base64url and that base64url also decodes base64 unless explicitly required not to by some optional configuration

Currently I'm not sure if flatcc does this, but I know that the implementation is prepared for it and I'd be happy to make this update later.

vglavnyy · 2018-01-24T10:57:27Z

The (base64url) decoder implicitly can decode sequence form (base64) encoder, but versa does not.
I can add compatibility (base64url)=>(base64).
But this will be further away from the RFC specification.
Even (base64)=>(base64url) compatibility violates RFC.

vglavnyy · 2018-01-24T11:18:56Z

@aardappel ,
I done almost all requested changes and did rebase to master.
Now I have two unresolved problem.
First:
To the file "util_base64.h" aded templated function field_base64_mode.
It used in two places now. But if add nested buffers support it will be used in additional two places.
Should I delete this function or move it to another file?

Second:

Change requets for implementation of PrintVectorBase64 function:
why does this function encode into a vector and not a string? If it took a string, we could pass it text, and it could append straight where it is needed. See similarly how EscapeString is called.

PrintVectorBase64

Use appeding of characters to a string (like EscapeString) is impossible for current implementation of base64 encoder. Base64 encoder use direct pointer for writing to destination memory.
The vector copy can be avoided if use direct write to the string internal memory after extending it append(0, req_len). Address of memory can be obtained using &test[x] operator.
C++11 gives guarantee that internal memory of a std::string use a contiguous block of memory. But I'm not sure with correctness of this solution.

vglavnyy · 2018-02-13T03:22:05Z

@aardappel , could you review latest changes in the base-PR code?

KageKirin · 2018-02-13T10:42:27Z

Any ETA on when this PR is finalized and merged? I'd like to use it for some projects of mine (among which, https://github.com/KageKirin/flatGLTF)

aardappel · 2018-02-15T23:14:17Z

include/flatbuffers/util_base64.h

+// [BASE64]{https://tools.ietf.org/html/rfc4648}
+
+// a Field doesn't have base64/base64url attribute
+static const int kBase64Undefined = 0;


change these into an enum?

aardappel · 2018-02-15T23:16:55Z

include/flatbuffers/util_base64.h

+static const int kBase64CancelPadding = 4;
+
+// Helper: FDT is <idl.h>::FieldDef type
+template<typename FDT> inline int field_base64_mode(const FDT *fd) {


Just make it rely on FieldDef directly rather than this template trick. You can include idl.h since this this file is included in places where idl.h was already included anyway.

aardappel · 2018-02-15T23:17:44Z

include/flatbuffers/util_base64.h

+  return mode;
+}
+
+// calculate number of bytes required to save Decoded result


Comments are sentences: start with a capital and end with a .

aardappel · 2018-02-15T23:19:43Z

src/idl_gen_text.cpp

+  (void)type;
+  (void)indent;
+  auto b64mode = field_base64_mode(fd);
+  // leave if the Filed doesn't have base64 attribute


aardappel · 2018-02-15T23:26:34Z

src/idl_gen_text.cpp

+  // type T can by any, not only uint8_t
+  const auto *src = reinterpret_cast<const uint8_t *>(v.data());
+  const auto src_size = v.size();
+  const auto req_size = Base64EncodedSize(b64mode, src, src_size);


please refactor these functions. Base64EncodedSize is not needed, since the destination is a vector (or string), so we can simply push_back. There's no need for an intermediate vector either. Just pass _text to Base64Encode.

I'm going to stop my review here since it is clear my comments from last time haven't been addressed. Please confirm you have addressed all my comments (from both reviews) thruout the code before asking me to review again.

vglavnyy · 2018-02-20T03:52:31Z

@aardappel, Thank you for your review.
You are completely right about the implementation of the PrintVectorBase64 function and I fixed it.
Can we split review of this PR into two parts?

First part is an interface between the Flatbuffers and a base64 routines. The current interface declared in the “base64_utils.h” and used in the "idl_gen_text.cpp", "idl_parse.cpp" and "test.cpp". The enum “Base64Mode” define working modes for decode/encode routines.
The encoder has a simple interface. The decoder has two routines: Base64DecodedSize and Base64Decode. The current implementation of the decoder can return the position of a first invalid symbol inside a base64 string. This restricts a possible implementation of a base64 decoder. It can be removed from the interface and from the tests.

The second part is an implementation of base64 routines. There are a lot of implementations of base64 encode/decode routines and anyone can find or write own.
The proposed implementation located in the base64_utils.cpp. But for the first commit, it can be replaced by stub which does not violate current behavior of the Flatbuffers.

aardappel · 2018-02-22T17:38:32Z

I'm not sure what all this splitting up buys us. My exact problem with the current PR is that there's an mismatch: the core routines have C-style interfaces that require copying by the user. I proposed the opposite: that the core routines directly work with the types already used by the parser/generator, such that there is less translation steps. So if anything, less splitting up, more integration.

vglavnyy · 2018-02-23T10:33:22Z

Ok. I understand your proposal.
And I propose closing this PR and return to the issue stage.
We can take the already defined attributes "base64" and "base64url" as a basis for the issue (with or without the IDL option "base64_cancel_padding").
Then once again discussing the behavior of the encoder and decoder for these attributes if that behavior is not clear. After that, someone can propose a new solution for this issue.

aardappel · 2018-02-23T18:30:40Z

I don't think I understand. Why not just fix this PR with my suggestions?

vglavnyy · 2018-02-26T05:13:11Z

I don't know how to implement your suggestions and I need advice.
Let's start from simple case of printing.

Current implementation

template<typename T>
bool PrintVector(const Vector<T> &v, Type type, int indent,
                 const IDLOptions &opts, std::string *_text,
                 const FieldDef *fd = nullptr) {
  std::string &text = *_text;

  // Try to print UCHAR array as base64 string.
  if (type.base_type == BASE_TYPE_UCHAR) {
    auto b64mode = FieldGetBase64Mode(fd);
    if (b64mode) {
      text += "\"";
      Base64Encode(
          opts.base64_cancel_padding ? Base64CancelPadding(b64mode) : b64mode,
          reinterpret_cast<const uint8_t *>(v.data()), v.size(), &text);
      text += "\"";
      return true;
    }
  }
  .....
}

Proposed:

// Interface declaration from future version of the util_base64.h
// Try to print the vector of UCHAR as base64 string.
// Returns false if the field fd doesn't have base64 attribute.
bool PrintBase64Vector(const Vector<UCHAR> &v,  // data source
                       Type type, int indent, const IDLOptions &opts,
                       std::string *_text,  // destination
                       const FieldDef *fd);

Future implementation of PrintVector from idl_gen_text.cpp

template<typename T>
bool PrintVector(const Vector<T> &v, Type type, int indent,
                 const IDLOptions &opts, std::string *_text,
                 const FieldDef *fd = nullptr) {
  std::string &text = *_text;
  if (type.base_type == BASE_TYPE_UCHAR) {
    // very ugly reinterpret cast
    PrintBase64Vector(*(reinterpret_cast<const Vector<UCHAR> *>(&v)), type,
                      ident, opts, _text, fd)
  }
  ....
  ....
}

How to add other types not only UCHAR?
Templated version of the PrintBase64Vector will break encapsulation and will lead to the current version of PrintVector.

aardappel · 2018-02-26T16:52:05Z

Yes, you'll likely need templates. Break what encapsulation? There's no need for encapsulation of anything here.

Supported strict RFC4648 Base64 and Base64 with URL and Filename Safe Alphabet. Added two new attributes (base64) and (base64url). Added new option IDLOptions.output_base64_cancel_padding.

…r empty vector).

vglavnyy · 2018-03-12T11:54:04Z

@aardappel, could you review this PR again?
I rewrote interface to the base64 routines and removed low-level core routines from the header file.
Is this the correct way?

aardappel · 2018-03-15T21:21:03Z

include/flatbuffers/util_base64.h

+// Takes into account the "opts.base64_cancel_padding" option if a field has
+// (base64url) attribute and cancel trailing padding for the output string.
+template<typename T>
+static inline bool PrintBase64Vector(const Vector<T> &v, Type type, int indent,


just make this an if in the caller rather than template specialization.

aardappel · 2018-03-15T21:21:19Z

include/flatbuffers/util_base64.h

+                       const FieldDef *fd);
+
+// Internal tests for decode/encode routines.
+int UtilBase64InerTest();


move to test.cpp

aardappel · 2018-03-15T21:22:12Z

src/idl_gen_text.cpp

+                 const IDLOptions &opts, std::string *_text,
+                 const FieldDef *fd = nullptr) {
+  // Try to print an array as base64 string.
+  if (PrintBase64Vector(v, type, indent, opts, _text, fd)) return true;


don't pass it opts etc, only call this function if base64 is requested.

aardappel · 2018-03-15T21:23:23Z

src/idl_parser.cpp

+        val.constant = NumToString(off);
+      } else if ((token_ == kTokenStringConstant) && !attribute_.empty()) {
+        uoffset_t off;
+        auto decoded = ParseBase64Vector(val.type.VectorType(), opts,


check here if base64 is requested, not inside the function

aardappel · 2018-03-15T21:31:02Z

src/util_base64.cpp

+// Encodes the input byte sequence into the base64 encoded string.
+static void Base64Encode(Base64Mode base64_mode, const uint8_t *src,
+                         size_t src_size, std::string *_text) {
+  const auto req_size = B64EncodeImpl(base64_mode, src, src_size, nullptr, 0);


all the complexity in this function is unnecessary. a std::string can have variable amount of characters added. So remove this code for computing the size, and inline this function with B64EncodeImpl and simplify. Same for decode. This is the third time I am asking. I will stop my review here. Please do not ask me to review without addressing comments.

vglavnyy · 2018-03-29T17:58:48Z

I apologize for wasting your time.
This problem is harder for me than I thought.
I hope this issue will be solved by someone with more acceptable way.
If a one needs a temporary solution, I have made it as the addon with source code under CC0.

aardappel · 2018-04-02T17:17:59Z

Why don't you simply address my code review suggestions? From your other code you are clearly capable of doing so. The biggest waste of time is creating a feature and then not merging it because you don't want to make changes to your code.

vglavnyy · 2018-04-04T17:25:37Z

The problem was with me and my hidden goal.
My project requires high-performance base64 encoder and decoder with AVX2 and NEON optimizations.
Suggested approach with the symbol by symbol appending to the string will block the optimization.

Could we once again return to the beginning of PR?
Here is the updated proposal of the base64 interface base64-PR2.
This interface has only two unimplemented routines:

void util_base64_encode(int b64mode, const uint8_t *src, size_t src_size, std::string *_text);
size_t util_base64_decode(int b64mode, const std::string &src, uint8_t *dst, size_t dst_size);

I believe that these routines should be implemented inside util_base64.cpp without dependencies from flatbuffers. In this case, one can easily replace the encoder/decoder if it is needed.
Is this interface acceptable?

mikkelfj · 2018-04-05T06:10:04Z

So, I know Wouters position is different here. But in my view, the "reserve" facility in STL i exactly designed for the case where you feed in a known quantity of data or need to process that data in-memory. So while I can that in most cases messing with reserve is just code obfuscation, I think that does not apply here. Especially considering that a base64 string could be in the size of many megabytes.

vglavnyy · 2018-04-05T13:54:37Z

Now I agree with Wouter. The code should be simple and clean and optimization doesn't help to make the code simple.
If util_base64_encode and util_base64_decode are pure functions this will keep simplicity and flexibility.

In the proposed solution, only CreateUninitializedVector can't be avoided.
Memory for decoded array should be reserved before decode.
Calculation of required size might be very simple as R=text.size()*3/4 if to trim preallocated vector to written size after decoding.
This "*3/4" guess will differ from the real written number no more than 2 bytes. It depends on trailing padding in the source base64 string.

Update:
I rewrote the ParseBase64Vector: base64-PR2.

aardappel · 2018-04-06T16:24:57Z

All my suggestions are in my past code-review comments.

vglavnyy force-pushed the base64-PR branch 7 times, most recently from ab24e2d to 9c10590 Compare January 6, 2018 09:23

aardappel requested changes Jan 9, 2018

View reviewed changes

vglavnyy force-pushed the base64-PR branch from 9c10590 to 04bcca0 Compare January 10, 2018 02:31

vglavnyy force-pushed the base64-PR branch from a40e6a2 to 4563ad2 Compare January 18, 2018 03:05

vglavnyy force-pushed the base64-PR branch from 32e41b3 to 23a0871 Compare January 23, 2018 19:32

vglavnyy force-pushed the base64-PR branch 2 times, most recently from d1de374 to 1491ffd Compare January 24, 2018 10:08

vglavnyy force-pushed the base64-PR branch from 1491ffd to bc20a80 Compare February 9, 2018 04:32

aardappel requested changes Feb 15, 2018

View reviewed changes

vglavnyy force-pushed the base64-PR branch 2 times, most recently from 33a95b8 to a1bca88 Compare February 20, 2018 03:39

vglavnyy force-pushed the base64-PR branch 2 times, most recently from 25fd95a to 1307c99 Compare March 12, 2018 09:19

vglavnyy added 4 commits March 12, 2018 16:22

JSON import/export: added base64 support for [ubyte] array.

e43aaa9

Supported strict RFC4648 Base64 and Base64 with URL and Filename Safe Alphabet. Added two new attributes (base64) and (base64url). Added new option IDLOptions.output_base64_cancel_padding.

The base64 padding behavior restored.

8445540

return back to flatbuffers::vector_data for tests (after debug fix fo…

d021338

…r empty vector).

Base64Encode interface refactoring.

1389630

vglavnyy force-pushed the base64-PR branch from 1307c99 to 42d532a Compare March 12, 2018 09:28

Changes in the public interface of util_base64.

d6f8b1b

vglavnyy force-pushed the base64-PR branch from 42d532a to d6f8b1b Compare March 12, 2018 09:42

aardappel requested changes Mar 15, 2018

View reviewed changes

vglavnyy closed this Mar 29, 2018

KageKirin mentioned this pull request May 30, 2018

[C++] "missing" support for base64 attributes as used by flatcc #4762

Closed

		@@ -603,120 +603,6 @@ inline void EquipmentUnion::Reset() {
		type = Equipment_NONE;
		}

JSON import/export: added base64 support for [ubyte] array and nested FBS #4585

JSON import/export: added base64 support for [ubyte] array and nested FBS #4585

Conversation

vglavnyy commented Jan 5, 2018

mikkelfj commented Jan 5, 2018

mikkelfj commented Jan 5, 2018

mikkelfj commented Jan 5, 2018 • edited Loading

mikkelfj commented Jan 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vglavnyy commented Jan 17, 2018

mikkelfj commented Jan 17, 2018

vglavnyy commented Jan 17, 2018

mikkelfj commented Jan 19, 2018

vglavnyy commented Jan 20, 2018

mikkelfj commented Jan 20, 2018

mikkelfj commented Jan 23, 2018

mikkelfj commented Jan 23, 2018

vglavnyy commented Jan 24, 2018

mikkelfj commented Jan 24, 2018

vglavnyy commented Jan 24, 2018

vglavnyy commented Jan 24, 2018

vglavnyy commented Feb 13, 2018

KageKirin commented Feb 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vglavnyy commented Feb 20, 2018

aardappel commented Feb 22, 2018

vglavnyy commented Feb 23, 2018

aardappel commented Feb 23, 2018

vglavnyy commented Feb 26, 2018

aardappel commented Feb 26, 2018

vglavnyy commented Mar 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vglavnyy commented Mar 29, 2018

aardappel commented Apr 2, 2018

vglavnyy commented Apr 4, 2018 • edited Loading

mikkelfj commented Apr 5, 2018

vglavnyy commented Apr 5, 2018 • edited Loading

aardappel commented Apr 6, 2018

mikkelfj commented Jan 5, 2018 •

edited

Loading

vglavnyy commented Apr 4, 2018 •

edited

Loading

vglavnyy commented Apr 5, 2018 •

edited

Loading