Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON import/export: added base64 support for [ubyte] array and nested FBS #4585

Closed
wants to merge 5 commits into from

Conversation

vglavnyy
Copy link
Contributor

@vglavnyy vglavnyy commented Jan 5, 2018

Issue #4393
Supported strict RFC4648 Base64 and Base64 with URL and Filename Safe Alphabet.
Added two new attributes (base64) and (base64url).
Added new option IDLOptions.output_base64_cancel_padding.
The Monster schema extended and regenerated for testing purposes.

Problem:
I don't know what to do with Base64SelftTest function.
https://github.com/vglavnyy/flatbuffers/blob/base64-PR/src/util_base64.cpp#L337-L448

This is the test of internal implementations of base64 encoder and decoder.
I can remove it from PR.

@mikkelfj can you review this PR for compatibility with flatcc?

@mikkelfj
Copy link
Contributor

mikkelfj commented Jan 5, 2018

I can't confirm by reading the code, but I added the test case for input and wrote the expected output in padded strict JSON. This passes the flatcc json test suite including verifying the nested buffer (which is kind of lucky since the monster_test.fbs are not exactly identical).

dvidelabs/flatcc@9ed2733

Note that the test fails if flatcc compiles with UQ off (unquoted), meaning the test case fails when it expects strict json, but that is to be expected.

I have released my implementation in v0.5.0 earlier today so the above is not in the release.

@mikkelfj
Copy link
Contributor

mikkelfj commented Jan 5, 2018

Note that I did not add runtime flags to control printing with or without padding - mostly because it was messy in the input, as you said. But it could be added later. I think what we have is fine for interop.

@mikkelfj
Copy link
Contributor

mikkelfj commented Jan 5, 2018

BTW: the TEST macro in my test takes two string arguments, the first is the input of a monster_test.fbs json file and the second argument is the expected json printed output without spaces if the parse is successful.

@mikkelfj
Copy link
Contributor

mikkelfj commented Jan 5, 2018

I tried to test the schema, but some semantics have diverged a bit on namespaces and language specific attributes. Removing these from the schema and the schema parses cleanly with base64, base64url attributes.

@vglavnyy vglavnyy force-pushed the base64-PR branch 7 times, most recently from ab24e2d to 9c10590 Compare January 6, 2018 09:23
return static_cast<Base64Mode>(static_cast<int>(a) | static_cast<int>(b));
}

inline bool Base64ModeIsStrict(Base64Mode m) { return (m & kBase64Strict); }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove ()


// upper bound estimate of a size of the base64 decoded sequence
static inline size_t Base64EstimateDecodedSize(size_t src_size) {
return (src_size / 4) * 3 + (src_size % 4);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments as to where these constants come from, and/or make them into named constants.

@@ -603,120 +603,6 @@ inline void EquipmentUnion::Reset() {
type = Equipment_NONE;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please run sh generate_code.sh, this will ensure the generated code in this commit will have the same settings as before.

@@ -46,9 +48,12 @@ void OutputIdentifier(const std::string &name, const IDLOptions &opts,
// Print (and its template specialization below for pointers) generate text
// for a single FlatBuffer value into JSON format.
// The general case for scalars:
template<typename T>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does this formatting come from? unless done by clang-format, please don't reformat things that are unrelated to your PR.

@@ -69,10 +74,61 @@ bool Print(T val, Type type, int /*indent*/, Type * /*union_type*/,
return true;
}

// Print a vector as bas64-encoded string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing e

// reserve text-space before temporary allocation
text.reserve(text.size() + req_size + 64);
std::vector<char> buf(req_size); // temporary buffer
size_t b64_len = Base64Encode(b64mode, src, src_size, flatbuffers::vector_data(buf), buf.size());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this function encode into a vector<char> and not a string? If it took a string, we could pass it text, and it could append straight where it is needed. See similarly how EscapeString is called.

const size_t dst_size,
size_t *const error_position = nullptr) {
// strict base64 table (RFC 4648)
B64_DATA_ALIGNAS(64) static const uint8_t strict_table[256] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this alignment needed? if this is to align it with cache lines for performance, please remove. prefer simple code over optimizations that have not been shown to be a clear bottleneck.

size_t N4 = src_size / 4;
size_t R4 = src_size % 4;
// deferr last char[4] block to remained block
if ((R4 == 0) && (N4 > 0)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more comments / better identifier names please. This code is pretty cryptic.

}
}

#define TEST_EQ(exp, val) TestEq(exp, val, #exp, __FILE__, __LINE__)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move testing code to test.cpp

@@ -44,6 +44,13 @@ table Stat {
count:ushort;
}

table TestBase64
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you in-line this table? reduces the amount of new files generated.

@vglavnyy
Copy link
Contributor Author

At commit 79b80f8 the method ReserveElements was removed from FlatBufferBuilder class.
ReserveElements
My code uses this public method for memory preallocation.
How to preallocate a big Vector without using temporary objects like std::vector<>?

@mikkelfj
Copy link
Contributor

@aardappel reservation in some form is necessary, otherwise the base64 decoder becomes very complex. The output size can be estimated conservatively within a small margin allowing the decoder to run without having to check output boundaries and manage flushing.

@vglavnyy
Copy link
Contributor Author

Resolved.
Problem with ReserveElements can be solved with method:
uoffset_t CreateUninitializedVector(size_t len, size_t elemsize, uint8_t **buf)

@mikkelfj
Copy link
Contributor

I just realized that this also needs to be tested with union strings and string vectors, and union vectors with strings.

@vglavnyy
Copy link
Contributor Author

Can you present examples of these cases?
I'm finishing requested changes within a few days.
At the first stage (this PR), only [ubyte] arrays are supported.
At the second, nested buffers.

@mikkelfj
Copy link
Contributor

I got confused - I forgot it was only ubyte and [ubyte] cannot be part of unions.

I was thinking of a use case like loading multiple PEM certificates where each cert type would have different union type so the following could be handled

http://how2ssl.com/articles/working_with_pem_files/

This would currently require each cert ot be loaded into a separate table, but may that is not a bad thing.

@mikkelfj
Copy link
Contributor

I\m not sure exactly what the latest changes do, but it appears to enforcing padding.

flatcc isn't very configurable right now, but I believe it accepts unpadded input and in the future will be allowed to generate unpadded.

There are some JSON base64 formats floating around that define base64 to be unpadded, so I don't think it is wise to reject it, given that the string is fully delimited.

@mikkelfj
Copy link
Contributor

This is protocol buffers base64: "JSON value will be the data encoded as a string using standard base64 encoding with paddings. Either standard or URL-safe base64 encoding with/without paddings are accepted."
https://developers.google.com/protocol-buffers/docs/proto3#json

@vglavnyy
Copy link
Contributor Author

@mikkelfj,
Thank you for review and you are right.
This commit with a default base64 decode/encode behaviour was not good. This was experiment with code. As we discussed earlier encoder/decoder behavior must satisfy:
base64/base64url decoder: padding is optional,
base64 output: with padding,
base64url output: with padding and have IDL-option JsonBase64CancelPadding.

@vglavnyy vglavnyy force-pushed the base64-PR branch 2 times, most recently from d1de374 to 1491ffd Compare January 24, 2018 10:08
@mikkelfj
Copy link
Contributor

@vglavnyy note there is one new suggestion: that base64 decodes also base64url and that base64url also decodes base64 unless explicitly required not to by some optional configuration

Currently I'm not sure if flatcc does this, but I know that the implementation is prepared for it and I'd be happy to make this update later.

@vglavnyy
Copy link
Contributor Author

The (base64url) decoder implicitly can decode sequence form (base64) encoder, but versa does not.
I can add compatibility (base64url)=>(base64).
But this will be further away from the RFC specification.
Even (base64)=>(base64url) compatibility violates RFC.

@vglavnyy
Copy link
Contributor Author

@aardappel ,
I done almost all requested changes and did rebase to master.
Now I have two unresolved problem.
First:
To the file "util_base64.h" aded templated function field_base64_mode.
It used in two places now. But if add nested buffers support it will be used in additional two places.
Should I delete this function or move it to another file?

Second:

Change requets for implementation of PrintVectorBase64 function:
why does this function encode into a vector and not a string? If it took a string, we could pass it text, and it could append straight where it is needed. See similarly how EscapeString is called.

PrintVectorBase64

Use appeding of characters to a string (like EscapeString) is impossible for current implementation of base64 encoder. Base64 encoder use direct pointer for writing to destination memory.
The vector copy can be avoided if use direct write to the string internal memory after extending it append(0, req_len). Address of memory can be obtained using &test[x] operator.
C++11 gives guarantee that internal memory of a std::string use a contiguous block of memory. But I'm not sure with correctness of this solution.

@vglavnyy
Copy link
Contributor Author

@aardappel , could you review latest changes in the base-PR code?

@KageKirin
Copy link
Contributor

Any ETA on when this PR is finalized and merged? I'd like to use it for some projects of mine (among which, https://github.com/KageKirin/flatGLTF)

// [BASE64]{https://tools.ietf.org/html/rfc4648}

// a Field doesn't have base64/base64url attribute
static const int kBase64Undefined = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change these into an enum?

static const int kBase64CancelPadding = 4;

// Helper: FDT is <idl.h>::FieldDef type
template<typename FDT> inline int field_base64_mode(const FDT *fd) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just make it rely on FieldDef directly rather than this template trick. You can include idl.h since this this file is included in places where idl.h was already included anyway.

return mode;
}

// calculate number of bytes required to save Decoded result
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments are sentences: start with a capital and end with a .

(void)type;
(void)indent;
auto b64mode = field_base64_mode(fd);
// leave if the Filed doesn't have base64 attribute
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

field

// type T can by any, not only uint8_t
const auto *src = reinterpret_cast<const uint8_t *>(v.data());
const auto src_size = v.size();
const auto req_size = Base64EncodedSize(b64mode, src, src_size);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please refactor these functions. Base64EncodedSize is not needed, since the destination is a vector (or string), so we can simply push_back. There's no need for an intermediate vector either. Just pass _text to Base64Encode.

I'm going to stop my review here since it is clear my comments from last time haven't been addressed. Please confirm you have addressed all my comments (from both reviews) thruout the code before asking me to review again.

@vglavnyy vglavnyy force-pushed the base64-PR branch 2 times, most recently from 33a95b8 to a1bca88 Compare February 20, 2018 03:39
@vglavnyy
Copy link
Contributor Author

@aardappel, Thank you for your review.
You are completely right about the implementation of the PrintVectorBase64 function and I fixed it.
Can we split review of this PR into two parts?

First part is an interface between the Flatbuffers and a base64 routines. The current interface declared in the “base64_utils.h” and used in the "idl_gen_text.cpp", "idl_parse.cpp" and "test.cpp". The enum “Base64Mode” define working modes for decode/encode routines.
The encoder has a simple interface. The decoder has two routines: Base64DecodedSize and Base64Decode. The current implementation of the decoder can return the position of a first invalid symbol inside a base64 string. This restricts a possible implementation of a base64 decoder. It can be removed from the interface and from the tests.

The second part is an implementation of base64 routines. There are a lot of implementations of base64 encode/decode routines and anyone can find or write own.
The proposed implementation located in the base64_utils.cpp. But for the first commit, it can be replaced by stub which does not violate current behavior of the Flatbuffers.

@aardappel
Copy link
Collaborator

I'm not sure what all this splitting up buys us. My exact problem with the current PR is that there's an mismatch: the core routines have C-style interfaces that require copying by the user. I proposed the opposite: that the core routines directly work with the types already used by the parser/generator, such that there is less translation steps. So if anything, less splitting up, more integration.

@vglavnyy
Copy link
Contributor Author

Ok. I understand your proposal.
And I propose closing this PR and return to the issue stage.
We can take the already defined attributes "base64" and "base64url" as a basis for the issue (with or without the IDL option "base64_cancel_padding").
Then once again discussing the behavior of the encoder and decoder for these attributes if that behavior is not clear. After that, someone can propose a new solution for this issue.

@aardappel
Copy link
Collaborator

I don't think I understand. Why not just fix this PR with my suggestions?

@vglavnyy
Copy link
Contributor Author

I don't know how to implement your suggestions and I need advice.
Let's start from simple case of printing.

Current implementation

template<typename T>
bool PrintVector(const Vector<T> &v, Type type, int indent,
                 const IDLOptions &opts, std::string *_text,
                 const FieldDef *fd = nullptr) {
  std::string &text = *_text;

  // Try to print UCHAR array as base64 string.
  if (type.base_type == BASE_TYPE_UCHAR) {
    auto b64mode = FieldGetBase64Mode(fd);
    if (b64mode) {
      text += "\"";
      Base64Encode(
          opts.base64_cancel_padding ? Base64CancelPadding(b64mode) : b64mode,
          reinterpret_cast<const uint8_t *>(v.data()), v.size(), &text);
      text += "\"";
      return true;
    }
  }
  .....
}

Proposed:

// Interface declaration from future version of the util_base64.h
// Try to print the vector of UCHAR as base64 string.
// Returns false if the field fd doesn't have base64 attribute.
bool PrintBase64Vector(const Vector<UCHAR> &v,  // data source
                       Type type, int indent, const IDLOptions &opts,
                       std::string *_text,  // destination
                       const FieldDef *fd);

Future implementation of PrintVector from idl_gen_text.cpp

template<typename T>
bool PrintVector(const Vector<T> &v, Type type, int indent,
                 const IDLOptions &opts, std::string *_text,
                 const FieldDef *fd = nullptr) {
  std::string &text = *_text;
  if (type.base_type == BASE_TYPE_UCHAR) {
    // very ugly reinterpret cast
    PrintBase64Vector(*(reinterpret_cast<const Vector<UCHAR> *>(&v)), type,
                      ident, opts, _text, fd)
  }
  ....
  ....
}

How to add other types not only UCHAR?
Templated version of the PrintBase64Vector will break encapsulation and will lead to the current version of PrintVector.

@aardappel
Copy link
Collaborator

Yes, you'll likely need templates. Break what encapsulation? There's no need for encapsulation of anything here.

@vglavnyy vglavnyy force-pushed the base64-PR branch 2 times, most recently from 25fd95a to 1307c99 Compare March 12, 2018 09:19
Supported strict RFC4648 Base64 and Base64 with URL and Filename Safe Alphabet.
Added two new attributes (base64) and (base64url).
Added new option IDLOptions.output_base64_cancel_padding.
@vglavnyy
Copy link
Contributor Author

@aardappel, could you review this PR again?
I rewrote interface to the base64 routines and removed low-level core routines from the header file.
Is this the correct way?

// Takes into account the "opts.base64_cancel_padding" option if a field has
// (base64url) attribute and cancel trailing padding for the output string.
template<typename T>
static inline bool PrintBase64Vector(const Vector<T> &v, Type type, int indent,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just make this an if in the caller rather than template specialization.

const FieldDef *fd);

// Internal tests for decode/encode routines.
int UtilBase64InerTest();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to test.cpp

const IDLOptions &opts, std::string *_text,
const FieldDef *fd = nullptr) {
// Try to print an array as base64 string.
if (PrintBase64Vector(v, type, indent, opts, _text, fd)) return true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't pass it opts etc, only call this function if base64 is requested.

val.constant = NumToString(off);
} else if ((token_ == kTokenStringConstant) && !attribute_.empty()) {
uoffset_t off;
auto decoded = ParseBase64Vector(val.type.VectorType(), opts,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check here if base64 is requested, not inside the function

// Encodes the input byte sequence into the base64 encoded string.
static void Base64Encode(Base64Mode base64_mode, const uint8_t *src,
size_t src_size, std::string *_text) {
const auto req_size = B64EncodeImpl(base64_mode, src, src_size, nullptr, 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the complexity in this function is unnecessary. a std::string can have variable amount of characters added. So remove this code for computing the size, and inline this function with B64EncodeImpl and simplify. Same for decode. This is the third time I am asking. I will stop my review here. Please do not ask me to review without addressing comments.

@vglavnyy
Copy link
Contributor Author

I apologize for wasting your time.
This problem is harder for me than I thought.
I hope this issue will be solved by someone with more acceptable way.
If a one needs a temporary solution, I have made it as the addon with source code under CC0.

@vglavnyy vglavnyy closed this Mar 29, 2018
@aardappel
Copy link
Collaborator

Why don't you simply address my code review suggestions? From your other code you are clearly capable of doing so. The biggest waste of time is creating a feature and then not merging it because you don't want to make changes to your code.

@vglavnyy
Copy link
Contributor Author

vglavnyy commented Apr 4, 2018

The problem was with me and my hidden goal.
My project requires high-performance base64 encoder and decoder with AVX2 and NEON optimizations.
Suggested approach with the symbol by symbol appending to the string will block the optimization.

Could we once again return to the beginning of PR?
Here is the updated proposal of the base64 interface base64-PR2.
This interface has only two unimplemented routines:

void util_base64_encode(int b64mode, const uint8_t *src, size_t src_size, std::string *_text);
size_t util_base64_decode(int b64mode, const std::string &src, uint8_t *dst, size_t dst_size);

I believe that these routines should be implemented inside util_base64.cpp without dependencies from flatbuffers. In this case, one can easily replace the encoder/decoder if it is needed.
Is this interface acceptable?

@mikkelfj
Copy link
Contributor

mikkelfj commented Apr 5, 2018

So, I know Wouters position is different here. But in my view, the "reserve" facility in STL i exactly designed for the case where you feed in a known quantity of data or need to process that data in-memory. So while I can that in most cases messing with reserve is just code obfuscation, I think that does not apply here. Especially considering that a base64 string could be in the size of many megabytes.

@vglavnyy
Copy link
Contributor Author

vglavnyy commented Apr 5, 2018

Now I agree with Wouter. The code should be simple and clean and optimization doesn't help to make the code simple.
If util_base64_encode and util_base64_decode are pure functions this will keep simplicity and flexibility.

In the proposed solution, only CreateUninitializedVector can't be avoided.
Memory for decoded array should be reserved before decode.
Calculation of required size might be very simple as R=text.size()*3/4 if to trim preallocated vector to written size after decoding.
This "*3/4" guess will differ from the real written number no more than 2 bytes. It depends on trailing padding in the source base64 string.

Update:
I rewrote the ParseBase64Vector: base64-PR2.

@aardappel
Copy link
Collaborator

All my suggestions are in my past code-review comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants