Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow the schema to be an associative array #389

Merged
merged 4 commits into from
Mar 21, 2017

Conversation

erayd
Copy link
Contributor

@erayd erayd commented Mar 16, 2017

What

If the schema supplied to Validator::validate is an array, convert it to an object prior to using it for validation.

Why

Because it was requested in #388, and will improve the user-friendliness of the library interface, particularly noting we already support the data to be validated being an associative array.

@erayd
Copy link
Contributor Author

erayd commented Mar 16, 2017

Grrr.... didn't think this one through properly and screwed it up. Please don't merge or waste time with code reviews until I've fixed it (will be tomorrow sometime; it's bedtime here now).

@shmax
Copy link
Collaborator

shmax commented Mar 16, 2017

Okie-doke. Quick comment, though--I guess I was hoping to remove object casting altogether. Is that outside the realm of possibility?

@erayd
Copy link
Contributor Author

erayd commented Mar 16, 2017

@shmax Do you mean get rid of it in your own code, or get rid of within the library code?

@erayd
Copy link
Contributor Author

erayd commented Mar 17, 2017

Right, now that my idiotic mistake has been corrected.... @shmax, any chance you'd be willing to review this?

Also, what do you mean by "hoping to remove object casting altogether" - can you expand on that a bit?

@erayd
Copy link
Contributor Author

erayd commented Mar 17, 2017

For the record, my idiotic mistake was forgetting to consider that some arrays are supposed to be arrays, and should remain that way in order to process properly.

* @return object
*/
public static function arrayToObjectRecursive($array)
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this any faster than just doing this?

return json_decode(json_encode($array), false);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't benchmarked it. Do you have an opinion on which is better?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think my version is faster, because you're letting all the work happen in C, and you don't have the overhead of the recursive PHP function calls. Time it for yourself though and see what you think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You make a good point 👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worth considering the performance in this case. Perhaps a continuous means of measuring performance can be added in another PR to facilitate a continuous improvement and scrutiny?

@shmax
Copy link
Collaborator

shmax commented Mar 17, 2017

Do you mean get rid of it in your own code, or get rid of within the library code?

Both. I'm wondering if there's any real reason to use an object for the schema internally, at all. For people like me who use native PHP arrays it adds overhead. People who use real JSON files have to do a conversion whether we use arrays or objects, so it makes no difference to them. There's probably something I'm not considering, so it would be nice to talk through this more. I imagine the original authors would have some insight.

@erayd
Copy link
Contributor Author

erayd commented Mar 17, 2017

I'm wondering if there's any real reason to use an object for the schema internally, at all.

It's better performance-wise to use one or the other internally, rather than allowing both and dealing with it on-the-fly (e.g. via the LooseTypeCheck::property* methods).

That said, it doesn't particularly matter which one is chosen - the library uses objects at the moment, but could just as easily use arrays. I must admit I prefer object syntax, but that's a personal preference only.

@shmax
Copy link
Collaborator

shmax commented Mar 17, 2017

That said, it doesn't particularly matter which one is chosen - the library uses objects at the moment, but could just as easily use arrays. I must admit I prefer object syntax, but that's a personal preference only.

Sure, to most people it just looks a little sexier:

$v = $foo->bar;

vs

$v = $foo['bar'];

I guess I'd like to know exactly how much overhead the conversion to object typically takes before I start making too much fuss over this. I can do a little more research this weekend.

@erayd
Copy link
Contributor Author

erayd commented Mar 17, 2017

I guess I'd like to know exactly how much overhead the conversion to object typically takes before I start making too much fuss over this. I can do a little more research this weekend.

My instinct would be that it's trivial, and not enough to matter. There's probably a small difference (I suspect the arrays are slightly faster for internal use, and yes, the object cast will incur a little overhead), but I'm not particularly keen to refactor the entire library to use arrays unless it's significant.

If you're keen to benchmark it though, I'm curious - will wait to see what you come up with :-).

@erayd
Copy link
Contributor Author

erayd commented Mar 17, 2017

Sure, to most people it just looks a little sexier

Mmm, I like the object notation mainly because it's nicer to type, and eliminates a set of brackets ;-).

@shmax
Copy link
Collaborator

shmax commented Mar 17, 2017

If you're keen to benchmark it though, I'm curious - will wait to see what you come up with :-).

Will try to get some numbers this weekend.

@shmax
Copy link
Collaborator

shmax commented Mar 18, 2017

I used a tool I found online to benchmark object casting vs not:

with casting: https://3v4l.org/0V0iK/perf#output
without: https://3v4l.org/S0Hg3/perf#output

Nothing too shocking about the results--looks like you save less than a tenth of a second by not casting, so I'm not going to sweat it.

@shmax
Copy link
Collaborator

shmax commented Mar 18, 2017

I also did the analysis of json_decode(json_encode($val), false), and this one's also pretty tough to call:

@erayd's method: https://3v4l.org/Xe9Mh/perf#output
@shmax's method: https://3v4l.org/OVaiT/perf#output

It looks like yours has a faster "System time", and mine has a faster "User time". I have no idea what the difference is.

@erayd
Copy link
Contributor Author

erayd commented Mar 19, 2017

@shmax Looks too close to call really - the difference isn't enough to tell them apart. Do you have a preference for which approach you'd like this PR to use going forward?

@shmax
Copy link
Collaborator

shmax commented Mar 19, 2017

If it were me and all other things being equal I'd go for the version that is less code, but that said I wish I knew what is meant by "user time" and "system time".

@erayd
Copy link
Contributor Author

erayd commented Mar 19, 2017

The system time is the amount of time the process spends in system calls to the kernel. The user time is the amount of time the process spends in userspace.

Re code amounts, they're about equal, assuming the return status of json_encode is properly handled (i.e. check json_last_error() === \JSON_ERROR_NONE, and do something with any error that occurs).

One other consideration is that json_encode will error if the schema contains e.g. a resource (because resources are an abstract type which cannot be exported), whereas the current arrayToObjectRecursive will not care and will work properly regardless (because it's not trying to export anything). I think I would prefer to keep the current code for this reason, but it's a mild preference - the overwhelming majority of people will not have a resource in their schema arrays!

Thoughts?

@shmax
Copy link
Collaborator

shmax commented Mar 19, 2017

Re code amounts, they're about equal, assuming the return status of json_encode is properly handled (i.e. check json_last_error() === \JSON_ERROR_NONE, and do something with any error that occurs).

I don't think you need to worry about that. Presumably PHP will not produce syntax errors in its own JSON output.

One other consideration is that json_encode will error if the schema contains e.g. a resource

I guess I'd have to see a real use case, but a JSON schema is supposed to be composed of JSON, isn't it?

My vote is for json_decode. If "user space" is the one we care about then it's faster, it's more compact, and it's how I've always seen this kind of thing done. But you don't have to listen to me. We can see if anyone else has an opinion.

@erayd
Copy link
Contributor Author

erayd commented Mar 19, 2017

I don't think you need to worry about that. Presumably PHP will not produce syntax errors in its own JSON output. [snip] I guess I'd have to see a real use case, but a JSON schema is supposed to be composed of JSON, isn't it?

It's not syntax errors I was concerned about - it's the potential to end up with an empty schema if encoding fails - hence why I said error checking of json_encode, but not json_decode. For example:

<?php

$fp = fopen('php://stdout', 'w');
$schema = [
    'myProperty' => 'myValue',
    'myResource' => $fp
];

$schema = json_decode(json_encode($schema));

var_dump($schema); // NULL

A JSON schema is supposed to be JSON, yes. However, PHP arrays are not JSON, and may contain things which are not valid in the context of JSON (like resources), most likely because the user is generating the array on-the-fly using something else that may output resources. If we're going to cast them, we need to ensure that this is done without silently dropping important data on the floor - which means we have to have error checking.

My vote is for json_decode.

OK - I will change it to use this.

If "user space" is the one we care about then it's faster...

I think you misunderstand - what we care about is the total time, regardless of where that time is spent. Using less user time at the expense of more kernel time doesn't mean it's faster, it just means that the execution is occurring in a different place. The correct metric to compare them on is SUM(user, system). By that metric, they're essentially identical.

@erayd
Copy link
Contributor Author

erayd commented Mar 19, 2017

How's that?

@shmax
Copy link
Collaborator

shmax commented Mar 19, 2017

Looking good. Regarding system vs user time, I received a response from the maintainer of that benchmarking tool about it, and he says:

The reason is that I use standard Linux measuring tools which return these two times. An extensive explanation can be found for example on http://stackoverflow.com/a/556411/3749523; the short version is you can probably ignore the system time; as it mostly represents the 'overhead' required to start the php binary (which is also why it gets higher for newer versions, they are simply bigger in size).

The user-time is the best measurement since you cannot influence the system-time. I've long considered adding them together (or hide the system-time) but haven't decided if that's the best way to go. I'll think about it some more ;)

@erayd
Copy link
Contributor Author

erayd commented Mar 19, 2017

If the system time for our case is mostly PHP's own startup overhead, then that means the difference in system time is jitter in his benchmark platform, and probably means that the margin of error is larger than I had thought - running the tests in a loop with a high iteration count would provide better insight. Simply throwing the system time out is not a good idea - it's still a relevant part of the measurement.

System time can be influenced by making fewer / less expensive system calls - however for our particular use-case, he's right that there's not much we can do about it. How much scope is available for changing that kind of thing depends very much on what you're actually trying to achieve.

@shmax
Copy link
Collaborator

shmax commented Mar 19, 2017

FFS, then benchmark it yourself.

@erayd
Copy link
Contributor Author

erayd commented Mar 19, 2017

FFS, then benchmark it yourself.

I have no interest in benchmarking it again. The benchmarks you've already run show that the two cases are close enough that IMO it's not worth fussing about which one is used from a performance perspective, so I'm happy with either on that score. My discussion of the benchmarking metrics used was purely because you mentioned that you did not understand them, so I figured it might be useful to explain some about how they actually work.

You've mentioned a preference for json_decode(json_encode()), so that is what I have done - I'm not trying to convince you to change that, and I'm quite happy with it.

@shmax
Copy link
Collaborator

shmax commented Mar 19, 2017

so I figured it might be useful to explain some about how they actually work.

Yes, and you were wrong. When I provided evidence instead of just moving on you began this elaborate explanation about how the tests can't be trusted, and I officially ran out of patience. Do what you like.

@erayd
Copy link
Contributor Author

erayd commented Mar 19, 2017

Yes, and you were wrong. When I provided evidence instead of just moving on you began this elaborate explanation about how the tests can't be trusted, and I officially ran out of patience. Do what you like.

I don't think I was wrong (and I don't think the quote you provided was wrong either - it certainly doesn't disagree with what I was saying) - however this seems to be turning a bit toxic, so let's just leave things where they are, agree to disagree, and move on.

I was not trying to say the tests can't be trusted, and if I gave that impression then I apologise, and should probably have worded things a bit better than I did.

@shmax
Copy link
Collaborator

shmax commented Mar 19, 2017

You said:

I think you misunderstand - what we care about is the total time, regardless of where that time is spent. Using less user time at the expense of more kernel time doesn't mean it's faster, it just means that the execution is occurring in a different place. The correct metric to compare them on is SUM(user, system). By that metric, they're essentially identical.

Meanwhile, the author of the tool himself explains that system is concerned with PHP startup time, and can essentially be discounted. It is NOT the sum of the two values we are talking about, and it has nothing to do with work moving to the kernel in my version.

I was not trying to say the tests can't be trusted

Er, yes, that is exactly what you did:

that means the difference in system time is jitter in his benchmark platform, and probably means that the margin of error is larger than I had thought ... running the tests in a loop with a high iteration count would provide better insight

Sorry for being irritable, but I suggested json_decode(json_encode($var), false) on the grounds that it is 1. faster and 2. less code. For some reason you have been doing a lot of jedi hand-waving to prove that both are false, and now you are retroactively denying everything even when we have it down in black and white. Ordinarily I love talking about coding, but I find the 'alternative facts' going on here exhausting.

@erayd
Copy link
Contributor Author

erayd commented Mar 19, 2017

Based on that, it sounds like you want to debate this further, rather than simply disagree and move on... I only started discussing the metrics in the first place because you expressed a desire to know what they meant! So I will try to clarify what I was trying to say earlier.

Please bear in mind that I'm not trying for 'alternative facts' here - I'm just trying to clarify my position, as I feel you have badly misunderstood my intentions. I'm not saying that the situation is your fault - it could well be mine, and perhaps I could have worded some of my earlier replies more clearly.

So:

  1. System time is time spent in system calls, user time is time spent in userspace. The total time used by the process is the sum of those two metrics. This is corroborated by the benchmark author's assertion that he is using the standard Linux metrics, by the SO post that he linked to in your quote, and there is plenty of other documentation available on this should you wish to look further. This is what I was talking about when I said "...and I don't think the quote you provided was wrong either - it certainly doesn't disagree with what I was saying."
  2. The benchmark author says that, for the benchmark, the system time is mostly taken up by PHP's startup. I'm not disputing that, and it makes sense, as PHP startup will likely make a large number of system calls.
  3. The benchmark author says that because of (2), we can ignore the system time for the purposes of figuring out how long our code took to run. I'm not disputing that either.
  4. If the system time is PHP's startup, then it should be identical for the same version across both tests, because the 'PHP startup' task is identical. The fact that it's not means there is some kind of variance on the benchmark platform (aka jitter), probably due to varying CPU load at the time the benchmarks were run.
  5. That jitter can be directly observed by re-running the same test through the benchmark multiple times. Note that there is some kind of caching mechanism in effect, so you will need to change the input each time (e.g. by inserting or changing a comment).
  6. Higher jitter means higher margin of error, which needs to be controlled for. This does not mean that the test is untrustworthy, it just means that we need to bear it in mind when comparing very similar results.
  7. I suggested a looped test, because that will give an average time for the test to complete, and will control for jitter by running the same test over a long duration.
  8. I am disputing that system time can be thrown out completely, because it provides an important insight into the benchmark.

Sorry for being irritable, but I suggested json_decode(json_encode($var), false) on the grounds that is 1. faster and 2. less code. For some reason you have been doing a lot of jedi hand-waving to prove that both are false, and now you are retroactively denying everything even when we have it down in black and white.

For the record, I have never tried to claim that either approach was faster. I did claim that they are essentially identical in speed, given the numbers available, and provided additional explanation of the benchmark metrics (per above) to clarify why I believe that to be the case. I said I was happy to go with your preference, and implemented it. And I suggested a way of controlling for the jitter (looping) in case you wished to obtain a more accurate benchmark result, as it seems you care about micro-optimising this case (which I have no problem with; I also tend to micro-optimise when I have the time).

It's worth bearing in mind here that we are arguing about a difference of single-digit milliseconds, based on a single run of only one test case, for a runs-only-once piece of code! This is not IMO worth anything like the amount of angst in this thread.

As far as less code is concerned... it's objectively not. The only way you get less code is by ignoring error cases. Noting the result of an error here is fairly serious (i.e. feeding a NULL schema to the validator instead of the schema the user provided), ignoring the chance of an error doesn't feel reasonable.

That said, while I would prefer to do proper error checking, my preference is mild - as I said above, somebody feeding us a schema with resources in it is fairly unlikely. If you'd rather have no error checking for the sake of doing all this on a single line, then I'm happy to do that.

@shmax
Copy link
Collaborator

shmax commented Mar 20, 2017

it sounds like you want to debate this further

No, I do not.

code LGTM

@erayd
Copy link
Contributor Author

erayd commented Mar 20, 2017

No, I do not.

Then I'll leave the subject alone. 🤝

code LGTM

Thanks for reviewing :-).

Copy link
Collaborator

@bighappyface bighappyface left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@bighappyface bighappyface merged commit 5bc0d34 into jsonrainbow:6.0.0-dev Mar 21, 2017
@erayd erayd mentioned this pull request Mar 21, 2017
erayd added a commit to erayd/json-schema that referenced this pull request Mar 21, 2017
* Allow the schema to be an associative array

Implements jsonrainbow#388.

* Use json_decode(json_encode()) for array -> object cast

* Skip exception check on PHP versions < 5.5.0

* Skip test on HHVM, as it's happy to encode resources
@erayd erayd deleted the array-schemas branch March 22, 2017 21:36
bighappyface pushed a commit that referenced this pull request Mar 22, 2017
* Add URI translation for retrieval & add local copies of spec schema

* Add use line for InvalidArgumentException & adjust scope (#372)

Fixes issue #371

* add quiet option (#382)

* add quiet option

* use verbose instead of quiet

* add quiet option

* always output dump-schema

* always output dump-schema-url

* fix typo and ws

* [BUGFIX] Add provided schema under a dummy / internal URI (fixes #376) (#378)

* Add provided schema under a dummy / internal URI (fixes #376)

In order to resolve internal $ref references within a user-provided
schema, SchemaStorage needs to know about the schema. As user-supplied
schemas do not have an associated URI, use a dummy / internal one instead.

* Remove dangling use

* Change URI to class constant on SchemaStorage

* Add option to disable validation of "format" constraint (#383)

* Add more unit tests (#366)

* Add test coverage for coercion API

* Complete test coverage for SchemaStorage

* Add test coverage for ObjectIterator

* Add exception test for JsonPointer

* MabeEnum\Enum appears to use singletons - add testing const

* Don't check this line for coverage

mbstring is on all test platforms, so this line will never be reached.

* Add test for TypeConstraint::validateTypeNameWording()

* Add test for exception on TypeConstraint::validateType()

* PHPunit doesn't like an explanation with its @codeCoverageIgnore...

* Add various tests for UriRetriever

* Add tests for FileGetContents

* Add tests for JsonSchema\Uri\Retrievers\Curl

* Add missing bad-syntax test file

* Restrict ignore to the exception line only

* Fix exception scope

* Allow the schema to be an associative array (#389)

* Allow the schema to be an associative array

Implements #388.

* Use json_decode(json_encode()) for array -> object cast

* Skip exception check on PHP versions < 5.5.0

* Skip test on HHVM, as it's happy to encode resources

* Enable FILTER_FLAG_EMAIL_UNICODE for email format if present (#398)

* Don't throw exceptions until after checking anyOf / oneOf (#394)

Fixes #393

* Fix infinite recursion on some schemas when setting defaults (#359) (#365)

* Don't try to fetch files that don't exist

Throws an exception when the ref can't be resolved to a useful file URI,
rather than waiting for something further down the line to fail after
the fact.

* Refactor defaults code to use LooseTypeCheck where appropriate

* Test for not treating non-containers like arrays

* Update comments

* Rename variable for clarity

* Add CHECK_MODE_ONLY_REQUIRED_DEFAULTS

If CHECK_MODE_ONLY_REQUIRED_DEFAULTS is set, then only apply defaults
if they are marked as required.

* Workaround for $this scope issue on PHP-5.3

* Fix infinite recursion via $ref when applying defaults

* Add missing second test for array case

* Add test for setting a default value for null

* Also fix infinite recursion via $ref for array defaults

* Move nested closure into separate method

* $parentSchema will always be set when $name is, so don't check it

* Handle nulls properly - fixes issue #377

* Add option to also validate the schema (#357)

* Remove stale files from #357 (obviated by #362) (#400)

* Stop #386 sneaking in alongside another PR backport
erayd added a commit to erayd/json-schema that referenced this pull request Mar 22, 2017
* Allow the schema to be an associative array

Implements jsonrainbow#388.

* Use json_decode(json_encode()) for array -> object cast

* Skip exception check on PHP versions < 5.5.0

* Skip test on HHVM, as it's happy to encode resources
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants