-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25605][TESTS] Run cast string to timestamp tests for a subset of timezones #22631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @gatorsmile |
|
Line 506 in 1007cae
I guess number of time zones can be reduced in |
|
Test build #96945 has finished for PR 22631 at commit
|
gatorsmile
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks! Merged to master.
|
|
||
| test("cast string to timestamp") { | ||
| for (tz <- ALL_TIMEZONES) { | ||
| for (tz <- Random.shuffle(ALL_TIMEZONES).take(50)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgaido91 @gatorsmile surely we shouldn't do this. We're introducing nondeterminism into the test, and also no longer testing things we were testing before. If there's a lot of timezones we don't need to test, then, don't test them. Right? I didn't see this PR but would have objected to this change if I had seen it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen I think the point here is to test that this works with different timezones. In DateExpressionsSuite, for instance, we test only 3-4 timezones. I don't think it makes sense to test every of the 650 possible timezones: if it works with many of them, then it means that the code is generic and respects timezones. We can also define a fixed subset of timezones, but IMHO taking randomly some of them provides the additional safety that if there is a specific single timezone creating problem, we are able to identify it on several subsequent runs.
We have many place where we generate data randomly in the test, so we already have randomness in the tests. I think the rationale behind them is the same: if the function works with some different data, then it generalize properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests should be deterministic, ideally; any sources of randomness should be seeded. Do you see one that isn't?
I think this is like deciding we'll run just 90% of all test suites every time randomly, to save time. I think it's just well against good practice.
There are other solutions:
- pick a subset of timezones that we're confident do exercise the code and just explicitly test those
- parallelize these tests within the test suite
The latter should be trivial in this case: ALL_TIMEZONES.par.foreach { tz => instead. It's the same amount of work but 8x, 16x faster by wall clock time, depending on how many cores are available. What about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are many tests where data is randomly generated. And they are not seeded of course.
As I said, I think the goal here is to test that the function works well with different timezones: then picking a subset of timezones would be fine too, but I prefer taking them randomly among all because if there is a single timezone creating issues (very unlikely IMHO), we would discover it anyway (not on the single run though).
Moreover, it would be great then to be consistent among all the codebase on what we test. In DateExpressionsSuite we test only 3 timezones and here we test all 650: it is a weird, isn't it? We should probably define which is the right thing to do when timezones are involved and test always the same. Otherwise, testing 650 timezones on a single specific function and 3 on the most of the others is a nonsense IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely not by design? tests need to be deterministic, or else what's the value? failures can't be reproduced. (I know that in practice many things are hard to make deterministic.)
Of course, if you're worried that we might not be testing an important case, we have to test it. We can't just not test it sometimes to make some tests run a little faster.
Testing just 3 timezones might be fine too; I don't know. Testing 50 randomly seems suboptimal in all cases.
I'll open a PR to try simply testing in parallel instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think tests need to be deterministic in general as well.
In this particular case ideally, we should categorize timezones, pick up some timezones representing them and test fixed set. For instance, timezone with DST, without DST, and some exceptions such as, for instance, see this particular case which Python 3.6 addressed lately (https://github.com/python/cpython/blob/e42b705188271da108de42b55d9344642170aa2b/Lib/datetime.py#L1572-L1574), IMHO.
Of course, this approach requires a lot of investigations and overheads. So, as an alternative, I would incline to go for Sean's approach (https://github.com/apache/spark/pull/22631/files#r223224573) for this particular case.
For randomness, I think primarily we should have first deterministic set of tests. Maybe we could additionally have some set of randomized input to cover some cases we haven't foreseen but that's secondary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean some tests like with randomized input, let's say, integer range input are fine in common sense but this case is different, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that adding parallelism is a good way for improve test time. The amount of resources used for testing is anyway limited. I think the goal here is not (only) reduce the overall time of the test but also reduce the amount of resources needed to test.
Problems with a specific timezone like you mentioned, @HyukjinKwon, are exactly the reason why I am proposing this randomized approach, rather than picking 3 timezones and always use them as done in DateExpressionsSuite: if there is a problem with a specific timezone, in this way, we will be able to catch it. With a fixed subset of them (even though not on the single run), we are not.
The only safe deterministic way would be to run against all of them, as it was done before, but then I'd argue that we should do the same everywhere we have different timezones involved in tests (why are we testing all timezones only for casting to timestamp and not for all other functions involving dates and times, if it is so important to check all timezones?). But then the amount of time needed to run all the tests would be crazy, so it is not doable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should categorize timezones, pick up some timezones representing them and test fixed set
That would be the best, but we need some deep understanding of timezone, to make sure the test coverage is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aren't blocked on CPU time or resources, no. The tests are mostly single-threaded, and the big test machines mostly idle throughout the runs. I've opened #22672 to evaluate that.
I feel strongly that we can't do this kind of thing and am opening a thread on dev@ for wider discussion.
…ts for a subset of timezones ## What changes were proposed in this pull request? Try testing timezones in parallel instead in CastSuite, instead of random sampling. See also #22631 ## How was this patch tested? Existing test. Closes #22672 from srowen/SPARK-25605.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
…of timezones ## What changes were proposed in this pull request? The test `cast string to timestamp` used to run for all time zones. So it run for more than 600 times. Running the tests for a significant subset of time zones is probably good enough and doing this in a randomized manner enforces anyway that we are going to test all time zones in different runs. ## How was this patch tested? the test time reduces to 11 seconds from more than 2 minutes Closes apache#22631 from mgaido91/SPARK-25605. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>
…ts for a subset of timezones ## What changes were proposed in this pull request? Try testing timezones in parallel instead in CastSuite, instead of random sampling. See also apache#22631 ## How was this patch tested? Existing test. Closes apache#22672 from srowen/SPARK-25605.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
What changes were proposed in this pull request?
The test
cast string to timestampused to run for all time zones. So it run for more than 600 times. Running the tests for a significant subset of time zones is probably good enough and doing this in a randomized manner enforces anyway that we are going to test all time zones in different runs.How was this patch tested?
the test time reduces to 11 seconds from more than 2 minutes