Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MP DataLoader Improvements #742

Merged
merged 25 commits into from
Apr 17, 2020

Conversation

AaronSpieler
Copy link
Contributor

@AaronSpieler AaronSpieler commented Apr 3, 2020

Description of changes:

  • dataset iterations are now cache aligned (instead of modulo based, just assign subsets)
  • switched to default Pool creation method (for linux that would be fork currently: this improves initialisation times significantly on linux systems)
  • added caching support for file dataset
  • added support for num_batches_for_shuffling, with default 8
  • added much more typing, and documentation
  • some refactoring

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@AaronSpieler
Copy link
Contributor Author

@vafl maybe you could check wether you run into problems, or see improvements?

@AaronSpieler AaronSpieler changed the title Improvements MP DataLoader Improvements Apr 3, 2020
@AaronSpieler AaronSpieler requested a review from vafl April 3, 2020 19:55
@codecov-io
Copy link

Codecov Report

Merging #742 into master will increase coverage by 1.05%.
The diff coverage is 91.17%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #742      +/-   ##
==========================================
+ Coverage   84.45%   85.51%   +1.05%     
==========================================
  Files         171      171              
  Lines       10817    10862      +45     
==========================================
+ Hits         9136     9289     +153     
+ Misses       1681     1573     -108     
Impacted Files Coverage Δ
src/gluonts/dataset/jsonl.py 84.61% <82.60%> (-0.39%) ⬇️
src/gluonts/dataset/parallelized_loader.py 88.72% <92.53%> (-0.03%) ⬇️
src/gluonts/dataset/common.py 92.78% <100.00%> (+9.45%) ⬆️
src/gluonts/dataset/loader.py 100.00% <100.00%> (ø)
src/gluonts/support/util.py 94.23% <0.00%> (+3.20%) ⬆️
src/gluonts/dataset/artificial/_base.py 67.25% <0.00%> (+26.02%) ⬆️

@lostella lostella added this to the v0.5 milestone Apr 7, 2020
@AaronSpieler
Copy link
Contributor Author

AaronSpieler commented Apr 7, 2020

[Outdated]

So for WaveNetEstimator i get 4x performance if num_batches_for_schuffling=1 otherwise 3x (in which num_batches_for_schuffling=8).

The question is whether to leave the default 8 or 1. So far this arguments default was 8, on the other hand, there was a bug in the previous data loader code (before multiprocessing), namely:

@property
    def stream(self) -> Iterable:
        s = self.transform(self.dataset, is_train=self.is_train)
        if self.shuffle_for_training:
            return shuffler(s, self.num_batches_for_shuffling)
        return s
def shuffler(stream: Iterable[T], batch_size: int) -> Iterator[T]:
    """Modifies a stream by shuffling items in windows.

    It continously takes `batch_size`-elements from the stream and yields
    elements from each batch  in random order."""

    for batch in batcher(stream, batch_size):
        random.shuffle(batch)
        yield from batch

which means that only num_batches_for_schuffling samples were shuffled, and not batch_size*num_batches_for_schuffling, which is most likely even worse than having num_batches_for_schuffling=1.

And the model training seems to have worked that way fine too. However, I can imagine fixing this bug could yield better performance now in some cases.

Any preference @vafl ?

@AaronSpieler
Copy link
Contributor Author

AaronSpieler commented Apr 7, 2020

[Outdated]

Regarding the read speeds, if I use m4_yearly and run the following code (which is not an exact measurement, but stark differences should show):

num_iter=100
print(f"Runtime: {timeit.timeit(lambda: list(dataset.train), number=num_iter)/num_iter}")

I consistently get the following results:

Chunk: 0.04972371574986027
Modulo: 0.05123165169003187

However, while the difference in speed is measurable, it's also negligible.

@AaronSpieler
Copy link
Contributor Author

AaronSpieler commented Apr 7, 2020

[Outdated]

And setting cache=True in the FileDataset yields significant boost to performance (on Wavenet). I get essentially the same performance as if the dataset was converted beforehand to a list, except, there is not significant hit in performance in the beginning of training (due to dataset needing to be copied to all the workers), and overall there is less data cached in memory, as opposed to if the dataset was converted to a list beforehand (only 1/num_workers as much cashed in total)

For example, for this configuration on the electricity dataset:

estim = WaveNetEstimator(
    freq=meta.freq,
    prediction_length=24,
    seasonality=48,
    trainer=Trainer(
        batch_size=32,
        epochs=7,
        hybridize=True,
        learning_rate=0.01,
        num_batches_per_epoch=400
    )
)
estim.train(train_01, num_workers=8, num_batches_for_shuffling=1, shuffle_for_training=True)

I get ~17 vs 24 it/sec uncached vs cached.

@AaronSpieler
Copy link
Contributor Author

AaronSpieler commented Apr 7, 2020

With the last commit I changed the functionality of num_batches_for_schuffling to probabilistically choosing every num_batches_for_schuffling entry, which:

  • massively simplifies the code
  • improves the performance in the case of num_batches_for_schuffling=8 in my testing that it's basically as performant as num_batches_for_schuffling=1, which is a 50% improvement over the previous implementation
  • the caching mechanism is unaffected
  • "drawback": we can't guarantee (in fact it's unlikely) that in num_batches_for_schuffling passes every data entry is processed exactly once. (but the InstanceSplitter is also probabilistic at the moment for training, so that would be the case anyways.)

In order to achieve this, I had to switch back to the cache aligned data iteration, because in the none cached FileDataset the difference between the techniques is significantly noticeable (since now num_batches_for_schuffling as much has to be read from the dataset, i.e. from the disk).

@AaronSpieler
Copy link
Contributor Author

With this I think every feature is added that was missing (that I am aware of), and I tested every design choice that I though needed validation.

@lostella
Copy link
Contributor

I have some minor comments, see inline

src/gluonts/dataset/loader.py Outdated Show resolved Hide resolved
src/gluonts/dataset/common.py Outdated Show resolved Hide resolved
src/gluonts/dataset/jsonl.py Outdated Show resolved Hide resolved
src/gluonts/dataset/jsonl.py Outdated Show resolved Hide resolved
src/gluonts/dataset/loader.py Outdated Show resolved Hide resolved
src/gluonts/dataset/loader.py Outdated Show resolved Hide resolved
src/gluonts/dataset/loader.py Show resolved Hide resolved
src/gluonts/dataset/parallelized_loader.py Show resolved Hide resolved
src/gluonts/dataset/parallelized_loader.py Outdated Show resolved Hide resolved
Copy link
Contributor

@vafl vafl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

src/gluonts/dataset/common.py Show resolved Hide resolved
src/gluonts/dataset/jsonl.py Outdated Show resolved Hide resolved
vafl
vafl previously approved these changes Apr 16, 2020
src/gluonts/dataset/jsonl.py Outdated Show resolved Hide resolved
@lostella
Copy link
Contributor

I merged upstream changes to let the tests run again, there was a timeout in on test which I want to make sure is not occurring consistently. @AaronSpieler any idea where these time outs come from?

@AaronSpieler
Copy link
Contributor Author

The windows timeout error? IDK: It should be non mp: "You have set num_workers to a non zero value, however, currently multiprocessing is not supported on windows and therefore`num_workers will be set to 0."

@AaronSpieler
Copy link
Contributor Author

Should be fixed now. The error came from mp evaluation being enabled on windows. I think this was shadowed by the continual windows errors we were getting so far from other commits.

@AaronSpieler
Copy link
Contributor Author

So windows is still failing, however, it has nothing to do with multiprocessing. So I think we could at some point reenable multiprocessing evaluation for windows, however, how the error is now more clearly visible because its not happening in the forked processes anymore:

File "D:\a\gluon-ts\gluon-ts\test\model\conftest.py", line 100, in test_accuracy
    evaluator=Evaluator(calculate_owa=statsmodels is not None),
  File "d:\a\gluon-ts\gluon-ts\src\gluonts\evaluation\backtest.py", line 224, in backtest_metrics
    ts_it, forecast_it, num_series=maybe_len(test_dataset)
  File "d:\a\gluon-ts\gluon-ts\src\gluonts\evaluation\_base.py", line 178, in __call__
    for ts, forecast in it:
  File "c:\hostedtoolcache\windows\python\3.6.8\x64\lib\site-packages\tqdm\std.py", line 1127, in __iter__
    for obj in iterable:
  File "d:\a\gluon-ts\gluon-ts\src\gluonts\model\predictor.py", line 317, in predict
    num_samples=num_samples,
  File "d:\a\gluon-ts\gluon-ts\src\gluonts\model\forecast_generator.py", line 204, in __call__
    outputs = prediction_net(*inputs).asnumpy()
  File "c:\hostedtoolcache\windows\python\3.6.8\x64\lib\site-packages\mxnet\ndarray\ndarray.py", line 2535, in asnumpy
    ctypes.c_size_t(data.size)))

Copy link
Contributor

@lostella lostella left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks!

@lostella lostella merged commit 0143054 into awslabs:master Apr 17, 2020
@AaronSpieler AaronSpieler deleted the mp_data_loader_updates_V2 branch July 16, 2020 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants