You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For simplicity, we originally provided code to produce a single train.bin, val.bin and test.bin file for the data. However in our experiments for the paper we split the data into several chunks, each containing 1000 examples (i.e. train_000.bin, train_001.bin, ..., train_287.bin). In the interest of reproducibility, make_datafiles.py has now been updated to also produce chunked data that's saved in finished_files/chunked and the README for the Tensorflow code now gives instructions for chunked data. If you've already run make_datafiles.py to obtain train.bin/val.bin/test.bin files, then just run
import make_datafiles
make_datafiles.chunk_all()
in Python, from the cnn-dailymail directory, to get the chunked files (it takes a few seconds).
To use your chunked datafiles with the Tensorflow code, set e.g.
--data_path=/path/to/chunked/train_*
You don't have to restart training from the beginning to switch to the chunked datafiles.
Why does it matter?
The multi-threaded batcher code is originally from the TextSum project. The idea is that each input thread calls example_generator, which randomizes the chunks, and then reads from the chunks in that order. Thus 16 threads concurrently fill the input queue with examples drawn from different, randomly-chosen chunks. If your data is in a single file however, then the multi-threaded batcher will result in 16 threads concurrently filling the input queue with examples drawn in order from the same single .bin file. Firstly, this might produce batches containing more duplicate examples than we want. Secondly, reading through the dataset in order may produce different training results to reading through in randomized chunks.
If you're concerned about duplicate examples in batches, either chunk your data or switch the batcher to single-threaded by setting
self._num_example_q_threads = 1 # num threads to fill example queue
self._num_batch_q_threads = 1 # num threads to fill batch queue
(From a speed point of view, the multi-threaded batcher is probably unnecessary for many systems anyway).
If you're concerned about reproducibility and the effect on training of reading the data in randomized chunks vs. from a single file, then chunk your data.
The text was updated successfully, but these errors were encountered:
For simplicity, we originally provided code to produce a single
train.bin
,val.bin
andtest.bin
file for the data. However in our experiments for the paper we split the data into several chunks, each containing 1000 examples (i.e.train_000.bin
,train_001.bin
, ...,train_287.bin
). In the interest of reproducibility,make_datafiles.py
has now been updated to also produce chunked data that's saved infinished_files/chunked
and the README for the Tensorflow code now gives instructions for chunked data. If you've already runmake_datafiles.py
to obtaintrain.bin/val.bin/test.bin
files, then just runin Python, from the
cnn-dailymail
directory, to get the chunked files (it takes a few seconds).To use your chunked datafiles with the Tensorflow code, set e.g.
You don't have to restart training from the beginning to switch to the chunked datafiles.
Why does it matter?
The multi-threaded batcher code is originally from the TextSum project. The idea is that each input thread calls
example_generator
, which randomizes the chunks, and then reads from the chunks in that order. Thus 16 threads concurrently fill the input queue with examples drawn from different, randomly-chosen chunks. If your data is in a single file however, then the multi-threaded batcher will result in 16 threads concurrently filling the input queue with examples drawn in order from the same single.bin
file. Firstly, this might produce batches containing more duplicate examples than we want. Secondly, reading through the dataset in order may produce different training results to reading through in randomized chunks.(From a speed point of view, the multi-threaded batcher is probably unnecessary for many systems anyway).
The text was updated successfully, but these errors were encountered: