Fix for low accuracy in Tensorflow training #167

pribalta · 2018-09-13T07:51:31Z

Indexed_file_loader.h performs a sharded read of the index files supplied alongside the tfrecords for training. Thus, ReadSample() is called from multiple threads with different shard_id_. When all indexes have been read, we proceed to reset current_index_ to start again at the beginning of the shard.

This commit fixes a defect in the detection that the shard has arrived to the end, by comparing the value of current_index to the size of all the indexes. The fix consists in comparing against the end of the shard instead of the end of the indices.

Before the fix, the code would work correctly for 1 GPU when there is only 1 shard, but with >1 GPUs, it would result in a loss in accuracy due to---most likely---redundant reads.

Signed-off-by: Pablo Ribalta Lorenzo pribalta@nvidia.com

Indexed_file_loader.h performs a sharded read of the index files supplied alongside the tfrecords for training. Thus, ReadSample() is called from multiple threads with different shard_id_. When all indexes have been read, we proceed to reset current_index_ to start again at the beginning of the shard. This commit fixes a defect in the detection that the shard has arrived to the end, by comparing the value of current_index to the size of all the indexes. The fix consists in comparing against the end of the shard instead of the end of the indices. Before the fix, the code would work correctly for 1 GPU when there is only 1 shard, but with >1 GPUs, it would result in a loss in accuracy due to---most likely---redundant reads. Signed-off-by: Pablo Ribalta Lorenzo <pribalta@nvidia.com>

JanuszL · 2018-09-13T08:46:55Z

dali/pipeline/operators/reader/loader/indexed_file_loader.h

@@ -94,6 +94,9 @@ class IndexedFileLoader : public Loader<CPUBackend> {
    ReadIndexFile(index_uris);
    size_t num_indices = indices_.size();
    current_index_ = num_indices/num_shards_ * shard_id_;
+    max_index_ = (shard_id_ != num_shards_ - 1)
+      ? num_indices/num_shards_ * (shard_id_ + 1)


How about just:

max_index_ = num_indices * (shard_id_ + 1) / num_shards_

Signed-off-by: Pablo Ribalta Lorenzo <pribalta@nvidia.com>

pribalta · 2018-09-13T12:55:35Z

Waiting till another run finishes and reports good results and I'll be merging this guy

ptrendx · 2018-09-13T15:02:55Z

I agree that it is a bug (good job finding it :-) ) but I do not agree fully with the fix, because it is not consistent with the rest of readers. You should instead change the reset function to set the current index to 0

Another way to fix this issue is to make the different threads wrap around the number of shards and and only initializing them to their corresponding offset. This solution is consistent with other readers. Signed-off-by: Pablo Ribalta Lorenzo <pribalta@nvidia.com>

* Fix for low accuracy in Tensorflow training Indexed_file_loader.h performs a sharded read of the index files supplied alongside the tfrecords for training. Thus, ReadSample() is called from multiple threads with different shard_id_. When all indexes have been read, we proceed to reset current_index_ to start again at the beginning of the shard. This commit fixes a defect in the detection that the shard has arrived to the end, by comparing the value of current_index to the size of all the indexes. The fix consists in enabling the shard to wrap around the indices by making reset set current_index_ to zero. Before the fix, the code would work correctly for 1 GPU when there is only 1 shard, but with >1 GPUs, it would result in a loss in accuracy due to shards unevenly loading the dataset. Signed-off-by: Pablo Ribalta Lorenzo <pribalta@nvidia.com>

pribalta requested review from ptrendx and JanuszL September 13, 2018 07:51

JanuszL reviewed Sep 13, 2018

View reviewed changes

Simplify the shard's end calculation

0e27633

Signed-off-by: Pablo Ribalta Lorenzo <pribalta@nvidia.com>

pribalta force-pushed the pribalta_fix_sharding_indexed_reader branch from 6875724 to 0e27633 Compare September 13, 2018 09:42

JanuszL approved these changes Sep 13, 2018

View reviewed changes

ptrendx approved these changes Sep 14, 2018

View reviewed changes

pribalta merged commit 9579eba into master Sep 14, 2018

pribalta deleted the pribalta_fix_sharding_indexed_reader branch September 14, 2018 17:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for low accuracy in Tensorflow training #167

Fix for low accuracy in Tensorflow training #167

pribalta commented Sep 13, 2018 •

edited

Loading

JanuszL Sep 13, 2018

pribalta Sep 13, 2018

pribalta commented Sep 13, 2018

ptrendx commented Sep 13, 2018

Fix for low accuracy in Tensorflow training #167

Fix for low accuracy in Tensorflow training #167

Conversation

pribalta commented Sep 13, 2018 • edited Loading

JanuszL Sep 13, 2018

Choose a reason for hiding this comment

pribalta Sep 13, 2018

Choose a reason for hiding this comment

pribalta commented Sep 13, 2018

ptrendx commented Sep 13, 2018

pribalta commented Sep 13, 2018 •

edited

Loading