Faster data fraction #1069

HaokunLiu · 2020-04-16T03:30:45Z

#180
This add a one-time cost before training, but the code should be more efficient during training.

There isn't a test case for read_record, but I wrote some temporary code to compare the result from previous code and new code. You can check in the file changes. After the PR is approved, I will remove those temporary code.

Ideally, if we can get this through soon, we will be able to save some time and cost in the incoming taskmaster experiments.

sleepinyourhat · 2020-04-16T14:57:21Z

Leaving this to @pyeres, but a couple of high-level comments:

Check that logging is correct.
When you write permanent tests, one of them should check situations where you use a limited data fraction in one stage of training, but the full dataset in the other stage of training. IIRC, this was a source of some extra complexity previously, and I believe it has come up as something we've found useful.

pyeres · 2020-04-16T15:15:48Z

@HaokunLiu, thanks for the regression tests. Can you give an estimate of time savings w/ these changes? (useful for prioritizing review and any additional work that might be necessary for this PR)

pruksmhc · 2020-04-16T17:57:10Z

jiant/utils/serialize.py

+ return RepeatableIterator(_iter_fn) if repeatable else _iter_fn()
+
+
+# temporary backup code, remove before merging


Please remove.

In case other reviewer would like to check this, I will remove it once the PR is approved

jiant/preprocess.py

pruksmhc · 2020-04-16T17:59:08Z

jiant/preprocess.py

+ all_record_files = "%s*" % record_file
+ for one_record_file in glob.glob(all_record_files):
+ if os.path.exists(one_record_file) and os.path.islink(one_record_file):
+ os.remove(one_record_file)

 _index_split(
 task, split, indexers, vocab, record_file, model_preprocessing_interface


Can you put a comment about what is happening here?

pruksmhc · 2020-04-16T18:01:09Z

I'm pretty sure I understand what's going on here, but not 100% sure - basically, you're adding functionality of loading the vocabulary for all the tasks before running each experiment for the one-time cost. A few requests.

Please document more clearly what's going on, and flesh out what the "1-time cost" is so a stranger can understand what this PR is for.
Please move the test to tests folder and make it a unit test.

pruksmhc

Left some comments

HaokunLiu · 2020-04-16T19:40:50Z

I'm pretty sure I understand what's going on here, but not 100% sure - basically, you're adding functionality of loading the vocabulary for all the tasks before running each experiment for the one-time cost. A few requests.

Please document more clearly what's going on, and flesh out what the "1-time cost" is so a stranger can understand what this PR is for.

Please move the test to tests folder and make it a unit test.

Errrr, this PR is not about that thing... This PR is meant to address low gpu usage when data_fraction is very low, like 1%. Instead of iterating over all the instances and rejecting most of them, this create an additional preprocessed file that contains all the instances that will actually be used.

This is why when reload_index is True, we need to use glob to find out both the normally preprocessed file and our additional preprocessed files, and delete them all

HaokunLiu · 2020-04-16T19:44:49Z

Leaving this to @pyeres, but a couple of high-level comments:

Check that logging is correct.

When you write permanent tests, one of them should check situations where you use a limited data fraction in one stage of training, but the full dataset in the other stage of training. IIRC, this was a source of some extra complexity previously, and I believe it has come up as something we've found useful.

The issue you mentioned will be addressed in #1070 . This PR is orthogonal to it.

pyeres

@HaokunLiu — thanks for the PR, looks good. Please see comments for two small change requests.

pyeres · 2020-04-20T03:43:02Z

jiant/utils/serialize.py

+ if hash_float > fraction:
+ continue
+ example = pkl.loads(blob)
+ examples.append(example)


Please use generator/yield to avoid loading examples into memory here (before they're written out again).

jiant/utils/serialize.py

pruksmhc · 2020-04-20T15:12:13Z

scripts/temporary_test_read_records.py

@@ -0,0 +1,57 @@
+from jiant.utils import serialize


Can you please make this into a unit test?

I think this is more like a regression test. I don't see an easy way to convert it into a unit test.

Hey @HaokunLiu — I tried to write something similar to the tests you wrote, but tried to make them more like unit tests. These tests end up involving write_records, but they'll add some protection from changes going forward:

class TestReadRecords(unittest.TestCase): def test_read_records_without_data_fraction(self): """write records then read records (with no data fraction arg), check that records match""" data_file_name = "data.pb64" fake_example_count = 100 fake_examples = [i for i in range(fake_example_count)] with tempfile.TemporaryDirectory() as tmp_dir_path: fake_filepath = os.path.join(tmp_dir_path, data_file_name) serialize.write_records(fake_examples, fake_filepath) fake_examples_read = serialize.read_records(fake_filepath) self.assertCountEqual(fake_examples, fake_examples_read) def test_read_records_with_data_fraction(self): """write examples, read without fraction, read with fraction, check expected files exist""" filename = "data.pb64" frac = 0.1 fake_example_count = 100 fake_examples = [i for i in range(fake_example_count)] with tempfile.TemporaryDirectory() as tmp_dir_path: fake_filepath = os.path.join(tmp_dir_path, filename) serialize.write_records(fake_examples, fake_filepath) fake_examples_read = serialize.read_records(fake_filepath) fake_examples_read_frac = serialize.read_records(fake_filepath, fraction=frac) file_list = os.listdir(tmp_dir_path) self.assertLess(len(list(fake_examples_read_frac)), len(list(fake_examples_read))) self.assertTrue(set(fake_examples_read_frac).issubset(set(fake_examples_read))) self.assertCountEqual(file_list, [filename, filename + "__fraction_" + str(frac)]) def test_read_records_with_data_fraction_without_prior_full_data_read(self): """write examples, then read with fraction (no full read before fractional read)""" filename = "data.pb64" frac = 0.1 fake_example_count = 100 fake_examples = [i for i in range(fake_example_count)] with tempfile.TemporaryDirectory() as tmp_dir_path: fake_filepath = os.path.join(tmp_dir_path, filename) serialize.write_records(fake_examples, fake_filepath) fake_examples_frac = serialize.read_records(fake_filepath, fraction=frac) self.assertLess(len(list(fake_examples_frac)), len(list(fake_examples))) self.assertTrue(set(fake_examples_frac).issubset(set(fake_examples))) def test_read_records_with_data_fraction_from_cache(self): """write examples, read with fraction, then read with fraction again (from cache)""" filename = "data.pb64" frac = 0.1 fake_example_count = 100 fake_examples = [i for i in range(fake_example_count)] with tempfile.TemporaryDirectory() as tmp_dir_path: fake_filepath = os.path.join(tmp_dir_path, filename) serialize.write_records(fake_examples, fake_filepath) fake_examples_frac = serialize.read_records(fake_filepath, fraction=frac) os.remove(fake_filepath) # remove full file make sure we get frac examples from cache fake_examples_frac_cache = serialize.read_records(fake_filepath, fraction=frac) self.assertCountEqual(list(fake_examples_frac), list(fake_examples_frac_cache))

pyeres

Hey @HaokunLiu — after you remove your regression tests and add some version of the unit tests posted here, I think you're ready to merge. Please ping me in this thread for a final look / approval.

pyeres · 2020-05-01T15:22:46Z

Based on logs in some recent experiments it looks like these changes speed up data reading, however there's a bottleneck elsewhere (with these changes alone the speedup is <10%).

Planning to close this PR. Issue "training_data_fraction is slow." #180 remains open to track issue. If this issue becomes a priority again, the changes in this PR should be considered as part of a broader fix.

HaokunLiu added 5 commits April 15, 2020 01:58

new read_records

a4ff573

add test

6a1b902

Update temporary_test_read_records.py

aec2fc7

debug

be238ca

update force_reindex

c5983a8

HaokunLiu requested review from iftenney, pruksmhc, pyeres, sleepinyourhat and W4ngatang as code owners April 16, 2020 03:30

HaokunLiu added 4 commits April 15, 2020 23:36

debug

45d1b26

rename

221da71

change fraction default

72cc1aa

change fraction default

426eba8

pruksmhc reviewed Apr 16, 2020

View reviewed changes

jiant/preprocess.py Show resolved Hide resolved

pruksmhc reviewed Apr 16, 2020

View reviewed changes

pruksmhc suggested changes Apr 16, 2020

View reviewed changes

pyeres suggested changes Apr 20, 2020

View reviewed changes

pruksmhc reviewed Apr 20, 2020

View reviewed changes

HaokunLiu and others added 3 commits April 25, 2020 22:00

update config description

ad9d760

use generator

2caa06f

Merge branch 'master' into efficient_data_frac

f862ee8

pyeres suggested changes Apr 30, 2020

View reviewed changes

pyeres closed this May 1, 2020

HaokunLiu deleted the efficient_data_frac branch May 4, 2020 17:44

jeswan mentioned this pull request Sep 17, 2020

[CLOSED] Faster data fraction nyu-mll/jiant-v1-legacy#1069

Closed

jeswan added the jiant-v1-legacy Relevant to versions <= v1.3.2 label Sep 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster data fraction #1069

Faster data fraction #1069

HaokunLiu commented Apr 16, 2020 •

edited

Loading

sleepinyourhat commented Apr 16, 2020

pyeres commented Apr 16, 2020

pruksmhc Apr 16, 2020

HaokunLiu Apr 16, 2020

pruksmhc Apr 16, 2020

pruksmhc commented Apr 16, 2020

pruksmhc left a comment

HaokunLiu commented Apr 16, 2020

HaokunLiu commented Apr 16, 2020

pyeres left a comment

pyeres Apr 20, 2020

pruksmhc Apr 20, 2020

HaokunLiu Apr 26, 2020

pyeres Apr 30, 2020

pyeres left a comment

pyeres commented May 1, 2020

		return RepeatableIterator(_iter_fn) if repeatable else _iter_fn()


		# temporary backup code, remove before merging

Faster data fraction #1069

Faster data fraction #1069

Conversation

HaokunLiu commented Apr 16, 2020 • edited Loading

sleepinyourhat commented Apr 16, 2020

pyeres commented Apr 16, 2020

pruksmhc Apr 16, 2020

Choose a reason for hiding this comment

HaokunLiu Apr 16, 2020

Choose a reason for hiding this comment

pruksmhc Apr 16, 2020

Choose a reason for hiding this comment

pruksmhc commented Apr 16, 2020

pruksmhc left a comment

Choose a reason for hiding this comment

HaokunLiu commented Apr 16, 2020

HaokunLiu commented Apr 16, 2020

pyeres left a comment

Choose a reason for hiding this comment

pyeres Apr 20, 2020

Choose a reason for hiding this comment

pruksmhc Apr 20, 2020

Choose a reason for hiding this comment

HaokunLiu Apr 26, 2020

Choose a reason for hiding this comment

pyeres Apr 30, 2020

Choose a reason for hiding this comment

pyeres left a comment

Choose a reason for hiding this comment

pyeres commented May 1, 2020

HaokunLiu commented Apr 16, 2020 •

edited

Loading