Changed datatype to be np.uint8 universally in the call #61

zhenghh04 · 2023-04-11T15:47:13Z

This PR changed the datatype from np.float64 to np.uint8. This also fixed issue about the dimension_stdev.

hariharan-devarajan

In general the changes look good. I found few things we can improve in the code.

the generate method can be generic in data_generator.py and we create a save_file method which takes a records and stores it in file.
the data generator and its sub classes are copying config variables which can be avoided and improve memory footprint of library.

hariharan-devarajan · 2023-04-11T19:17:35Z

src/data_generator/data_generator.py

        self._file_list = None
        self.num_subfolders_train = self._args.num_subfolders_train
        self.num_subfolders_eval = self._args.num_subfolders_eval
        self.format = self._args.format
        self.storage = StorageFactory().get_storage(self._args.storage_type, self._args.storage_root,
                                                                        self._args.framework)
-
+        self._dimension = int(math.sqrt(self.record_size))


We have these variable within configuration. We should use self._args.dimension

hariharan-devarajan · 2023-04-11T19:17:55Z

src/data_generator/data_generator.py

        self._file_list = None
        self.num_subfolders_train = self._args.num_subfolders_train
        self.num_subfolders_eval = self._args.num_subfolders_eval
        self.format = self._args.format
        self.storage = StorageFactory().get_storage(self._args.storage_type, self._args.storage_root,
                                                                        self._args.framework)
-
+        self._dimension = int(math.sqrt(self.record_size))
+        self._dimension_stdev = self.record_size_stdev/2.0/math.sqrt(self.record_size)


We have these variable within configuration. We should use self._args.dimension_stdev

This is not initialized at this point. It got self._args.dimension = 1

oh i meant, we need to config method like derive_configurations_basic and derive_configurations_dataset.

This will split the configurations we are setting dataset dependent and not.

@dlp.log def derive_configurations_basic(self): self.dimension = int(math.sqrt(self.record_length)) self.dimension_stdev = self.record_length_stdev/2.0/math.sqrt(self.record_length) self.max_dimension = self.dimension if (self.record_length_resize>0): self.max_dimension = int(math.sqrt(self.record_length_resize)) self.resized_image = np.random.randint(255, size=(self.max_dimension, self.max_dimension), dtype=np.uint8) self.required_samples = self.comm_size * self.batch_size if self.read_threads > 0: self.required_samples *= self.read_threads @dlp.log def derive_configurations_dataset(self, file_list_train, file_list_eval): self.file_list_train = file_list_train self.file_list_eval = file_list_eval self.num_files_eval = len(file_list_eval) self.num_files_train = len(file_list_train) self.total_samples_train = self.num_samples_per_file * len(self.file_list_train) self.total_samples_eval = self.num_samples_per_file * len(self.file_list_eval) self.training_steps = int(math.ceil(self.total_samples_train / self.batch_size / self.comm_size)) self.eval_steps = int(math.ceil(self.total_samples_eval / self.batch_size_eval / self.comm_size))

Then in the main dlio_benchmark.py

first call derive_configurations_basic() and after you generate data call derive_configurations_dataset()

hariharan-devarajan · 2023-04-11T19:19:10Z

src/data_generator/data_generator.py

@@ -44,14 +44,14 @@ def __init__(self):
        self.compression = self._args.compression
        self.compression_level = self._args.compression_level
        self._file_prefix = None
-        self._dimension = None
        self._file_list = None
        self.num_subfolders_train = self._args.num_subfolders_train


we dont need to copy these variables just use args directly.

hariharan-devarajan · 2023-04-11T19:22:08Z

src/data_generator/csv_generator.py

            else:
                dim1 = dim2 = self._dimension
-            record = random.random(dim1*dim2)
+            record = np.random.randint(255, size=dim1*dim2, dtype=np.uint8)
            records = [record]*self.num_samples
            df = pd.DataFrame(data=records)


The part where we calculate dim per sample and do a record can be a common function data_generator.py. This way we can make it more modular.

hariharan-devarajan · 2023-04-11T19:24:12Z

src/data_generator/hdf5_generator.py

@@ -42,10 +42,10 @@ def generate(self):
        Generate hdf5 data for training. It generates a 3d dataset and writes it to file.
        """
        super().generate()
-        random.seed(10)
+        np.random.seed(10)
        samples_per_iter=1024
        dim1 = dim2 = self._dimension


Are we missing variable dimensions here. Also this creation of dims and record is done per file in other generators.
This whole method can be abstracted in the parent class with only overriding how we store the files.

…tion

zhenghh04 added 6 commits April 11, 2023 14:39

removing the factor of 8

dbe88f9

fixed computation time=0 bug

4968f80

fixed bugs for undefined variable issue

24387ac

fixed bugs

c8b9251

removed unnecessary import

fa8481c

fixed bugs

66b96f2

zhenghh04 requested a review from hariharan-devarajan April 11, 2023 16:27

hariharan-devarajan requested changes Apr 11, 2023

View reviewed changes

zhenghh04 added 9 commits April 12, 2023 04:09

added support for selecting a subset of dataset for training / evalua…

2b51ac7

…tion

fixed hdf5 generator out of memory issue

1239732

defined derive_configuration

f13689c

fixed bugs

f0b69f4

Added support for allowing training on a subset of data

ae61855

fixed some bugs

58db52b

moved dimension calculations to data_generator

9a25180

fixed bugs for generator

3f98104

fixed bugs for generating file lists in dlio_benchmark.py

d0dd6e3

zhenghh04 merged commit 88613b5 into argonne-lcf:main Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changed datatype to be np.uint8 universally in the call #61

Changed datatype to be np.uint8 universally in the call #61

zhenghh04 commented Apr 11, 2023

hariharan-devarajan left a comment

hariharan-devarajan Apr 11, 2023

hariharan-devarajan Apr 11, 2023

zhenghh04 Apr 11, 2023

hariharan-devarajan Apr 11, 2023

hariharan-devarajan Apr 11, 2023

hariharan-devarajan Apr 11, 2023

hariharan-devarajan Apr 11, 2023

Changed datatype to be np.uint8 universally in the call #61

Changed datatype to be np.uint8 universally in the call #61

Conversation

zhenghh04 commented Apr 11, 2023

hariharan-devarajan left a comment

Choose a reason for hiding this comment

hariharan-devarajan Apr 11, 2023

Choose a reason for hiding this comment

hariharan-devarajan Apr 11, 2023

Choose a reason for hiding this comment

zhenghh04 Apr 11, 2023

Choose a reason for hiding this comment

hariharan-devarajan Apr 11, 2023

Choose a reason for hiding this comment

hariharan-devarajan Apr 11, 2023

Choose a reason for hiding this comment

hariharan-devarajan Apr 11, 2023

Choose a reason for hiding this comment

hariharan-devarajan Apr 11, 2023

Choose a reason for hiding this comment