Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed datatype to be np.uint8 universally in the call #61

Merged
merged 15 commits into from
Apr 24, 2023

Conversation

zhenghh04
Copy link
Member

This PR changed the datatype from np.float64 to np.uint8. This also fixed issue about the dimension_stdev.

Copy link
Collaborator

@hariharan-devarajan hariharan-devarajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general the changes look good. I found few things we can improve in the code.

  1. the generate method can be generic in data_generator.py and we create a save_file method which takes a records and stores it in file.
  2. the data generator and its sub classes are copying config variables which can be avoided and improve memory footprint of library.

self._file_list = None
self.num_subfolders_train = self._args.num_subfolders_train
self.num_subfolders_eval = self._args.num_subfolders_eval
self.format = self._args.format
self.storage = StorageFactory().get_storage(self._args.storage_type, self._args.storage_root,
self._args.framework)

self._dimension = int(math.sqrt(self.record_size))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have these variable within configuration. We should use self._args.dimension

self._file_list = None
self.num_subfolders_train = self._args.num_subfolders_train
self.num_subfolders_eval = self._args.num_subfolders_eval
self.format = self._args.format
self.storage = StorageFactory().get_storage(self._args.storage_type, self._args.storage_root,
self._args.framework)

self._dimension = int(math.sqrt(self.record_size))
self._dimension_stdev = self.record_size_stdev/2.0/math.sqrt(self.record_size)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have these variable within configuration. We should use self._args.dimension_stdev

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not initialized at this point. It got self._args.dimension = 1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i meant, we need to config method like derive_configurations_basic and derive_configurations_dataset.

This will split the configurations we are setting dataset dependent and not.

   @dlp.log
    def derive_configurations_basic(self):
        self.dimension = int(math.sqrt(self.record_length))
        self.dimension_stdev = self.record_length_stdev/2.0/math.sqrt(self.record_length)
        self.max_dimension = self.dimension
        if (self.record_length_resize>0):
            self.max_dimension =  int(math.sqrt(self.record_length_resize))
        self.resized_image = np.random.randint(255, size=(self.max_dimension, self.max_dimension), dtype=np.uint8)
        self.required_samples = self.comm_size * self.batch_size
        if self.read_threads > 0:
            self.required_samples *= self.read_threads
        
    @dlp.log
    def derive_configurations_dataset(self, file_list_train, file_list_eval):
        self.file_list_train = file_list_train
        self.file_list_eval = file_list_eval
        self.num_files_eval = len(file_list_eval)
        self.num_files_train = len(file_list_train)
        self.total_samples_train = self.num_samples_per_file * len(self.file_list_train)
        self.total_samples_eval = self.num_samples_per_file * len(self.file_list_eval)
        self.training_steps = int(math.ceil(self.total_samples_train / self.batch_size / self.comm_size))
        self.eval_steps = int(math.ceil(self.total_samples_eval / self.batch_size_eval / self.comm_size))

Then in the main dlio_benchmark.py

first call derive_configurations_basic() and after you generate data call derive_configurations_dataset()

@@ -44,14 +44,14 @@ def __init__(self):
self.compression = self._args.compression
self.compression_level = self._args.compression_level
self._file_prefix = None
self._dimension = None
self._file_list = None
self.num_subfolders_train = self._args.num_subfolders_train
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont need to copy these variables just use args directly.

else:
dim1 = dim2 = self._dimension
record = random.random(dim1*dim2)
record = np.random.randint(255, size=dim1*dim2, dtype=np.uint8)
records = [record]*self.num_samples
df = pd.DataFrame(data=records)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part where we calculate dim per sample and do a record can be a common function data_generator.py. This way we can make it more modular.

@@ -42,10 +42,10 @@ def generate(self):
Generate hdf5 data for training. It generates a 3d dataset and writes it to file.
"""
super().generate()
random.seed(10)
np.random.seed(10)
samples_per_iter=1024
dim1 = dim2 = self._dimension
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we missing variable dimensions here. Also this creation of dims and record is done per file in other generators.
This whole method can be abstracted in the parent class with only overriding how we store the files.

@zhenghh04 zhenghh04 merged commit 88613b5 into argonne-lcf:main Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants