-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changed datatype to be np.uint8 universally in the call #61
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general the changes look good. I found few things we can improve in the code.
- the generate method can be generic in data_generator.py and we create a save_file method which takes a records and stores it in file.
- the data generator and its sub classes are copying config variables which can be avoided and improve memory footprint of library.
src/data_generator/data_generator.py
Outdated
self._file_list = None | ||
self.num_subfolders_train = self._args.num_subfolders_train | ||
self.num_subfolders_eval = self._args.num_subfolders_eval | ||
self.format = self._args.format | ||
self.storage = StorageFactory().get_storage(self._args.storage_type, self._args.storage_root, | ||
self._args.framework) | ||
|
||
self._dimension = int(math.sqrt(self.record_size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have these variable within configuration. We should use self._args.dimension
src/data_generator/data_generator.py
Outdated
self._file_list = None | ||
self.num_subfolders_train = self._args.num_subfolders_train | ||
self.num_subfolders_eval = self._args.num_subfolders_eval | ||
self.format = self._args.format | ||
self.storage = StorageFactory().get_storage(self._args.storage_type, self._args.storage_root, | ||
self._args.framework) | ||
|
||
self._dimension = int(math.sqrt(self.record_size)) | ||
self._dimension_stdev = self.record_size_stdev/2.0/math.sqrt(self.record_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have these variable within configuration. We should use self._args.dimension_stdev
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not initialized at this point. It got self._args.dimension = 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i meant, we need to config method like derive_configurations_basic and derive_configurations_dataset.
This will split the configurations we are setting dataset dependent and not.
@dlp.log
def derive_configurations_basic(self):
self.dimension = int(math.sqrt(self.record_length))
self.dimension_stdev = self.record_length_stdev/2.0/math.sqrt(self.record_length)
self.max_dimension = self.dimension
if (self.record_length_resize>0):
self.max_dimension = int(math.sqrt(self.record_length_resize))
self.resized_image = np.random.randint(255, size=(self.max_dimension, self.max_dimension), dtype=np.uint8)
self.required_samples = self.comm_size * self.batch_size
if self.read_threads > 0:
self.required_samples *= self.read_threads
@dlp.log
def derive_configurations_dataset(self, file_list_train, file_list_eval):
self.file_list_train = file_list_train
self.file_list_eval = file_list_eval
self.num_files_eval = len(file_list_eval)
self.num_files_train = len(file_list_train)
self.total_samples_train = self.num_samples_per_file * len(self.file_list_train)
self.total_samples_eval = self.num_samples_per_file * len(self.file_list_eval)
self.training_steps = int(math.ceil(self.total_samples_train / self.batch_size / self.comm_size))
self.eval_steps = int(math.ceil(self.total_samples_eval / self.batch_size_eval / self.comm_size))
Then in the main dlio_benchmark.py
first call derive_configurations_basic()
and after you generate data call derive_configurations_dataset()
@@ -44,14 +44,14 @@ def __init__(self): | |||
self.compression = self._args.compression | |||
self.compression_level = self._args.compression_level | |||
self._file_prefix = None | |||
self._dimension = None | |||
self._file_list = None | |||
self.num_subfolders_train = self._args.num_subfolders_train |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we dont need to copy these variables just use args directly.
else: | ||
dim1 = dim2 = self._dimension | ||
record = random.random(dim1*dim2) | ||
record = np.random.randint(255, size=dim1*dim2, dtype=np.uint8) | ||
records = [record]*self.num_samples | ||
df = pd.DataFrame(data=records) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The part where we calculate dim per sample and do a record can be a common function data_generator.py. This way we can make it more modular.
src/data_generator/hdf5_generator.py
Outdated
@@ -42,10 +42,10 @@ def generate(self): | |||
Generate hdf5 data for training. It generates a 3d dataset and writes it to file. | |||
""" | |||
super().generate() | |||
random.seed(10) | |||
np.random.seed(10) | |||
samples_per_iter=1024 | |||
dim1 = dim2 = self._dimension |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we missing variable dimensions here. Also this creation of dims and record is done per file in other generators.
This whole method can be abstracted in the parent class with only overriding how we store the files.
This PR changed the datatype from np.float64 to np.uint8. This also fixed issue about the dimension_stdev.