You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After PR #358 , Evaluator classes are used to calculate correlation functions. However, these classes cannot be properly pickled due to lack of "reduce" method. This makes it impossible to parallelize MC in separate processes using multiprocessing or Joblib, because the workers in these libraries require all I/O to be picklable.
Note that process-wise parallelization is still desirable even if you have implemented parallel evaluation of correlations, because people would frequently want to parallelize temperatures, chemical potentials, etc.
The cython class ClusterSpaceEvaluator does not have a reduce method.
Possible Solution
Implement a reduce method for all added cython classes in smol.util and test out process-wise parallelization again.
Steps to Reproduce
The problematic codes:
defgenerate_training_structures(
ce,
enumerated_matrices,
enumerated_counts,
previous_sampled_structures=None,
previous_feature_matrix=None,
keep_ground_states=True,
num_structs=60,
mc_generator_kwargs=None,
n_parallel=None,
duplicacy_criteria="correlations",
**kwargs,
):
"""Generate training structures at the first iteration. Args: ce(ClusterExpansion): ClusterExpansion object initialized as null. If charge decorated, will contain an ewald contribution at 100% enumerated_matrices(list[3*3 ArrayLike[int]]): Previously enumerated supercell matrices. Must be the same super-cell size. enumerated_counts(list[1D ArrayLike]): Previously enumerated compositions in "counts" format. Must fit in the super-cell size. Note: Different super-cell sizes not supported! previous_sampled_structures(list[Structure]): optional Sample structures already calculated in past iterations. If given, that means you will add structures to an existing training set. previous_feature_matrix(list[list[[float]]): optional Correlation vectors of structures already calculated in past iterations. keep_ground_states(bool): optional Whether always to include the electrostatic ground states. Default to True. num_structs(int): optional Number of training structures to add at the iteration. At least 2~3 structures should be enumerated for each composition. And it is recommended that num_structs_init * 10 > 2 * len(supercell_and_counts). Default is 60. mc_generator_kwargs(dict): optional Keyword arguments for McSampleGenerator, except num_samples. Note: currently only support Canonical. n_parallel(int): optional Number of generators to run in parallel. Default is to use a quarter of cpu count. duplicacy_criteria(str): The criteria when to consider two structures as the same and old to add one of them into the candidate training set. Default is "correlations", which means to assert duplication if two structures have the same correlation vectors. While "structure" means two structures must be symmetrically equivalent after being reduced. No other option is allowed. Note that option "structure" might be significantly slower since it has to attempt reducing every structure to its primitive cell before matching. It should be used with caution. kwargs: Keyword arguments for utils.selection.select_initial_rows. Returns: list[Structure], list[3*3 list[list[int]]], list[list[float]]: Initial training structures, super-cell matrices, and normalized correlation vectors. """mc_generator_args=mc_generator_kwargsor {}
n_parallel=n_parallelormin(cpu_count() //4, len(enumerated_counts))
ifn_parallel==0:
ifcpu_count() //4==0:
warn(
f"Number of CPUs found on the executing environment: {cpu_count()} might"f" not be enough for parallelization! Setting parallel processes to 1."
)
n_parallel=1previous_sampled_structures=previous_sampled_structuresor []
previous_feature_matrix=np.array(previous_feature_matrix).tolist() or []
iflen(previous_feature_matrix) !=len(previous_sampled_structures):
raiseValueError(
"Must provide a feature vector for each"" structure passed in!"
)
# Scale the number of structures to select for each comp.num_samples=get_num_structs_to_sample(
[countsfor_inenumerated_matricesforcountsinenumerated_counts],
num_structs,
)
withParallel(n_jobs=n_parallel) aspar:
gs_id=0keeps= []
structures= []
femat= []
sc_matrices= []
sc_matrix_indices= []
formid, sc_matrixinenumerate(enumerated_matrices):
# This should work on pytest.results=par(
delayed(_sample_single_generator)(
ce,
previous_sampled_structures+structures,
previous_feature_matrix+femat,
mc_generator_args,
sc_matrix,
counts,
num_sample,
duplicacy_criteria=duplicacy_criteria,
)
forcounts, num_sampleinzip(
enumerated_counts,
num_samples[
mid*len(enumerated_counts) : (mid+1)
*len(enumerated_counts)
],
)
)
for (
gs_struct,
gs_occu,
gs_feat,
samples,
samples_occu,
samples_feat,
gs_dupe,
) inresults:
ifgs_dupe:
structures.extend(samples)
femat.extend(samples_feat)
sc_matrices.extend([sc_matrixfor_insamples])
sc_matrix_indices.extend([midfor_insamples])
gs_id+=len(samples)
else:
structures.extend([gs_struct] +samples)
femat.extend([gs_feat] +samples_feat)
sc_matrices.extend([sc_matrixfor_inrange(len(samples) +1)])
sc_matrix_indices.extend([midfor_inrange(len(samples) +1)])
ifkeep_ground_states:
keeps.append(gs_id)
gs_id+=len(samples) +1femat=np.array(femat)
# External terms such as the ewald term should not be taken into comparison,# when selecting structuresnum_external_terms=len(ce.cluster_subspace.external_terms)
iflen(previous_sampled_structures) ==0:
# Start from scratch.selected_row_ids=select_initial_rows(
femat,
n_select=num_structs,
keep_indices=keeps,
num_external_terms=num_external_terms,
**kwargs,
)
else:
# Add to existing:selected_row_ids=select_added_rows(
femat,
np.array(previous_feature_matrix),
n_select=num_structs,
keep_indices=keeps,
num_external_terms=num_external_terms,
**kwargs,
)
# Must sort to ensure the same ordering between feature rows and structures.selected_row_ids=sorted(selected_row_ids)
selected_structures= [sfori, sinenumerate(structures) ifiinselected_row_ids]
selected_matrices= [mfori, minenumerate(sc_matrices) ifiinselected_row_ids]
selected_femat=femat[selected_row_ids, :].tolist()
iflen(selected_row_ids) <num_structs:
warn(
f"Expected to add {num_structs} new structures,"f" but only {len(selected_row_ids)}"f" non duplicate structures could be added."
)
returnselected_structures, selected_matrices, selected_femat
In this function I tried to initialize a MC sampler in every parallel process.
The text was updated successfully, but these errors were encountered:
After PR #358 , Evaluator classes are used to calculate correlation functions. However, these classes cannot be properly pickled due to lack of "reduce" method. This makes it impossible to parallelize MC in separate processes using multiprocessing or Joblib, because the workers in these libraries require all I/O to be picklable.
Note that process-wise parallelization is still desirable even if you have implemented parallel evaluation of correlations, because people would frequently want to parallelize temperatures, chemical potentials, etc.
Current Behavior
Here is the error reported by pytest:
The cython class ClusterSpaceEvaluator does not have a reduce method.
Possible Solution
Implement a reduce method for all added cython classes in smol.util and test out process-wise parallelization again.
Steps to Reproduce
The problematic codes:
In this function I tried to initialize a MC sampler in every parallel process.
The text was updated successfully, but these errors were encountered: