Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ClusterBasedNormalizer performance #336

Open
fealho opened this issue Dec 16, 2021 · 1 comment
Open

Improve ClusterBasedNormalizer performance #336

fealho opened this issue Dec 16, 2021 · 1 comment
Labels
feature request Request for a new feature

Comments

@fealho
Copy link
Member

fealho commented Dec 16, 2021

The BayesGMMTransformer should be experimented with to improve performace. The current parameters (the weight_threshold and the default values passed to the BayesianGaussianMixture) should be experimented with and new default values should be chosen.

The code can also be sped up. The reverse_transform is already much quicker than the other two methods, and fit takes almost all of its time fitting the BayesianGaussianMixture, which is unavoidable. Instead, the biggest gains can be achieved by improving the transform method, specifically the following lines:

selected_component = np.zeros(len(data), dtype='int')
for i in range(len(data)):
component_prob_t = component_probs[i] + 1e-6
component_prob_t = component_prob_t / component_prob_t.sum()
selected_component[i] = np.random.choice(
np.arange(self._valid_component_indicator.sum()),
p=component_prob_t
)

These lines take the majority of the transformation runtime, so any improvement would significantly speedup the whole process.

@fealho fealho added the internal The issue doesn't change the API or functionality label Dec 16, 2021
@npatki npatki added feature request Request for a new feature and removed internal The issue doesn't change the API or functionality labels Jun 10, 2022
@npatki
Copy link
Contributor

npatki commented Jun 10, 2022

The old BayesGMMTransformer has now been renamed to ClusterBasedNormalizer in RDT 1.0. Changing the title to reflect this.

@npatki npatki changed the title Improve BayesGMMTransformer performance Improve ClusterBasedNormalizer performance Jun 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

2 participants