Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Permutation Importance - "ValueError: assignment destination is read-only" #5

Open
mycho830 opened this issue Jan 17, 2024 · 3 comments

Comments

@mycho830
Copy link

Hello,

Thank you for providing such a great tool. It has been incredibly helpful in my research. However, I recently encountered an issue after downloading the latest version.

When performing analysis, I encountered the following error during the phase 5 modeling, specifically: "ValueError: assignment destination is read-only."

I suspected a parallelization issue and modified the code by setting run_parallel=False, but the problem persists. Could you please provide any assistance or insights into resolving this issue?

Here's the code snippet I used:
` from streamline.runners.model_runner import ModelExperimentRunner
model_exp = ModelExperimentRunner(
output_path, experiment_name, algorithms=algorithms,
exclude=exclude, class_label=class_label,
instance_label=instance_label, scoring_metric=primary_metric,
metric_direction=metric_direction,
training_subsample=training_subsample,
use_uniform_fi=use_uniform_FI, n_trials=n_trials,
timeout=timeout, save_plots=False,
do_lcs_sweep=do_lcs_sweep, lcs_nu=lcs_nu, lcs_n=lcs_N,
lcs_iterations=lcs_iterations,
lcs_timeout=lcs_timeout, resubmit=False)

model_exp.run(run_parallel=False)`

The error details are as follows:
`--------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [24], in <cell line: 13>()
1 from streamline.runners.model_runner import ModelExperimentRunner
2 model_exp = ModelExperimentRunner(
3 output_path, experiment_name, algorithms=algorithms,
4 exclude=exclude, class_label=class_label,
(...)
11 lcs_iterations=lcs_iterations,
12 lcs_timeout=lcs_timeout, resubmit=False)
---> 13 model_exp.run(run_parallel=False)

File /N/slate/minycho/tools/python/STREAMLINE/streamline/runners/model_runner.py:238, in ModelExperimentRunner.run(self, run_parallel)
236 job_list.append((job_obj, copy.deepcopy(model)))
237 else:
--> 238 job_obj.run(model)
239 if run_parallel and run_parallel != "False" and not self.run_cluster:
240 # run_jobs(job_list)
241 Parallel(n_jobs=num_cores)(
242 delayed(model_runner_fn)(job_obj, model
243 ) for job_obj, model in tqdm(job_list))

File /N/slate/minycho/tools/python/STREAMLINE/streamline/modeling/modeljob.py:83, in ModelJob.run(self, model)
81 self.algorithm = model.small_name
82 logging.info('Running ' + str(self.algorithm) + ' on ' + str(self.train_file_path))
---> 83 ret = self.run_model(model)
85 # Pickle all evaluation metrics for ML model training and evaluation
86 pickle.dump(ret, open(self.full_path
87 + '/model_evaluation/pickled_metrics/'
88 + self.algorithm + 'CV' + str(self.cv_count) + "_metrics.pickle", 'wb'))

File /N/slate/minycho/tools/python/STREAMLINE/streamline/modeling/modeljob.py:149, in ModelJob.run_model(self, model)
144 self.export_best_params(self.full_path + '/models/' + self.algorithm +
145 '_usedparams' + str(self.cv_count) + '.csv',
146 model.params)
148 if self.uniform_fi:
--> 149 results = permutation_importance(model.model, x_train, y_train, n_repeats=10, random_state=self.random_state,
150 scoring=self.scoring_metric)
151 self.feature_importance = results.importances_mean
152 else:

File ~/.local/lib/python3.10/site-packages/sklearn/inspection/_permutation_importance.py:258, in permutation_importance(estimator, X, y, scoring, n_repeats, n_jobs, random_state, sample_weight, max_samples)
254 scorer = _MultimetricScorer(scorers=scorers_dict)
256 baseline_score = _weights_scorer(scorer, estimator, X, y, sample_weight)
--> 258 scores = Parallel(n_jobs=n_jobs)(
259 delayed(_calculate_permutation_scores)(
260 estimator,
261 X,
262 y,
263 sample_weight,
264 col_idx,
265 random_seed,
266 n_repeats,
267 scorer,
268 max_samples,
269 )
270 for col_idx in range(X.shape[1])
271 )
273 if isinstance(baseline_score, dict):
274 return {
275 name: _create_importances_bunch(
276 baseline_score[name],
(...)
280 for name in baseline_score
281 }

File ~/.local/lib/python3.10/site-packages/sklearn/utils/parallel.py:63, in Parallel.call(self, iterable)
58 config = get_config()
59 iterable_with_config = (
60 (_with_config(delayed_func, config), args, kwargs)
61 for delayed_func, args, kwargs in iterable
62 )
---> 63 return super().call(iterable_with_config)

File ~/.local/lib/python3.10/site-packages/joblib/parallel.py:1863, in Parallel.call(self, iterable)
1861 output = self._get_sequential_output(iterable)
1862 next(output)
-> 1863 return output if self.return_generator else list(output)
1865 # Let's create an ID that uniquely identifies the current call. If the
1866 # call is interrupted early and that the same instance is immediately
1867 # re-used, this id will be used to prevent workers that were
1868 # concurrently finalizing a task from the previous call to run the
1869 # callback.
1870 with self._lock:

File ~/.local/lib/python3.10/site-packages/joblib/parallel.py:1792, in Parallel._get_sequential_output(self, iterable)
1790 self.n_dispatched_batches += 1
1791 self.n_dispatched_tasks += 1
-> 1792 res = func(*args, **kwargs)
1793 self.n_completed_tasks += 1
1794 self.print_progress()

File ~/.local/lib/python3.10/site-packages/sklearn/utils/parallel.py:123, in _FuncWrapper.call(self, *args, **kwargs)
121 config = {}
122 with config_context(**config):
--> 123 return self.function(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/sklearn/inspection/_permutation_importance.py:62, in _calculate_permutation_scores(estimator, X, y, sample_weight, col_idx, random_state, n_repeats, scorer, max_samples)
60 X_permuted[X_permuted.columns[col_idx]] = col
61 else:
---> 62 X_permuted[:, col_idx] = X_permuted[shuffling_idx, col_idx]
63 scores.append(_weights_scorer(scorer, estimator, X_permuted, y, sample_weight))
65 if isinstance(scores[0], dict):

ValueError: assignment destination is read-only`

Your help on this matter would be greatly appreciated.

Thank you,
Min

@ryanurbs
Copy link
Member

Hi Min,
Thanks for informing us of your issue. We'll attempt to track down the problem and get back to you. We may reach out to try and get more information about your run.

@raptor419
Copy link
Member

Hi Min,

Thank you so much for informing us of your issue. This seems to be a highly specific issue within scikit-learn and we will do our best to replicate and correct it so this doesn't affect any future analysis. I would love any information about any versions of the packages and size of the datasets you're using. For now, would it be fair to assume you gave the latest versions for all packages and a fairly large dataset?

Your localization of the issue to parallelization is highly helpful and seems to be a great step in the correct direction. It seems to be an issue with internal parallelization within sklearn. While I did not find the exact issue on GitHub in sklearn repositories for the permutation_importance function I found a very similar issue ( scikit-learn/scikit-learn#5956 ) dealing with the same error. scikit-learn/scikit-learn#5956

As a step-wise solution I would try the following:

  1. Adding n_jobs = 1 to the permutation_importance function in line 149 of modeljob.py, instead of the default None (which does result in 1, but we should try anyway).
  2. Update joblib library through conda/pip
  3. Trying the solution mentioned ValueError: assignment destination is read-only, when paralleling with n_jobs > 1 scikit-learn/scikit-learn#5956 : ValueError: assignment destination is read-only, when paralleling with n_jobs > 1 scikit-learn/scikit-learn#5956 (comment)

Do let me know if I am making fair assumptions or if I am wrong somewhere and if any of the steps above seem to have helped out.
Feel free to reach out if you have any more queries or questions.

Thanks and regards,
Harsh

@mycho830
Copy link
Author

Hi Harsh,

Thank you for your response. Your assumptions are accurate – we are using the latest versions, and we use 5 input files from 4.7k to 55k.

I followed your step-wise solution, and adding n_jobs = 1 to the permutation_importance function on line 149 of modeljob.py did the trick! The issue seems to be resolved, and the analysis is running smoothly without encountering the parallelization error.

I sincerely appreciate your assistance and quick resolution.

Thanks once again, and have a great day!

Best regards,
Min

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants