Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added the Boss-SP algorithm and its example. Tests are still missing. #156

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

SvenBarray
Copy link
Contributor

No description provided.

@SvenBarray
Copy link
Contributor Author

Not a pull request meant to be accepted, but a pull request meant to share code and get feedback

Comment on lines 85 to 86
level : integer (default = 3)
Number of times the series is being divided. Maximum of 3.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of the parameters is relevant: the parameters are usually sorted based on their relevance / importance. level is a key element of BOSSSP (compared to BOSS), so it should be higher on the list (above window_step I would say).

anova=False, drop_sum=False, norm_mean=False, norm_std=False,
strategy='quantile', alphabet=None,
numerosity_reduction=True, use_idf=True, smooth_idf=False,
sublinear_tf=True, level=3):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise

self.use_idf = use_idf
self.smooth_idf = smooth_idf
self.sublinear_tf = sublinear_tf
self.level = level
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise

y[:length//2])
words_3 = self._boss_word_extractor(X[length//2:],
y[:length//2:])
words = words + words_2 + words_3
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are concatenated in BOSSSP are the histograms (and not the list of words). For instance, if level=3, the final histogram is the concatenation of 7 histograms. You need to first compute the histogram for each words_i list, then concatenate the histogram, instead of concatenating first the words_i lists and then computing the histogram.

else:
X_bow = np.asarray([' '.join(X_word[i]) for i in range(n_samples)])

words_list = np.asarray(X_bow[0].split(' ')).tolist()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You only get the transform for the first time series (X_bow[0]).

Comment on lines 198 to 270
def _check_params(self, n_timestamps):
if not isinstance(self.word_size, (int, np.integer)):
raise TypeError("'word_size' must be an integer.")
if not self.word_size >= 1:
raise ValueError("'word_size' must be a positive integer.")

if not isinstance(self.window_size,
(int, np.integer, float, np.floating)):
raise TypeError("'window_size' must be an integer or a float.")
if isinstance(self.window_size, (int, np.integer)):
if self.drop_sum:
if not 1 <= self.window_size <= (n_timestamps - 1):
raise ValueError(
"If 'window_size' is an integer, it must be greater "
"than or equal to 1 and lower than or equal to "
"(n_timestamps - 1) if 'drop_sum=True'."
)
else:
if not 1 <= self.window_size <= n_timestamps:
raise ValueError(
"If 'window_size' is an integer, it must be greater "
"than or equal to 1 and lower than or equal to "
"n_timestamps if 'drop_sum=False'."
)
window_size = self.window_size
else:
if not 0 < self.window_size <= 1:
raise ValueError(
"If 'window_size' is a float, it must be greater "
"than 0 and lower than or equal to 1."
)
window_size = ceil(self.window_size * n_timestamps)

if not isinstance(self.window_step,
(int, np.integer, float, np.floating)):
raise TypeError("'window_step' must be an integer or a float.")
if isinstance(self.window_step, (int, np.integer)):
if not 1 <= self.window_step <= n_timestamps:
raise ValueError(
"If 'window_step' is an integer, it must be greater "
"than or equal to 1 and lower than or equal to "
"n_timestamps."
)
window_step = self.window_step
else:
if not 0 < self.window_step <= 1:
raise ValueError(
"If 'window_step' is a float, it must be greater "
"than 0 and lower than or equal to 1."
)
window_step = ceil(self.window_step * n_timestamps)

if self.drop_sum:
if not self.word_size <= (window_size - 1):
raise ValueError(
"'word_size' must be lower than or equal to "
"(window_size - 1) if 'drop_sum=True'."
)
else:
if not self.word_size <= window_size:
raise ValueError(
"'word_size' must be lower than or equal to "
"window_size if 'drop_sum=False'."
)

if not isinstance(self.level, (int, np.integer)):
raise TypeError("'level' must be an integer.")
if not self.level >= 1:
raise ValueError("'level' must be a positive integer.")
if self.level >= 4:
raise ValueError("'level' must not exceed 3.")

return window_size, window_step
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, it might be easier to use the corresponding of method of BOSS if the code is the same. You could just add the checks for level.

Comment on lines 135 to 138
words_2 = self._boss_word_extractor(X[:length//2],
y[:length//2])
words_3 = self._boss_word_extractor(X[length//2:],
y[:length//2:])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You split your dataset in two, but on the first axis (the axis of samples) instead of the second axis (the time axis).

Comment on lines 141 to 149
words_4 = self._boss_word_extractor(X[:length//4],
y[:length//4])
words_5 = self._boss_word_extractor(X[length//4:length//2],
y[length//4:length//2])
words_6 = self._boss_word_extractor(X[length//2:length*3//4],
y[length//2:length*3//4])
words_7 = self._boss_word_extractor(X[length*3//4:],
y[length*3//4:])
words = words + words_4 + words_5 + words_6 + words_7
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise.

Comment on lines +204 to +248
if not isinstance(self.window_size,
(int, np.integer, float, np.floating)):
raise TypeError("'window_size' must be an integer or a float.")
if isinstance(self.window_size, (int, np.integer)):
if self.drop_sum:
if not 1 <= self.window_size <= (n_timestamps - 1):
raise ValueError(
"If 'window_size' is an integer, it must be greater "
"than or equal to 1 and lower than or equal to "
"(n_timestamps - 1) if 'drop_sum=True'."
)
else:
if not 1 <= self.window_size <= n_timestamps:
raise ValueError(
"If 'window_size' is an integer, it must be greater "
"than or equal to 1 and lower than or equal to "
"n_timestamps if 'drop_sum=False'."
)
window_size = self.window_size
else:
if not 0 < self.window_size <= 1:
raise ValueError(
"If 'window_size' is a float, it must be greater "
"than 0 and lower than or equal to 1."
)
window_size = ceil(self.window_size * n_timestamps)

if not isinstance(self.window_step,
(int, np.integer, float, np.floating)):
raise TypeError("'window_step' must be an integer or a float.")
if isinstance(self.window_step, (int, np.integer)):
if not 1 <= self.window_step <= n_timestamps:
raise ValueError(
"If 'window_step' is an integer, it must be greater "
"than or equal to 1 and lower than or equal to "
"n_timestamps."
)
window_step = self.window_step
else:
if not 0 < self.window_step <= 1:
raise ValueError(
"If 'window_step' is a float, it must be greater "
"than 0 and lower than or equal to 1."
)
window_step = ceil(self.window_step * n_timestamps)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this method is run in the _boss_word_extractor method, if the values of window_size and window_step are floats, the proportions are computed on the split datasets, which means that the values of window_size and window_step would be different for different levels. It might not be an issue in itself, but in this case it should probably be mentioned in the documentation.

plt.ylabel("Occurences", fontsize=14)
plt.title("Number of occurence of each word after BOSS-SP transformation",
fontsize=15)
plt.legend(loc='best')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have no label in your plot, so there is no legend to display.

…t for the ones regarding code similar from other python modules as discussed orally
@SvenBarray
Copy link
Contributor Author

Corrected the issues mentioned in the pull request feedback, except for the ones regarding code similar from other python modules as discussed orally

@codecov
Copy link

codecov bot commented Sep 11, 2023

Codecov Report

Patch coverage is 14.87% of modified lines.

Files Changed Coverage
pyts/classification/bosssp.py 13.44%
pyts/classification/__init__.py 100.00%

📢 Thoughts on this report? Let us know!.

@fkiraly
Copy link

fkiraly commented Apr 12, 2024

Not a pull request meant to be accepted, but a pull request meant to share code and get feedback

@SvenBarray, would you like to add this to sktime if it is out of scope for pyts?
We follow a "collection" approach, and algorithms can be owned by individuals.

Not many changes have to be made:

  • using the transformation extension template
  • fit should be changed to _fit
  • tags need to be set - this is a series-to-primitives transformer, if I understand correctly
  • you need to add a _transform method, that should return the word features in one of the permissible data formats (e.g., data frame)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants