Possible data process in dataset before cut by length limit #10

xpzhang · 2024-01-11T07:57:07Z

Hi, Huilin

To my understanding, the number of data points feed to training is limited by pf_points.length or pf_features.length, and any points exceeding these limits will be discarded. But in some applications, original data are prepared in specific order that is not suitable for applying a direct cut. It may be desirable to have some pre-process to eliminate some bias. Two possible approaches are:

Shuffle the order of the data points (coordinates, features, masks etc..);

def shuffle_data(data):
    random_indices = torch.randperm(data.size(2))
    data = data[:, :, random_indices]
    return data

Sort the data based on the value of a particular feature variable (ascend or descend).

def sort_by_feature(data,i,descending=False):
    # i is the indice of the feature for sorting
    sorted_indices = torch.argsort(data[:,i,:], dim=1, descending=descending)
    data = torch.gather(data, i, sorted_indices.unsqueeze(1).expand(-1, data.size(1), -1))
    return data

It would be nice if any new items could be add in the data config file for this process options. I'm not sure whether I've made myself clear and if this new feature is easy to implement, thanks a lot.

The text was updated successfully, but these errors were encountered:

hqucms · 2024-01-13T16:19:50Z

Hi @xpzhang ,

If I understand correctly, both are actually already supported in weaver (though with some limitations):

random shuffle (via the _batch_permute_indices function)

new_variables:
  # random permutation and then keep the leading 100 elements
  random_indices: _batch_permute_indices(-energy, 100)
  var1_permuted: _batch_gather(var1, random_indices)
  var2_permuted: _batch_gather(var2, random_indices)
  # ...

sort by a particular feature (via the _batch_argsort function)

new_variables:
  # sort (by descending energy here because of the minus sign) and then keep the leading 100 elements
  sorted_indices: _batch_argsort(-energy, 100)
  var1_permuted: _batch_gather(var1, sorted_indices)
  var2_permuted: _batch_gather(var2, sorted_indices)
  # ...

Let me know if these two meet your needs or not :)

xpzhang · 2024-01-14T09:21:11Z

Hi @hqucms , Thanks for the reply. These functions are exactly what I need!
But when I was testing, there raised an error:

File ~/miniforge3/envs/weaver/lib/python3.10/site-packages/weaver/utils/data/tools.py:101, in _batch_gather(array, indices)
    100 def _batch_gather(array, indices):
--> 101     out = array.zeros_like()
    102     for i, (a, idx) in enumerate(zip(array, indices)):

File ~/miniforge3/envs/weaver/lib/python3.10/site-packages/awkward/highlevel.py:1236, in Array.__getattr__(self, where)
   1231         raise AttributeError(
   1232             f"while trying to get field {where!r}, an exception "
   1233             f"occurred:\n{type(err)}: {err!s}"
   1234         ) from err
   1235 else:
-> 1236     raise AttributeError(f"no field named {where!r}")

AttributeError: no field named 'zeros_like'

Then I tried to modify

out = array.zeros_like()

to

out = ak.zeros_like(array)

I got another error:

File ~/miniforge3/envs/weaver/lib/python3.10/site-packages/weaver/utils/data/tools.py:105, in _batch_gather(array, indices)
    103 for i, (a, idx) in enumerate(zip(array, indices)):
    104     maxlen = min(len(a), len(idx))
--> 105     out[i][:maxlen] = a[idx[:maxlen]]
    106 return out

File ~/miniforge3/envs/weaver/lib/python3.10/site-packages/awkward/highlevel.py:1131, in Array.__setitem__(self, where, what)
   1076 def __setitem__(self, where, what):
   1077     """
   1078     Args:
   1079         where (str or tuple of str): Field name to add to records in the array.
   (...)
   1129     is not a factor in choosing one over the other.)
   1130     """
-> 1131     with ak._errors.OperationErrorContext(
   1132         "ak.Array.__setitem__",
   1133         (self,),
   1134         {"where": where, "what": what},
   1135     ):
   1136         if not (
   1137             isinstance(where, str)
   1138             or (isinstance(where, tuple) and all(isinstance(x, str) for x in where))
   1139         ):
   1140             raise TypeError("only fields may be assigned in-place (by field name)")

File ~/miniforge3/envs/weaver/lib/python3.10/site-packages/awkward/_errors.py:85, in ErrorContext.__exit__(self, exception_type, exception_value, traceback)
     78 try:
     79     # Handle caught exception
     80     if (
     81         exception_type is not None
     82         and issubclass(exception_type, Exception)
     83         and self.primary() is self
     84     ):
---> 85         self.handle_exception(exception_type, exception_value)
     86 finally:
     87     # Step out of the way so that another ErrorContext can become primary.
     88     if self.primary() is self:

File ~/miniforge3/envs/weaver/lib/python3.10/site-packages/awkward/_errors.py:95, in ErrorContext.handle_exception(self, cls, exception)
     93     self.decorate_exception(cls, exception)
     94 else:
---> 95     raise self.decorate_exception(cls, exception)

TypeError: only fields may be assigned in-place (by field name)

This error occurred while calling

    ak.Array.__setitem__(
        <Array [0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0] type='696 * float32'>
        where = slice-instance
        what = <Array [227, 122, 290, ..., 362, 209, 409] type='696 * float32'>
    )

hqucms · 2024-01-25T22:48:19Z

Hi @xpzhang -- sorry for the slow response, it slipped my attention somehow...
It turns out with the new version of awkward-array, these functions can be implemented in a much easier and faster way, so I updated them in 7be2906 and made a new release (v0.4.10). Let me know if the new implementation works.

xpzhang · 2024-01-26T09:26:30Z

Hi @hqucms , the new implementation works well, thanks a lot!
There's another small issue. It seems that the composed variables defined with another "new variables" cannot utilize the preprocess.method=auto option.
For example:

new_variables:
  # random permutation
  random_indices: _batch_permute_indices(x)
  x_square: np.square(x)
  x_permuted: _batch_gather(x, random_indices)
  x_square_negtive: -x_square


pf_features:
  length: 100
  vars: 
    - [x, auto]  # This works
    - [x_permuted, auto] # This will raise an error (no field 'random_indices' in record)
    - [x_square_negtive, auto] # This also raises an error (no field 'x_square' in record)

It could be avoid by not defining a two-step variable like this:

new_variables:
  # random permutation
  x_permuted: _batch_gather(x, _batch_permute_indices(x))
  x_square_negtive: -np.square(x)

pf_features:
  length: 100
  vars: 
    - [x, auto] 
    - [x_permuted, auto] # This works
    - [x_square_negtive, auto] # This works too

but in many cases, when variables like random_indices are used for many times, I wonder if this compromise might raise performance issues.

hqucms · 2024-01-26T19:06:39Z

Hi @xpzhang -- indeed that's something I have been thinking to improve for a while. It should be working with this pull request #11 so probably you can give it a try. I'd like to run a few more tests to make sure it does not break other things before merging it.

However just to add a note here about auto: it is mainly for quick R&D but for the "production" model training I would recommend to fix all the preprocessing parameters manually. Often some manual inspection of the input distributions is needed to get the most suitable transformations -- eps. if some features have long tails or very asymmetric shapes, etc.

xpzhang · 2024-01-27T16:10:52Z

Hi @hqucms , I've tested this pr and it works as expected. really appreciate all your hard work on this. Although It seems that the custom type of labels could not be used (name not found), I guess that could be easily fixed.

hqucms · 2024-01-27T23:52:21Z

Hi @xpzhang -- Nice catch! I just updated the pull request to fix that :)

xpzhang · 2024-01-28T04:02:41Z

thanks for the quick fix! :)

xpzhang closed this as completed Jan 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible data process in dataset before cut by length limit #10

Possible data process in dataset before cut by length limit #10

xpzhang commented Jan 11, 2024

hqucms commented Jan 13, 2024

xpzhang commented Jan 14, 2024

hqucms commented Jan 25, 2024

xpzhang commented Jan 26, 2024

hqucms commented Jan 26, 2024

xpzhang commented Jan 27, 2024 •

edited

Loading

hqucms commented Jan 27, 2024

xpzhang commented Jan 28, 2024

Possible data process in dataset before cut by length limit #10

Possible data process in dataset before cut by length limit #10

Comments

xpzhang commented Jan 11, 2024

hqucms commented Jan 13, 2024

xpzhang commented Jan 14, 2024

hqucms commented Jan 25, 2024

xpzhang commented Jan 26, 2024

hqucms commented Jan 26, 2024

xpzhang commented Jan 27, 2024 • edited Loading

hqucms commented Jan 27, 2024

xpzhang commented Jan 28, 2024

xpzhang commented Jan 27, 2024 •

edited

Loading