Representing time series (timestamp-value pairs) #289

prockenschaub · 2020-06-05T07:14:25Z

prockenschaub
Jun 5, 2020

Problem

I am looking into efficient ways to represent multi-subject, multi-variate time series of arbitrary length and sampling frequency. For example, I have data on 100 patients in hospital (=subjects) for whom I collect data on heart rate and body temperature (multi-variate). The measurements are taken at random times (i.e. whenever the nurse happens to check on or the other) and the length of observation for each patient depends on their length of stay.

In a perfect world, I would represent such a structure as a 3D pandas dataframe, where rows correspond to patients, columns to variables, and the 3rd dimension is made up of timestamp-value pairs. Ideally this structure would also be performant (it can currently be approximated by having a "nested" dataframe where each cell is again a pandas series but can be fairly awkward - pun not intended - and slow).

Necessary operators

The most important operation on such a structure would be efficient slicing in time. For example, I often want to select all measurements taken within the first hour after admission to hospital. In a highly regular case where each measurement is taken let's say every minute for 24 hours, the above example could be represented as a 3D numpy array with dimensions 100 patients x 2 variables (heart rate and temperature) x 1440 minutes and slicing for the first hour could be performed as arr[:, :, :60].

Question

Can such a datastructure be represented in awkward1?

awkward array seems like an ideal choice to provide a performant solution to this and can now also be embedded in pandas dataframes.

In pre-1.0 awkward implementation, something like the above could be approximated by using SparseArrays nested within two JaggedArrays (one for each variable) and treating the indexes of the SparseArray as the timestamp (e.g. seconds), although printing wasn't ideal since all intermediate values (which for the most part are missing) were also printed.

jpivarski · 2020-06-05T13:59:10Z

jpivarski
Jun 5, 2020
Maintainer

Suggestion

How about the following?

dataset = [
    {"patient_name": "Bob",
     "room_number": 104,
     "data": [{"time": 1591360416,
               "heart_rate": 60,
               "temperature": 101},
              {"time": 1591360476,
               "heart_rate": 70,
               "temperature": 102},
              {"time": 1591360536,
               "heart_rate": 50,
               "temperature": 99},
              {"time": 1591360716,
               "heart_rate": 60,
               "temperature": 98},
              {"time": 1591360776,
               "heart_rate": 60,
               "temperature": 99}]},
    {"patient_name": "Sally",
     "room_number": 102,
     "data": [{"time": 1591359404,
               "heart_rate": 130,
               "temperature": 98},
              {"time": 1591359464,
               "heart_rate": 120,
               "temperature": 98},
              {"time": 1591359524,
               "heart_rate": 110,
               "temperature": 99},
              {"time": 1591359644,
               "heart_rate": 90,
               "temperature": 98}]}]

array = ak.Array(dataset)
<Array [{patient_name: 'Bob', ... ] type='2 * {"patient_name": string, "room_num...'>

I've assumed that the timestamps start their lives as the number of seconds since 1970. If most of your queries are going to be relative to the first time, you could preserve the absolute offset yet make the times relative to the starting time with

array["data", "first_time"] = array["data", "time", :, 0]
array["data", "time"] = array["data", "time"] - array["data", "first_time"]

(I have assumed that the times within each subarray are sorted, or at the very least that index 0 corresponds to the first. This deep assignment is a feature recently added by @nikoladze and I like how it's immediately useful beyond its original domain!)

Switching to attribute access because it's more convenient (square bracket access is necessary for assignment; see #273), we can select times relative to the first with NumPy-style selectors. If we didn't have relative times, this would just be a little more complicated (we'd have to subtract time zero from each of the array.data.time expressions).

array.data.heart_rate[array.data.time < 150]
<Array [[60, 70, 50], [130, 120, 110]] type='2 * var * int64'>

array.data.temperature[array.data.time < 150]
<Array [[101, 102, 99], [98, 98, 99]] type='2 * var * int64'>

These expressions have lost metadata (patient name, room number). You could keep the metadata by doing all of the operations in place, like the absolute → relative time conversion, but it's probably safer bookkeeping to create new arrays with the metadata "zipped" in.

ak.zip({"room_number": array.room_number,
        "heart_rate": array.data.heart_rate[array.data.time < 150]})
<Array [[{room_number: 104, ... ] type='2 * var * {"room_number": int64, "heart_...'>

In general, you should think of projecting columns out (e.g. array.data.heart_rate) and zipping columns together (e.g. the above) as quick operations. Because the data are internally columnar, all of this projecting and zipping is _O(1)_ (does not scale with the size of the dataset).

The main thing that's different

I see that you're thinking of time as part of the index. That is a Pandas way of thinking—I get it—but that's going against the grain in NumPy/Awkward. In Pandas terms, we've made the time a column. Whether that matters for performance depends on what you're doing with it. As an index, searching through a huge number of times can take O(log(N)) time, rather than O(N) time, but that's assuming you're looking for one time or one time interval, not one time per patient. How much this matters depends on the relative scales of the number of patients and the number of measurements.

As a column, you have some more flexibility: inequalities like array.data.time < 150 don't require the times to be sorted. The array.data.time[:, 0] used an assumption that the time at index 0 of each patient (the : slice), but if the times were unsorted and you couldn't assume that, you could have found the minimum for each patient with

ak.min(array.data.time, axis=1)
<Array [1591360416, 1591359404] type='2 * ?int64'>

SparseArray from Awkward 0 approximates the access-via-index way of thinking from Pandas, but

it's different enough from other types of arrays that it's hard to integrate into the operations (like ChunkedArray, which I restricted in Awkward 1 because of similar issues)
SparseArray doesn't give you much Pandas-like indexing anyway, since array indexes always have to be integers. What would happen if you get millisecond times from some instrument?

If you didn't like seeing the zeros between time points with measurements, then SparseArray wasn't the right data structure for you. Passing around two arrays of the same length, one with time values, the other with measurements at those times, is a very common way to work in the NumPy world, and it's probably not uncommon to have "time" as a column in a Pandas DataFrame.

Getting data into this (or a similar) form

Depending on how large your dataset is, turning it into JSON or Python objects and passing them into the ak.Array constructor might be prohibitive. (The scale where that might start to matter is 10's to 100's of GB.) When you've decided what form you want your data to have, by playing around with small samples, I can help you with a large scale conversion if the scale is large enough to matter and if I know what form it's starting in.

Involving Pandas

I'm not sure how much this would involve Pandas. Awkward 1 is Pandasable by default because it seemed like that would be a good idea, though I find it hard to see how the suite of operations Pandas provides mix well with Awkward's view of the world.

You can put these things into Pandas with

pd.DataFrame({"everything": array})
                                          everything
0  ... temperature: 98}, {time: 1591360776, heart...
1  ... temperature: 99}, {time: 1591359644, heart...

but I don't see what can be done with it in such an opaque form. Maybe this?

pd.DataFrame({"patient_name": array.patient_name,
              "room_number": array.room_number,
              "heart_rate": array.data.heart_rate,
              "temperature": array.data.temperature})
  patient_name room_number            heart_rate             temperature
0          Bob         104  [60, 70, 50, 60, 60]  [101, 102, 99, 98, 99]
1        Sally         102   [130, 120, 110, 90]        [98, 98, 99, 98]

We'd like to do

df.temperature[:, 0]

but Pandas complains because it believes the data in each column is a scalar. ("Can only tuple-index with a MultiIndex.")

We could go "full Pandas" with something like

ak.pandas.df(array)
               patient_name room_number        data data data
entry subentry                                               
0     0                  66         104  1591360416   60  101
      1                 111         104  1591360476   70  102
      2                  98         104  1591360536   50   99
1     0                  83         102  1591359404  130   98
      1                  97         102  1591359464  120   98
      2                 108         102  1591359524  110   99
      3                 108         102  1591359644   90   98

(Those last three columns are wrong: looks like a bug. They ought to be MultiIndex ("data", "time"), ("data", "heart_rate"), ("data", "temperature").)

Then you'd be able to get the first time for each patient with

df.xs(0, level=1)
      patient_name room_number        data data data
entry                                               
0               66         104  1591360416   60  101
1               83         102  1591359404  130   98

and then get relative times with

df.iloc[:, 2] - df.iloc[:, 2].xs(0, level=1)
entry  subentry
0      0             0
       1            60
       2           120
1      0             0
       1            60
       2           120
       3           240
Name: (data,), dtype: int64

which you can use for the same kind of time-slicing I did above with Awkward Arrays. You might be able to do your whole analysis in Pandas. The thing is that Pandas only recognizes its own structures: there are operations for dealing with jagged arrays as a MultiIndex, but not when they're in a column.

(When I talk about Awkward vs Pandas to physicists, I point out the fact that physics datasets have a lot of different nested jagged arrays, but a Pandas DataFrame can have only one MultiIndex, so physicists would be forced to use multiple DataFrames with frequent JOINs. It looks to me like your dataset has only one jagged array, so I think MultiIndex is an option for you.)

@martindurant might have other suggestions on using Awkward and Pandas together. I thought it was important to take the initial step of making Awkward Arrays a column type, but I don't know where to go from there—I don't know what would be the most useful way to use them in Pandas.

Closing this issue

I'm open to continuing conversation! I'm closing it now because I think it's done and I want to avoid a situation like that in the previous repo where issues remain open because they might not be done. It's for bookkeeping.

0 replies

prockenschaub · 2020-06-05T14:14:44Z

prockenschaub
Jun 5, 2020
Author

This is amazing, thanks for the incredibly thorough answer. I will go through it in detail over the weekend and have a look at how much this already covers my usual use cases.

df.temperature[:, 0] would be closest in spirit to what I am ultimately aiming for (a pandas-like strcture with a ragged third dimension) but as you mentioned pandas is quite unhappy with anything 3D.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Representing time series (timestamp-value pairs) #289

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Representing time series (timestamp-value pairs) #289

prockenschaub Jun 5, 2020

Problem

Necessary operators

Question

Replies: 2 comments

jpivarski Jun 5, 2020 Maintainer

Suggestion

The main thing that's different

Getting data into this (or a similar) form

Involving Pandas

Closing this issue

prockenschaub Jun 5, 2020 Author

prockenschaub
Jun 5, 2020

jpivarski
Jun 5, 2020
Maintainer

prockenschaub
Jun 5, 2020
Author