Representing time series (timestamp-value pairs) #289
Replies: 2 comments
-
SuggestionHow about the following? dataset = [
{"patient_name": "Bob",
"room_number": 104,
"data": [{"time": 1591360416,
"heart_rate": 60,
"temperature": 101},
{"time": 1591360476,
"heart_rate": 70,
"temperature": 102},
{"time": 1591360536,
"heart_rate": 50,
"temperature": 99},
{"time": 1591360716,
"heart_rate": 60,
"temperature": 98},
{"time": 1591360776,
"heart_rate": 60,
"temperature": 99}]},
{"patient_name": "Sally",
"room_number": 102,
"data": [{"time": 1591359404,
"heart_rate": 130,
"temperature": 98},
{"time": 1591359464,
"heart_rate": 120,
"temperature": 98},
{"time": 1591359524,
"heart_rate": 110,
"temperature": 99},
{"time": 1591359644,
"heart_rate": 90,
"temperature": 98}]}]
array = ak.Array(dataset)
<Array [{patient_name: 'Bob', ... ] type='2 * {"patient_name": string, "room_num...'> I've assumed that the timestamps start their lives as the number of seconds since 1970. If most of your queries are going to be relative to the first time, you could preserve the absolute offset yet make the times relative to the starting time with array["data", "first_time"] = array["data", "time", :, 0]
array["data", "time"] = array["data", "time"] - array["data", "first_time"] (I have assumed that the times within each subarray are sorted, or at the very least that index Switching to attribute access because it's more convenient (square bracket access is necessary for assignment; see #273), we can select times relative to the first with NumPy-style selectors. If we didn't have relative times, this would just be a little more complicated (we'd have to subtract time zero from each of the array.data.heart_rate[array.data.time < 150]
<Array [[60, 70, 50], [130, 120, 110]] type='2 * var * int64'>
array.data.temperature[array.data.time < 150]
<Array [[101, 102, 99], [98, 98, 99]] type='2 * var * int64'> These expressions have lost metadata (patient name, room number). You could keep the metadata by doing all of the operations in place, like the absolute → relative time conversion, but it's probably safer bookkeeping to create new arrays with the metadata "zipped" in. ak.zip({"room_number": array.room_number,
"heart_rate": array.data.heart_rate[array.data.time < 150]})
<Array [[{room_number: 104, ... ] type='2 * var * {"room_number": int64, "heart_...'> In general, you should think of projecting columns out (e.g. The main thing that's differentI see that you're thinking of time as part of the index. That is a Pandas way of thinking—I get it—but that's going against the grain in NumPy/Awkward. In Pandas terms, we've made the time a column. Whether that matters for performance depends on what you're doing with it. As an index, searching through a huge number of times can take O(log(N)) time, rather than O(N) time, but that's assuming you're looking for one time or one time interval, not one time per patient. How much this matters depends on the relative scales of the number of patients and the number of measurements. As a column, you have some more flexibility: inequalities like ak.min(array.data.time, axis=1)
<Array [1591360416, 1591359404] type='2 * ?int64'> SparseArray from Awkward 0 approximates the access-via-index way of thinking from Pandas, but
If you didn't like seeing the zeros between time points with measurements, then SparseArray wasn't the right data structure for you. Passing around two arrays of the same length, one with time values, the other with measurements at those times, is a very common way to work in the NumPy world, and it's probably not uncommon to have "time" as a column in a Pandas DataFrame. Getting data into this (or a similar) formDepending on how large your dataset is, turning it into JSON or Python objects and passing them into the ak.Array constructor might be prohibitive. (The scale where that might start to matter is 10's to 100's of GB.) When you've decided what form you want your data to have, by playing around with small samples, I can help you with a large scale conversion if the scale is large enough to matter and if I know what form it's starting in. Involving PandasI'm not sure how much this would involve Pandas. Awkward 1 is Pandasable by default because it seemed like that would be a good idea, though I find it hard to see how the suite of operations Pandas provides mix well with Awkward's view of the world. You can put these things into Pandas with pd.DataFrame({"everything": array})
everything
0 ... temperature: 98}, {time: 1591360776, heart...
1 ... temperature: 99}, {time: 1591359644, heart... but I don't see what can be done with it in such an opaque form. Maybe this? pd.DataFrame({"patient_name": array.patient_name,
"room_number": array.room_number,
"heart_rate": array.data.heart_rate,
"temperature": array.data.temperature})
patient_name room_number heart_rate temperature
0 Bob 104 [60, 70, 50, 60, 60] [101, 102, 99, 98, 99]
1 Sally 102 [130, 120, 110, 90] [98, 98, 99, 98] We'd like to do df.temperature[:, 0] but Pandas complains because it believes the data in each column is a scalar. ("Can only tuple-index with a MultiIndex.") We could go "full Pandas" with something like ak.pandas.df(array)
patient_name room_number data data data
entry subentry
0 0 66 104 1591360416 60 101
1 111 104 1591360476 70 102
2 98 104 1591360536 50 99
1 0 83 102 1591359404 130 98
1 97 102 1591359464 120 98
2 108 102 1591359524 110 99
3 108 102 1591359644 90 98 (Those last three columns are wrong: looks like a bug. They ought to be MultiIndex ("data", "time"), ("data", "heart_rate"), ("data", "temperature").) Then you'd be able to get the first time for each patient with df.xs(0, level=1)
patient_name room_number data data data
entry
0 66 104 1591360416 60 101
1 83 102 1591359404 130 98 and then get relative times with df.iloc[:, 2] - df.iloc[:, 2].xs(0, level=1)
entry subentry
0 0 0
1 60
2 120
1 0 0
1 60
2 120
3 240
Name: (data,), dtype: int64 which you can use for the same kind of time-slicing I did above with Awkward Arrays. You might be able to do your whole analysis in Pandas. The thing is that Pandas only recognizes its own structures: there are operations for dealing with jagged arrays as a MultiIndex, but not when they're in a column. (When I talk about Awkward vs Pandas to physicists, I point out the fact that physics datasets have a lot of different nested jagged arrays, but a Pandas DataFrame can have only one MultiIndex, so physicists would be forced to use multiple DataFrames with frequent JOINs. It looks to me like your dataset has only one jagged array, so I think MultiIndex is an option for you.) @martindurant might have other suggestions on using Awkward and Pandas together. I thought it was important to take the initial step of making Awkward Arrays a column type, but I don't know where to go from there—I don't know what would be the most useful way to use them in Pandas. Closing this issueI'm open to continuing conversation! I'm closing it now because I think it's done and I want to avoid a situation like that in the previous repo where issues remain open because they might not be done. It's for bookkeeping. |
Beta Was this translation helpful? Give feedback.
-
This is amazing, thanks for the incredibly thorough answer. I will go through it in detail over the weekend and have a look at how much this already covers my usual use cases.
|
Beta Was this translation helpful? Give feedback.
-
Problem
I am looking into efficient ways to represent multi-subject, multi-variate time series of arbitrary length and sampling frequency. For example, I have data on 100 patients in hospital (=subjects) for whom I collect data on heart rate and body temperature (multi-variate). The measurements are taken at random times (i.e. whenever the nurse happens to check on or the other) and the length of observation for each patient depends on their length of stay.
In a perfect world, I would represent such a structure as a 3D pandas dataframe, where rows correspond to patients, columns to variables, and the 3rd dimension is made up of timestamp-value pairs. Ideally this structure would also be performant (it can currently be approximated by having a "nested" dataframe where each cell is again a pandas series but can be fairly awkward - pun not intended - and slow).
Necessary operators
The most important operation on such a structure would be efficient slicing in time. For example, I often want to select all measurements taken within the first hour after admission to hospital. In a highly regular case where each measurement is taken let's say every minute for 24 hours, the above example could be represented as a 3D numpy array with dimensions 100 patients x 2 variables (heart rate and temperature) x 1440 minutes and slicing for the first hour could be performed as
arr[:, :, :60]
.Question
Can such a datastructure be represented in awkward1?
awkward array seems like an ideal choice to provide a performant solution to this and can now also be embedded in pandas dataframes.
In pre-1.0 awkward implementation, something like the above could be approximated by using
SparseArray
s nested within twoJaggedArray
s (one for each variable) and treating the indexes of theSparseArray
as the timestamp (e.g. seconds), although printing wasn't ideal since all intermediate values (which for the most part are missing) were also printed.Beta Was this translation helpful? Give feedback.
All reactions