[Experimental] An AbstractDataFrame that's a composite type with columns as type members #471

tshort · 2014-01-11T16:03:03Z

This is related to #451. CDataFrame is an AbstractDataFrame made of composite types made on the fly. Columns are directly type members. This has some advantages:

You can directly access columns with df.colA.
Column access is quite fast.

It has some disadvantages:

You cannot do df["newcol"] = something. We would need an API that treats DataFrames as immutable. For example, to add a column, I used this: newdf = tdataframe(olddf, newcol = something).
It might eat up memory with all the type creation.

Here are the results of some tests in test/cdataframe.jl:

julia> include("cdataframe.jl")
# standard DataFrame indexing:
elapsed time: 3.495338932 seconds (560157268 bytes allocated)  
# standard DataFrame indexing with a CDataFrame:
elapsed time: 3.249076907 seconds (176956 bytes allocated)  
# composite-style indexing:
elapsed time: 0.020856282 seconds (97116 bytes allocated)
# straight-vector indexing:
elapsed time: 0.018481679 seconds (94796 bytes allocated)

The composite-style indexing is nearly as fast as indexing with the raw vectors.

Note that I didn't implement everything needed for it to be an AbstractDataFrame. Here are some things that do work:

julia> d = cdataframe(DataFrame(a = 1:10, b = 11:20, c = 21:30))
10x3 CDataFrame63741
|-------|----|----|----|
| Row # | a  | b  | c  |
| 1     | 1  | 11 | 21 |
| 2     | 2  | 12 | 22 |
| 3     | 3  | 13 | 23 |
| 4     | 4  | 14 | 24 |
| 5     | 5  | 15 | 25 |
| 6     | 6  | 16 | 26 |
| 7     | 7  | 17 | 27 |
| 8     | 8  | 18 | 28 |
| 9     | 9  | 19 | 29 |
| 10    | 10 | 20 | 30 |

julia> d.a
10-element DataArray{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

julia> d["a"]
10-element DataArray{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

julia> d[:,["a","c"]]
10x2 CDataFrame36884
|-------|----|----|
| Row # | a  | c  |
| 1     | 1  | 21 |
| 2     | 2  | 22 |
| 3     | 3  | 23 |
| 4     | 4  | 24 |
| 5     | 5  | 25 |
| 6     | 6  | 26 |
| 7     | 7  | 27 |
| 8     | 8  | 28 |
| 9     | 9  | 29 |
| 10    | 10 | 30 |

julia> d[1:2,["a","c"]]
2x2 CDataFrame1687
|-------|---|----|
| Row # | a | c  |
| 1     | 1 | 21 |
| 2     | 2 | 22 |

julia> d[1:2,"a"]
2-element DataArray{Int64,1}:
 1
 2

julia> cdataframe(d, x = d.a .* d.c)
WARNING: cbind is deprecated, use hcat instead.
 in cbind at deprecated.jl:8
 in cdataframe at /home/tshort/.julia/DataFrames/src/cdataframe.jl:41
10x4 CDataFrame62646
|-------|----|----|----|-----|
| Row # | a  | b  | c  | x   |
| 1     | 1  | 11 | 21 | 21  |
| 2     | 2  | 12 | 22 | 44  |
| 3     | 3  | 13 | 23 | 69  |
| 4     | 4  | 14 | 24 | 96  |
| 5     | 5  | 15 | 25 | 125 |
| 6     | 6  | 16 | 26 | 156 |
| 7     | 7  | 17 | 27 | 189 |
| 8     | 8  | 18 | 28 | 224 |
| 9     | 9  | 19 | 29 | 261 |
| 10    | 10 | 20 | 30 | 300 |

I'm not sure this is a good idea, but we do need some way to get these speed advantages and also to get df.colA.

…e members.

johnmyleswhite · 2014-01-11T16:12:05Z

This seems really cool. Thanks so much for working on it, @tshort.

I fully agree that we more performance tuning of DataFrames. I'm not sure that I agree that we need df.colA indexing. I really want us to have exactly one way to index into a DataFrame. Having this alternative mechanism for indexing, which is faster than the other one, could become that one way, but then I'd want to remove df["colA"] completely.

The thing I want to do everything to avoid is a system in which we allow the existence of two completely different ways to do things, but only one of which is high performance. That's just a trap for people who aren't immersed enough in the internals to know that, even though we support X, one should never do X.

kmsquire · 2014-01-13T19:44:59Z

+1 for the idea of being able to do df.colA to access a column, and requiring df[:,"colA"] to access the column using indexing.

As an aside, JuliaLang/julia#1974 is getting some push, which affects this discussions somewhat.

simonster · 2014-01-13T21:18:53Z

Unfortunately this is not particularly constructive, since I don't have a solution that doesn't involve type inference/compiler changes, but even with CDataFrame, we still can't get good type information if the DataFrame is constructed in the same function as it's used, since the CDataFrame type won't even exist when the function is inferred. This is something that we could think about revisiting with staged functions, although even then handling cases where the field names and types cannot be known at compile time (e.g. readtable and data access in the same function) is kind of a nightmare.

Add an AbstractDataFrame that is a composite type with columns as typ…

dff0bc6

…e members.

tshort closed this Jan 20, 2014

tshort mentioned this pull request Feb 2, 2014

Generic performance problems #523

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Experimental] An AbstractDataFrame that's a composite type with columns as type members #471

[Experimental] An AbstractDataFrame that's a composite type with columns as type members #471

tshort commented Jan 11, 2014

johnmyleswhite commented Jan 11, 2014

kmsquire commented Jan 13, 2014

simonster commented Jan 13, 2014

[Experimental] An AbstractDataFrame that's a composite type with columns as type members #471

[Experimental] An AbstractDataFrame that's a composite type with columns as type members #471

Conversation

tshort commented Jan 11, 2014

johnmyleswhite commented Jan 11, 2014

kmsquire commented Jan 13, 2014

simonster commented Jan 13, 2014