Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Experimental] An AbstractDataFrame that's a composite type with columns as type members #471

Closed
wants to merge 1 commit into from

Conversation

tshort
Copy link
Contributor

@tshort tshort commented Jan 11, 2014

This is related to #451. CDataFrame is an AbstractDataFrame made of composite types made on the fly. Columns are directly type members. This has some advantages:

  • You can directly access columns with df.colA.
  • Column access is quite fast.

It has some disadvantages:

  • You cannot do df["newcol"] = something. We would need an API that treats DataFrames as immutable. For example, to add a column, I used this: newdf = tdataframe(olddf, newcol = something).
  • It might eat up memory with all the type creation.

Here are the results of some tests in test/cdataframe.jl:

julia> include("cdataframe.jl")
# standard DataFrame indexing:
elapsed time: 3.495338932 seconds (560157268 bytes allocated)  
# standard DataFrame indexing with a CDataFrame:
elapsed time: 3.249076907 seconds (176956 bytes allocated)  
# composite-style indexing:
elapsed time: 0.020856282 seconds (97116 bytes allocated)
# straight-vector indexing:
elapsed time: 0.018481679 seconds (94796 bytes allocated)

The composite-style indexing is nearly as fast as indexing with the raw vectors.

Note that I didn't implement everything needed for it to be an AbstractDataFrame. Here are some things that do work:

julia> d = cdataframe(DataFrame(a = 1:10, b = 11:20, c = 21:30))
10x3 CDataFrame63741
|-------|----|----|----|
| Row # | a  | b  | c  |
| 1     | 1  | 11 | 21 |
| 2     | 2  | 12 | 22 |
| 3     | 3  | 13 | 23 |
| 4     | 4  | 14 | 24 |
| 5     | 5  | 15 | 25 |
| 6     | 6  | 16 | 26 |
| 7     | 7  | 17 | 27 |
| 8     | 8  | 18 | 28 |
| 9     | 9  | 19 | 29 |
| 10    | 10 | 20 | 30 |

julia> d.a
10-element DataArray{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

julia> d["a"]
10-element DataArray{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

julia> d[:,["a","c"]]
10x2 CDataFrame36884
|-------|----|----|
| Row # | a  | c  |
| 1     | 1  | 21 |
| 2     | 2  | 22 |
| 3     | 3  | 23 |
| 4     | 4  | 24 |
| 5     | 5  | 25 |
| 6     | 6  | 26 |
| 7     | 7  | 27 |
| 8     | 8  | 28 |
| 9     | 9  | 29 |
| 10    | 10 | 30 |

julia> d[1:2,["a","c"]]
2x2 CDataFrame1687
|-------|---|----|
| Row # | a | c  |
| 1     | 1 | 21 |
| 2     | 2 | 22 |

julia> d[1:2,"a"]
2-element DataArray{Int64,1}:
 1
 2

julia> cdataframe(d, x = d.a .* d.c)
WARNING: cbind is deprecated, use hcat instead.
 in cbind at deprecated.jl:8
 in cdataframe at /home/tshort/.julia/DataFrames/src/cdataframe.jl:41
10x4 CDataFrame62646
|-------|----|----|----|-----|
| Row # | a  | b  | c  | x   |
| 1     | 1  | 11 | 21 | 21  |
| 2     | 2  | 12 | 22 | 44  |
| 3     | 3  | 13 | 23 | 69  |
| 4     | 4  | 14 | 24 | 96  |
| 5     | 5  | 15 | 25 | 125 |
| 6     | 6  | 16 | 26 | 156 |
| 7     | 7  | 17 | 27 | 189 |
| 8     | 8  | 18 | 28 | 224 |
| 9     | 9  | 19 | 29 | 261 |
| 10    | 10 | 20 | 30 | 300 |

I'm not sure this is a good idea, but we do need some way to get these speed advantages and also to get df.colA.

@johnmyleswhite
Copy link
Contributor

This seems really cool. Thanks so much for working on it, @tshort.

I fully agree that we more performance tuning of DataFrames. I'm not sure that I agree that we need df.colA indexing. I really want us to have exactly one way to index into a DataFrame. Having this alternative mechanism for indexing, which is faster than the other one, could become that one way, but then I'd want to remove df["colA"] completely.

The thing I want to do everything to avoid is a system in which we allow the existence of two completely different ways to do things, but only one of which is high performance. That's just a trap for people who aren't immersed enough in the internals to know that, even though we support X, one should never do X.

@kmsquire
Copy link
Contributor

+1 for the idea of being able to do df.colA to access a column, and requiring df[:,"colA"] to access the column using indexing.

As an aside, JuliaLang/julia#1974 is getting some push, which affects this discussions somewhat.

@simonster
Copy link
Contributor

Unfortunately this is not particularly constructive, since I don't have a solution that doesn't involve type inference/compiler changes, but even with CDataFrame, we still can't get good type information if the DataFrame is constructed in the same function as it's used, since the CDataFrame type won't even exist when the function is inferred. This is something that we could think about revisiting with staged functions, although even then handling cases where the field names and types cannot be known at compile time (e.g. readtable and data access in the same function) is kind of a nightmare.

@tshort tshort closed this Jan 20, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants