-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dframe proposal #19
base: master
Are you sure you want to change the base?
Dframe proposal #19
Conversation
a8f00f7
to
bab2444
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just general comments.
// located at the provided source. | ||
// | ||
// Possible drivers: hdf5, npyio, csv, json, hdfs, spark, sql, ... | ||
func Open(drv, src string) (*Frame, error) { ... } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to have a version that accepts a Reader
instance here? Or, to simply make the src
type more liberal (an empty interface under the hood I suppose...)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking about having a dfcsv
package where some appropriate function would take the io.Reader
:
package dfcsv // import "gonum.org/v1/exp/dframe/dfcsv"
func Read(r io.Reader, opts ...Option) (*Frame, error) { ... } // or name it Open?
and some database/sql
-like driver registration code so one could write:
df, err := dframe.Open("csv", "file.csv")
Dataframes are clearly important for the Go ecosystem. However, I do wonder if Gonum is really the right place for this package. Gonum is nice in that it is self-contained, relying only on the standard library and In a different repository there's a lot more freedom, even if it's many of the same people working to develop the codebase. I think the small set of dependencies is important to Gonum, and there are many API consistency considerations, especially involving memory allocation and the ability to make code run in parallel. A package outside of Gonum can easily break these constraints where reasonable to satisfy that package's goals. One idea for a name would be |
3cd6367
to
b25ef51
Compare
keeping the amount of dependencies to a minimum is good practice. see: $> function show-deps() {
for pkg in $(go list ./...); do
go list -f "{{range .Deps}}
{{.}}
{{- end}}" $pkg | grep "\." | grep -v "gonum.org";
done | sort | uniq
}
$> cd $GOPATH/src/gonum.org/v1/gonum
$> show-deps
golang.org/x/exp/rand
golang.org/x/tools/container/intsets
$> cd ../plot
$> show-deps
github.com/ajstarks/svgo
github.com/golang/freetype
github.com/golang/freetype/raster
github.com/golang/freetype/truetype
github.com/jung-kurt/gofpdf
github.com/llgcode/draw2d
github.com/llgcode/draw2d/draw2dbase
github.com/llgcode/draw2d/draw2dimg
golang.org/x/image/draw
golang.org/x/image/font
golang.org/x/image/math/f64
golang.org/x/image/math/fixed
golang.org/x/image/tiff
golang.org/x/image/tiff/lzw
rsc.io/pdf
$> cd ../exp/dframe
$> show-deps
github.com/apache/arrow/go/arrow
github.com/apache/arrow/go/arrow/array
github.com/apache/arrow/go/arrow/internal/bitutil
github.com/apache/arrow/go/arrow/internal/cpu
github.com/apache/arrow/go/arrow/internal/debug
github.com/apache/arrow/go/arrow/memory
github.com/pkg/errors my reasoning for trying to have
right now, under Gonum, we have the following active repos:
hardly a super crowded place :) that said, if many Gonum devs feel like |
f593cdb
to
0838c52
Compare
func (df *Frame) Exec(f func(tx *Tx) error) error { ... } | ||
|
||
func example(df *dframe.Frame) { | ||
err := df.Exec(func(tx *dframe.Tx) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The need for reference counting in Arrow will probably incur some mental overhead and be a source of bugs since it's different from how memory is handled in Go normally. This would probably be a bigger issue in the case of an immutable dataframe than a mutable but it will always be present (right?).
I like the idea of a transaction, chained calls or not, immutable or not. Your suggestion would probably give the greatest flexibility, at the cost of some noise.
This is an alternative which removes some noise (function declaration and return) in client code at the cost of sligthly less flexibility.
// Client code
df.Exec(dframe.Slice(0, 10), dframe.Select("col1", "col2"), dframe.Apply("col1 + col2"))
// In package dframe
type ExecFunc func(df *Frame)
func(df *Frame) Exec(ff... ExecFunc) {
/* apply functions, manage reference counts, etc */
}
func(df *Frame) doSelect(cols... string) {
// Select columns
}
func Select(cols... string) ExecFunc {
return func(df *Frame) {
// Either manipulate df directly, or, as below, call package private method on df to do the real work.
df.doSelect(cols...)
}
}
func Slice(cols... string) ExecFunc {
// Similar to above
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The need for reference counting in Arrow will probably incur some mental overhead and be a source of bugs
yeah... I tried at some point to get rid of it in Go Arrow, but I didn't find a way to support GPGPU (or any alternate memory backing store) and drop this ref-counting thing.
hiding the ref-counting behind the dframe.Tx
transaction was my way to handle this (so one gets only 1 dframe.Frame
that needs to be defer
ed with Release()
, like the usual os.File.Close
. This greatly reduces the amount of possible code paths that can go wrong.)
This is an alternative which removes some noise (function declaration and return) in client code at the cost of sligthly less flexibility.
that's an interesting alternative. thanks for bringing it up!
it indeed reduces the amount of user-code "noise".
it creates and returns one closure/function per operation, compared to one big closure/function.
I must say I don't know whether that really matters performance wise (CPU, Mem) though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would assume it should not matter performance wise for any but the smallest dataframes. But one should not assume things. :-)
fc4fe4b
to
64ebeea
Compare
64ebeea
to
912cc70
Compare
912cc70
to
15d72fa
Compare
Wow this looks incredibly promising so far! |
} | ||
|
||
// Column returns the i-th column of this Frame. | ||
func (df *Frame) Column(i int) *array.Column { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to consider here, is the signatures:
func (df *Frame) NumCols() int64
func (df *Frame) Column(i int)
func (df *Frame) Name(i int)
If someone calls NumCols()
to get the number of columns and then uses that int64
to iterate over the columns by way of Column(i int)
, they are going to have a problem because it accepts an int
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right.
I've always been a bit split b/w using "natural" primitive type (such as int
) and also being able to use and follow the arrow standard (ie: int64
).
I put Column(int)
because I'd somehow expected the columns would be stored as []someType
...
perhaps one should just have NumCols() int
.
On Sat, Apr 13, 2019 at 7:54 PM Nick Poorman ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In dframe/README.md
<#19 (comment)>:
> + return nil
+ })
+ if err != nil {
+ log.Fatal(err)
+ }
+}
+```
+
+Or, without a "chained methods" API:
+
+```go
+func example(df *dframe.Frame) {
+ err := df.Exec(func(tx *dframe.Tx) error {
+ tx.Slice(0, 10)
+ tx.Select("col1", "col2")
+ tx.Apply("col1 + col2")
I'm curious as to your thoughts on how the implementation for this should
be approached.
my hand-waving reply would: an SQL-like engine to interpret the expression
and build some kind of AST that would be executed.
but that's as fuzzy as it sounds :)
…-s
|
No description provided.