Dframe proposal #19

sbinet · 2019-01-11T17:12:02Z

No description provided.

dframe/README.md

kortschak

Just general comments.

dframe/README.md

dframe/dfmat/dfmat.go

dframe/dframe.go

dframe/dframe_example_test.go

dframe/dframe.go

dframe/dfmat/dfmat.go

tobgu · 2019-01-12T11:25:25Z

dframe/README.md

+// located at the provided source.
+//
+// Possible drivers: hdf5, npyio, csv, json, hdfs, spark, sql, ...
+func Open(drv, src string) (*Frame, error) { ... }


Would it make sense to have a version that accepts a Reader instance here? Or, to simply make the src type more liberal (an empty interface under the hood I suppose...)?

I am thinking about having a dfcsv package where some appropriate function would take the io.Reader:

package dfcsv // import "gonum.org/v1/exp/dframe/dfcsv" func Read(r io.Reader, opts ...Option) (*Frame, error) { ... } // or name it Open?

and some database/sql-like driver registration code so one could write:

df, err := dframe.Open("csv", "file.csv")

btracey · 2019-01-12T12:12:35Z

Dataframes are clearly important for the Go ecosystem. However, I do wonder if Gonum is really the right place for this package. Gonum is nice in that it is self-contained, relying only on the standard library and exp/rand. This builds off several core pieces outside of what Go and Gonum provide, and thus would complicate the story of what Gonum "is". Beyond that, it seems to me the lessons of history point to keeping things contained and independent. The Go team has expressed lament that the standard library is as big as it is. The python ecosystem is just that, an ecosystem. Scipy, Sci-kit learn, pandas are all independent packages. Plotting in python has seen a number of different attempts by different groups and make the story exciting. We even see this story within the scientific Go ecosystem. You cite two previous versions of dataframe packages, and Gorgonia has been very successful, and it while it is outside Gonum it can still work with Gonum.

In a different repository there's a lot more freedom, even if it's many of the same people working to develop the codebase. I think the small set of dependencies is important to Gonum, and there are many API consistency considerations, especially involving memory allocation and the ability to make code run in parallel. A package outside of Gonum can easily break these constraints where reasonable to satisfy that package's goals. One idea for a name would be kugos which is the word for "bear" in Cebuano (and has an open github address), another idea would be "makwa" (though not open) which means bear in Algonquin.

sbinet · 2019-01-14T09:44:05Z

@btracey

I think the small set of dependencies is important to Gonum

keeping the amount of dependencies to a minimum is good practice.
but I wouldn't use it to prevent dframe to find Gonum as a home.

see:

$> function show-deps() {
for pkg in $(go list ./...); do 
        go list -f "{{range .Deps}}
{{.}}
{{- end}}" $pkg | grep  "\." | grep -v "gonum.org"; 
   done | sort | uniq
}

$> cd $GOPATH/src/gonum.org/v1/gonum
$> show-deps
golang.org/x/exp/rand
golang.org/x/tools/container/intsets

$> cd ../plot
$> show-deps
github.com/ajstarks/svgo
github.com/golang/freetype
github.com/golang/freetype/raster
github.com/golang/freetype/truetype
github.com/jung-kurt/gofpdf
github.com/llgcode/draw2d
github.com/llgcode/draw2d/draw2dbase
github.com/llgcode/draw2d/draw2dimg
golang.org/x/image/draw
golang.org/x/image/font
golang.org/x/image/math/f64
golang.org/x/image/math/fixed
golang.org/x/image/tiff
golang.org/x/image/tiff/lzw
rsc.io/pdf

$> cd ../exp/dframe
$> show-deps
github.com/apache/arrow/go/arrow
github.com/apache/arrow/go/arrow/array
github.com/apache/arrow/go/arrow/internal/bitutil
github.com/apache/arrow/go/arrow/internal/cpu
github.com/apache/arrow/go/arrow/internal/debug
github.com/apache/arrow/go/arrow/memory
github.com/pkg/errors

my reasoning for trying to have dframe under the Gonum umbrella is:

to make it adhere to the high standards of Gonum
to make it as an entry point for people to hopefully then diffuse in other Gonum repos
to somewhat gather a critical mass of Gonum developers.

right now, under Gonum, we have the following active repos:

gonum.org/v1/exp
gonum.org/v1/gonum
gonum.org/v1/hdf5
gonum.org/v1/netlib
gonum.org/v1/plot
gonum.org/v1/tools

hardly a super crowded place :)

that said, if many Gonum devs feel like dframe has no business to live under gonum.org, well, I'll try to find another home for it.

tobgu · 2019-01-14T21:34:12Z

dframe/README.md

+func (df *Frame) Exec(f func(tx *Tx) error) error { ... }
+
+func example(df *dframe.Frame) {
+	err := df.Exec(func(tx *dframe.Tx) error {


The need for reference counting in Arrow will probably incur some mental overhead and be a source of bugs since it's different from how memory is handled in Go normally. This would probably be a bigger issue in the case of an immutable dataframe than a mutable but it will always be present (right?).

I like the idea of a transaction, chained calls or not, immutable or not. Your suggestion would probably give the greatest flexibility, at the cost of some noise.

This is an alternative which removes some noise (function declaration and return) in client code at the cost of sligthly less flexibility.

// Client code df.Exec(dframe.Slice(0, 10), dframe.Select("col1", "col2"), dframe.Apply("col1 + col2")) // In package dframe type ExecFunc func(df *Frame) func(df *Frame) Exec(ff... ExecFunc) { /* apply functions, manage reference counts, etc */ } func(df *Frame) doSelect(cols... string) { // Select columns } func Select(cols... string) ExecFunc { return func(df *Frame) { // Either manipulate df directly, or, as below, call package private method on df to do the real work. df.doSelect(cols...) } } func Slice(cols... string) ExecFunc { // Similar to above }

The need for reference counting in Arrow will probably incur some mental overhead and be a source of bugs

yeah... I tried at some point to get rid of it in Go Arrow, but I didn't find a way to support GPGPU (or any alternate memory backing store) and drop this ref-counting thing.

hiding the ref-counting behind the dframe.Tx transaction was my way to handle this (so one gets only 1 dframe.Frame that needs to be defered with Release(), like the usual os.File.Close. This greatly reduces the amount of possible code paths that can go wrong.)

This is an alternative which removes some noise (function declaration and return) in client code at the cost of sligthly less flexibility.

that's an interesting alternative. thanks for bringing it up!
it indeed reduces the amount of user-code "noise".
it creates and returns one closure/function per operation, compared to one big closure/function.
I must say I don't know whether that really matters performance wise (CPU, Mem) though...

I would assume it should not matter performance wise for any but the smallest dataframes. But one should not assume things. :-)

Fixes gonum#18.

nickpoorman · 2019-03-23T23:08:02Z

Wow this looks incredibly promising so far!

nickpoorman · 2019-04-11T22:14:41Z

dframe/dframe.go

+}
+
+// Column returns the i-th column of this Frame.
+func (df *Frame) Column(i int) *array.Column {


One thing to consider here, is the signatures:

func (df *Frame) NumCols() int64 func (df *Frame) Column(i int) func (df *Frame) Name(i int)

If someone calls NumCols() to get the number of columns and then uses that int64 to iterate over the columns by way of Column(i int), they are going to have a problem because it accepts an int.

you're right.

I've always been a bit split b/w using "natural" primitive type (such as int) and also being able to use and follow the arrow standard (ie: int64).

I put Column(int) because I'd somehow expected the columns would be stored as []someType...

perhaps one should just have NumCols() int.

sbinet · 2019-04-16T07:35:52Z

On Sat, Apr 13, 2019 at 7:54 PM Nick Poorman ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In dframe/README.md <#19 (comment)>: > + return nil + }) + if err != nil { + log.Fatal(err) + } +} +``` + +Or, without a "chained methods" API: + +```go +func example(df *dframe.Frame) { + err := df.Exec(func(tx *dframe.Tx) error { + tx.Slice(0, 10) + tx.Select("col1", "col2") + tx.Apply("col1 + col2") I'm curious as to your thoughts on how the implementation for this should be approached.

my hand-waving reply would: an SQL-like engine to interpret the expression and build some kind of AST that would be executed. but that's as fuzzy as it sounds :)

…

-s

sbinet requested a review from kortschak January 11, 2019 17:12

sbinet force-pushed the dframe-proposal branch from a8f00f7 to bab2444 Compare January 11, 2019 17:39

Soypete reviewed Jan 11, 2019

View reviewed changes

dframe/README.md Outdated Show resolved Hide resolved

kortschak reviewed Jan 12, 2019

View reviewed changes

tobgu reviewed Jan 12, 2019

View reviewed changes

sbinet force-pushed the dframe-proposal branch from 3cd6367 to b25ef51 Compare January 14, 2019 09:12

sbinet force-pushed the dframe-proposal branch 2 times, most recently from f593cdb to 0838c52 Compare January 14, 2019 17:19

tobgu reviewed Jan 14, 2019

View reviewed changes

sbinet force-pushed the dframe-proposal branch from fc4fe4b to 64ebeea Compare January 15, 2019 09:10

sbinet added 3 commits January 15, 2019 10:15

ci: update Travis CI test suite to gonum/gonum standards

a6a8dea

Fixes gonum#18.

ci: reduce size of CI matrix

157435f

linsolve: fix copyrights

0bc07f5

sbinet force-pushed the dframe-proposal branch from 64ebeea to 912cc70 Compare January 15, 2019 09:26

sbinet added 14 commits January 15, 2019 11:09

exp: add godoc, coverage badges

c2bbe04

dframe: first stab to dframe proposal

539d5cb

dframe: flesh out proposal

4b2f72b

cleanup

affe29c

dframe: add example

b369015

dframe: cosmetics

6de2ece

dframe: add FromArrays, FromCols

aedf124

dframe: add FromFrame

99a7d50

dframe: make sure *Frame implements array.Table

964a9ce

dframe: add FromMem

3ec4386

dframe: rename Map to Dict

3ffc95a

dframe: add support for []int and []uint in FromMem

974c4db

dframe: introduce read/only and read/write transactions

67124b5

dframe: add import clause

8b840c3

sbinet added 8 commits January 15, 2019 11:10

dframe/dfmat: first import

a2278f6

all: apply gofmt simplify

40e9cec

dframe: explain the reason for dframe.Tx (ie: ref-counting)

ea4d8e7

dframe: cosmetics

76f4b62

PS 2

853ae0a

PS 3

23a82f4

PS 4

3d4f163

ci: drop 1.9.x because of Apache Arrow

15d72fa

sbinet force-pushed the dframe-proposal branch from 912cc70 to 15d72fa Compare January 15, 2019 10:10

dframe: create stable order of columns for FromMem

7741eff

nickpoorman suggested changes Apr 12, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dframe proposal #19

Dframe proposal #19

sbinet commented Jan 11, 2019

kortschak left a comment

tobgu Jan 12, 2019

sbinet Jan 14, 2019

btracey commented Jan 12, 2019

sbinet commented Jan 14, 2019

tobgu Jan 14, 2019

sbinet Jan 15, 2019

tobgu Jan 15, 2019

nickpoorman commented Mar 23, 2019

nickpoorman Apr 11, 2019 •

edited

Loading

sbinet Apr 16, 2019

sbinet commented Apr 16, 2019 via email

Dframe proposal #19

Are you sure you want to change the base?

Dframe proposal #19

Conversation

sbinet commented Jan 11, 2019

kortschak left a comment

Choose a reason for hiding this comment

tobgu Jan 12, 2019

Choose a reason for hiding this comment

sbinet Jan 14, 2019

Choose a reason for hiding this comment

btracey commented Jan 12, 2019

sbinet commented Jan 14, 2019

tobgu Jan 14, 2019

Choose a reason for hiding this comment

sbinet Jan 15, 2019

Choose a reason for hiding this comment

tobgu Jan 15, 2019

Choose a reason for hiding this comment

nickpoorman commented Mar 23, 2019

nickpoorman Apr 11, 2019 • edited Loading

Choose a reason for hiding this comment

sbinet Apr 16, 2019

Choose a reason for hiding this comment

sbinet commented Apr 16, 2019 via email

nickpoorman Apr 11, 2019 •

edited

Loading