Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Code wide] Fix pending todo + schema + several small improvements #141

Merged
merged 20 commits into from
Jun 6, 2023
Merged

Conversation

achoum
Copy link
Collaborator

@achoum achoum commented May 29, 2023

  • Add Event Set constructor method that does not rely on Pandas. Add argument to create multiple Nodes or EventSets with the same sampling.
  • Add "source_node()" function event sets.
  • Rename "tp.input_node" to "tp.source_node".
  • Definition and use of "schema" in the Node and EventSet.
  • Add utility functions to handle Nodes and EventSets.
  • Update of all ops and tests with new schema system.
  • Various code optimizations e.g. removed unnecessary sorts, list creation, dictionary query, shuffle for-loops, etc.
  • Various code cleaning e.g. simplify unit tests.
  • Various code improvements e.g. more checks, fix annotations, remove redundant code, de-duplicate default arguments, error messages, unit tests utility, solve some memory handling errors, remove private attribute usage, simplify code.
  • Remove argument type overloading in all operator / operator implementation constructors.
  • set_index now keeps the removed index e.g., convert them into features.
  • Rename sample operator to resample.
  • Remove @Property on all non N(1) time-cost functions e.g. node.schema.feature_names() instead of node.schema.feature_names.
  • Normalize data dtypes across all operators / utilities.
  • Update pandas dataframe importer to use event set user constructor.
  • Support for List[DTYPE] in operator attributes.
  • Remove support for list of durations in lag and leak operators.
  • Split lag and leak into two separate operators.
  • drop_index adds the new feature at the end.
  • add_index adds the new index at the end.
  • Remove indexing implementation in Pandas importer. Instead, importing EventSet (directy or with a pd dataframe) uses the add_index operator internally.
  • The evaluation algorithm does not rely anymore on node's string representation to index the nodes.
  • Remove support for named nodes in evaluation. To be thought again.
  • Graph object support named and unnamed nodes. The algorithm to handle node names is separated from the graph inference algorithm. Named nodes are only required for serialization (so far).
  • Add support for "evaluate" on event set directly i.e. the source node is extracted automatically.

Benchmark

Main observations.

  • No change in most ops.
  • from_dataframe: 17x speed-up
  • add_index / set_index: from 1.8x slow-down, to 2x speedup depending on the experiment. Added todo for further improvement in the code.
AFTER

================================================================
Name                              Wall time (s)    CPU time (s)
================================================================
from_dataframe:s:10_000_numidx:0_numidxval:20_idx:int       0.00030       0.00030
from_dataframe:s:10_000_numidx:0_numidxval:20_idx:str       0.00028       0.00028
from_dataframe:s:10_000_numidx:1_numidxval:20_idx:int       0.00280       0.00280
from_dataframe:s:10_000_numidx:1_numidxval:20_idx:str       0.00577       0.00578
from_dataframe:s:10_000_numidx:3_numidxval:20_idx:int       0.03032       0.03031
from_dataframe:s:10_000_numidx:3_numidxval:20_idx:str       0.03728       0.03728
from_dataframe:s:10_000_numidx:5_numidxval:20_idx:int       0.04636       0.04634
from_dataframe:s:10_000_numidx:5_numidxval:20_idx:str       0.06261       0.06258
----------------------------------------------------------------
simple_moving_average:100            0.00059       0.00059
simple_moving_average:10_000         0.00091       0.00091
simple_moving_average:1_000_000       0.00597       0.00597
----------------------------------------------------------------
select_and_glue:100                  0.00043       0.00043
select_and_glue:10_000               0.00292       0.00292
select_and_glue:1_000_000            0.00063       0.00063
----------------------------------------------------------------
calendar_day_of_month:100            0.00009       0.00009
calendar_day_of_month:10_000         0.00487       0.00487
calendar_day_of_month:1_000_000       0.50778       0.50776
----------------------------------------------------------------
sample:e100_s100                     0.00065       0.00065
sample:e100_s10_000                  0.00282       0.00282
sample:e100_s1_000_000               0.00356       0.00356
sample:e10_000_s100                  0.00065       0.00065
sample:e10_000_s10_000               0.00096       0.00096
sample:e10_000_s1_000_000            0.00460       0.00460
sample:e1_000_000_s100               0.00086       0.00086
sample:e1_000_000_s10_000            0.00137       0.00137
sample:e1_000_000_s1_000_000         0.00685       0.00685
----------------------------------------------------------------
propagate:100                        0.00034       0.00034
propagate:10_000                     0.00048       0.00048
propagate:1_000_000                  0.00048       0.00048
----------------------------------------------------------------
cast(check=False):100                0.00024       0.00024
cast(check=True):100                 0.00091       0.00091
cast(check=False):1000000            0.00163       0.00163
cast(check=True):1000000             0.00315       0.00315
----------------------------------------------------------------
unique_timestamps:100                0.00047       0.00047
unique_timestamps:10000              0.00072       0.00072
unique_timestamps:1000000            0.00672       0.00672
----------------------------------------------------------------
add_index:s:10_000:num_idx:1         0.00447       0.00447
add_index:s:10_000:num_idx:2         0.01682       0.01682
add_index:s:10_000:num_idx:3         0.04388       0.04388
add_index:s:10_000:num_idx:4         0.04627       0.04627
add_index:s:10_000:num_idx:5         0.04057       0.04056
add_index:s:100_000:num_idx:1        0.03086       0.03085
add_index:s:100_000:num_idx:2        0.04657       0.04657
add_index:s:100_000:num_idx:3        0.17663       0.17663
add_index:s:100_000:num_idx:4        0.47043       0.47041
add_index:s:100_000:num_idx:5        0.46969       0.46964
add_index:s:1_000_000:num_idx:1       0.30103       0.30101
add_index:s:1_000_000:num_idx:2       0.34583       0.34582
add_index:s:1_000_000:num_idx:3       0.51490       0.51486
add_index:s:1_000_000:num_idx:4       1.89817       1.89801
add_index:s:1_000_000:num_idx:5       4.92727       4.92624
================================================================

BEFORE

================================================================
Name                              Wall time (s)    CPU time (s)
================================================================
from_dataframe:s:10_000_numidx:0_numidxval:20_idx:int       0.00041       0.00041
from_dataframe:s:10_000_numidx:0_numidxval:20_idx:str       0.00040       0.00040
from_dataframe:s:10_000_numidx:1_numidxval:20_idx:int       0.00188       0.00188
from_dataframe:s:10_000_numidx:1_numidxval:20_idx:str       0.00489       0.00489
from_dataframe:s:10_000_numidx:3_numidxval:20_idx:int       0.33466       0.33460
from_dataframe:s:10_000_numidx:3_numidxval:20_idx:str       0.54433       0.54430
from_dataframe:s:10_000_numidx:5_numidxval:20_idx:int       0.59611       0.59608
from_dataframe:s:10_000_numidx:5_numidxval:20_idx:str       1.11714       1.11645
----------------------------------------------------------------
simple_moving_average:100            0.00055       0.00055
simple_moving_average:10_000         0.00085       0.00085
simple_moving_average:1_000_000       0.00752       0.00752
----------------------------------------------------------------
select_and_glue:100                  0.00034       0.00034
select_and_glue:10_000               0.00047       0.00047
select_and_glue:1_000_000            0.00045       0.00045
----------------------------------------------------------------
calendar_day_of_month:100            0.00008       0.00008
calendar_day_of_month:10_000         0.00479       0.00479
calendar_day_of_month:1_000_000       0.47936       0.47933
----------------------------------------------------------------
sample:e100_s100                     0.00061       0.00061
sample:e100_s10_000                  0.00079       0.00079
sample:e100_s1_000_000               0.00351       0.00351
sample:e10_000_s100                  0.00059       0.00059
sample:e10_000_s10_000               0.00095       0.00095
sample:e10_000_s1_000_000            0.00469       0.00469
sample:e1_000_000_s100               0.00087       0.00087
sample:e1_000_000_s10_000            0.00140       0.00140
sample:e1_000_000_s1_000_000         0.00718       0.00718
----------------------------------------------------------------
propagate:100                        0.00033       0.00033
propagate:10_000                     0.00047       0.00047
propagate:1_000_000                  0.00045       0.00045
----------------------------------------------------------------
cast(check=False):100                0.00021       0.00021
cast(check=True):100                 0.00100       0.00099
cast(check=False):1000000            0.00153       0.00153
cast(check=True):1000000             0.00325       0.00325
----------------------------------------------------------------
unique_timestamps:100                0.00052       0.00051
unique_timestamps:10000              0.00077       0.00077
unique_timestamps:1000000            0.00725       0.00725
----------------------------------------------------------------
set_index:s:10_000:num_idx:1:append:False       0.01329       0.01329
set_index:s:10_000:num_idx:2:append:False       0.01483       0.01483
set_index:s:10_000:num_idx:3:append:False       0.02704       0.02704
set_index:s:10_000:num_idx:4:append:False       0.07546       0.07546
set_index:s:10_000:num_idx:5:append:False       0.07996       0.07995
set_index:s:100_000:num_idx:1:append:False       0.16087       0.16086
set_index:s:100_000:num_idx:2:append:False       0.16262       0.16262
set_index:s:100_000:num_idx:3:append:False       0.15592       0.15592
set_index:s:100_000:num_idx:4:append:False       0.24693       0.24692
set_index:s:100_000:num_idx:5:append:False       0.93581       0.93576
set_index:s:1_000_000:num_idx:1:append:False       2.38795       2.38752
set_index:s:1_000_000:num_idx:2:append:False       2.45258       2.44883
set_index:s:1_000_000:num_idx:3:append:False       1.52917       1.52897
set_index:s:1_000_000:num_idx:4:append:False       1.54799       1.54789
set_index:s:1_000_000:num_idx:5:append:False       2.73235       2.73212
================================================================

@achoum achoum changed the title Code wide fix pending TODOs and improvements [Code wide] Fix pending todo + schema + several small improvements May 29, 2023
@achoum achoum marked this pull request as ready for review May 30, 2023 08:56
Copy link
Collaborator

@ianspektor ianspektor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to leave for today, will finish my review tomorrow, but here is part of it :) Tons of comments, but mostly nitty stuff.

Truly, truly impressive job. 100% sure working with our codebase will be more pleasant once we merge this. Thank you! 👏🏼

Some more important and more general comments/questions:

  • There are lots of unused imports in several files (see temporian/core/operators/lag.py as an example). Tools like Pylint and Pylance both show warnings to help avoid these.
  • "Remove support for list of durations in lag and leak operators."
    • Why? Creating several lags with different durations is a very common use case, the shorthand was valuable
  • "Remove support for named nodes in evaluation. To be thought again."
    • Removing a feature we decided was worth implementing isn't great (even knowing the implementation wasn't great either). A TODO or a proposal on how to redo it would be good replacements :)

temporian/__init__.py Outdated Show resolved Hide resolved
benchmark/benchmark_time.py Show resolved Hide resolved
temporian/core/data/dtype.py Outdated Show resolved Hide resolved
temporian/core/data/duration.py Outdated Show resolved Hide resolved
temporian/core/data/duration.py Outdated Show resolved Hide resolved
temporian/core/evaluation.py Show resolved Hide resolved
temporian/core/graph.py Outdated Show resolved Hide resolved
temporian/core/graph.py Outdated Show resolved Hide resolved
temporian/core/operators/set_index.py Outdated Show resolved Hide resolved
temporian/core/operators/binary/base.py Outdated Show resolved Hide resolved
@achoum
Copy link
Collaborator Author

achoum commented May 31, 2023

Thanks for the initial review. Let me address and solve the nits.

For the larger questions, I've added some material to our next meeting agenda. We can go over it and validate / invalidate those changes.

@DonBraulio DonBraulio mentioned this pull request May 31, 2023
Copy link
Collaborator

@ianspektor ianspektor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more comments 😅 again, lots of nits, docs and typos, few big ones.

Thanks again for this huge amount of work 👏🏼

temporian/core/operators/calendar/base.py Outdated Show resolved Hide resolved
temporian/core/operators/cast.py Outdated Show resolved Hide resolved
temporian/core/operators/cast.py Outdated Show resolved Hide resolved
temporian/core/operators/lag.py Outdated Show resolved Hide resolved
temporian/core/operators/leak.py Outdated Show resolved Hide resolved
temporian/implementation/numpy/operators/drop_index.py Outdated Show resolved Hide resolved
temporian/implementation/numpy/operators/glue.py Outdated Show resolved Hide resolved
temporian/implementation/numpy/operators/test/test_util.py Outdated Show resolved Hide resolved
temporian/proto/core.proto Show resolved Hide resolved
tools/create_operator.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@ianspektor ianspektor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more comments 😅 again, lots of nits, docs and typos, few big ones.

Thanks again for this huge amount of work 👏🏼

@achoum achoum requested a review from rstz June 1, 2023 06:59
benchmark/benchmark_time.py Show resolved Hide resolved
temporian/core/data/duration.py Outdated Show resolved Hide resolved
temporian/core/data/node.py Outdated Show resolved Hide resolved
temporian/core/data/node.py Show resolved Hide resolved
temporian/core/graph.py Outdated Show resolved Hide resolved
temporian/core/operators/leak.py Show resolved Hide resolved
temporian/core/serialize.py Outdated Show resolved Hide resolved
@ianspektor
Copy link
Collaborator

One last comment for thoroughness on docstring formatting: running the docs (mkdocs serve -f docs/mkdocs.yml) prints out a list of indentation errors and parameter description mismatches :) (and these latter should definitely be fixed)

WARNING  -  griffe: temporian/core/operators/propagate.py:120: Confusing indentation for continuation line 27 in docstring, should be 4 * 2 = 8 spaces, not 6
WARNING  -  griffe: temporian/core/data/node.py:522: Confusing indentation for continuation line 23 in docstring, should be 4 * 2 = 8 spaces, not 6
WARNING  -  griffe: temporian/core/data/node.py:524: Confusing indentation for continuation line 25 in docstring, should be 4 * 2 = 8 spaces, not 6
WARNING  -  griffe: temporian/core/data/node.py:526: Confusing indentation for continuation line 27 in docstring, should be 4 * 2 = 8 spaces, not 6
WARNING  -  griffe: temporian/core/data/node.py:527: Confusing indentation for continuation line 28 in docstring, should be 4 * 2 = 8 spaces, not 6
WARNING  -  griffe: temporian/core/data/node.py:528: Confusing indentation for continuation line 29 in docstring, should be 4 * 2 = 8 spaces, not 6
WARNING  -  griffe: temporian/implementation/numpy/data/io.py:49: Confusing indentation for continuation line 21 in docstring, should be 4 * 2 = 8 spaces, not 6
WARNING  -  griffe: temporian/implementation/numpy/data/io.py:52: Confusing indentation for continuation line 24 in docstring, should be 4 * 2 = 8 spaces, not 6
WARNING  -  griffe: temporian/implementation/numpy/data/io.py:53: Confusing indentation for continuation line 25 in docstring, should be 4 * 2 = 8 spaces, not 6
WARNING  -  griffe: temporian/implementation/numpy/data/io.py:54: Confusing indentation for continuation line 26 in docstring, should be 4 * 2 = 8 spaces, not 6
WARNING  -  griffe: temporian/implementation/numpy/data/io.py:56: Confusing indentation for continuation line 28 in docstring, should be 4 * 2 = 8 spaces, not 6
WARNING  -  griffe: temporian/implementation/numpy/data/io.py:57: Confusing indentation for continuation line 29 in docstring, should be 4 * 2 = 8 spaces, not 6
WARNING  -  griffe: temporian/implementation/numpy/data/io.py:166: Confusing indentation for continuation line 15 in docstring, should be 4 * 2 = 8 spaces, not 6
WARNING  -  griffe: temporian/implementation/numpy/data/io.py:162: No type or annotation for parameter 'is_sorted'
WARNING  -  griffe: temporian/implementation/numpy/data/io.py:162: Parameter 'is_sorted' does not appear in the function signature
WARNING  -  griffe: temporian/implementation/numpy/data/event_set.py:392: Confusing indentation for continuation line 6 in docstring, should be 4 * 2 = 8 spaces, not 6
WARNING  -  griffe: temporian/implementation/numpy/data/event_set.py:391: No type or annotation for parameter 'force_new_node'

@ianspektor
Copy link
Collaborator

And a similar one for unused imports with flake8 --select F401 --per-file-ignores="__init__.py:F401 all_operators.py:F401" temporian:

temporian/core/test/registered_operators_test.py:20:1: F401 'temporian.core.operators.all_operators as _op' imported but unused
temporian/core/data/node.py:28:5: F401 'temporian.implementation.numpy.data.event_set.EventSet' imported but unused
temporian/core/data/schema.py:25:5: F401 'temporian.core.operators.base.Operator' imported but unused
temporian/test/user_guide_test.py:3:1: F401 'temporian as tp' imported but unused
temporian/implementation/numpy/evaluation.py:26:1: F401 'temporian.implementation.numpy.operators.all_operators as _impls' imported but unused
temporian/implementation/numpy/test/registered_operators_test.py:18:1: F401 'temporian.implementation.numpy.operators.all_operators as imps' imported but unused
temporian/implementation/numpy/operators/test/since_last_test.py:21:1: F401 'temporian.implementation.numpy.operators.since_last.operators_cc' imported but unused
temporian/implementation/numpy/data/plotter_bokeh.py:25:5: F401 'bokeh.plotting.figure' imported but unused
temporian/implementation/numpy/data/plotter_bokeh.py:27:5: F401 'bokeh.layouts.column' imported but unused
temporian/implementation/numpy/data/plotter_bokeh.py:28:5: F401 'bokeh.models.ColumnDataSource' imported but unused
temporian/implementation/numpy/data/plotter_bokeh.py:28:5: F401 'bokeh.models.CategoricalColorMapper' imported but unused
temporian/implementation/numpy/data/plotter_bokeh.py:28:5: F401 'bokeh.models.HoverTool' imported but unused
temporian/implementation/numpy/data/plotter_bokeh.py:196:5: F401 'bokeh.io.output_notebook' imported but unused
temporian/implementation/numpy/data/plotter_bokeh.py:196:5: F401 'bokeh.io.show' imported but unused
temporian/implementation/numpy/data/plotter_bokeh.py:197:5: F401 'bokeh.layouts.gridplot' imported but unused
temporian/implementation/numpy/data/plotter_bokeh.py:197:5: F401 'bokeh.layouts.column' imported but unused
temporian/implementation/numpy/data/plotter_bokeh.py:198:5: F401 'bokeh.models.CategoricalColorMapper' imported but unused
temporian/implementation/numpy/data/plotter_bokeh.py:198:5: F401 'bokeh.models.HoverTool' imported but unused
temporian/implementation/numpy/data/plotter_bokeh.py:199:5: F401 'bokeh.models.ColumnDataSource' imported but unused
temporian/implementation/numpy/data/plotter_bokeh.py:199:5: F401 'bokeh.models.CustomJS' imported but unused
temporian/implementation/numpy/data/plotter_bokeh.py:200:5: F401 'bokeh.palettes.Dark2_5 as colors' imported but unused
temporian/implementation/numpy/data/test/plotter_test.py:18:13: F401 'IPython.display' imported but unused

Copy link
Collaborator

@rstz rstz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

Copy link
Collaborator

@ianspektor ianspektor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for addressing all comments! LGTM after we resolve pending conversations in today's meeting :)

@achoum
Copy link
Collaborator Author

achoum commented Jun 6, 2023

Thanks a lot.

I'll address the extra changes (e.g., "evset", API functions) in separate PRs.

@achoum achoum merged commit 3772e34 into main Jun 6, 2023
@achoum achoum deleted the gbm branch June 6, 2023 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants