Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace LAMA with Dask: grouping methods to migrate #295

Closed
sadielbartholomew opened this issue Jan 6, 2022 · 7 comments
Closed

Replace LAMA with Dask: grouping methods to migrate #295

sadielbartholomew opened this issue Jan 6, 2022 · 7 comments
Assignees
Labels
dask Relating to the use of Dask high priority performance Relating to speed and memory performance
Milestone

Comments

@sadielbartholomew
Copy link
Member

sadielbartholomew commented Jan 6, 2022

Table for #182

(See comment update datetime for author and datestamp of last table update)

Note: 306 methods in total now 315 methods in total in the table.

Data method name Done? (Bool as emoji, PR tag) Group with (if at all) Notes 1 Notes 2 Notes 3
HDF_chunks ✔️ #338 . . Deprecated API change
Units ✔️ #405 . . . .
_HDF_chunks ✔️ #338 . . Deprecated API change
_Units ✔️ . . . .
_YMDhms ✔️ #322 . . . .
__abs__ ✔️, #409 UA . . .
__add__ ✔️, #409 A . . .
__and__ ✔️, #409 L . . .
__array__ ✔️ CFDM . . .
__bool__ ✔️ #319 . . . .
__contains__ ✔️ #320 . . . .
__data__ ✔️ . . . .
__deepcopy__ ✔️ CFDM . . .
__div__ ✔️, #409 A . . .
__doc_template__ ✔️ CFDM . . .
__docstring_package_depth__ ✔️ CFDM . . .
__docstring_substitutions__ ✔️ CFDM . . .
__eq__ ✔️, #409 C . . .
__float__ ✔️ #337 . . . .
__floordiv__ ✔️, #409 A . . .
__ge__ ✔️, #409 C . . .
__getitem__ ✔️ #257 . . . .
__gt__ ✔️, #409 C . . .
__hash__ ✔️ #366 . . Deprecated API change
__iadd__ ✔️, #409 . . . .
__iand__ ✔️, #409 . . . .
__idiv__ ✔️, #409 . . . .
__ifloordiv__ ✔️, #409 . . . .
__ilshift__ ✔️, #409 . . . .
__imod__ ✔️, #409 . . . .
__imul__ ✔️, #409 . . . .
__init__ ✔️ . . . API change
__int__ ✔️ #337 . . . .
__invert__ ✔️, #409 UA . . .
__ior__ ✔️, #409 . . . .
__ipow__ ✔️, #409 . . . .
__irshift__ ✔️, #409 . . . .
__isub__ ✔️, #409 . . . .
__iter__ ✔️ #321 . . . .
__itruediv__ ✔️, #409 . . . .
__ixor__ ✔️, #409 RL . . .
__le__ ✔️, #409 C . . .
__len__ ✔️ #321 . . . .
__lshift__ ✔️, #409 . . . .
__lt__ ✔️, #409 C . . .
__mod__ ✔️, #409 A . . .
__module__ ✔️ CFDM . . .
__mul__ ✔️, #409 A . . .
__ne__ ✔️, #409 C . . .
__neg__ ✔️, #409 UA . . .
__new__ ✔️ . . . .
__or__ ✔️, #409: L . . .
__pos__ ✔️, #409 UA . . .
__pow__ ✔️, #409 A . . .
__query_set__ ✔️ #368 . . . .
__query_wi__ ✔️ #368 . . . .
__query_wo__ ✔️ #368 . . . .
__radd__ ✔️, #409 RA . . .
__rand__ ✔️, #409 RL . . .
__rdiv__ ✔️, #409 RA . . .
__reduce__ ✔️ . . . .
__reduce_ex__ ✔️ . . . .
__rfloordiv__ ✔️, #409 RA . . .
__rlshift__ ✔️, #409 . . . .
__rmod__ ✔️, #409 RA . . .
__rmul__ ✔️, #409 RA . . .
__ror__ ✔️, #409 RL . . .
__round__ ✔️ #370 . . . .
__rpow__ ✔️, #409 . . . .
__rrshift__ ✔️, #409 . . . .
__rshift__ ✔️, #409 . . . .
__rsub__ ✔️, #409 RA . . .
__rtruediv__ ✔️, #409 . . . .
__rxor__ ✔️, #409 RL . . .
__setitem__ ✔️ #257 . . . .
__sub__ ✔️, #409 A . . .
__truediv__ ✔️, #409 . . . .
__xor__ ✔️, #409 RL . . .
_all_axes ✔️ . . . .
_all_axis_names ✔️ . . . .
_asdatetime ✔️ #322 . . . .
_asreftime ✔️ #322 . . . .
_atol ✔️ #393 TOL . . .
_auxiliary_mask ✔️ . . . .
_auxiliary_mask_add_component ✔️ . . . .
_auxiliary_mask_from_1d_indices ✔️ . . . .
_auxiliary_mask_return ✔️ . . . .
_auxiliary_mask_subspace ✔️ . . . .
_auxiliary_mask_tidy ✔️ . . . .
_axes ✔️ . . . .
_binary_operation ✔️, #409 A . . .
_change_axis_names ✔️ . . . .
_chunk_add_partitions ✔️ . . . .
_collapse ✔️ #356 COL Needs dask >= 2022.03.0 . .
_collapse_create_weights ✔️ #356 COL . . .
_collapse_finalise ✔️ #356 COL . . .
_collapse_mask ✔️ #356 COL . . .
_collapse_optimize_weights ✔️ #356 COL . . .
_collapse_subspace ✔️ #356 COL . . .
_combined_units ✔️ :, #436 A . . .
_create_auxiliary_mask_component ✔️ . . . .
_custom ✔️ CFDM . . .
_cyclic ✔️ . . . .
_default ✔️ CFDM . . .
_del_Array ✔️ CFDM . . .
_del_component ✔️ CFDM . . .
_dtype ✔️ . . . .
_equals ✔️ CFDM . . .
_equals_preprocess ✔️ CFDM . . .
_flag_partitions_for_processing ✔️ . . . .
_flip ✔️ . . . .
_get_Array ✔️ CFDM . . .
_get_component ✔️ CFDM . . .
_has_component ✔️ CFDM . . .
_initialise_netcdf ✔️ CFDM . . .
_is_abstract_Array_subclass ✔️ . . . .
_isdatetime ✔️ #322 . . . .
_item ✔️ CFDM . . .
_move_flip_to_partitions ✔️ . . . .
_ndim ✔️ . . . .
_new_axis_identifier ✔️ . . . .
_package ✔️ CFDM . . .
_parse_axes ✔️ CFDM . . .
_parse_indices ✔️ #369 . . . .
_pmaxes ✔️ . . . .
_pmndim ✔️ . . . .
_pmshape ✔️ . . . .
_pmsize ✔️ . . . .
_rtol ✔️ #393 TOL . . .
_set_Array ✔️ CFDM . . .
_set_CompressedArray ✔️ CFDM . . .
_set_component ✔️ CFDM . . .
_set_partition_matrix ✔️ . . . .
_set_subspace ✔️ #371 . . . .
_shape ✔️ . . . .
_share_lock_files ✔️ . . . .
_share_partitions ✔️ . . . .
_size ✔️ . . . .
_unary_operation ✔️, pre-table & tested by #409 A . . .
add_partitions ✔️ . . Deprecated API change
all ✔️ #373 . Needs dask >= 2022.6.0. . .
allclose ✔️ #413 . . . .
any ✔️ #373 . Needs dask >= 2022.6.0. . .
apply_masking ✔️ #374 . . . .
arccos ✔️ #300 & #309 T . .
arccosh ✔️ #300 & #309 H . .
arcsin ✔️ #300 & #309 T . .
arcsinh ✔️ #300 & #309 H . .
arctan ✔️ #300 & #309 T . .
arctanh ✔️ #300 & #309 H . .
argmax ✔️ #339 . . . .
array ✔️ #353 . Needs cftime >= 1.6.0 . .
asdata ✔️ #334 . . . .
binary_mask ✔️ #372 . . . .
ceil ✔️ #300 & #308 . . .
change_calendar ✔️ #335 . . . .
chunks ✔️ #355 MEM . . .
clip ✔️ #375 . . . .
close ✔️ #382 . . Deprecated API change
compressed ✔️ #383 . . . .
compressed_array ✔️ #400 . . . .
compute ✔️ MEM . . .
concatenate ✔️, #425 . . . .
concatenate_data ✔️, #426 . . . .
convolution_filter ✔️ #294 . . . .
copy ✔️ CFDM . . .
cos ✔️ #300 & #309 T . .
cosh ✔️ #300 & #309 H . .
count ✔️ #414 . Needs dask >= 2022.03.0. . API change
count_masked ✔️ #414 . Needs dask >= 2022.03.0. . .
creation_commands ✔️ CFDM . . .
cumsum ✔️ #343 . . . API change
cyclic ✔️ #344 . . . .
data ✔️ #376 . . . .
datetime_array ✔️ #353 . Needs >= cftime 1.6.0 . .
datetime_as_string ✔️ CFDM . . .
datum ✔️ #332 . . . .
day ✔️ #322 . . . .
del_calendar ✔️ #357 . . . .
del_fill_value ✔️ CFDM . . .
del_units ✔️ #357 . . . .
diff ✔️ #350 . . . .
digitize ✔️ #312 . . . .
dtarray ✔️ . . . .
dtype ✔️ . . . .
dump ✔️ #377 . . . .
dumpd ✔️ #392 . . Deprecated API change
dumps ✔️ #392 . . Deprecated API change
empty ✔️ #315 . . . API change
equals ✔️ #254 #330 . Required for testing hence early migration, but rather complicated. . .
exp ✔️ #300 & #308 . . .
files ✔️ . . Deprecated API change
fill_value ✔️ #386 . . . .
filled ✔️ #340 . . . .
first_element ✔️ . . . .
fits_in_memory ✔️ #401 MEM . . API change
fits_in_one_chunk_in_memory ✔️ #359 MEM . Deprecated API change
flat ✔️ #379 . . . .
flatten ✔️ #333 . . . .
flip ✔️ #329 . . . .
floor ✔️ #300 & #308 . . .
full ✔️ #315 . . . API change
func ✔️ . . . API change
get_calendar ✔️ #357 . . . .
get_compressed_axes ✔️ #403 CFDM . . .
get_compressed_dimension ✔️ #403 CFDM . . .
get_compression_type ✔️ #403 CFDM . . .
get_count ✔️ #406 CFDM . . .
get_data ✔️ #404 . . . .
get_filenames ✔️ #367, #408 . Note: this method on other classes (Field, DimensionCoordinate, etc) will need separate treatment. Deprecated API change
get_fill_value ✔️ CFDM . . .
get_index ✔️ #406 CFDM . . .
get_list ✔️ #406 CFDM . . .
get_units ✔️ #357 . . . .
halo ✔️ #331 . Related to convolution_filter. . API change
hardmask ✔️ #399 . . . .
has_calendar ✔️ #357 . . . .
has_fill_value ✔️ CFDM . . .
has_units ✔️ #357 . . . .
hour ✔️ #322 . . . .
in_memory ✔️ #387 MEM . Deprecated API change
insert_dimension ✔️ . . . .
inspect ✔️ #394 . . . .
integral ✔️ #356 COL . . .
isclose ✔️ #411 . . . .
is_masked ✔️ . . . .
ispartitioned ✔️ . . Deprecated API change
isscalar ✔️ #395 . . Deprecated API change
last_element ✔️ . . . .
loadd ✔️ #392 . . Deprecated API change
loads ✔️ #392 . . Deprecated API change
log ✔️ . . . .
mask ✔️ #301 . . . .
mask_fpe ✔️ #380 . . Deprecated API change
mask_invalid ✔️ #390 . . . API change
masked_all ✔️ #396 . . . .
masked_invalid ✔️ #390 . . . API change
max ✔️ #356 COL . . .
maximum ✔️ #356 COL . . .
maximum_absolute_value ✔️ #356 COL . . .
mean ✔️ #356 COL . . .
mean_absolute_value ✔️ #356 COL . . .
mean_of_upper_decile ✔️ #412 COL . . .
median ✔️ #313 COL . . .
mid_range ✔️ #356 COL . . .
min ✔️ #356 COL . . .
minimum ✔️ #356 COL . . .
minimum_absolute_value ✔️ #356 COL . . .
minute ✔️ #322 . . . .
month ✔️ #322 . . . .
nbytes ✔️ . . . .
nc_clear_hdf5_chunksizes ✔️ NCC, CFDM . . .
nc_hdf5_chunksizes ✔️ NCC, CFDM . . .
nc_set_hdf5_chunksizes ✔️ NCC, CFDM . . .
ndim ✔️ . . . .
ndindex ✔️ #351 . . . .
ones ✔️ #315 . . . API change
outerproduct ✔️ #398 . . . .
override_calendar ✔️ #389 . . . .
override_units ✔️ #389 . . . .
partition_boundaries ✔️ . . Deprecated API change
partition_configuration ✔️ . . . .
partitions ✔️ . . . .
percentile ✔️ #313 . . . API change
persist ✔️ MEM . . .
range ✔️ #356 COL . . .
rechunk ✔️ #355 . . . .
reconstruct_sectioned_data ✔️ #407 . . Deprecated API change
reshape ✔️ #356 . . . .
rint ✔️ #300 #308 . . .
roll ✔️ #352 . . . .
root_mean_square ✔️ #356 COL . . .
round ✔️ #300 #308 . . .
sample_size ✔️ #356 COL . . .
save_to_disk ✔️ #387 MEM . Deprecated API change
sd ✔️ #356 COL . . .
second ✔️ #322 . . . .
second_element ✔️ . . . .
section ✔️ #359 . . . API change
set_calendar ✔️ #357 . . . .
set_fill_value ✔️ CFDM . . .
set_units ✔️ #357 . . . .
seterr ✔️ #384 . . Deprecated API change
shape ✔️ . . . .
sin ✔️ #300 & #309 T . .
sinh ✔️ #300 & #309 H . .
size ✔️ . . . .
source ✔️ CFDM . . .
square ✔️ #356 . . . .
sqrt ✔️ #356 . . . .
squeeze ✔️ . . . .
standard_deviation ✔️ #356 COL . . .
std ✔️ #356 COL . . .
stats ✔️ #432 COL . . .
sum ✔️ #356 COL . . .
sum_of_squares ✔️ #356 COL . . .
sum_of_weights ✔️ #356 COL . . .
sum_of_weights2 ✔️ #356 COL . . .
swapaxes ✔️ #361 . . . .
tan ✔️ #300 & #309 T . .
tanh ✔️ #300 & #309 H . .
to_dask_array ✔️ #388, #399 . . . .
to_disk ✔️ #387 MEM . Deprecated API change
to_memory ✔️ #387 MEM . . .
tolist ✔️ #360 . . . .
transpose ✔️ #247 . . . .
trunc ✔️ #300 #308 . . .
uncompress ✔️ #385 . . . .
unique ✔️ #391 . Needs dask >= 2022.6.0. . .
var ✔️ #356 COL . . .
variance ✔️ #356 COL . . .
varray ✔️ #397 . . Deprecated API change
where ✔️ #260 . . . .
year ✔️ #322 . . . .
zeros ✔️ #315 . . . .

Code to re-generate table

(In case a different form proves useful. Uses the library python-tabulate for ease.)

# Setup
from tabulate import tabulate
complete = {
# ... Python dict of all Data methods, as in original issue
}

# Processing
complete_convert_to_emoji = {}
for k, v in complete.items():
    if v:
        complete_convert_to_emoji[k] = ":heavy_check_mark:"
    else:
        complete_convert_to_emoji[k] = ":heavy_multiplication_x:"

blank_methods_list = [[f"`{m}`", v, ".", ".", ".", "."] for m, v in complete_convert_to_emoji.items()]
print(tabulate(
    blank_methods_list, tablefmt="github",
    headers=[
        "`Data` method name", "Done? (Bool as emoji, PR tag)",
        "Group with (if at all)", "Notes 1", "Notes 2", "Notes 3"
    ]
))
# Copy and paste output of above call into comment.

List of groupings of methods referenced in above table

See 'Group with (if at all)' column in table, which is designed to denote methods which are lunked in some useful way (say, being similar so can be migrated in a set of related PRs). Any groups referenced there should be added here.

Example: name group A for all of the trig. methods, and list 'A' in the 'Group with' column for all such methods, as well as noting A with all methods in it here.

Group name Description; Done, Y/N (N implied if no Y given) Members Notes (optional)
T trigonometric, Y sin, cos, tan, arcsin, arccos, arctan Note arctan2 is special (as takes two args) and not yet implemented, but now can be.
H hyperbolic, Y T method names with 'h' appended
A binary arithmetic __add__, __sub__, __mul__, __div__, __floordiv__, __mod__, etc.
RA reverse binary arithmetic A methods with 'r' prepended
UA unary arithmetic/bitwise __abs__, __pos__, __neg__, __invert__, etc.
C comparison __eq__, __ne__, __gt__, __lt__, __ge__, __le__, etc.
L logical __and__, __or__, __xor__, etc.
RL reverse logical L methods with 'r' prepended
TOL tolerance _rtol, _atol
NCC netCDF HDF5 chunksize related nc_clear_hdf5_chunksizes, nc_hdf5_chunksizes, nc_set_hdf5_chunksizes
B bitwise shifts, including binary and augmented __lshift__, __ilshift__, __rlshift__, __rshift__, __irshift__, __rrshift__
CFDM Inherited unchanged from cfdm
COL Collapse functions E.g. max, mean, _collapse
MEM Related to memory mangement E.g. to_memory
@davidhassell
Copy link
Collaborator

Hi Sadie - this is great, just what we need. Thanks.

(Is there a line 8 missing from the Python code?)

@sadielbartholomew
Copy link
Member Author

Thanks @davidhassell, after our conversation I thought it was needed to allow us to organise the work. I have started to add in some basic groupings that may be helpful when thinking about methods with similar requirements and ones to tackle as a group, etc. I will add the properties in as a group too, though I don't think there are that many sadly (since as you mentioned they count for three in the table).

Please edit the table as and when you wish to, there is plenty of info and groupings that could be added but I was going to chip away at it as useful rather than doing it all at once. I will ensure to keep it up to date in terms of what methods are in a PR that is open or merged.

(Is there a line 8 missing from the Python code?)

Ah yes, good spot, I must have missed it when copying lines from the interactive session. I will add that back in now.

@sadielbartholomew
Copy link
Member Author

I should add, I am aware you aren't a fan of emojis but the green and red colouring here makes it much easier to process at a glance whether a method is done or not than Booleans or Y/N...

@davidhassell
Copy link
Collaborator

colon smiley colon

@sadielbartholomew
Copy link
Member Author

sadielbartholomew commented Apr 26, 2022

Hi @davidhassell, here's the code I have quickly written to grab the total count of completed methods:

# TODO: copy table as string here, heeding warning below.
# Make sure table is copied such that the first line of table starts as the
# first character, with no newline, and the string ends at end of final line.
# Basically as the line below indicates. (Code relies on that format.)
table = """<insert the current table here!>"""

def get_count_of_completed_methods(table_string):
    """Get count (and total) of (un)daskified methods from table of PR 295."""
    lines = table_string.split("\n")
    total_number_methods = len(lines) - 2  # subtract two heading lines

    count = 0
    validate = 0
    for line in lines[2:]:  # skip 2 header lines
        if ":heavy_check_mark:" in line:
            count += 1
        elif ":heavy_multiplication_x:" in line:
            validate += 1
        else:
            print(f"WARNING: POSSIBLY DODGY LINE? CHECK:'{line}'")

    # Check that nothing dodgy has happened, that all lines/methods got covered
    missing_methods = total_number_methods - count - validate
    print("NUMBER OF METHODS UNACCOUNTED FOR IS", missing_methods)

    return count, total_number_methods  # include total for extra info


results = get_count_of_completed_methods(table)
print(
    "COUNT OF DASKIFIED METHODS IS {} FROM TOTAL {}.".format(*results)
)
print(f"THAT'S A COMPLETION PERCENTAGE OF {results[0]/results[1]*100:.1f}")

Right now (see comment timestamp later if needed as a reference) is:

$ python daskification-table.py 
NUMBER OF METHODS UNACCOUNTED FOR IS 0
COUNT OF DASKIFIED METHODS IS 228 FROM TOTAL 315.
THAT'S A COMPLETION PERCENTAGE OF 72.4

As you predicted, once I put my _binary_arithmetic and concatenate PRs in, and those get merged, we'll be closer to 90%: (228+51+1)/315 = 0.888....

@sadielbartholomew
Copy link
Member Author

@davidhassell I've just updated the table to account for #409 and the scores on the doors are now:

$ python daskification-table.py 
NUMBER OF METHODS UNACCOUNTED FOR IS 0
COUNT OF DASKIFIED METHODS IS 306 FROM TOTAL 315.
THAT'S A COMPLETION PERCENTAGE OF 97.1

So we're very close now, as the green to red ratio indicates just from scrolling down the table! Just getting up PRs for _combined_units and concatenate to get us ever nearer 100%.

@davidhassell
Copy link
Collaborator

Data test suite now passes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask Relating to the use of Dask high priority performance Relating to speed and memory performance
Projects
None yet
Development

No branches or pull requests

2 participants