-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update MatrixTable.show #14250
base: main
Are you sure you want to change the base?
Update MatrixTable.show #14250
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR scales much better in columns shown than current main. It now seems we need to get to
about 100,000 columns before we overcome the basic overhead of Spark & Hail. Fantastic.
A recent main commit:
In [24]: %%time
...: hl.utils.range_matrix_table(1, 1_000_000).annotate_entries(x=1).show(1, 10, handler=lambda x: isinstance(str(x), str))
CPU times: user 29.9 ms, sys: 2.84 ms, total: 32.7 ms
Wall time: 289 ms
Out[24]: True
In [25]: %%time
...: hl.utils.range_matrix_table(1, 1_000_000).annotate_entries(x=1).show(1, 100, handler=lambda x: isinstance(str(x), str))
CPU times: user 171 ms, sys: 9.37 ms, total: 180 ms
Wall time: 775 ms
Out[25]: True
In [26]: %%time
...: hl.utils.range_matrix_table(1, 1_000_000).annotate_entries(x=1).show(1, 1000, handler=lambda x: isinstance(str(x), str))
CPU times: user 978 ms, sys: 64.7 ms, total: 1.04 s
Wall time: 10.7 s
Out[26]: True
This PR:
In [4]: %%time
...: hl.utils.range_matrix_table(1, 1_000_000).annotate_entries(x=1).show(1, 10, handler=lambda x: isinstance(str(x), str))
CPU times: user 19.1 ms, sys: 2.76 ms, total: 21.8 ms
Wall time: 283 ms
In [5]: %%time
...: hl.utils.range_matrix_table(1, 1_000_000).annotate_entries(x=1).show(1, 10, handler=lambda x: isinstance(str(x), str))
CPU times: user 67.3 ms, sys: 7.44 ms, total: 74.8 ms
Wall time: 293 ms
In [6]: %%time
...: hl.utils.range_matrix_table(1, 1_000_000).annotate_entries(x=1).show(1, 100, handler=lambda x: isinstance(str(x), str))
CPU times: user 18.8 ms, sys: 2.46 ms, total: 21.3 ms
Wall time: 318 ms
In [7]: %%time
...: hl.utils.range_matrix_table(1, 1_000_000).annotate_entries(x=1).show(1, 1000, handler=lambda x: isinstance(str(x), str))
CPU times: user 24.6 ms, sys: 1.98 ms, total: 26.6 ms
Wall time: 315 ms
In [8]: %%time
...: hl.utils.range_matrix_table(1, 1_000_000).annotate_entries(x=1).show(1, 10000, handler=lambda x: isinstance(str(x), str))
CPU times: user 75.2 ms, sys: 3.65 ms, total: 78.9 ms
Wall time: 351 ms
In [9]: %%time
...: hl.utils.range_matrix_table(1, 1_000_000).annotate_entries(x=1).show(1, 100000, handler=lambda x: isinstance(str(x), str))
CPU times: user 740 ms, sys: 16 ms, total: 756 ms
Wall time: 1.17 s
With the repr change this all seems copacetic to me.
In [2]: hl.utils.range_matrix_table(10,1000).annotate_entries(a='a\n', b=3, c=4.0, d=hl.dict([(3, "a")])).show(5, 10)
+---------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+
| row_idx | 0.a | 0.b | 0.c | 0.d | 1.a | 1.b | 1.c | 1.d | 2.a | 2.b | 2.c | 2.d | 3.a | 3.b | 3.c | 3.d | 4.a | 4.b | 4.c | 4.d | 5.a | 5.b | 5.c |
+---------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+
| int32 | str | int32 | float64 | dict<int32, str> | str | int32 | float64 | dict<int32, str> | str | int32 | float64 | dict<int32, str> | str | int32 | float64 | dict<int32, str> | str | int32 | float64 | dict<int32, str> | str | int32 | float64 |
+---------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+
| 0 | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 |
| 1 | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 |
| 2 | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 |
| 3 | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 |
| 4 | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 |
+---------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+
+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+
| 5.d | 6.a | 6.b | 6.c | 6.d | 7.a | 7.b | 7.c | 7.d | 8.a | 8.b | 8.c | 8.d | 9.a | 9.b | 9.c | 9.d |
+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+
| dict<int32, str> | str | int32 | float64 | dict<int32, str> | str | int32 | float64 | dict<int32, str> | str | int32 | float64 | dict<int32, str> | str | int32 | float64 | dict<int32, str> |
+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+
| {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} |
| {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} |
| {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} |
| {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} |
| {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} | 'a\n' | 3 | 4.0 | {3: 'a'} |
+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+
showing top 5 rows
showing the first 10 of 1000 columns
hail/python/hail/matrixtable.py
Outdated
truncate_limit, | ||
types, | ||
) | ||
handler(repr(show)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you don't want the repr
here. Due to repr
, the IPython handler see the output as a string, which it prints as its repr. e.g.
In [5]: from IPython.display import display; display('a\nb')
'a\nb'
Instead, what we want is to send the show object itself. When IPython gets an object with a __repr__
it calls that and then actually `prints that string (not the repr of that string).
In [8]: class Foo:
...: def __repr__(self):
...: return 'a\nb'
In [9]: from IPython.display import display; display(Foo())
a
b
if include_row_fields: | ||
characters -= estimate_size(self.row_value) | ||
characters = max(characters, 0) | ||
n_cols = characters // (estimate_size(self.entry) + 4) # 4 for the column index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, one thing that has long confused users is that a Matrix Table with enough entry fields on a skinny enough terminal will show with zero columns. We do print the message about the number of columns, but users still get confused. Since you're changing this code anyway, can we add a n_cols = max(n_cols, 1)
to ensure we always print at least one column?
hail/python/hail/matrixtable.py
Outdated
width: Optional[int] = None, | ||
truncate: Optional[int] = None, | ||
types: bool = True, | ||
handler: Optional[Callable[[str], Any]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not actually sure how to type this correctly. Even PyRight can't seem to figure it out:
(base) dking@wm28c-761 hail % cat bar.py
from IPython.display import display
reveal_type(display)
(base) dking@wm28c-761 hail % pyright bar.py
WARNING: there is a new pyright version available (v1.1.349 -> v1.1.350).
Please install the new version or set PYRIGHT_PYTHON_FORCE_VERSION to `latest`
/Users/dking/projects/hail/bar.py
/Users/dking/projects/hail/bar.py:2:13 - information: Type of "display" is "(*objs: Unknown, include: Unknown | None = None, exclude: Unknown | None = None, metadata: Unknown | None = None, transient: Unknown | None = None, display_id: Unknown | None = None, raw: bool = False, clear: bool = False, **kwargs: Unknown) -> (DisplayHandle | None)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, in the next set of changes, I make the type Optional[Callable]
.
from typing import Callable
from IPython.display import display
type(display)
Out[4]: function
isinstance(display, Callable)
Out[5]: True
hail/python/hail/matrixtable.py
Outdated
|
||
def __repr__(self): | ||
return self.__str__() | ||
|
||
def _repr_html_(self): | ||
s = self.table_show._repr_html_() | ||
if self.displayed_n_cols != self.actual_n_cols: | ||
# This method is not thoroughly tested and should be used carefully. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We prefer to not include comments in the source code unless they're linking to a document that describes some jarringly abnormal behavior (like a known WONTFIX bug in a third-party library).
The HTML show methods have never been well-tested and that's OK for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove most of the comments in the next set of changes, but I would like to leave this one in.
# Here we must distinguish between matrix table columns
# and the columns that are used to display this matrix table.
# Hail displays both the row fields and the matrix table's columns
# and entry fields as columns in a tabular format.
# We refer to the columns used to display the matrix table as "display columns".
I think it clarifies concepts in a way that would be difficult to understand from the code alone.
s += '<p style="background: #fdd; padding: 0.4em;">' | ||
s += f"showing the first { self.displayed_n_cols } of { self.actual_n_cols } columns" | ||
s += f"showing the first { self.n_cols } of { total_n_cols } columns" | ||
s += '</p>\n' | ||
return s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary for this PR (which hugely improves the ASCII table!), but I think it would be quite simple to extend your approach for an ASCII table to producing an HTML table. Easier, even, because you don't need to worry about cooking up the padding of the ASCII cells. You just need to put the HTML tags in the right places. We could probably even delegate truncation to HTML/CSS.
hail/python/hail/matrixtable.py
Outdated
ascii_str += format_line(row[block_slice], block_display_column_widths, block_right_align) | ||
ascii_str += hline | ||
|
||
if n_rows < matrix_table.count_rows(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Counting the number of rows, in general, requires visiting each row (consider mt.explode(mt.some_array
). Instead of counting the rows, use the has_more returned value from _take_n
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has_more
is always False
in my current implementation because of table = matrix_table.head(n_rows, n_cols).localize_entries('entries', 'cols')
. I propose just printing the number of rows that are displayed regardless of the total row count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I have changed my implementation to use the has_more
value now.
hail/python/hail/matrixtable.py
Outdated
else: | ||
return value + ' ' * extra_count | ||
|
||
def format_line(values: List[str], widths: List[int], right_align: List[bool]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer including return types on all functions, this one looks like -> str
to me
hail/python/hail/matrixtable.py
Outdated
row_fields = list(row_dtype) if include_row_fields else list(table_key_dtype) | ||
row_fields = [row_field for row_field in row_fields if row_field != 'entries'] | ||
truncated_row_fields = [truncate(field) for field in row_fields] | ||
truncated_rows = [[truncate(str(row[field])) for field in row_fields] for row in rows] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of str I think we should use repr
. In particular, adversarial inputs produce break the output:
In [6]: hl.utils.range_matrix_table(10,1000).annotate_entries(a='a\n').show(5, 2)
+---------+-----+-----+
| row_idx | 0.a | 1.a |
+---------+-----+-----+
| int32 | str | str |
+---------+-----+-----+
| 0 | a
| a
|
| 1 | a
| a
|
| 2 | a
| a
|
| 3 | a
| a
|
| 4 | a
| a
|
+---------+-----+-----+
showing top 5 rows
showing the first 2 of 1000 columns
hail/python/hail/matrixtable.py
Outdated
entry_fields = list(entry_dtype) | ||
# entries has shape (n_rows, n_cols, len(entry_fields)) | ||
entries = [[[entry[field] for field in entry_fields] for entry in row['entries']] for row in rows] | ||
truncated_entries = [[[truncate(str(entry)) for entry in column] for column in row] for row in entries] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here.
truncated_row_fields = [truncate(field) for field in row_fields] | ||
truncated_rows = [[truncate(str(row[field])) for field in row_fields] for row in rows] | ||
truncated_row_field_types = ( | ||
[truncate(str(row_dtype[field])) for field in row_fields] if should_show_types else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Types, on the other hand, should use str, so this is correct.
513cf79
to
9d8f693
Compare
The CI checks likely failed because of failing unit tests. I just fixed the unit tests. Please rerun the CI checks when you get a chance. |
I'll rerun CI now. In case it's helpful, here are the failures from the last run, all in the doctests:
|
Thanks for sharing the doctest failures. Those were indeed issues, and I think I have fixed them now. |
Description
In this pull request, I change the implementation of the
MatrixTable.show
method. Previously,show
would create a table and then show the table by reusing the table implementation ofshow
.The problem with the existing implementation is that it creates row fields in the table for all of the entries in the matrix table.
This new implementation directly shows the matrix table by creating a table and displaying the information in the table.
I make some refactoring changes in this pull request as well that helped me understand the code and will hopefully help others as well.
Testing
I add some unit tests.