Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update MatrixTable.show #14250

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

Will-Tyler
Copy link
Contributor

Description

In this pull request, I change the implementation of the MatrixTable.show method. Previously, show would create a table and then show the table by reusing the table implementation of show.

The problem with the existing implementation is that it creates row fields in the table for all of the entries in the matrix table.

This new implementation directly shows the matrix table by creating a table and displaying the information in the table.

I make some refactoring changes in this pull request as well that helped me understand the code and will hopefully help others as well.

Testing

I add some unit tests.

@patrick-schultz patrick-schultz self-assigned this Feb 5, 2024
Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR scales much better in columns shown than current main. It now seems we need to get to
about 100,000 columns before we overcome the basic overhead of Spark & Hail. Fantastic.

A recent main commit:

In [24]: %%time
    ...: hl.utils.range_matrix_table(1, 1_000_000).annotate_entries(x=1).show(1, 10, handler=lambda x: isinstance(str(x), str))
CPU times: user 29.9 ms, sys: 2.84 ms, total: 32.7 ms
Wall time: 289 ms
Out[24]: True

In [25]: %%time
    ...: hl.utils.range_matrix_table(1, 1_000_000).annotate_entries(x=1).show(1, 100, handler=lambda x: isinstance(str(x), str))
CPU times: user 171 ms, sys: 9.37 ms, total: 180 ms
Wall time: 775 ms
Out[25]: True

In [26]: %%time
    ...: hl.utils.range_matrix_table(1, 1_000_000).annotate_entries(x=1).show(1, 1000, handler=lambda x: isinstance(str(x), str))
CPU times: user 978 ms, sys: 64.7 ms, total: 1.04 s
Wall time: 10.7 s
Out[26]: True

This PR:

In [4]: %%time
   ...: hl.utils.range_matrix_table(1, 1_000_000).annotate_entries(x=1).show(1, 10, handler=lambda x: isinstance(str(x), str))
CPU times: user 19.1 ms, sys: 2.76 ms, total: 21.8 ms
Wall time: 283 ms

In [5]: %%time
   ...: hl.utils.range_matrix_table(1, 1_000_000).annotate_entries(x=1).show(1, 10, handler=lambda x: isinstance(str(x), str))
CPU times: user 67.3 ms, sys: 7.44 ms, total: 74.8 ms
Wall time: 293 ms

In [6]: %%time
   ...: hl.utils.range_matrix_table(1, 1_000_000).annotate_entries(x=1).show(1, 100, handler=lambda x: isinstance(str(x), str))
CPU times: user 18.8 ms, sys: 2.46 ms, total: 21.3 ms
Wall time: 318 ms

In [7]: %%time
   ...: hl.utils.range_matrix_table(1, 1_000_000).annotate_entries(x=1).show(1, 1000, handler=lambda x: isinstance(str(x), str))
CPU times: user 24.6 ms, sys: 1.98 ms, total: 26.6 ms
Wall time: 315 ms

In [8]: %%time
   ...: hl.utils.range_matrix_table(1, 1_000_000).annotate_entries(x=1).show(1, 10000, handler=lambda x: isinstance(str(x), str))
CPU times: user 75.2 ms, sys: 3.65 ms, total: 78.9 ms
Wall time: 351 ms

In [9]: %%time
   ...: hl.utils.range_matrix_table(1, 1_000_000).annotate_entries(x=1).show(1, 100000, handler=lambda x: isinstance(str(x), str))
CPU times: user 740 ms, sys: 16 ms, total: 756 ms
Wall time: 1.17 s

With the repr change this all seems copacetic to me.

In [2]: hl.utils.range_matrix_table(10,1000).annotate_entries(a='a\n', b=3, c=4.0, d=hl.dict([(3, "a")])).show(5, 10)
+---------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+
| row_idx | 0.a   |   0.b |     0.c | 0.d              | 1.a   |   1.b |     1.c | 1.d              | 2.a   |   2.b |     2.c | 2.d              | 3.a   |   3.b |     3.c | 3.d              | 4.a   |   4.b |     4.c | 4.d              | 5.a   |   5.b |     5.c |
+---------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+
|   int32 | str   | int32 | float64 | dict<int32, str> | str   | int32 | float64 | dict<int32, str> | str   | int32 | float64 | dict<int32, str> | str   | int32 | float64 | dict<int32, str> | str   | int32 | float64 | dict<int32, str> | str   | int32 | float64 |
+---------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+
|       0 | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 |
|       1 | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 |
|       2 | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 |
|       3 | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 |
|       4 | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 |
+---------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+

+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+
| 5.d              | 6.a   |   6.b |     6.c | 6.d              | 7.a   |   7.b |     7.c | 7.d              | 8.a   |   8.b |     8.c | 8.d              | 9.a   |   9.b |     9.c | 9.d              |
+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+
| dict<int32, str> | str   | int32 | float64 | dict<int32, str> | str   | int32 | float64 | dict<int32, str> | str   | int32 | float64 | dict<int32, str> | str   | int32 | float64 | dict<int32, str> |
+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+
| {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         |
| {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         |
| {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         |
| {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         |
| {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         | 'a\n' |     3 |     4.0 | {3: 'a'}         |
+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+-------+-------+---------+------------------+
showing top 5 rows
showing the first 10 of 1000 columns

truncate_limit,
types,
)
handler(repr(show))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't want the repr here. Due to repr, the IPython handler see the output as a string, which it prints as its repr. e.g.

In [5]: from IPython.display import display; display('a\nb')
'a\nb'

Instead, what we want is to send the show object itself. When IPython gets an object with a __repr__ it calls that and then actually `prints that string (not the repr of that string).

In [8]: class Foo:
   ...:     def __repr__(self):
   ...:         return 'a\nb'

In [9]: from IPython.display import display; display(Foo())
a
b

if include_row_fields:
characters -= estimate_size(self.row_value)
characters = max(characters, 0)
n_cols = characters // (estimate_size(self.entry) + 4) # 4 for the column index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, one thing that has long confused users is that a Matrix Table with enough entry fields on a skinny enough terminal will show with zero columns. We do print the message about the number of columns, but users still get confused. Since you're changing this code anyway, can we add a n_cols = max(n_cols, 1) to ensure we always print at least one column?

width: Optional[int] = None,
truncate: Optional[int] = None,
types: bool = True,
handler: Optional[Callable[[str], Any]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not actually sure how to type this correctly. Even PyRight can't seem to figure it out:

(base) dking@wm28c-761 hail % cat bar.py    
from IPython.display import display
reveal_type(display)

(base) dking@wm28c-761 hail % pyright bar.py
WARNING: there is a new pyright version available (v1.1.349 -> v1.1.350).
Please install the new version or set PYRIGHT_PYTHON_FORCE_VERSION to `latest`

/Users/dking/projects/hail/bar.py
  /Users/dking/projects/hail/bar.py:2:13 - information: Type of "display" is "(*objs: Unknown, include: Unknown | None = None, exclude: Unknown | None = None, metadata: Unknown | None = None, transient: Unknown | None = None, display_id: Unknown | None = None, raw: bool = False, clear: bool = False, **kwargs: Unknown) -> (DisplayHandle | None)"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, in the next set of changes, I make the type Optional[Callable].

from typing import Callable
from IPython.display import display
type(display)
Out[4]: function
isinstance(display, Callable)
Out[5]: True


def __repr__(self):
return self.__str__()

def _repr_html_(self):
s = self.table_show._repr_html_()
if self.displayed_n_cols != self.actual_n_cols:
# This method is not thoroughly tested and should be used carefully.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer to not include comments in the source code unless they're linking to a document that describes some jarringly abnormal behavior (like a known WONTFIX bug in a third-party library).

The HTML show methods have never been well-tested and that's OK for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove most of the comments in the next set of changes, but I would like to leave this one in.

# Here we must distinguish between matrix table columns
# and the columns that are used to display this matrix table.
 # Hail displays both the row fields and the matrix table's columns
# and entry fields as columns in a tabular format.
# We refer to the columns used to display the matrix table as "display columns".

I think it clarifies concepts in a way that would be difficult to understand from the code alone.

s += '<p style="background: #fdd; padding: 0.4em;">'
s += f"showing the first { self.displayed_n_cols } of { self.actual_n_cols } columns"
s += f"showing the first { self.n_cols } of { total_n_cols } columns"
s += '</p>\n'
return s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary for this PR (which hugely improves the ASCII table!), but I think it would be quite simple to extend your approach for an ASCII table to producing an HTML table. Easier, even, because you don't need to worry about cooking up the padding of the ASCII cells. You just need to put the HTML tags in the right places. We could probably even delegate truncation to HTML/CSS.

ascii_str += format_line(row[block_slice], block_display_column_widths, block_right_align)
ascii_str += hline

if n_rows < matrix_table.count_rows():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Counting the number of rows, in general, requires visiting each row (consider mt.explode(mt.some_array). Instead of counting the rows, use the has_more returned value from _take_n.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has_more is always False in my current implementation because of table = matrix_table.head(n_rows, n_cols).localize_entries('entries', 'cols'). I propose just printing the number of rows that are displayed regardless of the total row count.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I have changed my implementation to use the has_more value now.

else:
return value + ' ' * extra_count

def format_line(values: List[str], widths: List[int], right_align: List[bool]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer including return types on all functions, this one looks like -> str to me

row_fields = list(row_dtype) if include_row_fields else list(table_key_dtype)
row_fields = [row_field for row_field in row_fields if row_field != 'entries']
truncated_row_fields = [truncate(field) for field in row_fields]
truncated_rows = [[truncate(str(row[field])) for field in row_fields] for row in rows]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of str I think we should use repr. In particular, adversarial inputs produce break the output:

In [6]: hl.utils.range_matrix_table(10,1000).annotate_entries(a='a\n').show(5, 2)
+---------+-----+-----+
| row_idx | 0.a | 1.a |
+---------+-----+-----+
|   int32 | str | str |
+---------+-----+-----+
|       0 | a
  | a
  |
|       1 | a
  | a
  |
|       2 | a
  | a
  |
|       3 | a
  | a
  |
|       4 | a
  | a
  |
+---------+-----+-----+
showing top 5 rows
showing the first 2 of 1000 columns

entry_fields = list(entry_dtype)
# entries has shape (n_rows, n_cols, len(entry_fields))
entries = [[[entry[field] for field in entry_fields] for entry in row['entries']] for row in rows]
truncated_entries = [[[truncate(str(entry)) for entry in column] for column in row] for row in entries]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

truncated_row_fields = [truncate(field) for field in row_fields]
truncated_rows = [[truncate(str(row[field])) for field in row_fields] for row in rows]
truncated_row_field_types = (
[truncate(str(row_dtype[field])) for field in row_fields] if should_show_types else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Types, on the other hand, should use str, so this is correct.

@Will-Tyler Will-Tyler requested a review from danking March 7, 2024 21:23
@Will-Tyler Will-Tyler force-pushed the update-matrix-table-show branch from 513cf79 to 9d8f693 Compare March 9, 2024 02:28
@Will-Tyler
Copy link
Contributor Author

The CI checks likely failed because of failing unit tests. I just fixed the unit tests. Please rerun the CI checks when you get a chance.

@patrick-schultz
Copy link
Collaborator

I'll rerun CI now. In case it's helpful, here are the failures from the last run, all in the doctests:

FAILED usr/local/lib/python3.9/dist-packages/hail/matrixtable.py::hail.matrixtable.MatrixTable.from_parts
FAILED usr/local/lib/python3.9/dist-packages/hail/matrixtable.py::hail.matrixtable.MatrixTable.localize_entries
FAILED usr/local/lib/python3.9/dist-packages/hail/table.py::hail.table.Table.to_matrix_table
FAILED usr/local/lib/python3.9/dist-packages/hail/table.py::hail.table.Table.to_matrix_table_row_major
FAILED usr/local/lib/python3.9/dist-packages/hail/experimental/full_outer_join_mt.py::hail.experimental.full_outer_join_mt.full_outer_join_mt
FAILED usr/local/lib/python3.9/dist-packages/hail/methods/impex.py::hail.methods.impex.import_vcf
FAILED usr/local/lib/python3.9/dist-packages/hail/methods/statgen.py::hail.methods.statgen.balding_nichols_model

@Will-Tyler
Copy link
Contributor Author

Thanks for sharing the doctest failures. Those were indeed issues, and I think I have fixed them now.

@daniel-goldstein daniel-goldstein requested review from patrick-schultz and removed request for danking April 11, 2024 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants