Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: limit / specify number of rows to display #253

Closed
1 task done
pep-sanwer opened this issue Mar 11, 2024 · 7 comments
Closed
1 task done

Feat: limit / specify number of rows to display #253

pep-sanwer opened this issue Mar 11, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@pep-sanwer
Copy link

Checks

  • I have checked that this enhancement has not already been requested

How would you categorize this request. You can select multiple if not sure

Display (is this related to visual display of a value)

Enhancement Description

As far as I understand, currently Data Frames are displayed in their entirety up to 10k rows, after which they are sampled to 10k rows and displayed.

This request is looking for argument to DFViewer, or wherever makes the most sense, to limit the number of rows displayed to some n, where 10k > n >1.

While I understand that its possible to call DFViewer(df.head(10)) to only display 10 rows, this also only provides summary stats over those 10 rows. This request is looking for some behavior like below:

DFViewer(df, max_rows=10)  # only displays 10 rows, show summary stats over entire / sampled df

If this is already possible my apologies.

Appreciative of this great tool!

Pseudo Code Implementation

NA

Prior Art

NA

@pep-sanwer pep-sanwer added the enhancement New feature or request label Mar 11, 2024
@paddymul
Copy link
Owner

Thanks for the interest. BuckarooWidget and PolarsBuckarooWidget have a facility for changing sampling behavior through inheritance. Sampling occurs before summary stats, and before serialization. The python side of Serialization is very slow. In the following code I modified the behavior of DFViewer to accept a widget_klass. I also made an implementation of BuckarooWidget that uses a severely restrictive sampling_klass.

Try this code snippet out.

I will definitely modify the DFViewer function to accept a widget_klass in an upcoming release.

I could add an option for configuring sampling behavor, but for now I'd like to wait. you can write your own utility function to build a sampling_klass and assemble a DFViewer as you see fit. What do you think about ergonomics one way vs the other?

from buckaroo.buckaroo_widget import RawDFViewerWidget, BuckarooWidget
from buckaroo.dataflow.widget_extension_utils import (configure_buckaroo)
from buckaroo.dataflow.dataflow_extras import Sampling

def DFViewer(df,
             column_config_overrides=None,
             extra_pinned_rows=None, pinned_rows=None,
             extra_analysis_klasses=None, analysis_klasses=None,
             widget_klass=BuckarooWidget):
    """
    Display a DataFrame with buckaroo styling and analysis, no extra UI pieces

    column_config_overrides allows targetted specific overriding of styling

    extra_pinned_rows adds pinned_rows of summary stats
    pinned_rows replaces the default pinned rows

    extra_analysis_klasses adds an analysis_klass
    analysis_klasses replaces default analysis_klass
    """
    BuckarooKls = configure_buckaroo(
        widget_klass,
        extra_pinned_rows=extra_pinned_rows, pinned_rows=pinned_rows,
        extra_analysis_klasses=extra_analysis_klasses, analysis_klasses=analysis_klasses)

    bw = BuckarooKls(df, column_config_overrides=column_config_overrides)
    dfv_config = bw.df_display_args['dfviewer_special']['df_viewer_config']
    df_data = bw.df_data_dict['main']
    summary_stats_data = bw.df_data_dict['all_stats']
    return RawDFViewerWidget(
        df_data=df_data, df_viewer_config=dfv_config, summary_stats_data=summary_stats_data)

df = pd.DataFrame({'a':[10, 20, 339, 887], 'b': ['foo', 'bar', None, 'baz']})
#DFViewer(df)

class TwoSample(Sampling):
    pre_limit = 5
    max_columns = 1
    serialize_limit = 2

class TwoBuckaroo(BuckarooWidget):
    sampling_klass = TwoSample
DFViewer(df, widget_klass=TwoBuckaroo)

@pep-sanwer
Copy link
Author

pep-sanwer commented Mar 12, 2024

Appreciate the speedy response!

I did try out the code snippet you shared, and while it looked promising, I wasn't able to produce the behavior I was looking for. Playing with pre_limit and serialize_limit did limit the amount of displayed rows, but it also altered the behavior of the sampling. In my test case, I have a dataframe with 300 rows, and what I'd like to see is sample stats across the entire dataframe, but showing only the top (by index) 5 and bottom 5 rows, akin to default pandas behavior
Just to clarify, I love the current logic of the default dataframe view after import buckaroo - what I'm looking for is to maintain that wonderful logic, but simply display less / a configurable number of rows. Something akin to pandas's pd.options.display.max_rows

Ex:

import pandas as pd

df = pd.DataFrame({"a": range(300), "b": ["c" * i for i in range(300)]})
df

shows
image

import polars as pl

pl.from_dataframe(df)

shows
image

import buckaroo

df

show all 300 rows, with summary stats over all 300 rows.
Desired behavior is to show only top 5 & bottom 5 rows, with summary stats over all 300 rows.

@paddymul
Copy link
Owner

Other than the ellipsis row this should do what you want. I'd need to think a bit about how to accommodate an ellipsis row. You could just do values, but really you want a row with different styling, which requires a separate release for frontend mods.

Screenshot 2024-03-12 at 10 40 04 AM

@paddymul
Copy link
Owner

So far as customizing the default display behavior. I love that you want to do this. It's exactly how I want people to use Buckaroo, customize it with their own opinions, and make it do the thing you want by default.

There are a couple of ways to get the behavior that you want, all that will require some dev work on my end.

  1. Customize the implementation of buckaroo.widget_utils.enable. This should accept tuples of (BuckarooKls, dataframeType). Then you could have a one liner that calls enable with your own customized widget. That will work for pandas, it's harder for polars and geopandas, since I have done a bunch of work to keep those dependencies optional
  2. Use some type of customization framework so you could have .buckaroo config file.

Why don't you work on some of the customizations available now, and we'll look at these options in future releases.

BTW, If you're up for it, I'd love to talk to you about how you're using Buckaroo. contact me offline, my info is available in my github profile.

@pep-sanwer
Copy link
Author

Thank you so much! I'll definitely follow up with you on this!

@paddymul
Copy link
Owner

paddymul commented Apr 2, 2024

How has this solution been working for you?

@paddymul
Copy link
Owner

Closing because of no further comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants