Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add truncate option to show and update default width of output #116

Merged
merged 1 commit into from
Jul 22, 2024

Conversation

mattseddon
Copy link
Member

@mattseddon mattseddon commented Jul 22, 2024

This PR introduces a truncate flag into DataChain.show and updates the output's default width to match the terminal's (if available). The default behaviour of DataChain.show remains the same. Users can now force the command to expand all column output using truncate=False (seems a bit late to introduce pagination before the release).

Demo

Screen.Recording.2024-07-22.at.10.14.33.AM.mov

Here is a script that I used to test the code along with the output of that script:

if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"


if __name__ == "__main__":
    print("** HuggingFace pipeline helper model zoo demo **")
    print("\nZero-shot object detection and classification:")
    (
        DataChain.from_storage(
            image_source,
            anon=True,
            type="image",
        )
        .filter(C("name").glob("*.jpg"))
        .limit(1)
        .map(
            Helper(
                model="google/owlv2-base-patch16",
                device=device,
                candidate_labels=["cat", "dog", "squirrel", "unknown"],
            ),
            params=["file"],
            output={"model_output": dict, "error": str},
        )
        .save("zero-shot")
    )

    print("Show default output:")
    (
        DataChain.from_dataset("zero-shot")
        .select("file.source", "file.parent", "file.name", "model_output", "error")
        .show()
    )
    print("Show expanded output:")
    (
        DataChain.from_dataset("zero-shot")
        .select("file.source", "file.parent", "file.name", "model_output", "error")
        .show(truncate=False)
    )
** HuggingFace pipeline helper model zoo demo **

Zero-shot object detection and classification:
Processed: 400 rows [00:00, 22001.17 rows/s]
Download: 16.5kB [00:14, 1.14kB/s]
Processed: 1 rows [00:00, 305.73 rows/s]
Show default output:
                  file           file       file                                       model_output error
                source         parent       name                                                         
0  gs://datachain-demo  dogs-and-cats  cat.1.jpg  [{"score": 0.6445202827453613, "label": "cat",...      
Show expanded output:
                  file           file       file                                                                                                model_output error
                source         parent       name                                                                                                                  
0  gs://datachain-demo  dogs-and-cats  cat.1.jpg  [{"score": 0.6445202827453613, "label": "cat", "box": {"xmin": 34, "ymin": 34, "xmax": 299, "ymax": 281}}]   


try:
columns = os.get_terminal_size().columns
except OSError:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

codecov bot commented Jul 22, 2024

Codecov Report

Attention: Patch coverage is 80.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 85.68%. Comparing base (c40d468) to head (25f175d).

Files Patch % Lines
src/datachain/lib/dc.py 80.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #116      +/-   ##
==========================================
+ Coverage   85.55%   85.68%   +0.12%     
==========================================
  Files          93       93              
  Lines        9469     9477       +8     
  Branches     1889     1891       +2     
==========================================
+ Hits         8101     8120      +19     
+ Misses       1039     1025      -14     
- Partials      329      332       +3     
Flag Coverage Δ
datachain 85.61% <80.00%> (+0.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

cloudflare-workers-and-pages bot commented Jul 22, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: 25f175d
Status: ✅  Deploy successful!
Preview URL: https://468f98cf.datachain-documentation.pages.dev
Branch Preview URL: https://add-show-option.datachain-documentation.pages.dev

View logs

@mattseddon mattseddon self-assigned this Jul 22, 2024
@mattseddon mattseddon marked this pull request as ready for review July 22, 2024 01:37
@mattseddon mattseddon requested review from dmpetrov, dberenbaum and a team July 22, 2024 01:37
Comment on lines 1024 to 1023
if columns > 0:
options.extend(["display.width", columns])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

display.width : int
Width of the display in characters. In case python/IPython is running in
a terminal this can be set to None and pandas will correctly auto-detect
the width.
Note that the IPython notebook, IPython qtconsole, or IDLE do not run in a
terminal and hence it is not possible to correctly detect the width.
[default: 80] [currently: 80]

Pandas can auto-detect width. So I don't think you need to set them yourself, or use os.get_terminal_size().

However, you probably should move (display.max_columns, None) to if not truncate condition.

Copy link
Member Author

@mattseddon mattseddon Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas can auto-detect width. So I don't think you need to set them yourself, or use os.get_terminal_size().

If you watch the demo this behaviour is not working for me.

Running the examples/get_started/udfs/parallel.py script from #111 without the change I get:

Listing gs://datachain-demo: 400 objects [00:01, 339.57 objects/s]
Processed: 400 rows [00:00, 16351.90 rows/s]
Processed: 400 rows [00:42,  9.40 rows/s]
                   file           file           file   file  \
                 source         parent           name   size   
0   gs://datachain-demo  dogs-and-cats      cat.1.jpg  16880   
1   gs://datachain-demo  dogs-and-cats     cat.1.json     99   
2   gs://datachain-demo  dogs-and-cats     cat.10.jpg  34315   
3   gs://datachain-demo  dogs-and-cats    cat.10.json    100   
4   gs://datachain-demo  dogs-and-cats    cat.100.jpg  28377   
5   gs://datachain-demo  dogs-and-cats   cat.100.json    101   
6   gs://datachain-demo  dogs-and-cats   cat.1000.jpg   5944   
7   gs://datachain-demo  dogs-and-cats  cat.1000.json    102   
8   gs://datachain-demo  dogs-and-cats   cat.1001.jpg  23099   
9   gs://datachain-demo  dogs-and-cats  cat.1001.json    102   
10  gs://datachain-demo  dogs-and-cats   cat.1002.jpg  16999   
11  gs://datachain-demo  dogs-and-cats  cat.1002.json    102   
12  gs://datachain-demo  dogs-and-cats   cat.1003.jpg  13996   
13  gs://datachain-demo  dogs-and-cats  cat.1003.json    102   
14  gs://datachain-demo  dogs-and-cats   cat.1004.jpg  41052   
15  gs://datachain-demo  dogs-and-cats  cat.1004.json    102   
16  gs://datachain-demo  dogs-and-cats   cat.1005.jpg  33372   
17  gs://datachain-demo  dogs-and-cats  cat.1005.json    102   
18  gs://datachain-demo  dogs-and-cats   cat.1006.jpg  23571   
19  gs://datachain-demo  dogs-and-cats  cat.1006.json    102   

                file              file      file                      file  \
             version              etag is_latest             last_modified   
0   1721494538128219  CNuWtvOKtocDEAE=         1 1970-01-01 00:00:00+00:00   
1   1721494541157069  CM2F7/SKtocDEAE=         1 1970-01-01 00:00:00+00:00   
2   1721494540482739  CLPxxfSKtocDEAE=         1 1970-01-01 00:00:00+00:00   
3   1721494537938657  COHNqvOKtocDEAE=         1 1970-01-01 00:00:00+00:00   
4   1721494542320150  CJaEtvWKtocDEAE=         1 1970-01-01 00:00:00+00:00   
5   1721494541917698  CIK8nfWKtocDEAE=         1 1970-01-01 00:00:00+00:00   
6   1721494540506694  CMasx/SKtocDEAE=         1 1970-01-01 00:00:00+00:00   
7   1721494542191364  CISWrvWKtocDEAE=         1 1970-01-01 00:00:00+00:00   
8   1721494538757000  CIjH3POKtocDEAE=         1 1970-01-01 00:00:00+00:00   
9   1721494539671362  CMKulPSKtocDEAE=         1 1970-01-01 00:00:00+00:00   
10  1721494539926153  CIn1o/SKtocDEAE=         1 1970-01-01 00:00:00+00:00   
11  1721494538606028  CMyr0/OKtocDEAE=         1 1970-01-01 00:00:00+00:00   
12  1721494540292260  CKShuvSKtocDEAE=         1 1970-01-01 00:00:00+00:00   
13  1721494538083602  CJK6s/OKtocDEAE=         1 1970-01-01 00:00:00+00:00   
14  1721494539902670  CM69ovSKtocDEAE=         1 1970-01-01 00:00:00+00:00   
15  1721494541290725  COWZ9/SKtocDEAE=         1 1970-01-01 00:00:00+00:00   
16  1721494538590041  CNmu0vOKtocDEAE=         1 1970-01-01 00:00:00+00:00   
17  1721494540901128  CIi23/SKtocDEAE=         1 1970-01-01 00:00:00+00:00   
18  1721494539610043  CLvPkPSKtocDEAE=         1 1970-01-01 00:00:00+00:00   
19  1721494540202751  CP/ltPSKtocDEAE=         1 1970-01-01 00:00:00+00:00   

       file  file path_len  
   location vtype           
0      None              9  
1      None             -1  
2      None             10  
3      None             -1  
4      None             11  
5      None             -1  
6      None             12  
7      None             -1  
8      None             12  
9      None             -1  
10     None             12  
11     None             -1  
12     None             12  
13     None             -1  
14     None             12  
15     None             -1  
16     None             12  
17     None             -1  
18     None             12  
19     None             -1  

[Limited by 20 rows]

and then after the change I get the output spread across my terminal:

Processed: 400 rows [00:00, 23722.10 rows/s]
Processed: 400 rows [00:43,  9.28 rows/s]
                   file           file           file   file              file              file      file                      file     file  file path_len
                 source         parent           name   size           version              etag is_latest             last_modified location vtype         
0   gs://datachain-demo  dogs-and-cats      cat.1.jpg  16880  1721494538128219  CNuWtvOKtocDEAE=         1 1970-01-01 00:00:00+00:00     None              9
1   gs://datachain-demo  dogs-and-cats     cat.1.json     99  1721494541157069  CM2F7/SKtocDEAE=         1 1970-01-01 00:00:00+00:00     None             -1
2   gs://datachain-demo  dogs-and-cats     cat.10.jpg  34315  1721494540482739  CLPxxfSKtocDEAE=         1 1970-01-01 00:00:00+00:00     None             10
3   gs://datachain-demo  dogs-and-cats    cat.10.json    100  1721494537938657  COHNqvOKtocDEAE=         1 1970-01-01 00:00:00+00:00     None             -1
4   gs://datachain-demo  dogs-and-cats    cat.100.jpg  28377  1721494542320150  CJaEtvWKtocDEAE=         1 1970-01-01 00:00:00+00:00     None             11
5   gs://datachain-demo  dogs-and-cats   cat.100.json    101  1721494541917698  CIK8nfWKtocDEAE=         1 1970-01-01 00:00:00+00:00     None             -1
6   gs://datachain-demo  dogs-and-cats   cat.1000.jpg   5944  1721494540506694  CMasx/SKtocDEAE=         1 1970-01-01 00:00:00+00:00     None             12
7   gs://datachain-demo  dogs-and-cats  cat.1000.json    102  1721494542191364  CISWrvWKtocDEAE=         1 1970-01-01 00:00:00+00:00     None             -1
8   gs://datachain-demo  dogs-and-cats   cat.1001.jpg  23099  1721494538757000  CIjH3POKtocDEAE=         1 1970-01-01 00:00:00+00:00     None             12
9   gs://datachain-demo  dogs-and-cats  cat.1001.json    102  1721494539671362  CMKulPSKtocDEAE=         1 1970-01-01 00:00:00+00:00     None             -1
10  gs://datachain-demo  dogs-and-cats   cat.1002.jpg  16999  1721494539926153  CIn1o/SKtocDEAE=         1 1970-01-01 00:00:00+00:00     None             12
11  gs://datachain-demo  dogs-and-cats  cat.1002.json    102  1721494538606028  CMyr0/OKtocDEAE=         1 1970-01-01 00:00:00+00:00     None             -1
12  gs://datachain-demo  dogs-and-cats   cat.1003.jpg  13996  1721494540292260  CKShuvSKtocDEAE=         1 1970-01-01 00:00:00+00:00     None             12
13  gs://datachain-demo  dogs-and-cats  cat.1003.json    102  1721494538083602  CJK6s/OKtocDEAE=         1 1970-01-01 00:00:00+00:00     None             -1
14  gs://datachain-demo  dogs-and-cats   cat.1004.jpg  41052  1721494539902670  CM69ovSKtocDEAE=         1 1970-01-01 00:00:00+00:00     None             12
15  gs://datachain-demo  dogs-and-cats  cat.1004.json    102  1721494541290725  COWZ9/SKtocDEAE=         1 1970-01-01 00:00:00+00:00     None             -1
16  gs://datachain-demo  dogs-and-cats   cat.1005.jpg  33372  1721494538590041  CNmu0vOKtocDEAE=         1 1970-01-01 00:00:00+00:00     None             12
17  gs://datachain-demo  dogs-and-cats  cat.1005.json    102  1721494540901128  CIi23/SKtocDEAE=         1 1970-01-01 00:00:00+00:00     None             -1
18  gs://datachain-demo  dogs-and-cats   cat.1006.jpg  23571  1721494539610043  CLvPkPSKtocDEAE=         1 1970-01-01 00:00:00+00:00     None             12
19  gs://datachain-demo  dogs-and-cats  cat.1006.json    102  1721494540202751  CP/ltPSKtocDEAE=         1 1970-01-01 00:00:00+00:00     None             -1

[Limited by 20 rows]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the width correctly set on your machine?

Copy link
Member

@skshetry skshetry Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I get the same output as you (the first one). I think it's due to (max_columns, None), which sets it to be in unlimited mode.

The downside of removing that is that it's going to collapse columns, which is what's going to happen with your PR too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems as though setting width with None gives nowrap behaviour but using os.get_terminal_size().columns gives the desired result:

Screen.Recording.2024-07-22.at.2.14.56.PM.mov

I am going to revert to the original implementation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't set the ("display.width", None), you will get the same behaviour as os.get_terminal_size(), no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The width defaults to 80.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, right. It's the other way around. None does the same thing as os.get_terminal_size(), so you are right.

@mattseddon mattseddon force-pushed the add-show-option branch 3 times, most recently from 4bbd514 to eae98dc Compare July 22, 2024 04:27
Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great

@mattseddon mattseddon merged commit c3ea4b3 into main Jul 22, 2024
18 of 19 checks passed
@mattseddon mattseddon deleted the add-show-option branch July 22, 2024 05:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants