Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataset/pipeline columns to the table #1657

Closed
maxagin opened this issue May 5, 2022 · 31 comments
Closed

Add dataset/pipeline columns to the table #1657

maxagin opened this issue May 5, 2022 · 31 comments
Assignees
Labels
🎨 design Needs design input or is being actively worked on

Comments

@maxagin
Copy link
Contributor

maxagin commented May 5, 2022

Description: If in files some changes were made, we want to see a sort of alert in the table view. For now we will add only dataset/pipeline columns, but maybe after there will be more to add.

@maxagin maxagin added the 🎨 design Needs design input or is being actively worked on label May 5, 2022
@maxagin maxagin self-assigned this May 5, 2022
@maxagin
Copy link
Contributor Author

maxagin commented May 5, 2022

 
Changes were made in Dataset file/s, but not in the Pipeline file/s.

Screen Shot 2022-05-04 at 9 28 42 PM

 

No changes were made to Dataset and Pipeline file/s.

Screen Shot 2022-05-04 at 9 28 52 PM

 

On click showing info about that was changed in the Dataset file/s. Textual content is just for example.

Screen Shot 2022-05-04 at 9 29 04 PM

 

On click showing that no changes were made in Pipeline file/s.

Screen Shot 2022-05-04 at 9 29 13 PM

 

Icon concept

If no changes were made, it means the situation is same as before. In other words it's equal.
The opposite - unequal (different than before).

Screen Shot 2022-05-04 at 9 29 23 PM

 

Notes

  1. We need to be able to hide and show these columns.
    Probably can be added to the left panel with METRICS & PARAMS, but the section name needs to be changed to Columns (like in Studio).

  2. Dataset and Pipeline names are just an example. However I see two options:
    a. Use Dataset and Pipeline as names, and on click display information about changes in multiple files.
    b. Have more than two columns, which is making everything not compact as the first option.

@maxagin
Copy link
Contributor Author

maxagin commented May 5, 2022

@shcheklein Following our discussion I have designed a first draft of possible solution. Please let me know what do you think.

@shcheklein
Copy link
Member

@maxagin looks good, a few comments:

  • minor: it's a duplicate ticket (here is the discussion Include deps columns in experiment table (dataset columns) #1183 (comment))
  • datasets - as we discussed we need one column per dataset (with it's name, etc)
  • color should be probably similar to Studio and CLI.
  • in Studio we are using a yellow dot symbol to notify about changes in the value - probably we should be using something similar here
  • for datasets ppl should be able to show hash, number of files, etc if they want it makes more sense for them (+ that symbol)
  • Pipeline and Datasets should have different colors probably (?) - not sure if that would be too much

@mattseddon
Copy link
Member

in Studio we are using a yellow dot symbol to notify about changes in the value - probably we should be using something similar here

We use a yellow color to highlight changes wrt the workspace record:

Loss and acc changed:

image

Loss, acc and lr changed:

image

@maxagin
Copy link
Contributor Author

maxagin commented May 5, 2022

minor: it's a duplicate ticket (here is the discussion Include deps columns in experiment table (dataset columns) #1183 (comment))

Would you like to move information from this task to one you mentioned?
Comment: Change title to “Include deps columns in experiment table (dataset and pipeline columns)”

color should be probably similar to Studio and CLI.

We are crafting information design. The styles will be outlined at the later stages. I have used VS Code in browser styles simply because we have discussed VS Code.

in Studio we are using a yellow dot symbol to notify about changes in the value - probably we should be using something similar here

The VS Code has very specific GUI requirements. It is why it looks very different from the Studio. However this can be discussed when we work on the styles.

for datasets ppl should be able to show hash, number of files, etc if they want it makes more sense for them (+ that symbol)

At this stage I have not focused on the naming and textual content (as I have mentioned in the previous comment). In the coming iterations I will include all the ideas related to the modals content on a very basic level, so we will have something to start with for the further discussions.

Pipeline and Datasets should have different colors probably (?) - not sure if that would be too much

I have in mind a few examples of how to handle this. We will talk about it later.

Thank you for your comments @shcheklein

@maxagin
Copy link
Contributor Author

maxagin commented May 5, 2022

datasets - as we discussed we need one column per dataset (with it's name, etc)

I am working on it. I can see a few possible solutions. Will share all when ready

@shcheklein

@maxagin
Copy link
Contributor Author

maxagin commented May 6, 2022

@shcheklein here is an update. Please share your comments


  1. Iteration. Table sections
  2. Dataset and Pipeline under FILES (name “files”, used just for now)

Screen Shot 2022-05-05 at 9 12 11 PM

 

Useful only if no BG, just lines. I Like this idea very much 🥊

Screen Shot 2022-05-05 at 9 12 18 PM

 

  1. Dataset and Pipeline are parent section
  2. With sections, table has better readability

Screen Shot 2022-05-05 at 9 12 25 PM

 

Showing all dataset files

Screen Shot 2022-05-05 at 9 12 39 PM

 



 

🥁 More elegant solution and my favorite

Dataset and Pipeline modals have the same structure, meaning more user friendly.

Modal info hierarchy:

  1. Just to let user know , changes were made
  2. Show were the changes were made
  3. May be able to click and see the changes (split window).

Screen Shot 2022-05-05 at 9 12 58 PM

 

Dataset and Pipeline modals structure

Screen Shot 2022-05-05 at 9 32 13 PM

@shcheklein
Copy link
Member

Just to clarify - how is the favorite one different from the initial one? (I just can't get the difference, am I missing something?)

I'm lost in all the other options a bit. I see the one Showing all dataset files that looks different - and I would say that what we probably need. Except the Pipelines column still needs some clarification there (also name should be Sources or Source code, not pipeline).

@maxagin
Copy link
Contributor Author

maxagin commented May 6, 2022

Just to clarify - how is the favorite one different from the initial one?

Same table, but with the modal. In fact there are only 2 options, all the rest intended to explain the benefits.

I do not understand why to show all Dataset files (say 10 or more) and not just one Dataset cell - I thought that we would like to show users that Dataset was changed, and after (optional) show the details - where exactly changes were made.

Also if we display in the table all Dataset files, why don't we display all Source code files? This is why I have proposed only 2 cells (Dataset and Source code)

@shcheklein
Copy link
Member

I do not understand why to show all Dataset files (say 10 or more) and not just one Dataset cell

Primarily because usually there will be less than 10 datasets, but more than 10 source code files.

I thought that we would like to show users that Dataset was changed, and after (optional) show the details - where exactly changes were made.

we want to show that dataset(s) have changed, yes. How it's done - one column, multiple columns, special mark nearby an experiment name, etc - that's what we need to decide on.

One column for example "hides" a lot of information. For example, in some case data file size is a good signal to see.


To summarize - I would probably initially go with a simple approach - one column per item for everything (data, source code, etc). We can just decide to toggle off certain columns (e.g. source code). We can even show some message how many columns got hidden by default.

Do we need that aggregated signal? It sounds good but can we make w/o allocating the whole column (it takes space)?

@maxagin
Copy link
Contributor Author

maxagin commented May 6, 2022

Primarily because usually there will be less than 10 datasets, but more than 10 source code files.

So the requirements for the dataset amount is 1 - unlimited. Probably the same for the source code files (not sure).
I can see two options:

  1. We display all in the table, with the ability to hide/show columns in the left panel. Changing the section name from METRICS, PARAMS, DATASET/S & SOURCE CODE to Columns
  2. and I see this as a better solution in general. Only 2 columns Dataset and Source Code

One column for example "hides" a lot of information. For example, in some case data file size is a good signal to see.

The easiest way to see that changes were made is to show DATASET+Changes Made indicator.
If, for example, we will use data file size as a signal, to find difference in the table becomes a not easy task aspacialy in a situation when in the table we display multiple DATASET files.

To summarize - I would probably initially go with a simple approach - one column per item for everything (data, source code, etc). We can just decide to toggle off certain columns

From my comment above:
…with the ability to hide/show columns in the left panel. Changing the section name from METRICS, PARAMS, DATASET/S & SOURCE CODE to Columns

We can even show some message how many columns got hidden by default.

This is why I propose “Only 2 columns Dataset and Source Code” this way on click we can show all files and some signals about the changes in every file !! The space is unlimited (scroll, unfolding and more). Meaning we can cover all the cases, using just 2 columns.

It sounds good but can we make w/o allocating the whole column (it takes space)?

Commented just above + At this point we can not know what information we want to pull. But we know for sure that the easiest way to see that changes were made is to show: DATASET+Changes Made indicator.

@shcheklein
Copy link
Member

So the requirements for the dataset amount is 1 - unlimited

Domain specific details matter here. They are no unlimited. And we can expect a certain distributions - e.g. in 80% of the project it's a single dataset and 5 source files.

Only 2 columns Dataset and Source Code

then why not make it even step further and not combine metrics and params? let's reflect on what is the difference between all of those.

If, for example, we will use data file size as a signal

In Studio we show size + a yellow dot, or size number can colored as yellow. That's what you mean?

Also, if you have sizes you can sort and filter by this field which also gives you a tool to navigate.

This is why I propose “Only 2 columns Dataset and Source Code” this way on click we can show all files and some signals

This sounds like an optimization and an additional (show all of this type columns additional element still). We can do this in one way or another, but the first step is still to show all the columns. wdyt?

DATASET+Changes Made indicator.

I'm not sure I understand this.

@maxagin
Copy link
Contributor Author

maxagin commented May 7, 2022

Domain specific details matter here. They are not unlimited. And we can expect a certain distribution - e.g. in 80% of the project it's a single dataset and 5 source files.

Yes they matter, but if we possibly can have more then 5 single dataset or source files for 20% of the users, it means a lot.

then why not make it even step further and not combine metrics and params? let's reflect on what is the difference between all of those.

Good point, but I think that

  1. if we have about 5 files in the pipeline (more than 1), this is already a good reason to have 2 columns.
  2. This is giving a lot more flexibility in the feature also. Think of: you have made changes in 2 xml and 3 py files | we want to signal the user (in modal) - this is going to be a bit too complex

But the main reason as I see it is that Dataset and Pipeline are 2 different entities.

Also I have a feeling that while experimenting it could be useful to see 2 signals in the table.
Possible scenario: User added dataset info | User runs exp | User want to try changing the Source file now | User changed the source and ran exp.
Now in the table we have a reflection of the changes on the basic level (before-dataset, now-source), at this point the user can better associate the table info with his actions.


Do not you think that in our specific domain, users will be very familiar with the Pipeline term? I feel it better describes the nature of related files then Source Code, which can be any files in the project.

In Studio we show size + a yellow dot, or size number can colored as yellow. That's what you mean? Also, if you have sizes you can sort and filter by this field which also gives you a tool to navigate.

I am not sure that the size is the best signal in this situation,

  1. because we may want to pull other information in the detail modal and having size instead of general indicator will limit as.
  2. If the size changed in multiple files of Dataset (or Pipeline)
  3. Maybe more important is the fact that in a complex table we have only one symbol per cell that tells users, something was changed.

This sounds like an optimization and an additional (show all of this type columns additional element still). We can do this in one way or another, but the first step is still to show all the columns. wdyt?

  1. we do not exactly know what info would be best to pull to cell

  2. The size is not clearly (visually) shows the difference !important; I underwood that the goal as
    a. notify user about the fact that fileS were changed
    b. provide option to see the changes without leaving the table.

  3. Having many columns is unnecessary complication of the system.

Another argument:
In the table four columns of the Dataset files || User hided one of the columns (data1.xml) using the filter panel || a few hours later user made changes in the data1.xml - user won’t see changes in the table because the column is filtered and not shown.

“DATASET+Changes Made indicator.” - I'm not sure I understand this.

The easiest way to see that changes were made is to show a cell with a symbol.

If, for example, we will use data file size as a signal, to find difference in the table becomes a not easy task especially in a situation when in the table we display multiple DATASET files and PIPELINE files.
6 additional columns (2 dataset + 4 pipeline files for example) with almost the same numbers (2.7, 2.8) is not the best way to show changes I think.

And if we want to show a symbol with size, I do not see any benefits of showing size, as at this point we only alert the user about the fact that the file changes and editions information beside the symbol (size) make it harder to work with table.
From what I understand the priority is to display in the table only relevant information. If the file size is the only and most important info about the file changes, then yes, it needs to be a part of the table.

But from what I understand, there are a lot of things you can do inside the files. Today you do A, tomorrow B. The symbol can be good for describing any possible change. The file size is may be misleading

@shcheklein
Copy link
Member

@maxagin we are making too many assumptions (size is misleading or not, 5 datasets or less, etc, etc) too early. Actual life is more complicated, DVC is not opinionated tool. Thus simple general, customizable solutions are better than complex opinionated ones. In this case the simples is to show everything and let people decide what to do.

I still consider introduction of these compound columns (or in some other way showing aggregated changes) as an optimization and convenience rather than solution. We need to have columns per dataset, per source file and ability to show different signals in them + highlight if there were changes. Everything else comes secondary to that. And probably better not taking the whole column for that (they are expensive).

@maxagin
Copy link
Contributor Author

maxagin commented May 8, 2022

@shcheklein here are two examples. Please let me know if this is what you had in mind.

Screen Shot 2022-05-08 at 12 46 38 AM

@shcheklein
Copy link
Member

@maxagin yes, right, something like this. Datasets and Source code will be changed to actual file path though.

file size will be only one of the options to see, another options might be - commit hash, file hash, number of files is dataset is a directory, etc.

But we can start here I think.

I like the idea of being able to collapse columns of each type into a single "signal" in some way, but would do this as a next step.

@maxagin
Copy link
Contributor Author

maxagin commented May 10, 2022

@shcheklein

Datasets and Source code will be changed to actual file path though.

Please share example

@shcheklein
Copy link
Member

yep:

Screen Shot 2022-05-10 at 10 21 23 AM

data/data.xml

@maxagin
Copy link
Contributor Author

maxagin commented May 18, 2022

@shcheklein if this is something we want to prioritize over other "improvements" tickets?

@mattseddon
Copy link
Member

@maxagin we will be implementing this as per #1183 (comment), updates to the current design can go into the spec that you create for #1562.

@mattseddon
Copy link
Member

Needs to take into account #1536.

@mattseddon
Copy link
Member

mattseddon commented Jun 6, 2022

#1830 will close #1183. Take this comment into account when starting with follow-up under this ticket.

@mattseddon
Copy link
Member

Next step

Highlight the text inside of each of the data/pipelines column cells that have changed wrt the last commit. This should match the way that we highlight changes within the workspace record:

image

@wolmir
Copy link
Contributor

wolmir commented Jul 12, 2022

If I understood correctly this would be the end result, right?
Screen Shot 2022-07-12 at 09 30 03

@wolmir
Copy link
Contributor

wolmir commented Jul 12, 2022

@shcheklein @maxagin Can you confirm/correct the above please?

@maxagin
Copy link
Contributor Author

maxagin commented Jul 12, 2022

@wolmir I think this is good. Just a question, for how long will we display the yellow color (changes made)? How should I know what has been changed if I will change train.py again? Thanks you

@wolmir
Copy link
Contributor

wolmir commented Jul 12, 2022

Thanks @maxagin. I believe there was an agreement to first just show that there was a change because the commit number of the dependency is different.
I would like to know if the screenshot above makes sense to a DVC user.

@mattseddon
Copy link
Member

Wolmir, that is what we agreed for the next step.

@wolmir
Copy link
Contributor

wolmir commented Jul 14, 2022

I opened a draft PR here: #2029

I just have to add unit tests for the modifications, but it should all be working.

@maxagin
Copy link
Contributor Author

maxagin commented Jul 25, 2022

@wolmir can we close this one as done? Thanks

@wolmir
Copy link
Contributor

wolmir commented Jul 26, 2022

I believe so, thanks @maxagin !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🎨 design Needs design input or is being actively worked on
Projects
None yet
Development

No branches or pull requests

4 participants