Add dataset/pipeline columns to the table #1657

maxagin · 2022-05-05T00:49:11Z

Description: If in files some changes were made, we want to see a sort of alert in the table view. For now we will add only dataset/pipeline columns, but maybe after there will be more to add.

maxagin · 2022-05-05T01:45:34Z

Changes were made in Dataset file/s, but not in the Pipeline file/s.

No changes were made to Dataset and Pipeline file/s.

On click showing info about that was changed in the Dataset file/s. Textual content is just for example.

On click showing that no changes were made in Pipeline file/s.

Icon concept

If no changes were made, it means the situation is same as before. In other words it's equal.
The opposite - unequal (different than before).

Notes

We need to be able to hide and show these columns.
Probably can be added to the left panel with METRICS & PARAMS, but the section name needs to be changed to Columns (like in Studio).
Dataset and Pipeline names are just an example. However I see two options:
a. Use Dataset and Pipeline as names, and on click display information about changes in multiple files.
b. Have more than two columns, which is making everything not compact as the first option.

maxagin · 2022-05-05T01:49:48Z

@shcheklein Following our discussion I have designed a first draft of possible solution. Please let me know what do you think.

shcheklein · 2022-05-05T02:00:03Z

@maxagin looks good, a few comments:

minor: it's a duplicate ticket (here is the discussion Include deps columns in experiment table (dataset columns) #1183 (comment))
datasets - as we discussed we need one column per dataset (with it's name, etc)
color should be probably similar to Studio and CLI.
in Studio we are using a yellow dot symbol to notify about changes in the value - probably we should be using something similar here
for datasets ppl should be able to show hash, number of files, etc if they want it makes more sense for them (+ that symbol)
Pipeline and Datasets should have different colors probably (?) - not sure if that would be too much

mattseddon · 2022-05-05T02:42:40Z

in Studio we are using a yellow dot symbol to notify about changes in the value - probably we should be using something similar here

We use a yellow color to highlight changes wrt the workspace record:

Loss and acc changed:

Loss, acc and lr changed:

maxagin · 2022-05-05T17:15:48Z

minor: it's a duplicate ticket (here is the discussion Include deps columns in experiment table (dataset columns) #1183 (comment))

Would you like to move information from this task to one you mentioned?
Comment: Change title to “Include deps columns in experiment table (dataset and pipeline columns)”

color should be probably similar to Studio and CLI.

We are crafting information design. The styles will be outlined at the later stages. I have used VS Code in browser styles simply because we have discussed VS Code.

in Studio we are using a yellow dot symbol to notify about changes in the value - probably we should be using something similar here

The VS Code has very specific GUI requirements. It is why it looks very different from the Studio. However this can be discussed when we work on the styles.

for datasets ppl should be able to show hash, number of files, etc if they want it makes more sense for them (+ that symbol)

At this stage I have not focused on the naming and textual content (as I have mentioned in the previous comment). In the coming iterations I will include all the ideas related to the modals content on a very basic level, so we will have something to start with for the further discussions.

Pipeline and Datasets should have different colors probably (?) - not sure if that would be too much

I have in mind a few examples of how to handle this. We will talk about it later.

Thank you for your comments @shcheklein

maxagin · 2022-05-05T23:42:37Z

datasets - as we discussed we need one column per dataset (with it's name, etc)

I am working on it. I can see a few possible solutions. Will share all when ready

@shcheklein

maxagin · 2022-05-06T01:42:18Z

@shcheklein here is an update. Please share your comments

Iteration. Table sections
Dataset and Pipeline under FILES (name “files”, used just for now)

Useful only if no BG, just lines. I Like this idea very much 🥊

Dataset and Pipeline are parent section
With sections, table has better readability

Showing all dataset files

🥁 More elegant solution and my favorite

Dataset and Pipeline modals have the same structure, meaning more user friendly.

Modal info hierarchy:

Just to let user know , changes were made
Show were the changes were made
May be able to click and see the changes (split window).

Dataset and Pipeline modals structure

shcheklein · 2022-05-06T01:55:25Z

Just to clarify - how is the favorite one different from the initial one? (I just can't get the difference, am I missing something?)

I'm lost in all the other options a bit. I see the one Showing all dataset files that looks different - and I would say that what we probably need. Except the Pipelines column still needs some clarification there (also name should be Sources or Source code, not pipeline).

maxagin · 2022-05-06T02:48:45Z

Just to clarify - how is the favorite one different from the initial one?

Same table, but with the modal. In fact there are only 2 options, all the rest intended to explain the benefits.

I do not understand why to show all Dataset files (say 10 or more) and not just one Dataset cell - I thought that we would like to show users that Dataset was changed, and after (optional) show the details - where exactly changes were made.

Also if we display in the table all Dataset files, why don't we display all Source code files? This is why I have proposed only 2 cells (Dataset and Source code)

shcheklein · 2022-05-06T03:51:54Z

I do not understand why to show all Dataset files (say 10 or more) and not just one Dataset cell

Primarily because usually there will be less than 10 datasets, but more than 10 source code files.

I thought that we would like to show users that Dataset was changed, and after (optional) show the details - where exactly changes were made.

we want to show that dataset(s) have changed, yes. How it's done - one column, multiple columns, special mark nearby an experiment name, etc - that's what we need to decide on.

One column for example "hides" a lot of information. For example, in some case data file size is a good signal to see.

To summarize - I would probably initially go with a simple approach - one column per item for everything (data, source code, etc). We can just decide to toggle off certain columns (e.g. source code). We can even show some message how many columns got hidden by default.

Do we need that aggregated signal? It sounds good but can we make w/o allocating the whole column (it takes space)?

maxagin · 2022-05-06T16:19:13Z

Primarily because usually there will be less than 10 datasets, but more than 10 source code files.

So the requirements for the dataset amount is 1 - unlimited. Probably the same for the source code files (not sure).
I can see two options:

We display all in the table, with the ability to hide/show columns in the left panel. Changing the section name from METRICS, PARAMS, DATASET/S & SOURCE CODE to Columns
and I see this as a better solution in general. Only 2 columns Dataset and Source Code

One column for example "hides" a lot of information. For example, in some case data file size is a good signal to see.

The easiest way to see that changes were made is to show DATASET+Changes Made indicator.
If, for example, we will use data file size as a signal, to find difference in the table becomes a not easy task aspacialy in a situation when in the table we display multiple DATASET files.

To summarize - I would probably initially go with a simple approach - one column per item for everything (data, source code, etc). We can just decide to toggle off certain columns

From my comment above:
…with the ability to hide/show columns in the left panel. Changing the section name from METRICS, PARAMS, DATASET/S & SOURCE CODE to Columns

We can even show some message how many columns got hidden by default.

This is why I propose “Only 2 columns Dataset and Source Code” this way on click we can show all files and some signals about the changes in every file !! The space is unlimited (scroll, unfolding and more). Meaning we can cover all the cases, using just 2 columns.

It sounds good but can we make w/o allocating the whole column (it takes space)?

Commented just above + At this point we can not know what information we want to pull. But we know for sure that the easiest way to see that changes were made is to show: DATASET+Changes Made indicator.

shcheklein · 2022-05-07T00:09:00Z

So the requirements for the dataset amount is 1 - unlimited

Domain specific details matter here. They are no unlimited. And we can expect a certain distributions - e.g. in 80% of the project it's a single dataset and 5 source files.

Only 2 columns Dataset and Source Code

then why not make it even step further and not combine metrics and params? let's reflect on what is the difference between all of those.

If, for example, we will use data file size as a signal

In Studio we show size + a yellow dot, or size number can colored as yellow. That's what you mean?

Also, if you have sizes you can sort and filter by this field which also gives you a tool to navigate.

This is why I propose “Only 2 columns Dataset and Source Code” this way on click we can show all files and some signals

This sounds like an optimization and an additional (show all of this type columns additional element still). We can do this in one way or another, but the first step is still to show all the columns. wdyt?

DATASET+Changes Made indicator.

I'm not sure I understand this.

maxagin · 2022-05-07T17:12:26Z

Domain specific details matter here. They are not unlimited. And we can expect a certain distribution - e.g. in 80% of the project it's a single dataset and 5 source files.

Yes they matter, but if we possibly can have more then 5 single dataset or source files for 20% of the users, it means a lot.

then why not make it even step further and not combine metrics and params? let's reflect on what is the difference between all of those.

Good point, but I think that

if we have about 5 files in the pipeline (more than 1), this is already a good reason to have 2 columns.
This is giving a lot more flexibility in the feature also. Think of: you have made changes in 2 xml and 3 py files | we want to signal the user (in modal) - this is going to be a bit too complex

But the main reason as I see it is that Dataset and Pipeline are 2 different entities.

Also I have a feeling that while experimenting it could be useful to see 2 signals in the table.
Possible scenario: User added dataset info | User runs exp | User want to try changing the Source file now | User changed the source and ran exp.
Now in the table we have a reflection of the changes on the basic level (before-dataset, now-source), at this point the user can better associate the table info with his actions.

Do not you think that in our specific domain, users will be very familiar with the Pipeline term? I feel it better describes the nature of related files then Source Code, which can be any files in the project.

In Studio we show size + a yellow dot, or size number can colored as yellow. That's what you mean? Also, if you have sizes you can sort and filter by this field which also gives you a tool to navigate.

I am not sure that the size is the best signal in this situation,

because we may want to pull other information in the detail modal and having size instead of general indicator will limit as.
If the size changed in multiple files of Dataset (or Pipeline)
Maybe more important is the fact that in a complex table we have only one symbol per cell that tells users, something was changed.

This sounds like an optimization and an additional (show all of this type columns additional element still). We can do this in one way or another, but the first step is still to show all the columns. wdyt?

we do not exactly know what info would be best to pull to cell
The size is not clearly (visually) shows the difference !important; I underwood that the goal as
a. notify user about the fact that fileS were changed
b. provide option to see the changes without leaving the table.
Having many columns is unnecessary complication of the system.

Another argument:
In the table four columns of the Dataset files || User hided one of the columns (data1.xml) using the filter panel || a few hours later user made changes in the data1.xml - user won’t see changes in the table because the column is filtered and not shown.

“DATASET+Changes Made indicator.” - I'm not sure I understand this.

The easiest way to see that changes were made is to show a cell with a symbol.

If, for example, we will use data file size as a signal, to find difference in the table becomes a not easy task especially in a situation when in the table we display multiple DATASET files and PIPELINE files.
6 additional columns (2 dataset + 4 pipeline files for example) with almost the same numbers (2.7, 2.8) is not the best way to show changes I think.

And if we want to show a symbol with size, I do not see any benefits of showing size, as at this point we only alert the user about the fact that the file changes and editions information beside the symbol (size) make it harder to work with table.
From what I understand the priority is to display in the table only relevant information. If the file size is the only and most important info about the file changes, then yes, it needs to be a part of the table.

But from what I understand, there are a lot of things you can do inside the files. Today you do A, tomorrow B. The symbol can be good for describing any possible change. The file size is may be misleading

shcheklein · 2022-05-07T22:33:14Z

@maxagin we are making too many assumptions (size is misleading or not, 5 datasets or less, etc, etc) too early. Actual life is more complicated, DVC is not opinionated tool. Thus simple general, customizable solutions are better than complex opinionated ones. In this case the simples is to show everything and let people decide what to do.

I still consider introduction of these compound columns (or in some other way showing aggregated changes) as an optimization and convenience rather than solution. We need to have columns per dataset, per source file and ability to show different signals in them + highlight if there were changes. Everything else comes secondary to that. And probably better not taking the whole column for that (they are expensive).

maxagin · 2022-05-08T04:49:33Z

@shcheklein here are two examples. Please let me know if this is what you had in mind.

shcheklein · 2022-05-08T21:16:49Z

@maxagin yes, right, something like this. Datasets and Source code will be changed to actual file path though.

file size will be only one of the options to see, another options might be - commit hash, file hash, number of files is dataset is a directory, etc.

But we can start here I think.

I like the idea of being able to collapse columns of each type into a single "signal" in some way, but would do this as a next step.

maxagin · 2022-05-10T17:00:38Z

@shcheklein

Datasets and Source code will be changed to actual file path though.

Please share example

shcheklein · 2022-05-10T17:21:38Z

yep:

data/data.xml

maxagin · 2022-05-18T15:45:21Z

@shcheklein if this is something we want to prioritize over other "improvements" tickets?

mattseddon · 2022-05-18T23:10:45Z

@maxagin we will be implementing this as per #1183 (comment), updates to the current design can go into the spec that you create for #1562.

mattseddon · 2022-05-24T07:14:49Z

Needs to take into account #1536.

mattseddon · 2022-06-06T01:35:29Z

#1830 will close #1183. Take this comment into account when starting with follow-up under this ticket.

mattseddon · 2022-07-06T23:00:17Z

Next step

Highlight the text inside of each of the data/pipelines column cells that have changed wrt the last commit. This should match the way that we highlight changes within the workspace record:

wolmir · 2022-07-12T12:32:11Z

If I understood correctly this would be the end result, right?

wolmir · 2022-07-12T17:08:42Z

@shcheklein @maxagin Can you confirm/correct the above please?

maxagin · 2022-07-12T20:08:34Z

@wolmir I think this is good. Just a question, for how long will we display the yellow color (changes made)? How should I know what has been changed if I will change train.py again? Thanks you

wolmir · 2022-07-12T20:11:54Z

Thanks @maxagin. I believe there was an agreement to first just show that there was a change because the commit number of the dependency is different.
I would like to know if the screenshot above makes sense to a DVC user.

mattseddon · 2022-07-12T23:48:27Z

Wolmir, that is what we agreed for the next step.

wolmir · 2022-07-14T20:59:45Z

I opened a draft PR here: #2029

I just have to add unit tests for the modifications, but it should all be working.

maxagin · 2022-07-25T17:39:51Z

@wolmir can we close this one as done? Thanks

wolmir · 2022-07-26T15:51:57Z

I believe so, thanks @maxagin !

maxagin added the 🎨 design Needs design input or is being actively worked on label May 5, 2022

maxagin self-assigned this May 5, 2022

maxagin mentioned this issue May 19, 2022

Improve the table of experiments UI #1562

Closed

6 tasks

mattseddon mentioned this issue Jun 5, 2022

Add deps to experiment table and columns tree #1830

Merged

This was referenced Jul 22, 2022

Add more info to experiment table cell tooltips #1383

Closed

Only show relevant columns in experiments table #1994

Closed

shcheklein closed this as completed Jul 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dataset/pipeline columns to the table #1657

Add dataset/pipeline columns to the table #1657

maxagin commented May 5, 2022

maxagin commented May 5, 2022

maxagin commented May 5, 2022

shcheklein commented May 5, 2022

mattseddon commented May 5, 2022

maxagin commented May 5, 2022

maxagin commented May 5, 2022

maxagin commented May 6, 2022

shcheklein commented May 6, 2022

maxagin commented May 6, 2022

shcheklein commented May 6, 2022

maxagin commented May 6, 2022

shcheklein commented May 7, 2022

maxagin commented May 7, 2022

shcheklein commented May 7, 2022

maxagin commented May 8, 2022

shcheklein commented May 8, 2022

maxagin commented May 10, 2022

shcheklein commented May 10, 2022

maxagin commented May 18, 2022

mattseddon commented May 18, 2022

mattseddon commented May 24, 2022

mattseddon commented Jun 6, 2022 •

edited

Loading

mattseddon commented Jul 6, 2022

wolmir commented Jul 12, 2022

wolmir commented Jul 12, 2022

maxagin commented Jul 12, 2022

wolmir commented Jul 12, 2022

mattseddon commented Jul 12, 2022

wolmir commented Jul 14, 2022

maxagin commented Jul 25, 2022

wolmir commented Jul 26, 2022

Add dataset/pipeline columns to the table #1657

Add dataset/pipeline columns to the table #1657

Comments

maxagin commented May 5, 2022

maxagin commented May 5, 2022

Icon concept

Notes

maxagin commented May 5, 2022

shcheklein commented May 5, 2022

mattseddon commented May 5, 2022

maxagin commented May 5, 2022

maxagin commented May 5, 2022

maxagin commented May 6, 2022

🥁 More elegant solution and my favorite

shcheklein commented May 6, 2022

maxagin commented May 6, 2022

shcheklein commented May 6, 2022

maxagin commented May 6, 2022

shcheklein commented May 7, 2022

maxagin commented May 7, 2022

shcheklein commented May 7, 2022

maxagin commented May 8, 2022

shcheklein commented May 8, 2022

maxagin commented May 10, 2022

shcheklein commented May 10, 2022

maxagin commented May 18, 2022

mattseddon commented May 18, 2022

mattseddon commented May 24, 2022

mattseddon commented Jun 6, 2022 • edited Loading

mattseddon commented Jul 6, 2022

wolmir commented Jul 12, 2022

wolmir commented Jul 12, 2022

maxagin commented Jul 12, 2022

wolmir commented Jul 12, 2022

mattseddon commented Jul 12, 2022

wolmir commented Jul 14, 2022

maxagin commented Jul 25, 2022

wolmir commented Jul 26, 2022

mattseddon commented Jun 6, 2022 •

edited

Loading