Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support strings in metrics #7960

Closed
shortcipher3 opened this issue Jul 1, 2022 · 8 comments · Fixed by #9387
Closed

Support strings in metrics #7960

shortcipher3 opened this issue Jul 1, 2022 · 8 comments · Fixed by #9387
Labels
A: metrics Related to dvc metrics feature request Requesting a new feature good first issue

Comments

@shortcipher3
Copy link

dvc metrics represent scalar numbers

This is nice for finding the difference in a metric between two models, however a couple metrics I'm interested in would benefit from being made more human readable by adding units. Specifically:

  • model file size (kilobytes, megabytes, and gigabytes)
  • inference latency (milliseconds, seconds)

There are other reasons beyond units to support strings, for example we use vertex.ai's training service, which can and does change without warning, so storing the date of the training would be useful. I would also be interested in having the model sha, so that given a model binary I can quickly verify which row corresponds to the model binary I have.

@daavoo daavoo added A: metrics Related to dvc metrics feature request Requesting a new feature good first issue labels Jul 1, 2022
@shortcipher3
Copy link
Author

I have a couple more thoughts on this.

If I commit a file to the git repo with git then in a pr/mr I can see the text differences branches under github/gitlabs changes. Maybe text data should just be left up to git.

It would be helpful to be able to diff files from different branches stored in dvc - is there a way to do that? If I had the two files locally I could diff file1 file2, if stored in git I can git diff branch1:file1 branch2:file2, is there an equivalent way to run a diff in dvc?

I thought that dvc diff branch1:file1 branch2:file2 might do what I would like, instead it just reports that the files are different.

@pmrowla
Copy link
Contributor

pmrowla commented Aug 1, 2022

You can already use git to version text-based metrics files by using cache: false to tell DVC not to handle it as a DVC-tracked file, and then tracking the file in git yourself. (But you would still be affected by the existing DVC limitation where metrics files must only contain numeric values)

Regarding diffing, DVC does not know anything about the type of file that it is tracking - everything is treated as arbitrary binary data, so we don't provide any kind of contextual diffing (which depends on handling specific file types).

There are some existing feature requests regarding diff behavior (like #7657), but essentially you would need to implement something that wraps dvc diff [--json] output yourself and then passes the relevant cache paths into a separate diff tool (as described in #7657)

@kshitiz305
Copy link

HI @pmrowla @shortcipher3 @daavoo @codito @tizoc I would like to join iterative Team and contribute to the development of the project please do let me know where I start with the open source contribution till the time I join the team. I am a python developer with 4 years of experience.

@kshitiz305
Copy link

Hi @shortcipher3 Could you provide an example of what is the current version doing and what is the requirements so that I can make the changes accordingly.

@shortcipher3
Copy link
Author

shortcipher3 commented Apr 21, 2023

I created an example repo here

Essentially I have two branches with a metrics file for a tiny model:

{
  "size": "1 MB",
  "latency": "20 ms",
  "mAP": 0.7,
  "precision": 0.6,
  "recall": 0.8,
  "model": "tiny"
}

and a large model:

{
  "size": "1 GB",
  "latency": "2 s",
  "mAP": 0.8,
  "precision": 0.7,
  "recall": 0.9,
  "model": "Gigantamax"
}

When I run a diff I get the following:

# dvc metrics diff --target metrics.json -- main tiny
Path          Metric     main    tiny    Change
metrics.json  mAP        0.8     0.7     -0.1
metrics.json  precision  0.7     0.6     -0.1
metrics.json  recall     0.9     0.8     -0.1

I would love to get something more like:

# dvc metrics diff --target metrics.json -- main tiny
Path          Metric     main    tiny    Change
metrics.json  mAP        0.8     0.7     -0.1
metrics.json  precision  0.7     0.6     -0.1
metrics.json  recall     0.9     0.8     -0.1
metrics.json  size       1 GB    1 MB     ---
metrics.json  latency    2 s     20 ms    ---
metrics.json  model    gigantamax   tiny  ---

That way I'm getting a nice table of results and I'm able to easily compare metrics that are on completely different scales (GB/MB and seconds/milliseconds) - it would be hard to read if I converted the GB and MB to bytes, I would be slowing down to count number of digits.

I can also add in meaningful data to help the reader understand the difference.

@shortcipher3
Copy link
Author

As for being able to do a local diff, a lot of state of the art research are producing a family of models rather than just a single model, I would love to have a metrics file for each model and be able to do a diff on these. An example is DINO v2

They actually have a table comparing the models on a few metrics one of which has a string for units.

Some other models with multiple sizes are:

I would think we could generate some useful tables for understanding some of these parameters automatically, making it easier for the data scientist to make decisions.

paulourbano added a commit to paulourbano/dvc that referenced this issue Apr 29, 2023
paulourbano added a commit to paulourbano/dvc that referenced this issue Apr 29, 2023
@paulourbano
Copy link
Contributor

Hello @shortcipher3 and @daavoo ,

I had a look into this issue and I might have a suitable solution.

I am new to the community, so I am not sure what is the best way to proceed. Should the issue be assigned to me before a pull request?

Thanks.

@daavoo
Copy link
Contributor

daavoo commented May 1, 2023

Hei @paulourbano ! feel free to open the P.R.

@daavoo daavoo linked a pull request May 2, 2023 that will close this issue
2 tasks
daavoo pushed a commit to paulourbano/dvc that referenced this issue May 4, 2023
daavoo pushed a commit to paulourbano/dvc that referenced this issue May 4, 2023
daavoo pushed a commit to paulourbano/dvc that referenced this issue May 5, 2023
daavoo pushed a commit to paulourbano/dvc that referenced this issue May 5, 2023
daavoo pushed a commit that referenced this issue May 5, 2023
daavoo pushed a commit that referenced this issue May 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: metrics Related to dvc metrics feature request Requesting a new feature good first issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants