Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: dag import --stats (#8237) #8237

Merged
merged 10 commits into from
Sep 23, 2021
Merged

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Jul 1, 2021

I'm just taking a stab at this and am unsure if I'm following the right pattern here so feedback would be appreciated.

Currently dag import only reports pinned CIDs and an error on pinning. If you pin-roots=false then you get nothing. I'm working on the JS version of this, including the HTTP client, and I'd really like to get a bit more information out of this process. The minimum I think is simply a report of how many blocks were imported (although in the current form it's not counting unique blocks, just total processed blocks).

So we end up with this slightly awkward event: {BlockCount:uint64, Root:*RootMeta} - where the first one is expected to be Root==nil and BlockCount telling you how many blocks, and any remaining should have Root!=nil and reporting pin status if you asked for pinning.

Is there a better pattern to follow than this? Can I make two separate event types? Can I clean up the struct a bit somehow or maybe use two separate structs rather than overloading one?

@rvagg rvagg force-pushed the rvagg/import-block-count-output branch from ae4855c to 44433f3 Compare July 1, 2021 12:25
core/commands/dag/dag.go Outdated Show resolved Hide resolved
@aschmahmann
Copy link
Contributor

@rvagg is the idea here basically to implement --progress like we have in ipfs pin add and ipfs add? If so we should also hide this behind a flag such that the output is unchanged by default

@ribasushi
Copy link
Contributor

is the idea here basically to implement --progress

No. This implements a single "end-summary" output as the very very last printout, no intermediate progress updates.

Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old response looks like this:

> ipfs dag import foo.car
Pinned root	QmbWqxBEKC3P8tqsKc98xmWNzrzDtRLMiMPL8wBuTGsMnR	success

behind the scenes, JSON (HTTP API at /api/vo/dag/import) looks like this:

> ipfs dag import foo.car --enc=json | jq
{
  "Root": {
    "Cid": {
      "/": "QmbWqxBEKC3P8tqsKc98xmWNzrzDtRLMiMPL8wBuTGsMnR"
    },
    "PinErrorMsg": ""
  }
}

While it is pretty safe to add new fields to JSON, changing the very first line printed as the text output could break people's scripts 🙈 👀

TODO before merge

We discussed this during triage today and it looks good, but we need to tweak it a bit:

  • we want Stats struct with Stats.BlockCount next to the Root so we can add other metrics in the future, eg:
  • print Stats only when --stats is passed (making this a backward-compatible opt-in feature)

@aschmahmann aschmahmann added the need/author-input Needs input from the original author label Aug 6, 2021
@BigLep
Copy link
Contributor

BigLep commented Aug 20, 2021

@rvagg : are you able to do the open TODOs?

@rvagg
Copy link
Member Author

rvagg commented Aug 23, 2021

@BigLep I'll put it on my personal backlog. Having it behind a flag makes me less enthusiastic, though I understand the reasoning.

@BigLep BigLep assigned rvagg and gammazero and unassigned rvagg Sep 3, 2021
@BigLep
Copy link
Contributor

BigLep commented Sep 3, 2021

@gammazero : given your tribute next week, can you please incorporate @lidel's comments and get this over the line?

@BigLep BigLep removed the need/author-input Needs input from the original author label Sep 3, 2021
rvagg and others added 3 commits September 7, 2021 14:45
This applies to both text and json output encoings.

- Stats data is now contained within a Stats datastructure
- Stats are printed after root so that first line of output is the same as previously, even when stats are output using --stats
@gammazero gammazero force-pushed the rvagg/import-block-count-output branch from edbbb78 to 875c3f0 Compare September 7, 2021 21:46
@gammazero
Copy link
Contributor

Done:

  • Only print stats when --stats flag is passed. This applies to both text and json output encodings.
  • Stats data is now contained within a Stats data structure
  • Stats are printed after root so that the first line of output is the same as previously, even when stats are output
  • Updated sharness test to use new stats output and --stats flag

@rvagg
Copy link
Member Author

rvagg commented Sep 8, 2021

would Approve if this wasn't my PR, but 👌 this is sweet, thanks so much @gammazero

@gammazero
Copy link
Contributor

gammazero commented Sep 8, 2021

I added PayloadBytesCount to stats. Please check that nd.Size(), done here, is the correct way to get the payload size. Was not sure if something from nd.Stats() was needed instead.

I did not add an elapsed time stat because I do not know what is the desired format for that duration value (seconds, milliseconds, microseconds, time.Duration string, something else). If the format is known, it will be trivial to add that stat, but the sharness tests will be more complicated because they will have to ignore the changing elapsed time value. Unless elapsed time is really useful in the stats, I would suggest omitting it and letting that be determined by external means.

@gammazero gammazero requested a review from lidel September 8, 2021 09:54
Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, perfectly fine to keep stats limited to info around imported blocks.
See comment below.

@@ -53,9 +54,15 @@ type ResolveOutput struct {
RemPath string
}

type CarImportStats struct {
BlockCount uint64
PayloadBytesCount uint64
Copy link
Member

@lidel lidel Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Payload can be confusing here: is this raw blocks (CAR-metadata) or actual data in each block (CAR-metadata-dagmetadata)?

Perhaps renaming it to BlockBytesCount and setting this to sum of nd.Stat().BlockSize() is a way to remove confusion while keeping this useful no matter what codecs are used inside of blocks?

Copy link
Contributor

@gammazero gammazero Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed PayloadBytesCount to BlockBytesCount

It happens that nd.Size() and nd.Stat().BlockSize return the same value, so I think it is better to use nd.Size(), given the comment here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. if they are the same, it makes sense, but something feels off when I compare Size imported from CAR with value reported by ipfs dag stat:

$ ipfs dag stat bafybeihcyruaeza7uyjd6ugicbcrqumejf6uf353e5etdkhotqffwtguva                                        
Size: 27676801, NumBlocks: 383

$ ipfs dag export bafybeihcyruaeza7uyjd6ugicbcrqumejf6uf353e5etdkhotqffwtguva > test.car                           
 0s  26.41 MiB / ? [--------------------------------------------------------------------------------=-----------------------] 390.25 MiB/s 0s
 
$ ipfs dag import --stats test.car
Pinned root	bafybeihcyruaeza7uyjd6ugicbcrqumejf6uf353e5etdkhotqffwtguva	success
Imported 383 blocks (125832269 bytes)

125832269 bytes is ~125 MB which is way more than 26MB

Copy link
Contributor

@gammazero gammazero Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidel What I did previously (using nd.Size()) worked for the car files in sharness/t0054-dag-car-import-export-data/ but does not work for your example. Using nd.Stat().DataSize + nd.Stat().LinksSize works for your example, but not for the test cars/dags. The test cars/dags have stats with all zeros, for almost all blocks.

It appears that nd.Size() returns nd.Stat().CumulativeSize if a block has stat values. Othersize, nd.Size() is set to len(nd.RawData()). This makes both nd.Size() and nd.Stats() completely unreliable across different dags.

Apparently, the way to get a reliable size is to always use len(nd.RawData()). So, that is what the latest change does.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataSize and LinksSize sounds like a dag-pb thing, which probably won't be represented in the fixture files.

What stat are we actually after here, @ribasushi what's your expectation of what this sizing is going to report? I would think that it's the size of the output, which includes CAR header, CID lengths and even varint section size prefixes. But I could find that by measuring the size of the output myself, so the utility doesn't seem great. But what is the utility of reporting just the block sizes? What is the useful for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect the ipfs dag stat number. It is useful in terms of "this is the amount of IPLD-data these blocks hold"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @ribasushi, len(nd.RawData()) is probably the one then, but in this case maybe just use len(block.RawData()) then you get to use the block as it comes out of the CAR rather than whatever happens to it through a Decode cycle (probably the same, but seems safer to use the one closer to the original)

@gammazero gammazero requested a review from lidel September 9, 2021 16:34
Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remaining thing: #8237 (comment)

@rvagg
Copy link
Member Author

rvagg commented Sep 13, 2021

I hope you don't mind @gammazero but pushed the len(block.RawSize()) change. With or without that change the import size is good for @lidel's example data so I think we're good to go on this PR.

Pinned root     bafybeihcyruaeza7uyjd6ugicbcrqumejf6uf353e5etdkhotqffwtguva     success
Imported 383 blocks (27676801 bytes)

@BigLep BigLep added this to the go-ipfs 0.11 milestone Sep 21, 2021
@lidel lidel assigned lidel and unassigned gammazero Sep 21, 2021
basic regression tests for the default output (text and json)
Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

We tested everything with --stats so I've added basic regression tests for dag import without --stats (default output).
This should be ready for merge.

@lidel lidel changed the title feat: report block count on dag import feat: dag import --stats (#8237) Sep 23, 2021
@lidel lidel merged commit 0057199 into master Sep 23, 2021
@lidel lidel deleted the rvagg/import-block-count-output branch September 23, 2021 14:23
@aschmahmann aschmahmann mentioned this pull request Sep 27, 2021
5 tasks
aschmahmann pushed a commit that referenced this pull request Sep 27, 2021
* feat: report block count on `dag import`
* fix: clean-up dag import message format
* Only print stats when --stats flag is passed

This applies to both text and json output encoding.

- Stats data is now contained within a Stats datastructure
- Stats are printed after root so that first line of output is the same as previously, even when stats are output using --stats

* fix sharness test

* Add PayloadBytesCount to stats

* Attempt to stabilize flaky tests

* Rename PayloadBytesCount to BlockBytesCount

* Correctly calculate size or imported dag

* Use RawSize of original block for import bytes calc

* test: dag import without --stats

basic regression tests for the default output (text and json)

Co-authored-by: gammazero <gammazero@users.noreply.github.com>
Co-authored-by: Marcin Rataj <lidel@lidel.org>
(cherry picked from commit 0057199)
aschmahmann pushed a commit that referenced this pull request Sep 27, 2021
* feat: report block count on `dag import`
* fix: clean-up dag import message format
* Only print stats when --stats flag is passed

This applies to both text and json output encoding.

- Stats data is now contained within a Stats datastructure
- Stats are printed after root so that first line of output is the same as previously, even when stats are output using --stats

* fix sharness test

* Add PayloadBytesCount to stats

* Attempt to stabilize flaky tests

* Rename PayloadBytesCount to BlockBytesCount

* Correctly calculate size or imported dag

* Use RawSize of original block for import bytes calc

* test: dag import without --stats

basic regression tests for the default output (text and json)

Co-authored-by: gammazero <gammazero@users.noreply.github.com>
Co-authored-by: Marcin Rataj <lidel@lidel.org>
(cherry picked from commit 0057199)
@aschmahmann aschmahmann mentioned this pull request Sep 30, 2021
62 tasks
@guseggert guseggert mentioned this pull request Nov 23, 2021
80 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants