PBENCH-1014 Using Tarball.extract in Inventory API for extracting files from tarball #3105

riya-17 · 2022-12-06T20:08:47Z

Using Tarball.extract in Inventory API for extracting files from tarballs.

Fixing Tests for the same

npalaska · 2022-12-06T20:35:52Z

lib/pbench/server/api/resources/datasets_inventory.py

-            raise APIAbort(
-                HTTPStatus.UNSUPPORTED_MEDIA_TYPE,
-                "The specified path does not refer to a regular file",


I am a bit behind on this one but why are we removing UNSUPPORTED_MEDIA_TYPE and return every error as 404 now?

Tarball.extract method was handling all error's around this. So, I didn't find the need to include these lines.

The cache manager raises an internal error, but we don't want that to fall through to the generic (internal error) except Exception in _dispatch. That's why we have APIAbort, which is treated specially. We want to make sure that the referenced file path exists in the tarball and is a regular file. I'm not sure exactly what tarfile.extractfile will return/raise in this case, but we need to specifically catch that case and raise APIAbort with UNSUPPORTED_MEDIA_TYPE.

dbutenhof

Good start, Riya, but sadly I think this needs some major refactoring to actually work. Unit testing is going to be awkward with the mocks currently in use, and it might be easier to pull a container pod to do live experiments.

(In fact, I encourage you to write a functional API test module for /api/v1/inventory!)

lib/pbench/server/cache_manager.py

lib/pbench/server/api/resources/datasets_inventory.py

webbnh · 2022-12-07T16:02:17Z

And, Black is unhappy with you (and I bet that iSort will be, too).

webbnh

I'm going to hold off on further review until you have addressed Dave's concerns.

lib/pbench/server/api/resources/datasets_inventory.py

dbutenhof · 2022-12-08T13:10:38Z

lib/pbench/server/api/resources/datasets_inventory.py

-            raise APIAbort(
-                HTTPStatus.UNSUPPORTED_MEDIA_TYPE,
-                "The specified path does not refer to a regular file",


The cache manager raises an internal error, but we don't want that to fall through to the generic (internal error) except Exception in _dispatch. That's why we have APIAbort, which is treated specially. We want to make sure that the referenced file path exists in the tarball and is a regular file. I'm not sure exactly what tarfile.extractfile will return/raise in this case, but we need to specifically catch that case and raise APIAbort with UNSUPPORTED_MEDIA_TYPE.

dbutenhof

Glad to see progress on this because we need to get the inventory API running again!

I have some concerns about the complications of filestream vs filecontents vs extract, and I think this needs to be streamlined. It also needs to be fixed, as you've broken a functional test for a tarball without metadata.log ... though that fix could end up being in intake_base.py rather than in the cache manager. (Specifically, I think you're now failing with a MetadataError exception on the cache manager create that wasn't firing before, and maybe should be refactored.)

___________________________ TestPut.test_no_metadata ___________________________
Traceback (most recent call last):
File "/var/tmp/jenkins/tox/py39/lib/python3.9/site-packages/pbench/test/functional/server/test_put.py", line 168, in test_no_metadata
assert (
AssertionError: upload nometadata returned unexpected status 400, {"message": "Tarball 'nometadata' is invalid or missing required metadata.log: A problem occurred processing metadata.log from /srv/pbench/archive/fs-version-001/UPLOAD/5f90cc513efab48adc492834b35a1fa0/nometadata.tar.xz: 'A problem occurred processing \'nometadata/metadata.log\' from /srv/pbench/archive/fs-version-001/UPLOAD/5f90cc513efab48adc492834b35a1fa0/nometadata.tar.xz: "filename \'nometadata/metadata.log\' not found"'"}

lib/pbench/server/cache_manager.py

webbnh

Below are a few comments on things I noted; I didn't get a chance to review the test code.

lib/pbench/server/api/resources/datasets_inventory.py

lib/pbench/server/cache_manager.py

lib/pbench/test/unit/server/test_cache_manager.py

webbnh

Here are some comments on the changes to the test code. However, the first one might become moot if you take Dave's suggestion to make the return from extract() uniform/consistent.

lib/pbench/test/unit/server/test_cache_manager.py

lib/pbench/test/unit/server/test_datasets_inventory.py

webbnh

Looks generally good, but it needs some polish, and I have a couple of specific concerns.

In Tarball.filestream() if path is a directory, we return None for the "stream" key, but there is a special case for a path of "." which I think will open the directory file and return a stream for it. Is this intended behavior? (Or, is it supposed to be returning a stream for the tarball itself? Is "." supposed to match on the root of the results tree?...because I don't think it will....)
There are a couple of places where the mocks are returning strings where they should be returning streams.
There is a mis-placed doc string.
I think the filestream() function(s) should be renamed...but perhaps that's just me. 🙂

webbnh · 2023-06-21T15:18:16Z

lib/pbench/server/api/resources/datasets_inventory.py

@@ -61,7 +61,7 @@ def _get(

        cache_m = CacheManager(self.config, current_app.logger)
        try:
-            file_info, file_stream = cache_m.extract(dataset, target)
+            file_info = cache_m.filestream(dataset, target)


Having a function named "filestream" return a "file information" dictionary doesn't seem like the best interface. Maybe something somewhat more generic, like "get_file()", would be better?

Yeah, reality hasn't kept up with the implementation, and at some point we should consider whipping it into shape.

lib/pbench/server/api/resources/datasets_inventory.py

lib/pbench/server/cache_manager.py

lib/pbench/test/unit/server/test_cache_manager.py

lib/pbench/test/unit/server/test_datasets_inventory.py

lib/pbench/server/cache_manager.py

riya-17 · 2023-06-21T19:00:11Z

In Tarball.filestream() if path is a directory, we return None for the "stream" key, but there is a special case for a path of "." which I think will open the directory file and return a stream for it. Is this intended behavior? (Or, is it supposed to be returning a stream for the tarball itself? Is "." supposed to match on the root of the results tree?...because I don't think it will....)

it is supposed to be a returning the stream of tarball itself.

lib/pbench/server/cache_manager.py

lib/pbench/server/api/resources/datasets_inventory.py

lib/pbench/test/unit/server/test_cache_manager.py

webbnh

Using stringized types for assertions and things is an interesting approach, but I think we can do better. In some cases, we can just use the same reference to the instance in both the mock and the assertion. If necessary, we can use isinstance() to check the type directly without stringizing it.

lib/pbench/test/unit/server/test_cache_manager.py

lib/pbench/test/unit/server/test_datasets_inventory.py

lib/pbench/server/api/resources/datasets_inventory.py

riya-17 · 2023-06-24T11:31:09Z

So, if you're going to keep MockBytesIO, the close() method should do something useful (like set a flag indicating that it got called, which the test checks with an assertion at the end, and/or a check that self is filestream_dict["stream"].

I have improved the test and verified the call to close() using an assertion check.

Using Tarball.extract in Inventory API to extract file content from tarball.

riya-17 · 2023-06-24T12:33:13Z

I would encourage you to rework the mocking in test_dataset_in_given_path() so that we're testing only DatasetsInventory._get() and not CacheManager.filestream() as well

@webbnh maybe we can try this in next iteration.

siddardh-ra

Looks good 👍

dbutenhof

Writing my scrum status this morning, I realized that a "glitch" I encountered late last week working on the contents API is going to apply here as well. Specifically, while in theory this all looks fine and is OK for unit tests, it's sadly useless in "real life" because it relies on the cache map ... which is at this point mostly theoretical.

That is, if we were to try a GET /datasets/<id>/inventory/metadata.log in the functional test right now, it'd fail with CacheMapMissing ... because we only unpack our tarballs in the pbench-index process, so only code within that process will ever have access to a cache map. And only for the specific tarballs it indexes in that cycle. 😦

Going forward, we need to make the cache map persistent, either via Redis or SQL. For now, however, if we actually want to be able to extract inventory files from a tarball, this could be modified to avoid the cache map (and get_info) and simply rely on extract complaining about a bad path.

I don't really want to redirect this PR now, however I want to point out that:

It's not really doing anything for us. (Sigh.)
It's really important to write functional tests for new features, not just unit tests, because (as we've just inadvertently demonstrated) we can easily write functioning unit tests for units that don't function. 😆

webbnh · 2023-06-26T17:15:49Z

if we were to try a GET /datasets/<id>/inventory/metadata.log in the functional test right now, it'd fail with CacheMapMissing ... because we only unpack our tarballs in the pbench-index process, so only code within that process will ever have access to a cache map. And only for the specific tarballs it indexes in that cycle.

I don't think that that is quite right. The CacheManager.find_dataset() function will build a (partial) cache map on the fly:

        # The dataset isn't already known; so search for it in the ARCHIVE tree
        # and (if found) discover the controller containing that dataset.

Now, granted, that depends on having a populated ARCHIVE tree, which presumably requires the pbench-index process, and I'm not sure whether we're actually running that or that it's populating the ARCHIVE tree....

Going forward, we need to make the cache map persistent, either via Redis or SQL. For now, however, if we actually want to be able to extract inventory files from a tarball, this could be modified to avoid the cache map (and get_info) and simply rely on extract complaining about a bad path.

We certainly could use Redis or SQL; and, I imagine that there are other options as well. However, that's all predicated on wanting to share the map across processes, which isn't an obvious requirement to me. Once we're using an object store, we'll need an accessible "index" to what's in it, and that, presumably, will be the persistent cache map, and then perhaps all the questions will have been answered.

I don't really want to redirect this PR now

No; I think this is valuable work, and I'd like to get it merged so that we can build on it later.

It's really important to write functional tests for new features, not just unit tests, because (as we've just inadvertently demonstrated) we can easily write functioning unit tests for units that don't function. 😆

Yes, having functional tests for units which are supposed to be functional would be good. But, good unit tests are important too, and, when done well, they enable us to write good stuff before it is supposed to be functional. 🙂

webbnh

Let's go!

dbutenhof · 2023-06-26T18:03:51Z

I don't think that that is quite right. The CacheManager.find_dataset() function will build a (partial) cache map on the fly:
        # The dataset isn't already known; so search for it in the ARCHIVE tree
        # and (if found) discover the controller containing that dataset.
Now, granted, that depends on having a populated ARCHIVE tree, which presumably requires the pbench-index process, and I'm not sure whether we're actually running that or that it's populating the ARCHIVE tree....

No, you're misunderstanding. We have the CacheManager, which is the highlevel structure mapping out the controllers and datasets within the controllers, and we have the CacheMap which maps out the results tree within the tarballs. Our PUT places the tarballs into the ARCHIVE tree, and the appropriate controller will be fully discovered by CacheManager.find_dataset().

But get_info needs the CacheMap in Tarball.cachemap. The CacheMap is only built when we unpack, and we only unpack for indexing. Because the CacheMap that we build when doing that is transient within the process, it goes away when pbench-index terminates, and was never shared with any other process, including the gunicorn server processes. Tarball.cachemap will always be None in the server.

One of the things I'd really love to take on in this sprint is fixing that, so that Riya's code works and I can do the same for contents. It wasn't at the top of the backlog, and maybe isn't critical for 1.0 (if we don't care about pulling inventory), but I'd really like to have it if it's possible: and it wasn't until this morning that I realized the current inventory just won't/can't work without that. I'm tempted to start on that while I start trying to figure out how to start hooking the agent into an Arca plugin given that you and Nick seem to already have work assigned while my swimlane is nearly empty. 😆

webbnh · 2023-06-26T19:34:52Z

But get_info needs the CacheMap in Tarball.cachemap. The CacheMap is only built when we unpack, and we only unpack for indexing.

Yes, I see that now. It looks like it would be straightforward (currrently) for the API to build a cache map on demand in find_dataset(), because (assuming that) the tarball has been unpacked, by using the same approach that the indexer uses, only, skipping the part where we unpack the tarball.

However, this leads to a more interesting question: if we already have the tarball unpacked, why are we extracting the requested file from it again instead of fetching the file from the unpack area? I assume that the answer has something to do with our expected conversion to using an object store, but I'm not convinced. On the flip side, if we're not going to fetch the file from the unpack area, what do we need the cache map for? (Can't we just look the file up in the tarball directly?)

dbutenhof · 2023-06-26T19:47:53Z

But get_info needs the CacheMap in Tarball.cachemap. The CacheMap is only built when we unpack, and we only unpack for indexing.

Yes, I see that now. It looks like it would be straightforward (currrently) for the API to build a cache map on demand in find_dataset(), because (assuming that) the tarball has been unpacked, by using the same approach that the indexer uses, only, skipping the part where we unpack the tarball.

However, this leads to a more interesting question: if we already have the tarball unpacked, why are we extracting the requested file from it again instead of fetching the file from the unpack area? I assume that the answer has something to do with our expected conversion to using an object store, but I'm not convinced. On the flip side, if we're not going to fetch the file from the unpack area, what do we need the cache map for? (Can't we just look the file up in the tarball directly?)

The tarball is temporarily unpacked for indexing, and then deleted. The uncached copy is owned by the indexer, because there's no cache management context. This was originally part of the hybrid design where we were going to put the passthrough and "archive" servers on the same file system so that the passthrough would manage the unpacked artifacts for 0.69 compatibility while the archive server would manage archiving and indexing, and could stop worrying about the INCOMING and RESULTS trees to move forward with cache management.

But, finally; yes, one possible workaround to restore inventory quickly would be to drop the get_info call (as I mentioned in my first comment) and rely on extract to diagnose the file path for the API. I'd rather move forward with a persistent cache map (even if its simplistic and static) because that allows dropping the Elasticsearch query (and it's odd and awkward schema) from the contents API.

webbnh · 2023-06-26T20:02:42Z

The tarball is temporarily unpacked for indexing, and then deleted.

Oh, yeah. (Huh...a bunch of this stuff is starting to feel silly, at least in its current incarnation....)

But, finally; yes, one possible workaround to restore inventory quickly would be to drop the get_info call (as I mentioned in my first comment) and rely on extract to diagnose the file path for the API. I'd rather move forward with a persistent cache map (even if its simplistic and static) because that allows dropping the Elasticsearch query (and it's odd and awkward schema) from the contents API.

I'm good with dropping the use of Elasticsearch from this. However, if we're going to extract files from the tarball, perhaps it wouldn't be hard to retool the existing cache-builder code to build the cache (on demand) directly from the tarball (e.g., by parsing the output of tar tvf or by iterating over the Python TarFile object which returns TarInfo objects which have things like the file type in it) instead of doing it from the unpacked directory tree and then persisting it.

dbutenhof · 2023-06-26T20:26:22Z

I'm good with dropping the use of Elasticsearch from this. However, if we're going to extract files from the tarball, perhaps it wouldn't be hard to retool the existing cache-builder code to build the cache (on demand) directly from the tarball (e.g., by parsing the output of tar tvf or by iterating over the Python TarFile object which returns TarInfo objects which have things like the file type in it) instead of doing it from the unpacked directory tree and then persisting it.

Running a tar tvf is a lot faster than iterating through tarfile.getmembers(), but it's still not "fast" on an API scale. I wouldn't want to build this dynamically on each call. Even a hack to "persist" as a global within the process means we need to do it for each of the gunicorn processes an API happens to hit, which is ugly.

A more strategic question would be how to quickly build a persistent shared cache that enables an efficient contents and inventory, ideally even for datasets we don't index. I wouldn't mind the overhead of processing a tar tvf if we only have to do it once for each dataset, and maybe that's the quickest path ... but the persistence mechanism is more interesting. It might be easier to stash it into a SQL table at least for now than to go through the whole process of deploying and configuring Redis in our CI and staging environment. (And SQL is a hammer I know well, even though Redis might be a better screwdriver, so given that this isn't necessarily high priority Panda work but I'd really like to have it soonish I'm somewhat inclined to attack the screw with my hammer in the short term and figure out the screwdriver later. 😆)

webbnh · 2023-06-26T21:02:37Z

Running a tar tvf is a lot faster than iterating through tarfile.getmembers()

I'll leave that up to the people who prove things. The problem is, for each element returned, we need to parse, analyze, and act, which might make the difference much smaller (i.e., if we useTarFile, all the parsing is done for us, we just look at fields in the returned TarInfo object).

it's still not "fast" on an API scale.

I'm afraid that that is just "the cost of doing business" unless we want to generate the cachemap beforehand (which results in other costs)...but it might be "fast enough", and we could cache it in memory at least for awhile.

the persistence mechanism is more interesting.

Yeah...it has to be the case that accessing the persisted cache is substantially cheaper (by whatever metric) than regenerating it...otherwise, regenerating it is pretty simple and easy to manage (and we don't have to worry about consistency problems and disaster recovery, etc.).

As for SQL vs. Redis, I don't have (much of) a preference, so long as it meets our needs. I'm biased against Redis if we can make SQL suitably multi-process safe, just because SQL strikes me as more stable and resilient, and possibly easier to manage. (Or, maybe it's just the hammer you know vs the screwdriver you don't...as you said.)

dbutenhof · 2023-06-27T12:11:19Z

First off, a real cache (which we don't have yet) is about accessing the unpacked artifacts, not just a map giving their names. SQL is probably fine, and certainly convenient (at least since we're already using it) for the map, but not ideal for managing the files. (Not that we can't store random large blobs, but it's not the best use of SQL.) With Redis we could manage the cache as objects along with the map, and it's built for fast cache access.

Ultimately maybe it's unrealistic to try to move contents to the cache map before we have a full cache, though, and the quickest way to get from Riya's refactoring here to making inventory actually usable is to get rid of the get_info call and rely on extract. So maybe I'll just do that...

webbnh · 2023-06-27T14:51:48Z

a real cache (which we don't have yet) is about accessing the unpacked artifacts, not just a map giving their names

Agreed. But from an operational perspective, the implementation of the cache has to be able to translate a key (e.g., a path within the tarball) to an object (which might be stored in a filesystem or it might be stored in a object store) and then fetch that object for the requestor. So, I expect that SQL would be fine for holding and implementing the key translation; where and how we store the objects could be implemented separately (i.e., using a suitably accessible filesystem for now, with the intention of shifting to an S3 service in the future). I don't (yet) see the benefits of Redis here, but that's probably just because I don't know much about Redis's capabilities in terms of serving large blobs of data.

the quickest way to get from Riya's refactoring here to making inventory actually usable is to get rid of the get_info call and rely on extract.

Seems reasonable, so long as you package it appropriately with an eye toward the hoped-for future.

riya-17 self-assigned this Dec 6, 2022

riya-17 added enhancement Server API Of and relating to application programming interfaces to services and functions labels Dec 6, 2022

riya-17 requested review from dbutenhof, siddardh-ra and webbnh December 6, 2022 20:09

This comment was marked as outdated.

Sign in to view

npalaska reviewed Dec 6, 2022

View reviewed changes

dbutenhof requested changes Dec 6, 2022

View reviewed changes

lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved

lib/pbench/server/api/resources/datasets_inventory.py Outdated Show resolved Hide resolved

webbnh reviewed Dec 7, 2022

View reviewed changes

lib/pbench/server/api/resources/datasets_inventory.py Outdated Show resolved Hide resolved

dbutenhof requested changes Dec 8, 2022

View reviewed changes

portante added this to the v0.72 milestone Feb 7, 2023

portante modified the milestones: v0.72, v0.73 Mar 14, 2023

riya-17 marked this pull request as draft May 25, 2023 12:38

riya-17 force-pushed the PBENCH-1014 branch from 781c8cd to 7ea7857 Compare June 16, 2023 08:44

riya-17 requested review from webbnh, npalaska and dbutenhof June 16, 2023 08:45

riya-17 marked this pull request as ready for review June 16, 2023 08:45

riya-17 force-pushed the PBENCH-1014 branch from 7ea7857 to 177c1c6 Compare June 16, 2023 08:48

dbutenhof requested changes Jun 16, 2023

View reviewed changes

lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved

lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved

webbnh reviewed Jun 19, 2023

View reviewed changes

webbnh reviewed Jun 20, 2023

View reviewed changes

lib/pbench/test/unit/server/test_cache_manager.py Outdated Show resolved Hide resolved

lib/pbench/test/unit/server/test_datasets_inventory.py Outdated Show resolved Hide resolved

lib/pbench/test/unit/server/test_datasets_inventory.py Outdated Show resolved Hide resolved

webbnh requested changes Jun 21, 2023

View reviewed changes

dbutenhof requested changes Jun 21, 2023

View reviewed changes

webbnh reviewed Jun 21, 2023

View reviewed changes

lib/pbench/test/unit/server/test_cache_manager.py Outdated Show resolved Hide resolved

lib/pbench/test/unit/server/test_datasets_inventory.py Outdated Show resolved Hide resolved

lib/pbench/server/api/resources/datasets_inventory.py Outdated Show resolved Hide resolved

riya-17 dismissed dbutenhof’s stale review via b78bc36 June 24, 2023 11:28

riya-17 force-pushed the PBENCH-1014 branch from b542bc3 to b78bc36 Compare June 24, 2023 11:28

dbutenhof approved these changes Jun 24, 2023

View reviewed changes

riya-17 added 10 commits June 24, 2023 17:40

PBENCH-1014

bdfe829

Using Tarball.extract in Inventory API to extract file content from tarball.

Improved functionality with only filestream

72cb4aa

Review Updates

6d344fe

Fix Functional Test Error

4db0706

Review Comment Update

6f0852a

Review Changes

08fa10e

test fixes

3dbfe84

Review Updates

fe3e748

Review updates and Functional test fix

1c9ee2d

verifying close mock

b05aca5

riya-17 force-pushed the PBENCH-1014 branch from b78bc36 to b05aca5 Compare June 24, 2023 12:11

riya-17 requested a review from webbnh June 24, 2023 12:33

siddardh-ra approved these changes Jun 24, 2023

View reviewed changes

dbutenhof reviewed Jun 26, 2023

View reviewed changes

webbnh approved these changes Jun 26, 2023

View reviewed changes

dbutenhof merged commit 999f797 into distributed-system-analysis:main Jun 26, 2023

PBENCH-1014 Using Tarball.extract in Inventory API for extracting files from tarball #3105

PBENCH-1014 Using Tarball.extract in Inventory API for extracting files from tarball #3105

Uh oh!

Conversation

riya-17 commented Dec 6, 2022

Uh oh!

This comment was marked as outdated.

npalaska Dec 6, 2022

Choose a reason for hiding this comment

Uh oh!

riya-17 Dec 8, 2022

Choose a reason for hiding this comment

Uh oh!

dbutenhof Dec 8, 2022

Choose a reason for hiding this comment

Uh oh!

dbutenhof left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

webbnh commented Dec 7, 2022

Uh oh!

webbnh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dbutenhof Dec 8, 2022

Choose a reason for hiding this comment

Uh oh!

dbutenhof left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

webbnh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

webbnh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

webbnh left a comment

Choose a reason for hiding this comment

Uh oh!

webbnh Jun 21, 2023

Choose a reason for hiding this comment

Uh oh!

dbutenhof Jun 22, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

riya-17 commented Jun 21, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

webbnh left a comment

Choose a reason for hiding this comment

Uh oh!