Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PBENCH-1014 Using Tarball.extract in Inventory API for extracting files from tarball #3105

Merged
merged 10 commits into from
Jun 26, 2023

Conversation

riya-17
Copy link
Member

@riya-17 riya-17 commented Dec 6, 2022

Using Tarball.extract in Inventory API for extracting files from tarballs.

Fixing Tests for the same

@riya-17 riya-17 self-assigned this Dec 6, 2022
@riya-17 riya-17 added enhancement Server API Of and relating to application programming interfaces to services and functions labels Dec 6, 2022
@lgtm-com

This comment was marked as outdated.

Comment on lines -77 to -95
raise APIAbort(
HTTPStatus.UNSUPPORTED_MEDIA_TYPE,
"The specified path does not refer to a regular file",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit behind on this one but why are we removing UNSUPPORTED_MEDIA_TYPE and return every error as 404 now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tarball.extract method was handling all error's around this. So, I didn't find the need to include these lines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache manager raises an internal error, but we don't want that to fall through to the generic (internal error) except Exception in _dispatch. That's why we have APIAbort, which is treated specially. We want to make sure that the referenced file path exists in the tarball and is a regular file. I'm not sure exactly what tarfile.extractfile will return/raise in this case, but we need to specifically catch that case and raise APIAbort with UNSUPPORTED_MEDIA_TYPE.

Copy link
Member

@dbutenhof dbutenhof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start, Riya, but sadly I think this needs some major refactoring to actually work. Unit testing is going to be awkward with the mocks currently in use, and it might be easier to pull a container pod to do live experiments.

(In fact, I encourage you to write a functional API test module for /api/v1/inventory!)

lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/api/resources/datasets_inventory.py Outdated Show resolved Hide resolved
@webbnh
Copy link
Member

webbnh commented Dec 7, 2022

And, Black is unhappy with you (and I bet that iSort will be, too).

Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to hold off on further review until you have addressed Dave's concerns.

lib/pbench/server/api/resources/datasets_inventory.py Outdated Show resolved Hide resolved
lib/pbench/server/api/resources/datasets_inventory.py Outdated Show resolved Hide resolved
lib/pbench/server/api/resources/datasets_inventory.py Outdated Show resolved Hide resolved
lib/pbench/server/api/resources/datasets_inventory.py Outdated Show resolved Hide resolved
Comment on lines -77 to -95
raise APIAbort(
HTTPStatus.UNSUPPORTED_MEDIA_TYPE,
"The specified path does not refer to a regular file",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache manager raises an internal error, but we don't want that to fall through to the generic (internal error) except Exception in _dispatch. That's why we have APIAbort, which is treated specially. We want to make sure that the referenced file path exists in the tarball and is a regular file. I'm not sure exactly what tarfile.extractfile will return/raise in this case, but we need to specifically catch that case and raise APIAbort with UNSUPPORTED_MEDIA_TYPE.

@portante portante added this to the v0.72 milestone Feb 7, 2023
@portante portante modified the milestones: v0.72, v0.73 Mar 14, 2023
@riya-17 riya-17 marked this pull request as draft May 25, 2023 12:38
@riya-17 riya-17 marked this pull request as ready for review June 16, 2023 08:45
Copy link
Member

@dbutenhof dbutenhof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to see progress on this because we need to get the inventory API running again!

I have some concerns about the complications of filestream vs filecontents vs extract, and I think this needs to be streamlined. It also needs to be fixed, as you've broken a functional test for a tarball without metadata.log ... though that fix could end up being in intake_base.py rather than in the cache manager. (Specifically, I think you're now failing with a MetadataError exception on the cache manager create that wasn't firing before, and maybe should be refactored.)

___________________________ TestPut.test_no_metadata ___________________________
Traceback (most recent call last):
File "/var/tmp/jenkins/tox/py39/lib/python3.9/site-packages/pbench/test/functional/server/test_put.py", line 168, in test_no_metadata
assert (
AssertionError: upload nometadata returned unexpected status 400, {"message": "Tarball 'nometadata' is invalid or missing required metadata.log: A problem occurred processing metadata.log from /srv/pbench/archive/fs-version-001/UPLOAD/5f90cc513efab48adc492834b35a1fa0/nometadata.tar.xz: 'A problem occurred processing \'nometadata/metadata.log\' from /srv/pbench/archive/fs-version-001/UPLOAD/5f90cc513efab48adc492834b35a1fa0/nometadata.tar.xz: "filename \'nometadata/metadata.log\' not found"'"}

lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below are a few comments on things I noted; I didn't get a chance to review the test code.

lib/pbench/server/api/resources/datasets_inventory.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/test/unit/server/test_cache_manager.py Outdated Show resolved Hide resolved
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some comments on the changes to the test code. However, the first one might become moot if you take Dave's suggestion to make the return from extract() uniform/consistent.

lib/pbench/test/unit/server/test_cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/test/unit/server/test_datasets_inventory.py Outdated Show resolved Hide resolved
lib/pbench/test/unit/server/test_datasets_inventory.py Outdated Show resolved Hide resolved
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks generally good, but it needs some polish, and I have a couple of specific concerns.

  • In Tarball.filestream() if path is a directory, we return None for the "stream" key, but there is a special case for a path of "." which I think will open the directory file and return a stream for it. Is this intended behavior? (Or, is it supposed to be returning a stream for the tarball itself? Is "." supposed to match on the root of the results tree?...because I don't think it will....)
  • There are a couple of places where the mocks are returning strings where they should be returning streams.
  • There is a mis-placed doc string.
  • I think the filestream() function(s) should be renamed...but perhaps that's just me. 🙂

@@ -61,7 +61,7 @@ def _get(

cache_m = CacheManager(self.config, current_app.logger)
try:
file_info, file_stream = cache_m.extract(dataset, target)
file_info = cache_m.filestream(dataset, target)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a function named "filestream" return a "file information" dictionary doesn't seem like the best interface. Maybe something somewhat more generic, like "get_file()", would be better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, reality hasn't kept up with the implementation, and at some point we should consider whipping it into shape.

lib/pbench/server/api/resources/datasets_inventory.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/test/unit/server/test_cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/test/unit/server/test_cache_manager.py Outdated Show resolved Hide resolved
lib/pbench/test/unit/server/test_datasets_inventory.py Outdated Show resolved Hide resolved
lib/pbench/test/unit/server/test_datasets_inventory.py Outdated Show resolved Hide resolved
lib/pbench/server/cache_manager.py Outdated Show resolved Hide resolved
@riya-17
Copy link
Member Author

riya-17 commented Jun 21, 2023

  • In Tarball.filestream() if path is a directory, we return None for the "stream" key, but there is a special case for a path of "." which I think will open the directory file and return a stream for it. Is this intended behavior? (Or, is it supposed to be returning a stream for the tarball itself? Is "." supposed to match on the root of the results tree?...because I don't think it will....)

it is supposed to be a returning the stream of tarball itself.

@riya-17
Copy link
Member Author

riya-17 commented Jun 24, 2023

So, if you're going to keep MockBytesIO, the close() method should do something useful (like set a flag indicating that it got called, which the test checks with an assertion at the end, and/or a check that self is filestream_dict["stream"].

I have improved the test and verified the call to close() using an assertion check.

@riya-17
Copy link
Member Author

riya-17 commented Jun 24, 2023

I would encourage you to rework the mocking in test_dataset_in_given_path() so that we're testing only DatasetsInventory._get() and not CacheManager.filestream() as well

@webbnh maybe we can try this in next iteration.

@riya-17 riya-17 requested a review from webbnh June 24, 2023 12:33
Copy link
Member

@siddardh-ra siddardh-ra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍

Copy link
Member

@dbutenhof dbutenhof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing my scrum status this morning, I realized that a "glitch" I encountered late last week working on the contents API is going to apply here as well. Specifically, while in theory this all looks fine and is OK for unit tests, it's sadly useless in "real life" because it relies on the cache map ... which is at this point mostly theoretical.

That is, if we were to try a GET /datasets/<id>/inventory/metadata.log in the functional test right now, it'd fail with CacheMapMissing ... because we only unpack our tarballs in the pbench-index process, so only code within that process will ever have access to a cache map. And only for the specific tarballs it indexes in that cycle. 😦

Going forward, we need to make the cache map persistent, either via Redis or SQL. For now, however, if we actually want to be able to extract inventory files from a tarball, this could be modified to avoid the cache map (and get_info) and simply rely on extract complaining about a bad path.

I don't really want to redirect this PR now, however I want to point out that:

  1. It's not really doing anything for us. (Sigh.)
  2. It's really important to write functional tests for new features, not just unit tests, because (as we've just inadvertently demonstrated) we can easily write functioning unit tests for units that don't function. 😆

@webbnh
Copy link
Member

webbnh commented Jun 26, 2023

if we were to try a GET /datasets/<id>/inventory/metadata.log in the functional test right now, it'd fail with CacheMapMissing ... because we only unpack our tarballs in the pbench-index process, so only code within that process will ever have access to a cache map. And only for the specific tarballs it indexes in that cycle.

I don't think that that is quite right. The CacheManager.find_dataset() function will build a (partial) cache map on the fly:

        # The dataset isn't already known; so search for it in the ARCHIVE tree
        # and (if found) discover the controller containing that dataset.

Now, granted, that depends on having a populated ARCHIVE tree, which presumably requires the pbench-index process, and I'm not sure whether we're actually running that or that it's populating the ARCHIVE tree....

Going forward, we need to make the cache map persistent, either via Redis or SQL. For now, however, if we actually want to be able to extract inventory files from a tarball, this could be modified to avoid the cache map (and get_info) and simply rely on extract complaining about a bad path.

We certainly could use Redis or SQL; and, I imagine that there are other options as well. However, that's all predicated on wanting to share the map across processes, which isn't an obvious requirement to me. Once we're using an object store, we'll need an accessible "index" to what's in it, and that, presumably, will be the persistent cache map, and then perhaps all the questions will have been answered.

I don't really want to redirect this PR now

No; I think this is valuable work, and I'd like to get it merged so that we can build on it later.

It's really important to write functional tests for new features, not just unit tests, because (as we've just inadvertently demonstrated) we can easily write functioning unit tests for units that don't function. 😆

Yes, having functional tests for units which are supposed to be functional would be good. But, good unit tests are important too, and, when done well, they enable us to write good stuff before it is supposed to be functional. 🙂

Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go!

@dbutenhof
Copy link
Member

I don't think that that is quite right. The CacheManager.find_dataset() function will build a (partial) cache map on the fly:

        # The dataset isn't already known; so search for it in the ARCHIVE tree
        # and (if found) discover the controller containing that dataset.

Now, granted, that depends on having a populated ARCHIVE tree, which presumably requires the pbench-index process, and I'm not sure whether we're actually running that or that it's populating the ARCHIVE tree....

No, you're misunderstanding. We have the CacheManager, which is the highlevel structure mapping out the controllers and datasets within the controllers, and we have the CacheMap which maps out the results tree within the tarballs. Our PUT places the tarballs into the ARCHIVE tree, and the appropriate controller will be fully discovered by CacheManager.find_dataset().

But get_info needs the CacheMap in Tarball.cachemap. The CacheMap is only built when we unpack, and we only unpack for indexing. Because the CacheMap that we build when doing that is transient within the process, it goes away when pbench-index terminates, and was never shared with any other process, including the gunicorn server processes. Tarball.cachemap will always be None in the server.

One of the things I'd really love to take on in this sprint is fixing that, so that Riya's code works and I can do the same for contents. It wasn't at the top of the backlog, and maybe isn't critical for 1.0 (if we don't care about pulling inventory), but I'd really like to have it if it's possible: and it wasn't until this morning that I realized the current inventory just won't/can't work without that. I'm tempted to start on that while I start trying to figure out how to start hooking the agent into an Arca plugin given that you and Nick seem to already have work assigned while my swimlane is nearly empty. 😆

@dbutenhof dbutenhof merged commit 999f797 into distributed-system-analysis:main Jun 26, 2023
@webbnh
Copy link
Member

webbnh commented Jun 26, 2023

But get_info needs the CacheMap in Tarball.cachemap. The CacheMap is only built when we unpack, and we only unpack for indexing.

Yes, I see that now. It looks like it would be straightforward (currrently) for the API to build a cache map on demand in find_dataset(), because (assuming that) the tarball has been unpacked, by using the same approach that the indexer uses, only, skipping the part where we unpack the tarball.

However, this leads to a more interesting question: if we already have the tarball unpacked, why are we extracting the requested file from it again instead of fetching the file from the unpack area? I assume that the answer has something to do with our expected conversion to using an object store, but I'm not convinced. On the flip side, if we're not going to fetch the file from the unpack area, what do we need the cache map for? (Can't we just look the file up in the tarball directly?)

@dbutenhof
Copy link
Member

But get_info needs the CacheMap in Tarball.cachemap. The CacheMap is only built when we unpack, and we only unpack for indexing.

Yes, I see that now. It looks like it would be straightforward (currrently) for the API to build a cache map on demand in find_dataset(), because (assuming that) the tarball has been unpacked, by using the same approach that the indexer uses, only, skipping the part where we unpack the tarball.

However, this leads to a more interesting question: if we already have the tarball unpacked, why are we extracting the requested file from it again instead of fetching the file from the unpack area? I assume that the answer has something to do with our expected conversion to using an object store, but I'm not convinced. On the flip side, if we're not going to fetch the file from the unpack area, what do we need the cache map for? (Can't we just look the file up in the tarball directly?)

The tarball is temporarily unpacked for indexing, and then deleted. The uncached copy is owned by the indexer, because there's no cache management context. This was originally part of the hybrid design where we were going to put the passthrough and "archive" servers on the same file system so that the passthrough would manage the unpacked artifacts for 0.69 compatibility while the archive server would manage archiving and indexing, and could stop worrying about the INCOMING and RESULTS trees to move forward with cache management.

But, finally; yes, one possible workaround to restore inventory quickly would be to drop the get_info call (as I mentioned in my first comment) and rely on extract to diagnose the file path for the API. I'd rather move forward with a persistent cache map (even if its simplistic and static) because that allows dropping the Elasticsearch query (and it's odd and awkward schema) from the contents API.

@webbnh
Copy link
Member

webbnh commented Jun 26, 2023

The tarball is temporarily unpacked for indexing, and then deleted.

Oh, yeah. (Huh...a bunch of this stuff is starting to feel silly, at least in its current incarnation....)

But, finally; yes, one possible workaround to restore inventory quickly would be to drop the get_info call (as I mentioned in my first comment) and rely on extract to diagnose the file path for the API. I'd rather move forward with a persistent cache map (even if its simplistic and static) because that allows dropping the Elasticsearch query (and it's odd and awkward schema) from the contents API.

I'm good with dropping the use of Elasticsearch from this. However, if we're going to extract files from the tarball, perhaps it wouldn't be hard to retool the existing cache-builder code to build the cache (on demand) directly from the tarball (e.g., by parsing the output of tar tvf or by iterating over the Python TarFile object which returns TarInfo objects which have things like the file type in it) instead of doing it from the unpacked directory tree and then persisting it.

@dbutenhof
Copy link
Member

I'm good with dropping the use of Elasticsearch from this. However, if we're going to extract files from the tarball, perhaps it wouldn't be hard to retool the existing cache-builder code to build the cache (on demand) directly from the tarball (e.g., by parsing the output of tar tvf or by iterating over the Python TarFile object which returns TarInfo objects which have things like the file type in it) instead of doing it from the unpacked directory tree and then persisting it.

Running a tar tvf is a lot faster than iterating through tarfile.getmembers(), but it's still not "fast" on an API scale. I wouldn't want to build this dynamically on each call. Even a hack to "persist" as a global within the process means we need to do it for each of the gunicorn processes an API happens to hit, which is ugly.

A more strategic question would be how to quickly build a persistent shared cache that enables an efficient contents and inventory, ideally even for datasets we don't index. I wouldn't mind the overhead of processing a tar tvf if we only have to do it once for each dataset, and maybe that's the quickest path ... but the persistence mechanism is more interesting. It might be easier to stash it into a SQL table at least for now than to go through the whole process of deploying and configuring Redis in our CI and staging environment. (And SQL is a hammer I know well, even though Redis might be a better screwdriver, so given that this isn't necessarily high priority Panda work but I'd really like to have it soonish I'm somewhat inclined to attack the screw with my hammer in the short term and figure out the screwdriver later. 😆)

@webbnh
Copy link
Member

webbnh commented Jun 26, 2023

Running a tar tvf is a lot faster than iterating through tarfile.getmembers()

I'll leave that up to the people who prove things. The problem is, for each element returned, we need to parse, analyze, and act, which might make the difference much smaller (i.e., if we useTarFile, all the parsing is done for us, we just look at fields in the returned TarInfo object).

it's still not "fast" on an API scale.

I'm afraid that that is just "the cost of doing business" unless we want to generate the cachemap beforehand (which results in other costs)...but it might be "fast enough", and we could cache it in memory at least for awhile.

the persistence mechanism is more interesting.

Yeah...it has to be the case that accessing the persisted cache is substantially cheaper (by whatever metric) than regenerating it...otherwise, regenerating it is pretty simple and easy to manage (and we don't have to worry about consistency problems and disaster recovery, etc.).

As for SQL vs. Redis, I don't have (much of) a preference, so long as it meets our needs. I'm biased against Redis if we can make SQL suitably multi-process safe, just because SQL strikes me as more stable and resilient, and possibly easier to manage. (Or, maybe it's just the hammer you know vs the screwdriver you don't...as you said.)

@dbutenhof
Copy link
Member

First off, a real cache (which we don't have yet) is about accessing the unpacked artifacts, not just a map giving their names. SQL is probably fine, and certainly convenient (at least since we're already using it) for the map, but not ideal for managing the files. (Not that we can't store random large blobs, but it's not the best use of SQL.) With Redis we could manage the cache as objects along with the map, and it's built for fast cache access.

Ultimately maybe it's unrealistic to try to move contents to the cache map before we have a full cache, though, and the quickest way to get from Riya's refactoring here to making inventory actually usable is to get rid of the get_info call and rely on extract. So maybe I'll just do that...

@webbnh
Copy link
Member

webbnh commented Jun 27, 2023

a real cache (which we don't have yet) is about accessing the unpacked artifacts, not just a map giving their names

Agreed. But from an operational perspective, the implementation of the cache has to be able to translate a key (e.g., a path within the tarball) to an object (which might be stored in a filesystem or it might be stored in a object store) and then fetch that object for the requestor. So, I expect that SQL would be fine for holding and implementing the key translation; where and how we store the objects could be implemented separately (i.e., using a suitably accessible filesystem for now, with the intention of shifting to an S3 service in the future). I don't (yet) see the benefits of Redis here, but that's probably just because I don't know much about Redis's capabilities in terms of serving large blobs of data.

the quickest way to get from Riya's refactoring here to making inventory actually usable is to get rid of the get_info call and rely on extract.

Seems reasonable, so long as you package it appropriately with an eye toward the hoped-for future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Of and relating to application programming interfaces to services and functions enhancement Server
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants