-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PBENCH-1014 Using Tarball.extract in Inventory API for extracting files from tarball #3105
PBENCH-1014 Using Tarball.extract in Inventory API for extracting files from tarball #3105
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
raise APIAbort( | ||
HTTPStatus.UNSUPPORTED_MEDIA_TYPE, | ||
"The specified path does not refer to a regular file", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit behind on this one but why are we removing UNSUPPORTED_MEDIA_TYPE
and return every error as 404 now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tarball.extract method was handling all error's around this. So, I didn't find the need to include these lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache manager raises an internal error, but we don't want that to fall through to the generic (internal error) except Exception
in _dispatch
. That's why we have APIAbort
, which is treated specially. We want to make sure that the referenced file path exists in the tarball and is a regular file. I'm not sure exactly what tarfile.extractfile
will return/raise in this case, but we need to specifically catch that case and raise APIAbort
with UNSUPPORTED_MEDIA_TYPE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start, Riya, but sadly I think this needs some major refactoring to actually work. Unit testing is going to be awkward with the mocks currently in use, and it might be easier to pull a container pod to do live experiments.
(In fact, I encourage you to write a functional API test module for /api/v1/inventory
!)
And, Black is unhappy with you (and I bet that iSort will be, too). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to hold off on further review until you have addressed Dave's concerns.
raise APIAbort( | ||
HTTPStatus.UNSUPPORTED_MEDIA_TYPE, | ||
"The specified path does not refer to a regular file", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache manager raises an internal error, but we don't want that to fall through to the generic (internal error) except Exception
in _dispatch
. That's why we have APIAbort
, which is treated specially. We want to make sure that the referenced file path exists in the tarball and is a regular file. I'm not sure exactly what tarfile.extractfile
will return/raise in this case, but we need to specifically catch that case and raise APIAbort
with UNSUPPORTED_MEDIA_TYPE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad to see progress on this because we need to get the inventory
API running again!
I have some concerns about the complications of filestream
vs filecontents
vs extract
, and I think this needs to be streamlined. It also needs to be fixed, as you've broken a functional test for a tarball without metadata.log
... though that fix could end up being in intake_base.py
rather than in the cache manager. (Specifically, I think you're now failing with a MetadataError
exception on the cache manager create
that wasn't firing before, and maybe should be refactored.)
___________________________ TestPut.test_no_metadata ___________________________
Traceback (most recent call last):
File "/var/tmp/jenkins/tox/py39/lib/python3.9/site-packages/pbench/test/functional/server/test_put.py", line 168, in test_no_metadata
assert (
AssertionError: upload nometadata returned unexpected status 400, {"message": "Tarball 'nometadata' is invalid or missing required metadata.log: A problem occurred processing metadata.log from /srv/pbench/archive/fs-version-001/UPLOAD/5f90cc513efab48adc492834b35a1fa0/nometadata.tar.xz: 'A problem occurred processing \'nometadata/metadata.log\' from /srv/pbench/archive/fs-version-001/UPLOAD/5f90cc513efab48adc492834b35a1fa0/nometadata.tar.xz: "filename \'nometadata/metadata.log\' not found"'"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below are a few comments on things I noted; I didn't get a chance to review the test code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some comments on the changes to the test code. However, the first one might become moot if you take Dave's suggestion to make the return from extract()
uniform/consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks generally good, but it needs some polish, and I have a couple of specific concerns.
- In
Tarball.filestream()
ifpath
is a directory, we returnNone
for the"stream"
key, but there is a special case for a path of"."
which I think will open the directory file and return a stream for it. Is this intended behavior? (Or, is it supposed to be returning a stream for the tarball itself? Is"."
supposed to match on the root of the results tree?...because I don't think it will....) - There are a couple of places where the mocks are returning strings where they should be returning streams.
- There is a mis-placed doc string.
- I think the
filestream()
function(s) should be renamed...but perhaps that's just me. 🙂
@@ -61,7 +61,7 @@ def _get( | |||
|
|||
cache_m = CacheManager(self.config, current_app.logger) | |||
try: | |||
file_info, file_stream = cache_m.extract(dataset, target) | |||
file_info = cache_m.filestream(dataset, target) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a function named "filestream" return a "file information" dictionary doesn't seem like the best interface. Maybe something somewhat more generic, like "get_file()", would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, reality hasn't kept up with the implementation, and at some point we should consider whipping it into shape.
it is supposed to be a returning the stream of tarball itself. |
I have improved the test and verified the call to close() using an assertion check. |
Using Tarball.extract in Inventory API to extract file content from tarball.
@webbnh maybe we can try this in next iteration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing my scrum status this morning, I realized that a "glitch" I encountered late last week working on the contents
API is going to apply here as well. Specifically, while in theory this all looks fine and is OK for unit tests, it's sadly useless in "real life" because it relies on the cache map ... which is at this point mostly theoretical.
That is, if we were to try a GET /datasets/<id>/inventory/metadata.log
in the functional test right now, it'd fail with CacheMapMissing
... because we only unpack our tarballs in the pbench-index
process, so only code within that process will ever have access to a cache map. And only for the specific tarballs it indexes in that cycle. 😦
Going forward, we need to make the cache map persistent, either via Redis or SQL. For now, however, if we actually want to be able to extract inventory files from a tarball, this could be modified to avoid the cache map (and get_info
) and simply rely on extract
complaining about a bad path.
I don't really want to redirect this PR now, however I want to point out that:
- It's not really doing anything for us. (Sigh.)
- It's really important to write functional tests for new features, not just unit tests, because (as we've just inadvertently demonstrated) we can easily write functioning unit tests for units that don't function. 😆
I don't think that that is quite right. The # The dataset isn't already known; so search for it in the ARCHIVE tree
# and (if found) discover the controller containing that dataset. Now, granted, that depends on having a populated
We certainly could use Redis or SQL; and, I imagine that there are other options as well. However, that's all predicated on wanting to share the map across processes, which isn't an obvious requirement to me. Once we're using an object store, we'll need an accessible "index" to what's in it, and that, presumably, will be the persistent cache map, and then perhaps all the questions will have been answered.
No; I think this is valuable work, and I'd like to get it merged so that we can build on it later.
Yes, having functional tests for units which are supposed to be functional would be good. But, good unit tests are important too, and, when done well, they enable us to write good stuff before it is supposed to be functional. 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go!
No, you're misunderstanding. We have the But One of the things I'd really love to take on in this sprint is fixing that, so that Riya's code works and I can do the same for |
Yes, I see that now. It looks like it would be straightforward (currrently) for the API to build a cache map on demand in However, this leads to a more interesting question: if we already have the tarball unpacked, why are we extracting the requested file from it again instead of fetching the file from the unpack area? I assume that the answer has something to do with our expected conversion to using an object store, but I'm not convinced. On the flip side, if we're not going to fetch the file from the unpack area, what do we need the cache map for? (Can't we just look the file up in the tarball directly?) |
The tarball is temporarily unpacked for indexing, and then deleted. The uncached copy is owned by the indexer, because there's no cache management context. This was originally part of the hybrid design where we were going to put the passthrough and "archive" servers on the same file system so that the passthrough would manage the unpacked artifacts for 0.69 compatibility while the archive server would manage archiving and indexing, and could stop worrying about the But, finally; yes, one possible workaround to restore |
Oh, yeah. (Huh...a bunch of this stuff is starting to feel silly, at least in its current incarnation....)
I'm good with dropping the use of Elasticsearch from this. However, if we're going to extract files from the tarball, perhaps it wouldn't be hard to retool the existing cache-builder code to build the cache (on demand) directly from the tarball (e.g., by parsing the output of |
Running a A more strategic question would be how to quickly build a persistent shared cache that enables an efficient |
I'll leave that up to the people who prove things. The problem is, for each element returned, we need to parse, analyze, and act, which might make the difference much smaller (i.e., if we use
I'm afraid that that is just "the cost of doing business" unless we want to generate the cachemap beforehand (which results in other costs)...but it might be "fast enough", and we could cache it in memory at least for awhile.
Yeah...it has to be the case that accessing the persisted cache is substantially cheaper (by whatever metric) than regenerating it...otherwise, regenerating it is pretty simple and easy to manage (and we don't have to worry about consistency problems and disaster recovery, etc.). As for SQL vs. Redis, I don't have (much of) a preference, so long as it meets our needs. I'm biased against Redis if we can make SQL suitably multi-process safe, just because SQL strikes me as more stable and resilient, and possibly easier to manage. (Or, maybe it's just the hammer you know vs the screwdriver you don't...as you said.) |
First off, a real cache (which we don't have yet) is about accessing the unpacked artifacts, not just a map giving their names. SQL is probably fine, and certainly convenient (at least since we're already using it) for the map, but not ideal for managing the files. (Not that we can't store random large blobs, but it's not the best use of SQL.) With Redis we could manage the cache as objects along with the map, and it's built for fast cache access. Ultimately maybe it's unrealistic to try to move |
Agreed. But from an operational perspective, the implementation of the cache has to be able to translate a key (e.g., a path within the tarball) to an object (which might be stored in a filesystem or it might be stored in a object store) and then fetch that object for the requestor. So, I expect that SQL would be fine for holding and implementing the key translation; where and how we store the objects could be implemented separately (i.e., using a suitably accessible filesystem for now, with the intention of shifting to an S3 service in the future). I don't (yet) see the benefits of Redis here, but that's probably just because I don't know much about Redis's capabilities in terms of serving large blobs of data.
Seems reasonable, so long as you package it appropriately with an eye toward the hoped-for future. |
Using Tarball.extract in Inventory API for extracting files from tarballs.
Fixing Tests for the same