-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: allow download of exported parquet files #459
Conversation
try: | ||
path.mkdir(parents=True, exist_ok=True) | ||
except OSError as e: | ||
if e.errno == errno.EEXIST: | ||
pass | ||
else: | ||
raise | ||
else: | ||
path.chmod(0o777) | ||
path.mkdir(parents=True, exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙏
return [ | ||
ExportedFile( | ||
file_name=path.stem, | ||
directory=str(EXPORT_DIR), | ||
) | ||
for path in await loop.run_in_executor( | ||
None, | ||
get_exported_files, | ||
n_latest, | ||
) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the async list comprehension do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just running the I/O operation (listing files) in a separate thread so it's not blocking the event loop. The comprehension itself is not async.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the idea is that the call to get_exported_files
does not block in case there are a large number of exported files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. probably doesn't matter in reality. i think it's just good practice (for I/O operations in general)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I'm curious if there's a way of accomplishing this without explicitly invoking the event loop. It would be possible to make get_exported_files
a co-routine, for example. Might be cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but not in 3.8 haha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought about making get_exported_files
a coroutine too but then i realized i need to call it in Jupyter notebook for session
so it would be inconvenient that way
def get_exported_files( | ||
n_latest: int = 5, | ||
directory: Path = EXPORT_DIR, | ||
extension: str = "parquet", | ||
) -> List[Path]: | ||
""" | ||
Yields n most recently exported files by descending modification time. | ||
|
||
Parameters | ||
---------- | ||
n_latest: int, optional, default=5 | ||
Specifies the number of the most recent exported files to return. If | ||
there are fewer than n exported files then fewer than n files will | ||
be returned. | ||
|
||
Returns | ||
------- | ||
list: List[Path] | ||
List of paths of the n most recent exported files. | ||
""" | ||
return nlargest( | ||
n_latest, | ||
directory.glob("*." + extension), | ||
lambda p: p.stat().st_mtime, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I would expect this function to live with the config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true. i didn't find a better home for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could make sense to put it in src/phoenix/server/api/types/Model.py
for now if that is the only place it is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea that's where i had put it in the first place, but then i realize i may want to use this function in other places too, e.g. in session
, so i need to put it somewhere higher up, like a utils folder
class Download(HTTPEndpoint): | ||
async def get(self, request: Request) -> FileResponse: | ||
params = QueryParams(request.query_params) | ||
file = EXPORT_DIR / (params.get("filename", "") + ".parquet") | ||
if not file.is_file(): | ||
raise HTTPException(status_code=404) | ||
return FileResponse( | ||
path=file, | ||
filename=file.name, | ||
media_type="application/x-octet-stream", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use Static
- that way you don't have to do anything - just host the files and let the file MIME types self-describe themselves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't figure out how to get Static
to work, because it always gets routed to the /static
subfolder (even though I had initialized a separate instance). I took this get
snippet from the Starlette manual and it worked out of the box.
list: List[Path] | ||
List of paths of the n most recent exported files. | ||
""" | ||
return nlargest( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return nlargest( | |
return nlatest( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a python built-in: heapq.nlargest
""" | ||
Returns n most recent exported Parquet files sorted by descending modification time. | ||
""" | ||
exportedFiles(nLatest: Int! = 5): [ExportedFile!]! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking I'm not sure you're going to be able to execute graphQL queries from the python runtime for colab- at least I've failed to do so so far.
It might be simplest to just read from the directory? No need for networkIO? This is nice for the UI so non-blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's correct. i am working on the next PR for the console version. this is just for the GUI (e.g. in a modal)
resolves #433
for GUI only, console environment will be tackled in next PR