Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow download of exported parquet files #459

Merged
merged 3 commits into from
Mar 29, 2023
Merged

Conversation

RogerHYang
Copy link
Contributor

@RogerHYang RogerHYang commented Mar 28, 2023

resolves #433

for GUI only, console environment will be tackled in next PR

Comment on lines -17 to +18
try:
path.mkdir(parents=True, exist_ok=True)
except OSError as e:
if e.errno == errno.EEXIST:
pass
else:
raise
else:
path.chmod(0o777)
path.mkdir(parents=True, exist_ok=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏

Comment on lines +102 to +112
return [
ExportedFile(
file_name=path.stem,
directory=str(EXPORT_DIR),
)
for path in await loop.run_in_executor(
None,
get_exported_files,
n_latest,
)
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the async list comprehension do here?

Copy link
Contributor Author

@RogerHYang RogerHYang Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just running the I/O operation (listing files) in a separate thread so it's not blocking the event loop. The comprehension itself is not async.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the idea is that the call to get_exported_files does not block in case there are a large number of exported files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. probably doesn't matter in reality. i think it's just good practice (for I/O operations in general)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I'm curious if there's a way of accomplishing this without explicitly invoking the event loop. It would be possible to make get_exported_files a co-routine, for example. Might be cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but not in 3.8 haha

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@RogerHYang RogerHYang Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought about making get_exported_files a coroutine too but then i realized i need to call it in Jupyter notebook for session so it would be inconvenient that way

Comment on lines +36 to +60
def get_exported_files(
n_latest: int = 5,
directory: Path = EXPORT_DIR,
extension: str = "parquet",
) -> List[Path]:
"""
Yields n most recently exported files by descending modification time.

Parameters
----------
n_latest: int, optional, default=5
Specifies the number of the most recent exported files to return. If
there are fewer than n exported files then fewer than n files will
be returned.

Returns
-------
list: List[Path]
List of paths of the n most recent exported files.
"""
return nlargest(
n_latest,
directory.glob("*." + extension),
lambda p: p.stat().st_mtime,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I would expect this function to live with the config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true. i didn't find a better home for it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could make sense to put it in src/phoenix/server/api/types/Model.py for now if that is the only place it is used.

Copy link
Contributor Author

@RogerHYang RogerHYang Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea that's where i had put it in the first place, but then i realize i may want to use this function in other places too, e.g. in session, so i need to put it somewhere higher up, like a utils folder

@RogerHYang RogerHYang merged commit 7d3d8ee into main Mar 29, 2023
@RogerHYang RogerHYang deleted the download-parquet branch March 29, 2023 00:25
Comment on lines +86 to +96
class Download(HTTPEndpoint):
async def get(self, request: Request) -> FileResponse:
params = QueryParams(request.query_params)
file = EXPORT_DIR / (params.get("filename", "") + ".parquet")
if not file.is_file():
raise HTTPException(status_code=404)
return FileResponse(
path=file,
filename=file.name,
media_type="application/x-octet-stream",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use Static - that way you don't have to do anything - just host the files and let the file MIME types self-describe themselves?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't figure out how to get Static to work, because it always gets routed to the /static subfolder (even though I had initialized a separate instance). I took this get snippet from the Starlette manual and it worked out of the box.

list: List[Path]
List of paths of the n most recent exported files.
"""
return nlargest(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nlargest(
return nlatest(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a python built-in: heapq.nlargest

"""
Returns n most recent exported Parquet files sorted by descending modification time.
"""
exportedFiles(nLatest: Int! = 5): [ExportedFile!]!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking I'm not sure you're going to be able to execute graphQL queries from the python runtime for colab- at least I've failed to do so so far.

It might be simplest to just read from the directory? No need for networkIO? This is nice for the UI so non-blocking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's correct. i am working on the next PR for the console version. this is just for the GUI (e.g. in a modal)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Ability to list cluster exports for download
3 participants