Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate duckdb-wasm to query parquet/csv files on the UI (experimental) #4821

Merged
merged 7 commits into from
Dec 15, 2022

Conversation

ozkatz
Copy link
Collaborator

@ozkatz ozkatz commented Dec 14, 2022

closes #4820

@ozkatz ozkatz added area/UI Improvements or additions to UI include-changelog PR description should be included in next release changelog labels Dec 14, 2022
@ozkatz ozkatz marked this pull request as ready for review December 14, 2022 19:53
@ozkatz
Copy link
Collaborator Author

ozkatz commented Dec 14, 2022

Preview: @AdiPolak

From @AdiPolak's tweet :)

@ozkatz ozkatz changed the title Integrate duckdb-wasm to query parquet/csv files on the UI Integrate duckdb-wasm to query parquet/csv files on the UI (experimental) Dec 14, 2022
Copy link
Contributor

@eladlachmi eladlachmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome solution!
Have a few comments/suggestions and I think you can delete /components/repository/dataTable.jsx, since it's empty

package.json Outdated
@@ -0,0 +1 @@
{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove this and the package-lock.json from the top level

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

ctx := r.Context()
c.LogAction(ctx, "head_object", r, repository, ref, "")

_, err := c.Catalog.GetRepository(ctx, repository)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks to me like GetEntry already handles this error state. Not sure this is needed here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

throw new Error(await extractError(response));
}

return response.text()
}

async getWithHeaders(repoId, ref, path) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getWithHeaders is still used in ReadmeContainer. Needs to be replaced there too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -28,7 +28,7 @@ import {RefTypeBranch, RefTypeCommit} from "../../../constants";
import {copyTextToClipboard} from "../controls";
import Modal from "react-bootstrap/Modal";

const PREVIEW_SIZE_LIMIT = 4194304; // 4MB in bytes
const PREVIEW_SIZE_LIMIT = 5.243e+8; // 4MB in bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const PREVIEW_SIZE_LIMIT = 5.243e+8; // 4MB in bytes
const PREVIEW_SIZE_LIMIT = 5.243e+8; // 524.3MB in bytes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, changed :)

onReady: (text: string) => JSX.Element;
}

export const DATA_FORMATS = ["parquet", "csv", "tsv"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're moving to TS, might as well make the most of it. I'd suggest replacing this with an enum.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! didn't know ts has enums. Changed.

return <ObjectTooLarge {...props}/>
}
// is this markdown?
if (contentType === "text/markdown" || fileExtension === 'md') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is overly verbose vs. the previous implementation (look up a map of renders). With the map it's also much quicker and easier to understand what's supported or not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I tried fitting the new logic into a map resolution interface and it ended up being complex. I took another stab at it. The difference being:

  1. We want duckdb to render data files even if they are above the size limit (duckdb does HTTP byte range requests, so it doesn't really pull all the data)
  2. We now render code and configuration files, there are A LOT of types and we need to somehow normalize the different content types and extensions to fit the list of table names.

Anyway - took another stab at it, I think it's more robust now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much more readable now IMHO
If we ever need more complex logic for some keys, we can change the values from FC<RendererComponent> to a function or HoC

/>
);

export const DuckDBRenderer: FC<RendererComponentWithDataFormat> = (props) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move DuckDBRenderer to another file along with DataRow, as it's getting a bit long. I should probably refactor to move all renderers out of this file. I should have done it this way to begin with.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

value: any;
}

const DataRow: FC<RowProps> = ({ value }) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since RowProps is a very minimal interface and it's probably not going to be reused, you can inline the type def.

Suggested change
const DataRow: FC<RowProps> = ({ value }) => {
const DataRow: FC<{ value: any }> = ({ value }) => {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)

export const guessLanguage = (extension: string | null, contentType: string | null) => {
if (extension && SyntaxHighlighter.supportedLanguages.indexOf(extension) !== -1) {
if (extension && SyntaxHighlighter.supportedLanguages.indexOf(extension) !== -1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use Array.includes() instead of indexOf
It's the more modern approach.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Done.

} else {
content = React.createElement(renderer, { content: rawContent, contentBlob: blobContent, }, []);
}

const repo = {
id: repoId,
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you changed the URL, I suggest adding the filename value to the download prop of the download button, so you get the correct filename in the download

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Done

@ozkatz ozkatz requested a review from eladlachmi December 15, 2022 15:18
@ozkatz
Copy link
Collaborator Author

ozkatz commented Dec 15, 2022

Thanks for the great review @eladlachmi - PTAL :)

Copy link
Contributor

@eladlachmi eladlachmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

api/swagger.yml Outdated
Comment on lines 2845 to 2850
default:
description: internal server error
410:
description: object expired
416:
description: Requested Range Not Satisfiable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to keep them ordered

Suggested change
default:
description: internal server error
410:
description: object expired
416:
description: Requested Range Not Satisfiable
410:
description: object expired
416:
description: Requested Range Not Satisfiable
default:
description: internal server error

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -2778,7 +2775,67 @@ paths:
ETag:
schema:
type: string
Content-Disposition:
401:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes reflected in the diff of the docs swagger doesn't match the main api/swagger - run: make docs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -0,0 +1 @@
{"name":"sbt","version":"1.8.0","bspVersion":"2.1.0-M1","languages":["scala"],"argv":["/opt/homebrew/Cellar/openjdk/19.0.1/libexec/openjdk.jdk/Contents/Home/bin/java","-Xms100m","-Xmx100m","-classpath","/opt/homebrew/Cellar/sbt/1.8.0/libexec/bin/sbt-launch.jar","-Dsbt.script=/opt/homebrew/Cellar/sbt/1.8.0/libexec/bin/sbt","xsbt.boot.Boot","-bsp"]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how did we got this one? I think it can be safely removed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Comment on lines 2863 to 2866
rr := httptest.NewRecorder()
if c.handleAPIError(ctx, w, err) {
w.WriteHeader(rr.Code)
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract the api error handle so you can capture the code and have the same result without using a test library to record the response write generate.

func (c *Controller) handleAPIErrorCallback(ctx context.Context, w http.ResponseWriter, err error, cb func(w http.ResponseWriter, code int, v interface{})) bool {
	switch {
	case errors.Is(err, catalog.ErrNotFound),
		errors.Is(err, graveler.ErrNotFound),
		errors.Is(err, actions.ErrNotFound),
		errors.Is(err, auth.ErrNotFound),
		errors.Is(err, kv.ErrNotFound):
		cb(w, http.StatusNotFound, err)

	case errors.Is(err, graveler.ErrDirtyBranch),
		errors.Is(err, graveler.ErrCommitMetaRangeDirtyBranch),
		errors.Is(err, catalog.ErrNoDifferenceWasFound),
		errors.Is(err, graveler.ErrNoChanges),
		errors.Is(err, permissions.ErrInvalidServiceName),
		errors.Is(err, permissions.ErrInvalidAction),
		errors.Is(err, model.ErrValidationError),
		errors.Is(err, graveler.ErrInvalidRef),
		errors.Is(err, graveler.ErrInvalidValue),
		errors.Is(err, actions.ErrParamConflict):
		cb(w, http.StatusBadRequest, err)

	case errors.Is(err, graveler.ErrNotUnique),
		errors.Is(err, graveler.ErrConflictFound),
		errors.Is(err, graveler.ErrRevertMergeNoParent):
		cb(w, http.StatusConflict, err)

	case errors.Is(err, catalog.ErrFeatureNotSupported):
		cb(w, http.StatusNotImplemented, err)

	case errors.Is(err, graveler.ErrLockNotAcquired):
		cb(w, http.StatusInternalServerError, "branch is currently locked, try again later")

	case errors.Is(err, adapter.ErrDataNotFound):
		cb(w, http.StatusGone, "No data")

	case errors.Is(err, auth.ErrAlreadyExists):
		cb(w, http.StatusConflict, "Already exists")

	case errors.Is(err, graveler.ErrTooManyTries):
		cb(w, http.StatusLocked, "Too many attempts, try again later")

	case err != nil:
		c.Logger.WithContext(ctx).WithError(err).Error("API call returned status internal server error")
		cb(w, http.StatusInternalServerError, err)

	default:
		return false
	}

	return true
}

func (c *Controller) handleAPIError(ctx context.Context, w http.ResponseWriter, err error) bool {
	return c.handleAPIErrorCallback(ctx, w, err, writeError)
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! did the same for c.authorize.

w.WriteHeader(rr.Code)
return
}
c.Logger.Tracef("get repo %s ref %s path %s: %+v", repository, ref, params.Path, entry)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tracing middleware should give you the same information. In case it doesn't - please use the logger's WithFields to capture the structured data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

c.LogAction(ctx, "head_object", r, repository, ref, "")

// read the FS entry
entry, err := c.Catalog.GetEntry(ctx, repository, ref, params.Path, catalog.GetEntryParams{ReturnExpired: true})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, not in this pr - but I don't think we currently supports ReturnExpired: true

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Let's open an issue?

@ozkatz ozkatz requested a review from nopcoder December 15, 2022 17:31
Copy link
Contributor

@nopcoder nopcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks amazing

@nopcoder nopcoder merged commit 292440a into master Dec 15, 2022
@nopcoder nopcoder deleted the experiment/duckdb-wasm branch December 15, 2022 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/UI Improvements or additions to UI include-changelog PR description should be included in next release changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UI: Allow an easy way to view and query data objects
3 participants