Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw/s3select: json output format #548

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

albin-antony
Copy link
Contributor

No description provided.

@albin-antony albin-antony force-pushed the json_format branch 5 times, most recently from 9d6af2f to 2bb3332 Compare February 13, 2024 06:56

@pytest.mark.s3select
def test_json_column_sum_min_max():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you make this conversion (remove JSON)?

@albin-antony albin-antony force-pushed the json_format branch 3 times, most recently from 26c4133 to c9f0e0a Compare June 14, 2024 16:19
s3select_assert_result( res_s3select_like, res_s3select )

@pytest.mark.s3select
def test_json_like_expressions():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albin
test_json_like_expressions is duplicated name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Updated the name.

@galsalomon66
Copy link
Contributor

the following functions run_s3select run_s3select_output run_s3select_json run_s3select_csv_json_format

they have much in common, it is possible to combine them into one function.

list_int = create_list_of_int( 1 , csv_obj )
print(list_int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove the debug printout

res_target = min( list_int )
print("target")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same

@galsalomon66
Copy link
Contributor

@albin-antony, please review comments
the goal of s3-tests (in the context of s3select) is to check the integration with RGW

in the case of Parquet, it needs to make sure that RGW is correctly built with arrow-library
the s3select-unit-tests does an automatic conversion (CSV->parquet), processes the same SQL statement for each data source, and compares results.

In the JSON output format, it needs to ensure that output-serialization parameters pass through the RGW into the s3select engine.
the JSON output format functionalities are tested in s3select-unit-tests.

@albin-antony
Copy link
Contributor Author

@galsalomon66 Thanks for the review. Addressed the comments. Please take a look

@@ -235,30 +239,45 @@ def create_random_json_object(rows,columns,col_delim=",",record_delim="\n",csv_s

return result

def csv_to_json(obj, field_split=",",row_split="\n",csv_schema=""):
def create_parquet_object(parquet_size):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albin-antony
please add other types such as string and float.

output_serialization = {"CSV": {}}
if input == "JSON":
input_serialization = {"JSON": {"Type": "DOCUMENT"}}
output_serialization = {"JSON": {}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_s3select should support output=JSON or CSV

elif(input == "PARQUET"):
input_serialization = {'Parquet': {}}
output_serialization = {"JSON": {}}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same
run_s3select should support output=JSON or CSV

res = remove_xml_tags_from_result( run_s3select(bucket_name,csv_obj_name,"select count(0) from s3object;") ).replace(",","")

res = remove_xml_tags_from_result( run_s3select(bucket_name,csv_obj_name,"select count(0) from s3object;","CSV") ).replace(",","")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "CSV" parameter is default.
no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, default is CSV. Don't need to override the same. Will remove it.

@galsalomon66
Copy link
Contributor

@albin-antony
please rebase

Signed-off-by: Albin Antony <albinantony20@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants