Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiling tool can miss datasources when they are GPU reads #4804

Merged
merged 17 commits into from
Feb 17, 2022

Conversation

tgravescs
Copy link
Collaborator

@tgravescs tgravescs commented Feb 16, 2022

fixes #4759

This fixes it so we properly report GPU based datasources (csv, parquet, json, orc) when the profiling tool looks at the event logs from a run witth rapids plugin enabled. Tested with both dsv1 and dsv2 versions. This also changes JDBC to report the other fields like format, locations, etc.

It also fixes a bug with CSV files where there could be commas in the field even though our delimiter is a comma. That makes it so if you read the CSV file back into Spark it truncates that. Specifically this happens with the file schema where format is: name:Type,name2:Type2,... So for this we took the same logic used by the qualiciation tool to just replace the comma in any strings.

example output:

Data Source Information:
+--------+-----+---------------+--------------------------------------------------------------------------------------------------------------------+-------------+---------------------------------------------------------------------------------------------+
|appIndex|sqlID|format         |location                                                                                                            |pushedFilters|schema                                                                                       |
+--------+-----+---------------+--------------------------------------------------------------------------------------------------------------------+-------------+---------------------------------------------------------------------------------------------+
|1       |0    |Text           |InMemoryFileIndex[file:/home/tgraves/workspace/spark-rapids-another/integration_tests/src/test/resources/people.csv]|[]           |value:string                                                                                 |
|1       |1    |gpucsv(GPU)    |InMemoryFileIndex[file:/home/tgraves/workspace/spark-rapids-another/integration_tests/src/test/re...                |unknown      |_c0:string,_c1:string,_c2:string                                                             |
|1       |2    |gpujson(GPU)   |InMemoryFileIndex[file:/home/tgraves/workspace/spark-rapids-another/integration_tests/target/test...                |unknown      |number:double                                                                                |
|1       |3    |gpuparquet(GPU)|InMemoryFileIndex[file:/home/tgraves/workspace/spark-rapids-another/integration_tests/target/test...                |[]           |loan_id:bigint,orig_channel:string,seller_name:string,orig_interest_rate:double,orig_upb:i...|
|1       |4    |gpuorc(GPU)    |InMemoryFileIndex[file:/home/tgraves/workspace/spark-rapids-another/tests/src/test/resources/file...                |[]           |loan_id:bigint,orig_channel:int,orig_interest_rate:double,orig_upb:int,orig_loan_term:int,...|
|2       |0    |Text           |InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/integration_tests/src/test/resources/people.csv]  |[]           |value:string                                                                                 |
|2       |1    |CSV(GPU)       |InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/integration_tests/src/test/re...                  |[]           |_c0:string,_c1:string,_c2:string                                                             |
|2       |2    |JSON(GPU)      |InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/integration_tests/target/test...                  |[]           |number:double                                                                                |
|2       |3    |ORC(GPU)       |InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/tests/src/test/resources/file...                  |[]           |loan_id:bigint,orig_channel:int,orig_interest_rate:double,orig_upb:int,orig_loan_term:int,...|
|2       |4    |Parquet(GPU)   |InMemoryFileIndex[file:/home/user1/workspace/spark-rapids-another/integration_tests/target/test...                  |[]           |loan_id:bigint,monthly_reporting_period:string,servicer:string,interest_rate:double,curren...|
+--------+-----+---------------+--------------------------------------------------------------------------------------------------------------------+-------------+---------------------------------------------------------------------------------------------+

@tgravescs tgravescs added this to the Feb 14 - Feb 25 milestone Feb 16, 2022
@tgravescs tgravescs self-assigned this Feb 16, 2022
@tgravescs
Copy link
Collaborator Author

actually looks like it missed removing the Location and pushedfilter tags for some gpu, I'll fix that

@tgravescs tgravescs marked this pull request as draft February 16, 2022 16:43
@tgravescs tgravescs marked this pull request as ready for review February 16, 2022 18:28
@tgravescs
Copy link
Collaborator Author

build

@nartal1
Copy link
Collaborator

nartal1 commented Feb 16, 2022

Just a question on method name. Rest all LGTM.

@tgravescs
Copy link
Collaborator Author

build

@tgravescs tgravescs merged commit 6eae4c1 into NVIDIA:branch-22.04 Feb 17, 2022
@tgravescs tgravescs deleted the profileFixGpuRead branch February 17, 2022 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Profiling tool can miss datasources when they are GPU reads
2 participants