Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiling tool: Update comparison mode output format and add error handling #2762

Merged
merged 9 commits into from
Jun 23, 2021

Conversation

nartal1
Copy link
Collaborator

@nartal1 nartal1 commented Jun 21, 2021

This PR fixes multiple FEA/bugs.

  1. Added appName to profiling output, fixes [FEA] include App Name from profiling tool output #2742
  2. Added error handling for comparison functions, fixes [BUG] Profiling tool: Add error handling for comparison functions  #2692
  3. Updated output for comparison mode functions, fixes [FEA] Profiling compare mode for table SQL Duration and Executor CPU Time Percent #2685
    Updated the tests and README to relfect these changes.

After adding appName, output will be as below:

+--------+-----------+------------------------------+-------------+-------------+--------+-----------+------------+-------+
|appIndex|appName    |appId                         |startTime    |endTime      |duration|durationStr|sparkVersion|gpuMode|
+--------+-----------+------------------------------+-------------+-------------+--------+-----------+------------+-------+
|1       |Spark shell|application_1616728744956_0028|1616988233052|1616988390501|157449  |2.6 min    |3.0.1       |true   |
|2       |Spark shell|application_1616728744956_0009|1616740267325|1616740893089|625764  |10 min     |3.0.1       |true   |
|3       |Spark shell|application_1616746343401_0025|1617764543255|1617764863521|320266  |5.3 min    |3.0.1       |true   |

For comparison mode, output is updated for below functions:
a. Executor Information:

+--------+------------+----------------+-----------+------------+-------------+--------+--------+--------+------------+--------+--------+
|appIndex|numExecutors|coresPerExecutor|maxMemory  |maxOnHeapMem|maxOffHeapMem|exec_cpu|exec_mem|exec_gpu|exec_offheap|task_cpu|task_gpu|
+--------+------------+----------------+-----------+------------+-------------+--------+--------+--------+------------+--------+--------+
|1       |2           |8               |47055896576|25581060096 |21474836480  |null    |null    |null    |null        |null    |null    |
|2       |2           |8               |55645831168|12696158208 |42949672960  |null    |null    |null    |null        |null    |null    |
|3       |24          |4               |42760929280|25581060096 |17179869184  |null    |null    |null    |null        |null    |null    |
|4       |24          |16              |42760929280|25581060096 |17179869184  |null    |null    |null    |null        |null    |null    |
+--------+------------+----------------+-----------+------------+-------------+--------+--------+--------+------------+--------+--------+

b. Rapids properties which are explicitly set

+------------------------------------------------------------+----------+----------+-------------+-------------+
|propertyName                                                |appIndex_1|appIndex_2|appIndex_3   |appIndex_4   |
+------------------------------------------------------------+----------+----------+-------------+-------------+
|spark.rapids.memory.gpu.debug                               |STDERR    |STDERR    |STDERR       |STDERR       |
|spark.rapids.memory.gpu.pooling.enabled                     |true      |true      |true         |true         |
|spark.rapids.memory.pinnedPool.size                         |16G       |16G       |16G          |16G          |
|spark.rapids.sql.batchSizeBytes                             |1073741824|1073741824|50331648     |null         |
|spark.rapids.sql.castFloatToString.enabled                  |true      |true      |true         |true         |

c. SQL Duration and Executor CPU Time Percent - (sorted order based on appIndex)

+--------+------------------------------+-----+------------+-------------------+------------+------------------+-------------------------+
|appIndex|App ID                        |sqlID|SQL Duration|Contains Dataset Op|App Duration|Potential Problems|Executor CPU Time Percent|
+--------+------------------------------+-----+------------+-------------------+------------+------------------+-------------------------+
|1       |application_1616728744956_0028|1    |18926       |false              |157449      |null              |34.15                    |
|1       |application_1616728744956_0028|0    |57854       |false              |157449      |null              |37.3                     |
|1       |application_1616728744956_0028|2    |14034       |false              |157449      |null              |38.53                    |
|2       |application_1616728744956_0009|0    |62433       |false              |625764      |null              |43.04                    |
|2       |application_1616728744956_0009|2    |15844       |false              |625764      |null              |23.32                    |
|2       |application_1616728744956_0009|1    |24855       |false              |625764      |null              |20.02                    |
|3       |application_1616746343401_0025|0    |259949      |false              |320266      |null              |54.21                    |
|4       |application_1616746343401_0028|2    |8           |false              |1559906     |null              |null                     |
|4       |application_1616746343401_0028|1    |1           |false              |1559906     |null              |null                     |

d. Failed tasks, jobs, stages include appIndex for better mapping:

+--------+-------+--------------+------+-------+----------------------------------------------------------------------------------------------------+
|appIndex|stageId|stageAttemptId|taskId|attempt|failureReason                                                                                       |
+--------+-------+--------------+------+-------+----------------------------------------------------------------------------------------------------+
|3       |4      |0             |2842  |0      |ExceptionFailure(ai.rapids.cudf.CudfException,cuDF failure at: /home/jenkins/agent/workspace/jenkins|
|3       |4      |0             |2858  |0      |TaskKilled(another attempt succeeded,List(AccumulableInfo(453,None,Some(22000),None,false,true,None)|
|3       |4      |0             |2884  |0      |TaskKilled(another attempt succeeded,List(AccumulableInfo(453,None,Some(21148),None,false,true,None)|
|3       |4      |0             |2908  |0      |TaskKilled(another attempt succeeded,List(AccumulableInfo(453,None,Some(20420),None,false,true,None)|

@nartal1 nartal1 self-assigned this Jun 21, 2021
@nartal1 nartal1 added the feature request New feature or request label Jun 21, 2021
@nartal1 nartal1 added this to the June 21 - July 2 milestone Jun 21, 2021
@nartal1 nartal1 requested a review from tgravescs June 21, 2021 19:30
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
@nartal1
Copy link
Collaborator Author

nartal1 commented Jun 22, 2021

build

tools/README.md Outdated
|1 |0 |4 |13984648396|13984648396 |0 |null |null |null |null |null |null |
|1 |1 |4 |13984648396|13984648396 |0 |null |null |null |null |null |null |
+--------+-----------------+------------+----------------+-----------+------------+-------------+-------------+--------------+------------------+---------------+-------+-------+
|appIndex|resourceProfileId|numExecutors|coresPerExecutor|maxMem |maxOnHeapMem|maxOffHeapMem|executorCores|executorMemory|numGpusPerExecutor|executorOffHeap|taskCpu|taskGpu|
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is difference in executor cores and cores per executor?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are same. Removed duplicate field. PTAL.

Signed-off-by: Niranjan Artal <nartal@nvidia.com>
@nartal1
Copy link
Collaborator Author

nartal1 commented Jun 23, 2021

build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
2 participants