Skip to content

Conversation

@leleamol
Copy link
Contributor

Description of changes:

This is work in progress script to read the M and X type of trace events from tensorboard.

The changes include the tensorfboard file that can be used to test the script.
The script can:

  1. read the X and M events from trace.
  2. store the timestamps in nanoseconds
  3. Return the events sorted either on start time or end time.
  4. Return the events that are in progress at the given timestamp.
  5. Return the events that are started and finished within the given range.

Style and formatting:

I have run pre-commit install to ensure that auto-formatting happens with every commit.

Issue number, if available

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@codecov-io
Copy link

codecov-io commented May 11, 2020

Codecov Report

Merging #246 into master will decrease coverage by 24.77%.
The diff coverage is 88.43%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #246       +/-   ##
===========================================
- Coverage   85.77%   60.99%   -24.78%     
===========================================
  Files          78       81        +3     
  Lines        5799     5946      +147     
===========================================
- Hits         4974     3627     -1347     
- Misses        825     2319     +1494     
Impacted Files Coverage Δ
smdebug/profiler/tf_profiler_parser.py 81.03% <81.03%> (ø)
smdebug/profiler/trace_event_file_parser.py 93.10% <93.10%> (ø)
smdebug/profiler/__init__.py 100.00% <100.00%> (ø)
smdebug/mxnet/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
smdebug/tensorflow/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
smdebug/mxnet/singleton_utils.py 0.00% <0.00%> (-100.00%) ⬇️
smdebug/tensorflow/singleton_utils.py 0.00% <0.00%> (-100.00%) ⬇️
smdebug/tensorflow/collection.py 0.00% <0.00%> (-96.00%) ⬇️
smdebug/tensorflow/callable_cache.py 0.00% <0.00%> (-95.66%) ⬇️
smdebug/tensorflow/keras.py 1.67% <0.00%> (-92.37%) ⬇️
... and 42 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 80153e7...fd75207. Read the comment docs.

@leleamol leleamol requested a review from Vikas-kum May 11, 2020 21:58
self._start_time_known = True

if phase_type == "X":
start_time = (event["ts"] * 1000) - self._start_time_stamp # In nano seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look for unit key, if not there assume certain unit.
It is assuming that unit is always us,

Introduced base event parser and the child classes TFEventProfiler
and SMTFEventProfiler
@Vikas-kum Vikas-kum changed the title [WIP] Aadding script to read the events from tensorboard trace. [WIP] Adding script to read the events from trace file May 12, 2020
Copy link
Contributor

@Vikas-kum Vikas-kum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you write test SMTFProfiler classs and please make sure the code coverage is close to 100.


if phase_type == "X":
# In nano seconds
start_time = (event["ts"] - self._start_timestamp) * self._timescale_multiplier_for_ns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be +

print(f"Number of events read = {num_trace_events}")
assert num_trace_events == 1663

event_list = t_events.get_events_at(15013686) # nanoseconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15013686 what is the unit of this ?

# First Party
from smdebug.core.logger import get_logger

TIMESCALE_MULTIPLIER = {"ns": 1000, "us": 1, "ms": 0.001, "s": 0.000001}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if tensorflow says displayTimeUnit="ns" , does the time reported there are in ns or some other unit?

I was under impression that displayTimeUnit tells what unit is for time in trace.json file. Individual reader can have its own _timescale_multiplier_for_ns which defines what to multiply to convert into ns time format. Example SMTFProfiler knows events are in us, so multiplier would be 1000 to convert to ns.
NS is time interval that reader normalizes all events to. This normalization will give us unique timeline view if there are timeline files from multiple sources.

Thinking this I kept
{"ns": 1, "us": 1000, "ms": 1000000, "s": 1000000000}

That means that if displayTimeUnit is "us", we will multiply by 1000 and so on.
Can you decribe and write mroe comments about displayTimeUnit and timescale multiplier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Vikas-kum As per the Trace event format https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/edit The 'ts' is always in uSecond granularity

"ts: The tracing clock timestamp of the event. The timestamps are provided at microsecond granularity.
"

In case of Tensorboard generated trace file, the 'displayTimeUnit' is specified to be 'ns'. This is a hint to display that for higher precision show the events on 'ns' scale. Therefore, we would have to multiply it by 1000, if we want to keep the same precision.

The short answer, as per the standard, the 'ts' and 'durations' will always be in uSeconds.
It will be up to us whether we want to convert into ns scale for precise calculations. IMO, we should always convert it to ns scale.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah in that case we should ignore displayTimeUnit, and always convert from us and default TIMESCLAE_MULTIPLIER to 1000 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah in that case we should ignore displayTimeUnit, and always convert from us and default TIMESCLAE_MULTIPLIER to 1000 .

@Vikas-kum
Ok sounds good. I will remove the code for reading the displayTimeUnit and will always convert the timestamp values (the 'ts' field and start_time_since_epoch_in_micros) from microSeconds to nanoSeconds.
I will also cut down the sample json files for testing.

Copy link
Contributor

@Vikas-kum Vikas-kum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also cut down ip-..json file size to some thing below 100.
We don't need 20K lines big file.

@Vikas-kum Vikas-kum changed the title [WIP] Adding script to read the events from trace file Adding script to read the events from trace file May 13, 2020
@Vikas-kum Vikas-kum merged commit 4cac480 into master May 13, 2020
@Vikas-kum Vikas-kum deleted the trace-events-reader branch May 13, 2020 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants