Timing framework improvement #305

artpol84 · 2014-12-04T14:26:42Z

@nkogteva please, review.

mellanox-github · 2014-12-04T14:47:23Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/114/
Test PASSed.

jsquyres · 2014-12-04T15:04:56Z

opal/util/timings.c

+
+    debug_hang(0);
+
+    if( fname != NULL ){


Minor style issue: we prefer constants to be on the left of ==. I know it feels unnatural, but it's defensive programming -- the compiler will fail to compile if you typo like this "if (NULL = fname) ...".

Yeah! This was developed before I read wiki carefuly. I will fix that!

nkogteva · 2014-12-04T15:14:02Z

@artpol84

Looks fine. I tried it:) and I have only one comment related output format. Since it is possible to have absolute time and relative time at the same moment let's create common format for them.
It will allow easy parse timing logs to construct tables, graphs or for other purposes

For example, for [37590,1],2] process I have following lines for the same procedure:

0.943561 0.005865 "MPI_Init: Start barrier" [mir13, [[37590,1],2] ompi_mpi_init.c:785:ompi_mpi_init]
1417704065.854526s "MPI_Init: Start barrier" | mir13 [[37590,1],2] ompi_mpi_init ompi_mpi_init.c:785

See what I mean?

We can use something like this:
file:line <tag: absolute or relative>
or you can put them in one line to save space: absolute time/time related start/time related previous event. Tag is not needed in this case.

jsquyres · 2014-12-04T15:17:45Z

opal/util/timings.c

+            rc = OPAL_ERR_OUT_OF_RESOURCE;
+            goto err_exit;
+        }
+        if( buf_used + strlen(line) > OPAL_TIMING_OUTBUF_SIZE ){


I'm a little confused by this logic -- in the above case, you error out if it's too long. But in this case, you just realloc.

There's also another dichotomy: you use asprintf() for allocation up above, but then seem to be enforcing some kind of (soft?) alloc limit here.

Is there a reason for these differences?

Idea was that one line shouldn't exceed the limit. However you are right. I'll rewrite this.

artpol84 · 2014-12-04T15:23:08Z

Thank you for your comments. I'll address them tomorrow.
@nkogteva I agree that output format is awful. Still can't I converge to something suitable yet.
The output of OPAL_TIMING_REPORT is supposed for automatic processing.
Whereas OPAL_TIMING_DELTAS is ought to be human readable (at least this is my intention).
Can you provide your vision of output for examples that you posted?

nkogteva · 2014-12-04T15:47:01Z

@artpol84 If you don't want to combine absolute and relative time in one line you can create macro for common format to have something like this:
host process name "the procedure name" file:line [time tag] time in appropriate format

for example:
[mir13] [[37590,1],2] ompi_mpi_init.c:785 "MPI_Init: Start barrier" [ABS] 1417704065.854526
[mir13] [[37590,1],2] ompi_mpi_init.c:785 "MPI_Init: Start barrier" [RLT] 0.943561 0.0058

open-mpi#305

artpol84 · 2014-12-05T08:14:16Z

retest this please

artpol84 · 2014-12-05T08:19:34Z

@nkogteva
Nadya, I noticed that you see both extended and deltas output. So I want to clarify a little.
To acheive your goal: to get just the deltas you need to set only OMPI_MCA_ompi_timing to true. Also you don't need mpirun_prof in this case - use regular mpirun if you don't need global synchronisation.

mellanox-github · 2014-12-05T08:30:21Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/124/
Test PASSed.

nkogteva · 2014-12-05T09:30:56Z

@artpol84 Sure, it is enough for my current task. But I think it would be not enough for general problem of performance measurement. It would be nice to have both - absolute time and relative time - in the same format to make output file parsing easier (when both ompi_timing and mpi_timing_ext are specified).

nkogteva · 2014-12-05T09:46:10Z

@artpol84 One more question.

For relative time you calculate time difference (second column for relative time) between current time and previous event. So now if I want to get time of modex I just get the appropriate column (because end time is following to the start time). That's ok.

But in the general case if some time events are located between start of procedure and end of procedure I will need summarize two intervals to get procedure time. It's not very good.

For example, I want to write script which will measure performance in the same manner on different versions of code without additional tuning. Is not better to have specific names for specific procedures instead of time related to the previous event? Like it was done previously?

artpol84 · 2014-12-05T10:01:09Z

My initial intention was that automatic analysis is performed only on _ext output. DELTAS is just for humans and is not supposed for that. Thus I think it is OK to have 2 different formats: for human and for machine.
I see what you are talking about. I was thinking about such kind of events in the framework - for overhead measurement. I.e.:

a. you initialize overhead event like this:

id = OPAL_TIMING_OVH_INIT(&tm_handler,"some description");

b. next you update it where you want using obtained id:

OPAL_TIMING_OVH_START(&tm_handler, id);
.... processing
OPAL_TIMING_OVH_END(&tm_handler, id);

Would it fit your needs?

nkogteva · 2014-12-05T10:21:46Z

@artpol84

According to your implementation these modes are not exclusive. If I use both mca parameters which mode I will have? It is not readable for human and hard to parse.
If I understood you correctly it is exactly what I want.

I will try to summarize. It would be nice to have:

absolute time of event
most interesting time intervals (not only following to each other)
both in the same format

artpol84 · 2014-12-05T10:35:49Z

They are not exlusive but you might want to see intervals and also have the data for postprocessing. You'll get the output in log in human readable format and (if you use mpirun_prof) you'll get the postprocessed data in the file that you point out using

export OMPI_MCA_opal_timing_output="test"

So you shouldn't enable _ext if you don't want to postprocess.
Ideally the postrocessing data should go to the separate file and you won't see it in the output. I am thinking about the proper way to do so now.

IMHO an absolute values is all you need for the postprocessing. You don't need intervals since you can compute them at that stage. And I keep only absolute values because you may want to merge events from different nodes and analyse data after that. In this case intervals wouldn't be correct after merge.

Do you already have any exact thoughts/intentions on parsing of those data? If so we can do some steps in that direction. I was thinking about postprocessing to OTFStat format in future. But don't have the time for that now.

artpol84 · 2014-12-05T10:38:47Z

I will update the framework with new functionality that we discuss. Can you provide what intervals you need to measure? Do you need it for omi_mpi_init or you measuring other code?

nkogteva · 2014-12-05T10:50:48Z

@artpol84 Thank you!

Let's start with intervals which were implemented in previous version in order to align them:

time from start to completion of rte_init
time from completion of rte_init to modex
time to execute modex
time from modex to first barrier
time to execute barrier
time from barrier to complete mpi_init

Then everyone can use your framework for own purposes. I think you should not worry about other cases except mpi_init.

jsquyres · 2014-12-05T11:46:13Z

• The idea of OPAL_TIMING_REPORT into a file is to have all timing events from all processes in one place and isolated from other output.

(my bias: I actually don't have any religious opinion that opal_output() must be used I just want to make sure you're not re-inventing the wheel when some other part of OMPI infrastructure would work ok... with that in mind, let's look at each of your analysis points...)

Ok. This means that you send all the data to a single process and have that one process output the file, right? (i.e., there's no other way to ensure that all data from all processes ends up in a single file without losing any of the data)

In that case, opal_output() should still be fine.

Since timing framework provides global time synchronisation feature (if run with mpirun_prof) we can later sort events by their timestamps and see in what sequence and with what delays they appear on different nodes/processes.

• If I use opal_output service to output to stdout everything works perfectly and output is merged. Line ordering is not an issue as we can sort them later. The only problem here is to separate timing data from all other output. This can be done using known string prefix that is common for all timing lines.

Sure, but keep in mind that you do not have to opal_output(0, ...). You can open your own stream (which won't be 0) and then use that. Hence, it doesn't have to go to stdout. Or it could go to stdout and a file. Or ......

• If I use opal_output service to output into a file the problem with output separation goes away, however:
a. I can't create a file at arbitrary location. The placement is application-wide and setted at ORTE level at ess initialization with opal_output_set_output_file_info.

Hmm. This sounds like you're relying on the filesystem to combine the data from all processes into a single file. I think that this is a losing proposition -- you're going to find cases where multiple processes write to the (network-filesystem hosted) file simultaneously, and therefore the data from some process(es) get lost/overwritten.

I don't honestly remember if opal_output allows you to open a file in arbitrary locations, or whether they're always under the session directory... But even if they're in the session directory, is that a problem? (especially if you take the approach that all processes send data to a single process and a single process writes out the data -- then a process-wide filesystem location is irrelevant)

b. I can't merge the output from different processes toghether at runtime. To acheive the initial goal I'll need a distributed postprocessing that will gather outputs from all processes of the application on all nodes.

Yes, I think you're going to need this -- regardless of whether you use opal_output or not.

• Current implementation that rely on "append" feature and fprintf may not work on all filesystems. If I recall correctly I've seen that some lines was disappearing on NFS. So I need an advice here: what is the easiest way to collect output from all nodes in one file without data loss.
a. The easiest way that I see is to write to stdout (see p. 2) and grep after by a prefix.
b. Other option can be to forward all timing info to the mpirun and do the centralized output. The problem is that we work at OPAL layer that has no communication facilities.

Mmm. Yes. This is a problem (you're down in OPAL and don't have communication facilities available).

I think there are 3 general options:

output to stdout/stderr with a specific prefix that you can easily grep for (i.e., your option a, above).
have every process write to a unique file name with high-resolution timestamps.
some kind of MPI_Gather-like operation at the end where all procs send their timing data to some master process (e.g., vpid=0 or mpirun or something) and that one process does the output. But you're down at OPAL, so... I guess you'd have to have OPAL just generate the data, and then have another agent up in ORTE (or OMPI?) do this synchronization.

Keep in mind that all 3 options have ordering issues -- none of them will guarantee to output data from each process in the correct order.

I'd prefer something that would be close to fopen(fname, "a").

I think only option 2 is close to that.

jsquyres · 2014-12-05T11:50:59Z

• absolute time of event

Keep in mind that no servers are ever in perfect sync. Hence, the absolute times that you get for each event may not interleave/merge properly into a single, consolidated timeline that accurately reflects the order in which events occurred.

Out-of-the-box thinking/suggestion: The MPI tools community has spent a LOT of time/effort on exactly this kind of issue (and others). Making an accurate distributed collection agent is complicated -- perhaps we don't need to re-invent it inside OMPI.

Have you considered using an external tool for this kind of collection?

I was just talking to the VampirTrace guys the other day about having SCORE-P tie into the MPI_T performance variable system of OMPI. That is, the MCA var system (which is the OMPI infrastructure behind MPI_T) would basically directly feed into SCORE-P, which then outputs data files that can than be used by multiple different analysis tools. Could something like this be used for what you're trying to do?

@bertwesarg Can you comment on this idea? (I don't know Andreas' github ID offhand to @ mention him...) The idea is that we have timing data for arbitrary OPAL/ORTE/OMPI events down in the OPAL layer. Could we output these via our MPI_T infrastructure and have them reported via SCORE-P?

artpol84 · 2014-12-05T11:54:54Z

@jsquyres I am aware that time on different servers is not synchronuous.

Currently we have an external tool mpisync that can synchronyse servers with microsecond precise. It uses well known techniques and derived from an existing project. I call it to measure time offsets infoand save it to the input file. This file is being readed from the MPI program.

artpol84 · 2014-12-05T12:14:19Z

@jsquyres
Actually I did thought about next steps

mpisync provides good synchronisation only for a short period of time. If we want to be in sync we need a kind of background synchronizer like IBM CCT (http://www.its.caltech.edu/~mallada/pubs/2013/icnp13.pdf) or PTP protocol (https://en.wikipedia.org/wiki/Precision_Time_Protocol). Or we need to integrate this into OMPI which is less preferrable from my pointof view.
We need to output in OTF or other format that can be analysed with the existing software.

Honestly I didn't thought about data collection using standard tools. Thanks for pointing me on this.

jsquyres · 2014-12-05T12:42:35Z

mpitimer seems to do things at the MPI layer (e.g., I see MPI_Comm as arguments to its functions). I thought you were trying to time ORTE events...?

Yes, I agree that something like NTP or PTP will be necessary to get servers "close" in sync -- but these kinds of systems will never be perfect enough to generate perfect timelines based on absolute local timestamps. I agree: these kinds of things are best left outside of OMPI.

Output format: I don't know what the output format of SCORE-P is -- I don't know if it's OTF or something else. FWIW: The way the SCORE-P guys described it to me earlier this week: SCORE-P outputs a file format that can be used by a variety of different analysis tools (I don't know if that's OTF or not).

Random other thought: are you using the OPAL timer framework to get high resolution timestamps? George just put in some improvements to use clock_gettime() and/or RDTSC recently.

artpol84 · 2014-12-05T13:22:19Z

"...mpitimer seems to do things at the MPI layer (e.g., I see MPI_Comm as arguments to its functions). I thought you were trying to time ORTE events...? ..."

Yes I am trying to measure ORTE. But OMPI BTL's usually provide access to networks with less latency so I run mpisync BEFORE I run the program that I measyre. Check mpirun_prof in ompi/tools/ompi_info/ to see how it works.

"... Yes, I agree that something like NTP or PTP will be necessary to get servers "close" in sync -- but these kinds of systems will never be perfect enough to generate perfect timelines based on absolute local timestamps. I agree: these kinds of things are best left outside of OMPI..."

According to Wiki NTP precise is 10 ms through the Internet and upto 0,2 ms in LAN. My evaluations confirms that. I tested mpisync on the clusters with NTP and the offsets was in 0.2 - 1 ms.
The protocol that is used in PTP is close to the one that is used in mpisync. However PTP is designed for continuous synchronisation whereas mpisyn performs only star synchronisation. This is mpisync's disadvatage that I mentioned earlier. mpisync operate on 1-8 ns precision for IB. For TCP it is 10-100ns.

About precision. Synchronisation should be able to rearrange events on different nodes. mpisync performs synchronisation using existing network. Thus we can acheive the quality that is enough to evaluate communication overhead. The precise may be even better in case then we use InfiniBand BTL to measure and then evaluate OOB that is currently TCP-based. Shortly speaking the event of message send and receive would be represented with the precise that is enough to estimate it's delivery time.
Obviously this precision won't be enough to measure individal instructions latency.

"... Output format: I don't know what the output format of SCORE-P is -- I don't know if it's OTF or something else. FWIW: The way the SCORE-P guys described it to me earlier this week: SCORE-P outputs a file format that can be used by a variety of different analysis tools (I don't know if that's OTF or not)..."

I checked their website. They have OTF support on the board.

"...Random other thought: are you using the OPAL timer framework to get high resolution timestamps? George just put in some improvements to use clock_gettime() and/or RDTSC recently..."

I was using pure gettimeofday but I planned to switch to more pricise ones. I'll do that along with improvements for @nkogteva and will update the PR.

jsquyres · 2014-12-05T13:52:47Z

mpisync: oh, right -- I remember our discussion about this now. It basically determines/records the skew between multiple servers, and then you use that skew for post-processing of the timestamps that you collect, right?

Precision: yes, 200 nanos is good precision, but that's still a lot of clock cycles. :-) Your ORTE measurements may not care about that level of precision, but if you want to have a guaranteed reproduction of the absolute ordering of events, then you need quite sophisticated event-merging capabilities (I think that modern tools use more than just a single skew-measurement at the beginning, but I could be wrong here).

...All this being said, if you don't mind a few events being out of order / don't need an absolute guarantee of overall ordering, then I should just shut up. :-)

Timer: cool. Check out opal/mca/timer. It should do the Right Things regardless of what platform you're on.

I'd still like to hear what @bertwesarg and the other tools people have to say -- perhaps we're trying to re-invent the wheel here, and we shouldn't do that (that's really my only point for this whole discussion)...

nkogteva · 2014-12-08T14:12:31Z

@artpol84

Format
I got the following:
[mir9.vbench.com:17622] ORTE_Init timings
[mir9.vbench.com:17622]
time from start to completion of rte_init: 1.033356e-01
time from completion of rte_init to modex: 1.571620e-01
time to execute modex: 3.076386e-02
time from modex to first barrier: 4.618168e-04
time to execute barrier: 5.657673e-04
time from barrier to complete mpi_init: 1.603842e-03

Please don't get hung up on the term "human readable format". It should works in all cases: human reading, parsing using script or simple utilities like grep, sed and so on.

a. add host name and process name to each line
b. header "ORTE_Init timings" is not needed in this case
c. why don't you want to have a uniform format for absolute time and relative time? I have already suggested something similar before. Under the new changes it would be something like this:
[mir9.vbench.com:17622] [[37590,1],2] [ABS] "MPI_Finalize: Start barrier" 7.919216e-02
...
[mir9.vbench.com:17622] [[37590,1],2] [INT] "time from start to completion of rte_init" 1.033356e-01

It seems that -mca opal_timing_output doesn't work for me any more.

artpol84 · 2014-12-08T14:40:00Z

понедельник, 8 декабря 2014 г. пользователь Nadezhda Kogteva написал:

@artpol84 https://github.com/artpol84

Format
I got the following:
[mir9.vbench.com:17622] ORTE_Init timings
[mir9.vbench.com:17622]
time from start to completion of rte_init: 1.033356e-01
time from completion of rte_init to modex: 1.571620e-01
time to execute modex: 3.076386e-02
time from modex to first barrier: 4.618168e-04
time to execute barrier: 5.657673e-04
time from barrier to complete mpi_init: 1.603842e-03

Please don't get hung up on the term "human readable format". It should works in all cases: human reading, parsing using script or simple utilities like grep, sed and so on.

a. add host name and process name to each line
b. header "ORTE_Init timings" is not needed in this case
c. why don't you want to have a uniform format for absolute time and
relative time?
I have already suggested something similar before.
Under the new changes it would be something like this:
[mir9.vbench.com:17622] [[37590,1],2] [ABS] "MPI_Finalize: Start barrier"
7.919216e-02
...
[mir9.vbench.com:17622] [[37590,1],2] [INT] "time from start to
completion of rte_init" 1.033356e-01

Ok. I don't actually care. Will follow this format.

It seems that -mca opal_timing_output doesn't work for me any more.

If you see "time from start to completion of rte_init: " than it works.
I removed tracing from OMPI Init since it doesn't make sense there for me.
So now _ext doesn't work. Do you need it there?

—
Reply to this email directly or view it on GitHub
#305 (comment).

Best regards, Artem Polyakov
(Mobile mail)

artpol84 · 2014-12-08T14:43:50Z

понедельник, 8 декабря 2014 г. пользователь Artem Polyakov написал:

понедельник, 8 декабря 2014 г. пользователь Nadezhda Kogteva написал:

@artpol84 https://github.com/artpol84

Format
I got the following:
[mir9.vbench.com:17622] ORTE_Init timings
[mir9.vbench.com:17622]
time from start to completion of rte_init: 1.033356e-01
time from completion of rte_init to modex: 1.571620e-01
time to execute modex: 3.076386e-02
time from modex to first barrier: 4.618168e-04
time to execute barrier: 5.657673e-04
time from barrier to complete mpi_init: 1.603842e-03

Please don't get hung up on the term "human readable format". It should works in all cases: human reading, parsing using script or simple utilities like grep, sed and so on.

a. add host name and process name to each line
b. header "ORTE_Init timings" is not needed in this case
c. why don't you want to have a uniform format for absolute time and
relative time?
I have already suggested something similar before.
Under the new changes it would be something like this:
[mir9.vbench.com:17622] [[37590,1],2] [ABS] "MPI_Finalize: Start
barrier" 7.919216e-02
...
[mir9.vbench.com:17622] [[37590,1],2] [INT] "time from start to
completion of rte_init" 1.033356e-01

Ok. I don't actually care. Will follow this format.

Do you need file names/ lines/ function names there?

It seems that -mca opal_timing_output doesn't work for me any more.

If you see "time from start to completion of rte_init: " than it works.
I removed tracing from OMPI Init since it doesn't make sense there for me.
So now _ext doesn't work. Do you need it there?

—
Reply to this email directly or view it on GitHub
#305 (comment).

Best regards, Artem Polyakov
(Mobile mail)

Best regards, Artem Polyakov
(Mobile mail)

nkogteva · 2014-12-08T14:48:07Z

@artpol84 I mean previously it was possible to specify output file name using mca parameter.

I think file names, function names and lines actually are not needed now. We have all necessary information in the name of time interval.

artpol84 · 2014-12-09T01:13:04Z

2014-12-08 20:48 GMT+06:00 Nadezhda Kogteva notifications@github.com:

@artpol84 https://github.com/artpol84 I mean previously it was possible
to specify output file name using mca parameter.

I think file names, function names and lines actually are not needed now.
We have all necessary information in the name of time interval.

Thank you, Nadya.
I will implement the output format that you requested and restore *_output
functionality today (hopefuly) or tomorrow. Will keep you informed.

—
Reply to this email directly or view it on GitHub
#305 (comment).

С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov

mellanox-github · 2014-12-10T02:10:16Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/134/

Build Log
last 50 lines

[...truncated 10897 lines...]
  CC       communicator/comm_helpers.lo
  CC       errhandler/errhandler.lo
  CC       errhandler/errhandler_invoke.lo
  CC       errhandler/errhandler_predefined.lo
  CC       errhandler/errcode.lo
  CC       file/file.lo
  CC       errhandler/errcode-internal.lo
  CC       group/group.lo
  CC       group/group_init.lo
  CC       group/group_set_rank.lo
  CC       group/group_plist.lo
  CC       group/group_sporadic.lo
  CC       group/group_strided.lo
  CC       group/group_bitmap.lo
  CC       info/info.lo
  CC       message/message.lo
  CC       op/op.lo
  CC       proc/proc.lo
  CC       request/grequest.lo
  CC       request/request.lo
  CC       request/req_test.lo
  CC       request/req_wait.lo
  CC       runtime/ompi_mpi_abort.lo
  CC       runtime/ompi_mpi_init.lo
  CC       runtime/ompi_mpi_finalize.lo
  CC       runtime/ompi_mpi_params.lo
  CC       runtime/ompi_mpi_preconnect.lo
  CC       runtime/ompi_cr.lo
runtime/ompi_mpi_finalize.c:237:53: error: macro "OPAL_TIMING_DELTAS" passed 3 arguments, but takes just 2
runtime/ompi_mpi_finalize.c: In function 'ompi_mpi_finalize':
runtime/ompi_mpi_finalize.c:237: error: 'OPAL_TIMING_DELTAS' undeclared (first use in this function)
runtime/ompi_mpi_finalize.c:237: error: (Each undeclared identifier is reported only once
runtime/ompi_mpi_finalize.c:237: error: for each function it appears in.)
runtime/ompi_mpi_init.c:940:53: error: macro "OPAL_TIMING_DELTAS" passed 3 arguments, but takes just 2
runtime/ompi_mpi_init.c: In function 'ompi_mpi_init':
runtime/ompi_mpi_init.c:940: error: 'OPAL_TIMING_DELTAS' undeclared (first use in this function)
runtime/ompi_mpi_init.c:940: error: (Each undeclared identifier is reported only once
runtime/ompi_mpi_init.c:940: error: for each function it appears in.)
make[2]: *** [runtime/ompi_mpi_finalize.lo] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: *** [runtime/ompi_mpi_init.lo] Error 1
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/ompi'
make[1]: *** [install-recursive] Error 1
make[1]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/ompi'
make: *** [install-recursive] Error 1
Build step 'Execute shell' marked build as failure
[BFA] Scanning build for known causes...

[BFA] Done. 0s

Test FAILed.

mellanox-github · 2014-12-10T02:28:18Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/135/
Test PASSed.

mellanox-github · 2014-12-10T09:09:43Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/137/
Test PASSed.

artpol84 · 2014-12-10T09:11:26Z

@nkogteva I think it's pretty closer to what you need than before. Can you check it again?

mellanox-github · 2014-12-10T09:20:49Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/138/

Build Log
last 50 lines

[...truncated 7342 lines...]
  CC       base/iof_base_frame.lo
  CC       base/iof_base_select.lo
  CC       base/iof_base_output.lo
  CC       base/iof_base_setup.lo
  CCLD     libmca_iof.la
make[3]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/orte/mca/iof'
make[3]: Nothing to be done for `install-exec-am'.
 /bin/mkdir -p '/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/include/openmpi/orte/mca/iof'
 /usr/bin/install -c -m 644  iof.h iof_types.h '/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/include/openmpi/orte/mca/iof/.'
 /bin/mkdir -p '/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/include/openmpi/orte/mca/iof/base'
 /usr/bin/install -c -m 644  base/base.h base/iof_base_setup.h '/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/include/openmpi/orte/mca/iof/base'
make[3]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/orte/mca/iof'
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/orte/mca/iof'
Making install in mca/odls
make[2]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/orte/mca/odls'
  CC       base/odls_base_frame.lo
  CC       base/odls_base_select.lo
  CC       base/odls_base_default_fns.lo
  CCLD     libmca_odls.la
make[3]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/orte/mca/odls'
make[3]: Nothing to be done for `install-exec-am'.
 /bin/mkdir -p '/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/share/openmpi'
 /bin/mkdir -p '/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/include/openmpi/orte/mca/odls'
 /usr/bin/install -c -m 644 base/help-orte-odls-base.txt '/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/share/openmpi'
 /usr/bin/install -c -m 644  odls.h odls_types.h '/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/include/openmpi/orte/mca/odls/.'
 /bin/mkdir -p '/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/include/openmpi/orte/mca/odls/base'
 /usr/bin/install -c -m 644  base/base.h base/odls_private.h '/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/include/openmpi/orte/mca/odls/base'
make[3]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/orte/mca/odls'
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/orte/mca/odls'
Making install in mca/oob
make[2]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/orte/mca/oob'
  CC       base/oob_base_stubs.lo
  CC       base/oob_base_frame.lo
  CC       base/oob_base_select.lo
base/oob_base_frame.c:110:53: error: macro "OPAL_TIMING_REPORT" requires 3 arguments, but only 2 given
base/oob_base_frame.c: In function 'orte_oob_base_close':
base/oob_base_frame.c:110: error: 'OPAL_TIMING_REPORT' undeclared (first use in this function)
base/oob_base_frame.c:110: error: (Each undeclared identifier is reported only once
base/oob_base_frame.c:110: error: for each function it appears in.)
make[2]: *** [base/oob_base_frame.lo] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/orte/mca/oob'
make[1]: *** [install-recursive] Error 1
make[1]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/orte'
make: *** [install-recursive] Error 1
Build step 'Execute shell' marked build as failure
[BFA] Scanning build for known causes...

[BFA] Done. 0s

Test FAILed.

mellanox-github · 2014-12-10T09:56:36Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/140/
Test PASSed.

nkogteva · 2014-12-10T10:20:21Z

@artpol84 That's fine! Thank you. Format is ok. But I have another one problem. I applied patch from PR and from time to time I have the following segfault:

==== backtrace ====
2 0x00000000000638cc mxm_handle_error() /var/tmp/OFED_topdir/BUILD/mxm-3.2.2987/src/mxm/util/debug/debug.c:641
3 0x0000000000063a3c mxm_error_signal_handler() /var/tmp/OFED_topdir/BUILD/mxm-3.2.2987/src/mxm/util/debug/debug.c:616
4 0x00000038e2032920 killpg() ??:0
5 0x000000000007f783 opal_basename() /hpc/home/USERS/nadezhda/repository/ompi/opal/util/basename.c:71
6 0x0000000000093f33 opal_timing_deltas() /hpc/home/USERS/nadezhda/repository/ompi/opal/util/timings.c:650
7 0x0000000000051b69 ompi_mpi_finalize() /hpc/home/USERS/nadezhda/repository/ompi/ompi/runtime/ompi_mpi_finalize.c:237
8 0x000000000007b319 PMPI_Finalize() /hpc/home/USERS/nadezhda/repository/ompi/ompi/mpi/c/profile/pfinalize.c:46
9 0x0000000000400880 main() ??:0
10 0x00000038e201ecdd __libc_start_main() ??:0
11 0x0000000000400729 _start() ??:0

I didn't look at this problem carefully. Could you please recheck it?

artpol84 · 2014-12-10T10:48:43Z

@nkogteva Thank you. Should be fixed now.

mellanox-github · 2014-12-10T11:06:28Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/141/
Test PASSed.

artpol84 · 2014-12-14T06:38:20Z

@nkogteva Does the latest commit fix the problem you observe? Can this PR be merged?

nkogteva · 2014-12-15T07:01:50Z

@artpol84 That's ok for me. Thanks. If no one else has comments, you can proceed to merging.

rhc54 · 2014-12-15T11:39:24Z

Indeed, looks good - please bring it in! Thanks Artem!

jsquyres · 2014-12-15T15:31:17Z

I have one last minor request: could you add some overview documentation to the top of timings.h? I.e., give a 50,000 foot description of the timing system, and maybe a short example or two of how it is supposed to be used?

Thanks!

Timing framework improvement

artpol84 · 2014-12-16T09:14:47Z

Jeff, I will provide the description asap. Will push it without pull request.

v1.10 btl_tcp_proc.c: add missing "continue"

Fix compilation warnings for prun

jsquyres reviewed Dec 4, 2014
View reviewed changes

jsquyres added the enhancement label Dec 4, 2014

artpol84 force-pushed the timing branch from a284e07 to 7c33777 Compare December 5, 2014 08:13

artpol84 added a commit to artpol84/ompi that referenced this pull request Dec 5, 2014

Fixes according to the PR discussion:

7c33777

open-mpi#305

artpol84 force-pushed the timing branch from 9678da8 to 93c5a32 Compare December 10, 2014 02:01

artpol84 force-pushed the timing branch from 93c5a32 to ae7061b Compare December 10, 2014 02:15

artpol84 force-pushed the timing branch from ae7061b to 0715dee Compare December 10, 2014 08:51

artpol84 force-pushed the timing branch from 0715dee to dbf5c35 Compare December 10, 2014 09:09

artpol84 force-pushed the timing branch from dbf5c35 to 2abe0d8 Compare December 10, 2014 09:36

Introduce timing interval measurement facility in timing framework

8ffad75

artpol84 force-pushed the timing branch from 2abe0d8 to 8ffad75 Compare December 10, 2014 10:48

artpol84 added a commit that referenced this pull request Dec 16, 2014

Merge pull request #305 from artpol84/timing

01601f3

Timing framework improvement

artpol84 merged commit 01601f3 into open-mpi:master Dec 16, 2014

jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Sep 19, 2016

Merge pull request open-mpi#305 from jsquyres/pr/v1.10-tcp-btl-ipv6-fix

4e9cea6

v1.10 btl_tcp_proc.c: add missing "continue"

dong0321 pushed a commit to dong0321/ompi that referenced this pull request Feb 19, 2020

Merge pull request open-mpi#305 from gvallee/prun_compile_warnings_fix

b16845c

Fix compilation warnings for prun

Timing framework improvement #305

Timing framework improvement #305

Conversation

artpol84 commented Dec 4, 2014

mellanox-github commented Dec 4, 2014

jsquyres Dec 4, 2014

Choose a reason for hiding this comment

artpol84 Dec 4, 2014

Choose a reason for hiding this comment

nkogteva commented Dec 4, 2014

jsquyres Dec 4, 2014

Choose a reason for hiding this comment

artpol84 Dec 4, 2014

Choose a reason for hiding this comment

artpol84 commented Dec 4, 2014

nkogteva commented Dec 4, 2014

artpol84 commented Dec 5, 2014

artpol84 commented Dec 5, 2014

mellanox-github commented Dec 5, 2014

nkogteva commented Dec 5, 2014

nkogteva commented Dec 5, 2014

artpol84 commented Dec 5, 2014

nkogteva commented Dec 5, 2014

artpol84 commented Dec 5, 2014

artpol84 commented Dec 5, 2014

nkogteva commented Dec 5, 2014

jsquyres commented Dec 5, 2014

jsquyres commented Dec 5, 2014

artpol84 commented Dec 5, 2014

artpol84 commented Dec 5, 2014

jsquyres commented Dec 5, 2014

artpol84 commented Dec 5, 2014

jsquyres commented Dec 5, 2014

nkogteva commented Dec 8, 2014

artpol84 commented Dec 8, 2014

artpol84 commented Dec 8, 2014

nkogteva commented Dec 8, 2014

artpol84 commented Dec 9, 2014

mellanox-github commented Dec 10, 2014

mellanox-github commented Dec 10, 2014

mellanox-github commented Dec 10, 2014

artpol84 commented Dec 10, 2014

mellanox-github commented Dec 10, 2014

mellanox-github commented Dec 10, 2014

nkogteva commented Dec 10, 2014

artpol84 commented Dec 10, 2014

mellanox-github commented Dec 10, 2014

artpol84 commented Dec 14, 2014

nkogteva commented Dec 15, 2014

rhc54 commented Dec 15, 2014

jsquyres commented Dec 15, 2014

artpol84 commented Dec 16, 2014