-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timing framework improvement #305
Conversation
Test PASSed. |
|
||
debug_hang(0); | ||
|
||
if( fname != NULL ){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor style issue: we prefer constants to be on the left of ==. I know it feels unnatural, but it's defensive programming -- the compiler will fail to compile if you typo like this "if (NULL = fname) ...".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah! This was developed before I read wiki carefuly. I will fix that!
Looks fine. I tried it:) and I have only one comment related output format. Since it is possible to have absolute time and relative time at the same moment let's create common format for them. For example, for [37590,1],2] process I have following lines for the same procedure: 0.943561 0.005865 "MPI_Init: Start barrier" [mir13, [[37590,1],2] ompi_mpi_init.c:785:ompi_mpi_init] See what I mean? We can use something like this: |
rc = OPAL_ERR_OUT_OF_RESOURCE; | ||
goto err_exit; | ||
} | ||
if( buf_used + strlen(line) > OPAL_TIMING_OUTBUF_SIZE ){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused by this logic -- in the above case, you error out if it's too long. But in this case, you just realloc.
There's also another dichotomy: you use asprintf() for allocation up above, but then seem to be enforcing some kind of (soft?) alloc limit here.
Is there a reason for these differences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea was that one line shouldn't exceed the limit. However you are right. I'll rewrite this.
Thank you for your comments. I'll address them tomorrow. |
@artpol84 If you don't want to combine absolute and relative time in one line you can create macro for common format to have something like this: for example: |
retest this please |
@nkogteva |
Test PASSed. |
@artpol84 Sure, it is enough for my current task. But I think it would be not enough for general problem of performance measurement. It would be nice to have both - absolute time and relative time - in the same format to make output file parsing easier (when both ompi_timing and mpi_timing_ext are specified). |
@artpol84 One more question. For relative time you calculate time difference (second column for relative time) between current time and previous event. So now if I want to get time of modex I just get the appropriate column (because end time is following to the start time). That's ok. But in the general case if some time events are located between start of procedure and end of procedure I will need summarize two intervals to get procedure time. It's not very good. For example, I want to write script which will measure performance in the same manner on different versions of code without additional tuning. Is not better to have specific names for specific procedures instead of time related to the previous event? Like it was done previously? |
a. you initialize overhead event like this: id = OPAL_TIMING_OVH_INIT(&tm_handler,"some description"); b. next you update it where you want using obtained id: OPAL_TIMING_OVH_START(&tm_handler, id);
.... processing
OPAL_TIMING_OVH_END(&tm_handler, id); Would it fit your needs? |
I will try to summarize. It would be nice to have:
|
They are not exlusive but you might want to see intervals and also have the data for postprocessing. You'll get the output in log in human readable format and (if you use mpirun_prof) you'll get the postprocessed data in the file that you point out using export OMPI_MCA_opal_timing_output="test" So you shouldn't enable _ext if you don't want to postprocess. IMHO an absolute values is all you need for the postprocessing. You don't need intervals since you can compute them at that stage. And I keep only absolute values because you may want to merge events from different nodes and analyse data after that. In this case intervals wouldn't be correct after merge. Do you already have any exact thoughts/intentions on parsing of those data? If so we can do some steps in that direction. I was thinking about postprocessing to OTFStat format in future. But don't have the time for that now. |
I will update the framework with new functionality that we discuss. Can you provide what intervals you need to measure? Do you need it for omi_mpi_init or you measuring other code? |
@artpol84 Thank you! Let's start with intervals which were implemented in previous version in order to align them:
Then everyone can use your framework for own purposes. I think you should not worry about other cases except mpi_init. |
(my bias: I actually don't have any religious opinion that opal_output() must be used I just want to make sure you're not re-inventing the wheel when some other part of OMPI infrastructure would work ok... with that in mind, let's look at each of your analysis points...) Ok. This means that you send all the data to a single process and have that one process output the file, right? (i.e., there's no other way to ensure that all data from all processes ends up in a single file without losing any of the data) In that case, opal_output() should still be fine.
Sure, but keep in mind that you do not have to opal_output(0, ...). You can open your own stream (which won't be 0) and then use that. Hence, it doesn't have to go to stdout. Or it could go to stdout and a file. Or ......
Hmm. This sounds like you're relying on the filesystem to combine the data from all processes into a single file. I think that this is a losing proposition -- you're going to find cases where multiple processes write to the (network-filesystem hosted) file simultaneously, and therefore the data from some process(es) get lost/overwritten. I don't honestly remember if opal_output allows you to open a file in arbitrary locations, or whether they're always under the session directory... But even if they're in the session directory, is that a problem? (especially if you take the approach that all processes send data to a single process and a single process writes out the data -- then a process-wide filesystem location is irrelevant)
Yes, I think you're going to need this -- regardless of whether you use opal_output or not.
Mmm. Yes. This is a problem (you're down in OPAL and don't have communication facilities available). I think there are 3 general options:
Keep in mind that all 3 options have ordering issues -- none of them will guarantee to output data from each process in the correct order. I'd prefer something that would be close to fopen(fname, "a"). I think only option 2 is close to that. |
Keep in mind that no servers are ever in perfect sync. Hence, the absolute times that you get for each event may not interleave/merge properly into a single, consolidated timeline that accurately reflects the order in which events occurred. Out-of-the-box thinking/suggestion: The MPI tools community has spent a LOT of time/effort on exactly this kind of issue (and others). Making an accurate distributed collection agent is complicated -- perhaps we don't need to re-invent it inside OMPI. Have you considered using an external tool for this kind of collection? I was just talking to the VampirTrace guys the other day about having SCORE-P tie into the MPI_T performance variable system of OMPI. That is, the MCA var system (which is the OMPI infrastructure behind MPI_T) would basically directly feed into SCORE-P, which then outputs data files that can than be used by multiple different analysis tools. Could something like this be used for what you're trying to do? @bertwesarg Can you comment on this idea? (I don't know Andreas' github ID offhand to @ mention him...) The idea is that we have timing data for arbitrary OPAL/ORTE/OMPI events down in the OPAL layer. Could we output these via our MPI_T infrastructure and have them reported via SCORE-P? |
@jsquyres I am aware that time on different servers is not synchronuous. Currently we have an external tool mpisync that can synchronyse servers with microsecond precise. It uses well known techniques and derived from an existing project. I call it to measure time offsets infoand save it to the input file. This file is being readed from the MPI program. |
@jsquyres
Honestly I didn't thought about data collection using standard tools. Thanks for pointing me on this. |
mpitimer seems to do things at the MPI layer (e.g., I see MPI_Comm as arguments to its functions). I thought you were trying to time ORTE events...? Yes, I agree that something like NTP or PTP will be necessary to get servers "close" in sync -- but these kinds of systems will never be perfect enough to generate perfect timelines based on absolute local timestamps. I agree: these kinds of things are best left outside of OMPI. Output format: I don't know what the output format of SCORE-P is -- I don't know if it's OTF or something else. FWIW: The way the SCORE-P guys described it to me earlier this week: SCORE-P outputs a file format that can be used by a variety of different analysis tools (I don't know if that's OTF or not). Random other thought: are you using the OPAL timer framework to get high resolution timestamps? George just put in some improvements to use clock_gettime() and/or RDTSC recently. |
"...mpitimer seems to do things at the MPI layer (e.g., I see MPI_Comm as arguments to its functions). I thought you were trying to time ORTE events...? ..." Yes I am trying to measure ORTE. But OMPI BTL's usually provide access to networks with less latency so I run mpisync BEFORE I run the program that I measyre. Check mpirun_prof in ompi/tools/ompi_info/ to see how it works. "... Yes, I agree that something like NTP or PTP will be necessary to get servers "close" in sync -- but these kinds of systems will never be perfect enough to generate perfect timelines based on absolute local timestamps. I agree: these kinds of things are best left outside of OMPI..." According to Wiki NTP precise is 10 ms through the Internet and upto 0,2 ms in LAN. My evaluations confirms that. I tested mpisync on the clusters with NTP and the offsets was in 0.2 - 1 ms. About precision. Synchronisation should be able to rearrange events on different nodes. mpisync performs synchronisation using existing network. Thus we can acheive the quality that is enough to evaluate communication overhead. The precise may be even better in case then we use InfiniBand BTL to measure and then evaluate OOB that is currently TCP-based. Shortly speaking the event of message send and receive would be represented with the precise that is enough to estimate it's delivery time. "... Output format: I don't know what the output format of SCORE-P is -- I don't know if it's OTF or something else. FWIW: The way the SCORE-P guys described it to me earlier this week: SCORE-P outputs a file format that can be used by a variety of different analysis tools (I don't know if that's OTF or not)..." I checked their website. They have OTF support on the board. "...Random other thought: are you using the OPAL timer framework to get high resolution timestamps? George just put in some improvements to use clock_gettime() and/or RDTSC recently..." I was using pure gettimeofday but I planned to switch to more pricise ones. I'll do that along with improvements for @nkogteva and will update the PR. |
mpisync: oh, right -- I remember our discussion about this now. It basically determines/records the skew between multiple servers, and then you use that skew for post-processing of the timestamps that you collect, right? Precision: yes, 200 nanos is good precision, but that's still a lot of clock cycles. :-) Your ORTE measurements may not care about that level of precision, but if you want to have a guaranteed reproduction of the absolute ordering of events, then you need quite sophisticated event-merging capabilities (I think that modern tools use more than just a single skew-measurement at the beginning, but I could be wrong here). ...All this being said, if you don't mind a few events being out of order / don't need an absolute guarantee of overall ordering, then I should just shut up. :-) Timer: cool. Check out opal/mca/timer. It should do the Right Things regardless of what platform you're on. I'd still like to hear what @bertwesarg and the other tools people have to say -- perhaps we're trying to re-invent the wheel here, and we shouldn't do that (that's really my only point for this whole discussion)... |
Please don't get hung up on the term "human readable format". It should works in all cases: human reading, parsing using script or simple utilities like grep, sed and so on. a. add host name and process name to each line
|
понедельник, 8 декабря 2014 г. пользователь Nadezhda Kogteva написал:
Best regards, Artem Polyakov |
понедельник, 8 декабря 2014 г. пользователь Artem Polyakov написал:
Best regards, Artem Polyakov |
@artpol84 I mean previously it was possible to specify output file name using mca parameter. I think file names, function names and lines actually are not needed now. We have all necessary information in the name of time interval. |
2014-12-08 20:48 GMT+06:00 Nadezhda Kogteva notifications@github.com:
С Уважением, Поляков Артем Юрьевич |
Test FAILed. Build Log
Test FAILed. |
Test PASSed. |
Test PASSed. |
@nkogteva I think it's pretty closer to what you need than before. Can you check it again? |
Test FAILed. Build Log
Test FAILed. |
Test PASSed. |
@artpol84 That's fine! Thank you. Format is ok. But I have another one problem. I applied patch from PR and from time to time I have the following segfault: ==== backtrace ==== I didn't look at this problem carefully. Could you please recheck it? |
@nkogteva Thank you. Should be fixed now. |
Test PASSed. |
@nkogteva Does the latest commit fix the problem you observe? Can this PR be merged? |
@artpol84 That's ok for me. Thanks. If no one else has comments, you can proceed to merging. |
Indeed, looks good - please bring it in! Thanks Artem! |
I have one last minor request: could you add some overview documentation to the top of timings.h? I.e., give a 50,000 foot description of the timing system, and maybe a short example or two of how it is supposed to be used? Thanks! |
Jeff, I will provide the description asap. Will push it without pull request. |
v1.10 btl_tcp_proc.c: add missing "continue"
Fix compilation warnings for prun
@nkogteva please, review.