Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Update network-mpi.c #133

Open
wants to merge 24 commits into
base: master
Choose a base branch
from
Open

WIP: Update network-mpi.c #133

wants to merge 24 commits into from

Conversation

caitlinross
Copy link
Member

Dan's Version of network-mpi.c with queue

Dan, I'm creating a pull request so I can add in comments/feedback to specific lines of your code.


If this merge represents a feature addition to ROSS, the following items must be completed before the branch will be merged:

  • Document the feature on the blog (See the Contributing guide in the gh-pages branch).
    Include a link to your blog post in the Pull Request.
  • One or more TravisCI tests should be created (and they should pass)
  • Through the TravisCI tests, coverage should increase
  • Test with CODES to ensure everything continues to work

Dan's Version of network-mpi.c with queue
@caitlinross caitlinross self-assigned this Jul 19, 2018
Copy link
Member Author

@caitlinross caitlinross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've caught a few things that I think should fix some issues.

@@ -786,13 +860,31 @@ tw_net_statistics(tw_pe * me, tw_statistics * s)

if(MPI_Reduce(&(s->s_net_events),
&me->stats.s_net_events,
17,
16,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to your current changes, but at some point, you'll need to update your branch to pull back in the original code. I think you were working from an old version of this file, and tw_net_statistics has recently changed, so you'll need to make sure that it's up to date with ROSS master. I don't think you'll need to change this at all for what you're working on, so it should just look exactly like what ROSS master currently looks like.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, this comment means the full tw_net_statistics function needs to be reverted back, not just this one line.


if(!(e = tw_event_grab(me)))
{
if(tw_gvt_inprogress(me))
tw_error(TW_LOC, "Out of events in GVT! Consider increasing --extramem");
tw_error(TW_LOC, "out of events in GVT!");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here as with the comment on tw_net_statistics below. This was a recent change, so you'll have to update your branch so that way we don't eventually undo recent changes to ROSS master.

@@ -460,9 +533,7 @@ recv_finish(tw_pe *me, tw_event *e, char * buffer)
/* Fast case, we are sending to our own PE and
* there is no rollback caused by this send.
*/
start = tw_clock_read();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line and lines 351 and 465 needs to be added back in as well (same reason as tw_net_statistics).

unsigned int cur;
int front;//add, front of queue
int coda;//add, back of queue but back is already a variable somewhere
int sizeOfQ;//add, size of queue array
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent, we use snake_case, not camelCase.

int front;//add, front of queue
int coda;//add, back of queue but back is already a variable somewhere
int sizeOfQ;//add, size of queue array
int numInQ;//add, number of elements in queue
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change this to something like num_in_free_q or something? I originally thought it was supposed to be the number of active operations you have in the queue based on the name, but eventually I realized it's supposed to be the number of free elements in the queue. Being a little clearer in the name will probably help in the future if someone else needs to make changes here.

{
unsigned id = posted_recvs.cur;

int id = fr_q_dq(&posted_recvs);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be moved until after trying to pull the event using tw_event_grab. As it is currently, if there is no event to grab, it will mess up your buffer's free list.

tw_node *dest_node = NULL;

unsigned id = posted_sends.cur;
int id = fr_q_dq(&posted_sends);// fixed, grabs from front of queue, moves front up one element
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will need to be moved to just after if (!e) break; because if the event is null, it's going to mess up your free element list for posted_sends.

@@ -207,7 +274,7 @@ tw_net_minimum(tw_pe *me)
e = e->next;
}

for (i = 0; i < posted_sends.cur; i++) {
for (i = 0; i < posted_sends.cur; i++) { //fix this line (?)
e = posted_sends.event_list[i];
if (m > e->recv_ts)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're also going to need to add a check to see if e is NULL.

@@ -101,7 +156,19 @@ init_q(struct act_q *q, const char *name)
q->event_list = (tw_event **) tw_calloc(TW_LOC, name, sizeof(*q->event_list), n);
q->req_list = (MPI_Request *) tw_calloc(TW_LOC, name, sizeof(*q->req_list), n);
q->idx_list = (int *) tw_calloc(TW_LOC, name, sizeof(*q->idx_list), n);
q->status_list = (MPI_Status *) tw_calloc(TW_LOC, name, sizeof(*q->status_list), n);
q->free_idx_list = (int *) tw_calloc(TW_LOC, name, sizeof(*q->idx_list), n);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you already caught that you have an issue with reading/writing out of bounds of this list.

yaciud added 15 commits July 21, 2018 18:35
Changed camelCase to snake_case
variable names changed, id assigned after events are checked for null in sends, and e null check added in tw_net_minimum
added NULL check before return
Commented out line 257
Added line tw_clock start;,  start = tw_clock_read();, and dest_pe->stats.s_pq += tw_clock_read() - start;. Changed error message to be more explicit, like in the master. Edited tw_net_statistics added to match master.
Exchanged incrementing in the queueing function for an addition outside of the loops (wherever the function was used).
added delete line in insert function
commented out entire recv_finish cancel if statement, within which is an else statement attached to the if(cancel!=null) for insertion
Filter anti-messages version
rolled back to original copy of all_tree.c
updated to segfaulting exception code
removed error for nonexistent events.
Exception code written and implemented, minor errors with all tree size with read_buffer size of 5000. Code is functional, but could use some optimization tweaking.
adds one AVL tree element after out of order event is detected to offset accessing tw_hash_remove() twice. Removed various debug statements in commented code.
late_recv_finish(me, e, NULL, q, n);
#endif
late_recv_finish(me, e, NULL, q, n);
//might need an augmented version for ROSS_MEMORY?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No don't worry about ROSS_MEMORY. It's being removed in another branch.

yaciud added 2 commits March 29, 2019 17:27
Added reset queue capability and additional new reset queue capability
@gonsie
Copy link
Member

gonsie commented Sep 30, 2019

@yaciud is this still a work in progress? Or, is this branch in some final state that works for you?

@yaciud
Copy link

yaciud commented Sep 30, 2019

I consider it a work in progress, but progress on it has stalled. In quick succession the May update to ROSS happened and the clusters I had access to got wiped or had a malfunction, so my code became outdated and it became difficult to test. I believe I have versions that work with the current version of ROSS, but there are still remnants from the outdated MPI layer in there that I'd like to get rid of before considering the work in its final stage.

Also by versions I mean that I have a 3 different ways of solving the problem written and working. I'd like to optimize 2 of them a little bit more, but they all seem to be improvements over the original network layer. It would probably take less than 2 days to get everything ready if needed.

This is the network file that uses both the methods of filling holes in the MPI_requests by moving events on the far side of the array inward (send event) and keeping track of events that have arrived and overwriting them(recv event). File still has ROSS MEMORY and debugging statements unfortunately.
So, it says queue but I switched implementations to a stack. Still messy, should be cleaned up. Small performance issue because cur variable does not shrink, causing MPI and some loops dependent on cur to check more elements than it needs to. I have a fix for it that is not currently implemented. At the end of test_q, just iterate from cur backwards to the next cell of the array that is not null, then we would need to check if we place events farther in the array then cur and either throw larger value out and continue with the next value or set cur to that new value.
This is the version that fills holes in the MPI request array by taking events from the back of the array and places them in spots where the request have been satisfied.
Merged current files with my files.
Updated old code to more easily integrate with new code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants