Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add queue sizes to endpoint attribute #10

Closed
shefty opened this issue Sep 3, 2014 · 9 comments
Closed

Add queue sizes to endpoint attribute #10

shefty opened this issue Sep 3, 2014 · 9 comments
Assignees
Milestone

Comments

@shefty
Copy link
Member

shefty commented Sep 3, 2014

The endpoint attribute structure should be expanded to expose the size of the underlying queue. Now that the EP attribute exist, we can simplify things for the user and avoid needing to use control interfaces to override the default values. But default values should still be available to the user, with the actual values returned when an endpoint is created.

@shefty shefty added this to the alpha release milestone Sep 3, 2014
@dledford
Copy link

@shefty For this, are you referring to adding fields to struct fi_info (which is used when creating an endpoint and therefore could be preset to requested values versus do a modify sequence after the endpoint is created)? And if so, are you wanting to get as details as IB gets here with options for send queue size, receive queue size, and send and recv queue maximum SG entries, and possibly a request for maximum inline data too? Or do you think that's getting to fabric specific and defeating the purpose of abstracting the fabric out?

@shefty
Copy link
Member Author

shefty commented Sep 16, 2014

I was thinking of adding the fields to fi_ep_attr, but I don't know what fields to add, if any. I was thinking along the lines of send/recv queue size and SGL sizes. But if we expand the endpoint to include the concept of sessions for multi-threaded purposes, then there may be multiple sizes, corresponding to different HW work queues. So a single send queue size value may not work. Personally, I like the idea of trying to keep things abstract and using return codes to keep the user from overrunning any lower level queues. I'm not sure that works for all apps though. And I'm not sure what to do with SGL limits. SGL limits seem easier to expose through fi_ep_attr.

@dledford
Copy link

On 09/16/2014 01:58 PM, Sean Hefty wrote:

I was thinking of adding the fields to fi_ep_attr, but I don't know what
fields to add, if any. I was thinking along the lines of send/recv queue
size and SGL sizes.

OK.

But if we expand the endpoint to include the concept
of sessions

Definition please. By sessions do you mean multiple connections between
two hosts not following the same path (like on over ib0 and one over
ib1, or say one over eth1 and one over ib0 where eth1 is RoCE enabled
and ib1 is InfiniBand)?

for multi-threaded purposes, then there may be multiple
sizes, corresponding to different HW work queues.

This will probably go beyond the scope of libfabrics. Or at least I
would think beyond the scope of the bottom layer of libfabrics. We've
talked multiple times about the difference between a libfabrics that
MPIs or other apps that want really low level, "get out of my way" type
access to the underlying fabric want, and then there are apps that want
"abstract away all that fabric stuff and give me a simple, but
performant, interface". The overhead associated with sessions seems
like it would pre-emptively force support for sessions up to that higher
layer abstraction. As such, I'm not sure you want to build that into
the lower layer data structures versus handling it entirely at a higher
layer.

At a minimum though, I can see that if you are going to support the
notion of sessions, then not only would the queue size and other
parameters need to be in fi_ep_attr, but I think you would need to move
the src_addr and dst_addr from fi_info to fi_ep_attr as well since the
addresses of each session would likely be unique.

So a single send queue
size value may not work. Personally, I like the idea of trying to keep
things abstract and using return codes to keep the user from overrunning
any lower level queues.

I had thought about that. But that is decided performance unfriendly in
the IB case. And it would prevent the app from implementing any sort of
credit mechanism themselves. But, for some providers, there is no
concept of a queue depth (sockets provider immediately comes to mind).
So I was thinking to add it, but define it in the API such that a user
can specify a requested queue depth in the fi_ep_attr struct, and
depending on the provider the endpoint is created on, the following
matrix of values will be placed in the fi_ep_attr struct on return:

User fills Provider has notion Provider is queue
in queue size of queue size deficient

Yes return min(max queue return -1
depth, requested
queue depth)
No return default queue return -1
depth

I'm not sure that works for all apps though. And
I'm not sure what to do with SGL limits. SGL limits seem easier to
expose through fi_ep_attr.

I would agree with that. I think there are a number of things that are
in fi_info that can go to fi_ep_attr if you are going to support
sessions, and a number of things that can come back if you aren't.
However, since there's no harm in them being in fi_ep_attr, we can put
stuff there and plan for the possible future that way.

@shefty
Copy link
Member Author

shefty commented Sep 16, 2014

For 'sessions', what I mean are multiple HW command queues mapped to the application. The command queues have the same transport and network level address. If the queues can receive data, they may have a different session level address, which ideally would be exposed to the app as an index. A very simple use case would be an app using different sessions to communicate with different sets of remote processes. (I haven't thought through this concept, so my ideas are just up in the air at the moment.)

I agree that we will need to expose a size for application credit schemes. Maybe the answer is in the definition. (Note that I'm lousy coming up with names.)

min_outstanding_send - The minimum number of data transfers that a provider will queue to an endpoint.

This still allows for returning EBUSY. A provider may be able to queue more requests.

I also want to consider software providers that enhance the capabilities of a HW provider. E.g. there could be a provider that supports transfers larger than 4 GB, by breaking up a large request into multiple smaller requests. I don't think this causes any issues to a reported queue size, but I haven't thought through it.

Btw, it's kind of arbitrary which fields go into fi_info versus fi_ep_attr. I wanted to keep all mandatory fields in fi_info, and only require those apps that want to deal at the lower level fill out fi_ep_attr.

@dledford
Copy link

On 09/16/2014 02:45 PM, Sean Hefty wrote:

For 'sessions', what I mean are multiple HW command queues mapped to the
application. The command queues have the same transport and network
level address. If the queues can receive data, they may have a different
session level address, which ideally would be exposed to the app as an
index. A very simple use case would be an app using different sessions
to communicate with different sets of remote processes. (I haven't
thought through this concept, so my ideas are just up in the air at the
moment.)

I think I get what you mean (but I doubt that sets of different
processes is reasonable, you will likely need a whole new EP for each
different process you talk to due to the requirement of having to
listen/connect to different ports/services). However, an example that
does make sense to me, and something I've been looking at doing as an
optimization to conserve memory use in IB communications, is the idea of
having multiple queue pairs between two apps where the queue pairs
utilized different maximum message sizes and queue depths in order to
allow you to send lots of small messages without wasting huge amounts of
space. Such as a queue pair with a max message size of 256 bytes,
another at 1k, another at 4k, another at 16k, and one at 64k, with each
queue pair having progressively fewer entries as the size got larger.
For apps that send lots of small messages with some medium and large
size messages mixed in, this would make a lot of sense (ordering issues
not being considered here, the app would either need to take care of
that or there would need to be a layered ordering provider on top of
this scheme).

I agree that we will need to expose a size for application credit
schemes. Maybe the answer is in the definition. (Note that I'm lousy
coming up with names.)

min_outstanding_send - The minimum number of data transfers that a
provider will queue to an endpoint.

Except that most credit schemes are based on the opposite of this: a
maximum that the app knows and can plan for minus the currently
in-flight number.

This still allows for returning EBUSY.

When it comes to applications that want to manage their credits, if we
ever return EBUSY, we've failed.

A provider may be able to queue
more requests.

I think it's fair to say that, if an app wants to manage its own
in-flight counts and credits, that the maximum queue depth plus sends
sent minus completions received should allow them to do so
deterministically, And that only applications that don't bother to
track queue state should ever hit EBUSY, but for them it should exist
and the tracking of queue depths versus sent versus completed should be
an optional optimization left up to the application. Fair enough?

I also want to consider software providers that enhance the capabilities
of a HW provider. E.g. there could be a provider that supports transfers
larger than 4 GB, by breaking up a large request into multiple smaller
requests. I don't think this causes any issues to a reported queue size,
but I haven't thought through it.

It shouldn't, but it would mean that the software provider will have to
provide a minimal queue of their own to compensate for split packets.
But that's OK, they have to split and recombine packets, a small queue
is nothing major to add to that.

Btw, it's kind of arbitrary which fields go into fi_info versus
fi_ep_attr. I wanted to keep all mandatory fields in fi_info, and only
require those apps that want to deal at the lower level fill out fi_ep_attr.

OK, I can understand that. I'll make a note to that effect in the
header file ;-)

@shefty
Copy link
Member Author

shefty commented Sep 16, 2014

I think I get what you mean (but I doubt that sets of different
processes is reasonable, you will likely need a whole new EP for each
different process you talk to due to the requirement of having to
listen/connect to different ports/services).

Ah - I was thinking more of unconnected endpoints. HPC apps in general want reliable unconnected endpoints. There are at least a couple of vendors that support this (including Intel). The Mellanox XRC and dynamic connection features are steps in this direction.

does make sense to me, and something I've been looking at doing as an
optimization to conserve memory use in IB communications, is the idea of
having multiple queue pairs between two apps where the queue pairs
utilized different maximum message sizes and queue depths in order to

The libfabric feature to do this is the FI_MULTI_RECV flag, which is support for 'slab based' memory buffering. I.e. the user posts a single large buffer, and multiple receives simply fill in the buffer. This would be more for future HW or non-offload HW. IB could simulate this by using RDMA writes with immediate in place of sending messages.

I agree that we will need to expose a size for application credit
schemes. Maybe the answer is in the definition. (Note that I'm lousy
coming up with names.)

min_outstanding_send - The minimum number of data transfers that a
provider will queue to an endpoint.

Except that most credit schemes are based on the opposite of this: a
maximum that the app knows and can plan for minus the currently
in-flight number.

The app can set its starting max_credits to the min_oustanding. I used min instead of max, since the app may be able to post more. E.g. for iWarp to support RDMA write with immediate, it would consume 2 queue entries (RDMA write + send message). So it would set the min_outstanding = 1/2(queue size). If the app posts nothing but writes with immediate, it will block at min_outstanding. But if it only does sends, it can queue twice that amount.

I think this meets the intent that you want. The only issue is really the name.

@shefty shefty self-assigned this Oct 1, 2014
@shefty
Copy link
Member Author

shefty commented Oct 1, 2014

A general proposal to expose this is described here:

http://lists.openfabrics.org/pipermail/ofiwg/2014-September/000354.html

I will post a patch for this idea for further discussion.

@shefty shefty closed this as completed Oct 2, 2014
@shefty shefty reopened this Oct 2, 2014
@shefty
Copy link
Member Author

shefty commented Oct 2, 2014

A patch has been developed, but has not been committed.

@shefty
Copy link
Member Author

shefty commented Oct 10, 2014

An initial patch for this was committed 5cb07ab. Discussions are continuing on the mail list to enhance this, but closing this issue, since the queue sizes are now available.

@shefty shefty closed this as completed Oct 10, 2014
hppritcha added a commit to hppritcha/libfabric that referenced this issue Feb 10, 2015
Honggang-LI added a commit to Honggang-LI/libfabric that referenced this issue Dec 17, 2020
ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7fff4c61e7e0 at pc 0x14f2cb7ae0b9 bp 0x7fff4c61e650 sp 0x7fff4c61ddd8
WRITE of size 17 at 0x7fff4c61e7e0 thread T0
    #0 0x14f2cb7ae0b8  (/lib64/libasan.so.5+0xb40b8)
    ofiwg#1 0x14f2cb7aedd2 in vsscanf (/lib64/libasan.so.5+0xb4dd2)
    ofiwg#2 0x14f2cb7aeede in __interceptor_sscanf (/lib64/libasan.so.5+0xb4ede)
    ofiwg#3 0x14f2cb230766 in ofi_addr_format src/common.c:401
    ofiwg#4 0x14f2cb233238 in ofi_str_toaddr src/common.c:780
    ofiwg#5 0x14f2cb314332 in vrb_handle_ib_ud_addr prov/verbs/src/verbs_info.c:1670
    ofiwg#6 0x14f2cb314332 in vrb_get_match_infos prov/verbs/src/verbs_info.c:1787
    ofiwg#7 0x14f2cb314332 in vrb_getinfo prov/verbs/src/verbs_info.c:1841
    ofiwg#8 0x14f2cb21fc28 in fi_getinfo_ src/fabric.c:1010
    ofiwg#9 0x14f2cb25fcc0 in ofi_get_core_info prov/util/src/util_attr.c:298
    ofiwg#10 0x14f2cb269b20 in ofix_getinfo prov/util/src/util_attr.c:321
    ofiwg#11 0x14f2cb3e29fd in rxd_getinfo prov/rxd/src/rxd_init.c:122
    ofiwg#12 0x14f2cb21fc28 in fi_getinfo_ src/fabric.c:1010
    ofiwg#13 0x407150 in ft_getinfo common/shared.c:794
    ofiwg#14 0x414917 in ft_init_fabric common/shared.c:1042
    ofiwg#15 0x402f40 in run functional/bw.c:155
    ofiwg#16 0x402f40 in main functional/bw.c:252
    ofiwg#17 0x14f2ca1b28e2 in __libc_start_main (/lib64/libc.so.6+0x238e2)
    ofiwg#18 0x401d1d in _start (/root/libfabric/fabtests/functional/fi_bw+0x401d1d)

Address 0x7fff4c61e7e0 is located in stack of thread T0 at offset 48 in frame
    #0 0x14f2cb2306f3 in ofi_addr_format src/common.c:397

  This frame has 1 object(s):
    [32, 48) 'fmt' <== Memory access at offset 48 overflows this variable
HINT: this may be a false positive if your program uses some custom stack unwind mechanism or swapcontext
      (longjmp and C++ exceptions *are* supported)
SUMMARY: AddressSanitizer: stack-buffer-overflow (/lib64/libasan.so.5+0xb40b8)
Shadow bytes around the buggy address:
  0x1000698bbca0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1000698bbcb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1000698bbcc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1000698bbcd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1000698bbce0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x1000698bbcf0: 00 00 00 00 00 00 f1 f1 f1 f1 00 00[f2]f2 f3 f3
  0x1000698bbd00: f3 f3 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1
  0x1000698bbd10: f1 f1 00 f2 f2 f2 f2 f2 f2 f2 00 f2 f2 f2 f2 f2
  0x1000698bbd20: f2 f2 00 f2 f2 f2 f2 f2 f2 f2 00 f2 f2 f2 f2 f2
  0x1000698bbd30: f2 f2 00 00 00 00 00 06 f2 f2 f2 f2 f2 f2 00 00
  0x1000698bbd40: 00 00 00 06 f2 f2 f2 f2 f2 f2 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb

Fixes: 5d31276 ("common: Redo address string conversions")
Signed-off-by: Honggang Li <honli@redhat.com>
shefty pushed a commit that referenced this issue Dec 19, 2020
ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7fff4c61e7e0 at pc 0x14f2cb7ae0b9 bp 0x7fff4c61e650 sp 0x7fff4c61ddd8
WRITE of size 17 at 0x7fff4c61e7e0 thread T0
    #0 0x14f2cb7ae0b8  (/lib64/libasan.so.5+0xb40b8)
    #1 0x14f2cb7aedd2 in vsscanf (/lib64/libasan.so.5+0xb4dd2)
    #2 0x14f2cb7aeede in __interceptor_sscanf (/lib64/libasan.so.5+0xb4ede)
    #3 0x14f2cb230766 in ofi_addr_format src/common.c:401
    #4 0x14f2cb233238 in ofi_str_toaddr src/common.c:780
    #5 0x14f2cb314332 in vrb_handle_ib_ud_addr prov/verbs/src/verbs_info.c:1670
    #6 0x14f2cb314332 in vrb_get_match_infos prov/verbs/src/verbs_info.c:1787
    #7 0x14f2cb314332 in vrb_getinfo prov/verbs/src/verbs_info.c:1841
    #8 0x14f2cb21fc28 in fi_getinfo_ src/fabric.c:1010
    #9 0x14f2cb25fcc0 in ofi_get_core_info prov/util/src/util_attr.c:298
    #10 0x14f2cb269b20 in ofix_getinfo prov/util/src/util_attr.c:321
    #11 0x14f2cb3e29fd in rxd_getinfo prov/rxd/src/rxd_init.c:122
    #12 0x14f2cb21fc28 in fi_getinfo_ src/fabric.c:1010
    #13 0x407150 in ft_getinfo common/shared.c:794
    #14 0x414917 in ft_init_fabric common/shared.c:1042
    #15 0x402f40 in run functional/bw.c:155
    #16 0x402f40 in main functional/bw.c:252
    #17 0x14f2ca1b28e2 in __libc_start_main (/lib64/libc.so.6+0x238e2)
    #18 0x401d1d in _start (/root/libfabric/fabtests/functional/fi_bw+0x401d1d)

Address 0x7fff4c61e7e0 is located in stack of thread T0 at offset 48 in frame
    #0 0x14f2cb2306f3 in ofi_addr_format src/common.c:397

  This frame has 1 object(s):
    [32, 48) 'fmt' <== Memory access at offset 48 overflows this variable
HINT: this may be a false positive if your program uses some custom stack unwind mechanism or swapcontext
      (longjmp and C++ exceptions *are* supported)
SUMMARY: AddressSanitizer: stack-buffer-overflow (/lib64/libasan.so.5+0xb40b8)
Shadow bytes around the buggy address:
  0x1000698bbca0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1000698bbcb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1000698bbcc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1000698bbcd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1000698bbce0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x1000698bbcf0: 00 00 00 00 00 00 f1 f1 f1 f1 00 00[f2]f2 f3 f3
  0x1000698bbd00: f3 f3 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1
  0x1000698bbd10: f1 f1 00 f2 f2 f2 f2 f2 f2 f2 00 f2 f2 f2 f2 f2
  0x1000698bbd20: f2 f2 00 f2 f2 f2 f2 f2 f2 f2 00 f2 f2 f2 f2 f2
  0x1000698bbd30: f2 f2 00 00 00 00 00 06 f2 f2 f2 f2 f2 f2 00 00
  0x1000698bbd40: 00 00 00 06 f2 f2 f2 f2 f2 f2 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb

Fixes: 5d31276 ("common: Redo address string conversions")
Signed-off-by: Honggang Li <honli@redhat.com>
ooststep pushed a commit to ooststep/libfabric that referenced this issue Feb 10, 2023
If a posted receive matches with a saved receive, we may need to
increment the rx counter.  Set the rx counter increment callback
to match that of the posted receive.  This fixes an assert in
xnet_cntr_inc() accessing a NULL cntr_inc function pointer.

Program received signal SIGABRT, Aborted.
0x0000155552d4d37f in raise () from /lib64/libc.so.6
#0  0x0000155552d4d37f in raise () from /lib64/libc.so.6
#1  0x0000155552d37db5 in abort () from /lib64/libc.so.6
#2  0x0000155552d37c89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#3  0x0000155552d45a76 in __assert_fail () from /lib64/libc.so.6
#4  0x00001555522967f9 in xnet_cntr_inc (ep=0x6e4c70, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:347
#5  0x0000155552296836 in xnet_report_cntr_success (ep=0x6e4c70, cq=0x6ca930, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:354
#6  0x000015555229970d in xnet_complete_saved (saved_entry=0x6f7a30) at prov/tcp/src/xnet_progress.c:153
#7  0x0000155552299961 in xnet_recv_saved (saved_entry=0x6f7a30, rx_entry=0x6f7840) at prov/tcp/src/xnet_progress.c:188
#8  0x00001555522946f8 in xnet_srx_tag (srx=0x6dd1c0, recv_entry=0x6f7840) at prov/tcp/src/xnet_srx.c:445
ofiwg#9  0x0000155552294bb1 in xnet_srx_trecv (ep_fid=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_srx.c:558
ofiwg#10 0x000015555228f60e in fi_trecv (ep=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at ./include/rdma/fi_tagged.h:91
ofiwg#11 0x00001555522900a7 in xnet_rdm_trecv (ep_fid=0x6d9fe0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_rdm.c:212

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
ooststep pushed a commit to ooststep/libfabric that referenced this issue Feb 10, 2023
If a posted receive matches with a saved receive, we may need to
increment the rx counter.  Set the rx counter increment callback
to match that of the posted receive.  This fixes an assert in
xnet_cntr_inc() accessing a NULL cntr_inc function pointer.

Program received signal SIGABRT, Aborted.
0x0000155552d4d37f in raise () from /lib64/libc.so.6
#0  0x0000155552d4d37f in raise () from /lib64/libc.so.6
#1  0x0000155552d37db5 in abort () from /lib64/libc.so.6
#2  0x0000155552d37c89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#3  0x0000155552d45a76 in __assert_fail () from /lib64/libc.so.6
#4  0x00001555522967f9 in xnet_cntr_inc (ep=0x6e4c70, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:347
#5  0x0000155552296836 in xnet_report_cntr_success (ep=0x6e4c70, cq=0x6ca930, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:354
#6  0x000015555229970d in xnet_complete_saved (saved_entry=0x6f7a30) at prov/tcp/src/xnet_progress.c:153
#7  0x0000155552299961 in xnet_recv_saved (saved_entry=0x6f7a30, rx_entry=0x6f7840) at prov/tcp/src/xnet_progress.c:188
#8  0x00001555522946f8 in xnet_srx_tag (srx=0x6dd1c0, recv_entry=0x6f7840) at prov/tcp/src/xnet_srx.c:445
ofiwg#9  0x0000155552294bb1 in xnet_srx_trecv (ep_fid=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_srx.c:558
ofiwg#10 0x000015555228f60e in fi_trecv (ep=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at ./include/rdma/fi_tagged.h:91
ofiwg#11 0x00001555522900a7 in xnet_rdm_trecv (ep_fid=0x6d9fe0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_rdm.c:212

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
shefty added a commit that referenced this issue Feb 10, 2023
If a posted receive matches with a saved receive, we may need to
increment the rx counter.  Set the rx counter increment callback
to match that of the posted receive.  This fixes an assert in
xnet_cntr_inc() accessing a NULL cntr_inc function pointer.

Program received signal SIGABRT, Aborted.
0x0000155552d4d37f in raise () from /lib64/libc.so.6
#0  0x0000155552d4d37f in raise () from /lib64/libc.so.6
#1  0x0000155552d37db5 in abort () from /lib64/libc.so.6
#2  0x0000155552d37c89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#3  0x0000155552d45a76 in __assert_fail () from /lib64/libc.so.6
#4  0x00001555522967f9 in xnet_cntr_inc (ep=0x6e4c70, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:347
#5  0x0000155552296836 in xnet_report_cntr_success (ep=0x6e4c70, cq=0x6ca930, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:354
#6  0x000015555229970d in xnet_complete_saved (saved_entry=0x6f7a30) at prov/tcp/src/xnet_progress.c:153
#7  0x0000155552299961 in xnet_recv_saved (saved_entry=0x6f7a30, rx_entry=0x6f7840) at prov/tcp/src/xnet_progress.c:188
#8  0x00001555522946f8 in xnet_srx_tag (srx=0x6dd1c0, recv_entry=0x6f7840) at prov/tcp/src/xnet_srx.c:445
#9  0x0000155552294bb1 in xnet_srx_trecv (ep_fid=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_srx.c:558
#10 0x000015555228f60e in fi_trecv (ep=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at ./include/rdma/fi_tagged.h:91
#11 0x00001555522900a7 in xnet_rdm_trecv (ep_fid=0x6d9fe0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_rdm.c:212

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants