Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improve] Use single buffer for metrics when noUnsafe use #23612

Merged
merged 6 commits into from
Nov 27, 2024

Conversation

zymap
Copy link
Member

@zymap zymap commented Nov 19, 2024


Motivation

We have met some core dumps when running pulsar service. Then saw Netty suggested using io.netty.noUnsafe to see if it works. netty/netty#2950. After adding that property, the core dump is gone, but the metrics failed to report because of the UnsupportedOpeartion exception.

Netty has a property io.netty.noUnsafe which allows disabling use the unsafe api. When this is enabled, the metrics won't collect because of the Unspport operation exception.
The root cause is it calls the internalNioBuffer which only allow to call when the buffer count is 1. But when using the composite buffer the buffer count may more than 1.
So use the buffer directly when the flag is enabled.

The exception:
Caused by: java.lang.UnsupportedOperationException
at io.netty.buffer.CompositeByteBuf.internalNioBuffer(CompositeByteBuf.java:1657) ~[io.netty-netty-buffer-4.1.113.Final.jar:4.1.113.Final]
at io.netty.buffer.ByteBufUtil.writeUtf8(ByteBufUtil.java:924) ~[io.netty-netty-buffer-4.1.113.Final.jar:4.1.113.Final]
at io.netty.buffer.ByteBufUtil.writeUtf8(ByteBufUtil.java:900) ~[io.netty-netty-buffer-4.1.113.Final.jar:4.1.113.Final] at io.netty.buffer.AbstractByteBuf.setCharSequence0(AbstractByteBuf.java:707) ~[io.netty-netty-buffer-4.1.113.Final.jar:4.1.113.Final]

Fixes #xyz

Main Issue: #xyz

PIP: #xyz

Motivation

Modifications

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

---

### Motivation

Netty has a property io.netty.noUnsafe which allows to force don't use
unsafe api. When this is enabled, the metrics won't collect because
of the Unspport operation exception.
The root cause is it calls the internalNioBuffer which only allow to
call when the buffer count is 1. But when using the composite buffer
the buffer count may more than 1.
So use the buffer directly when the flag is enabled.

The exception:
                                                                                                                                                                                                                                                             Caused by: java.lang.UnsupportedOperationException
    at io.netty.buffer.CompositeByteBuf.internalNioBuffer(CompositeByteBuf.java:1657) ~[io.netty-netty-buffer-4.1.113.Final.jar:4.1.113.Final]
    at io.netty.buffer.ByteBufUtil.writeUtf8(ByteBufUtil.java:924) ~[io.netty-netty-buffer-4.1.113.Final.jar:4.1.113.Final]
     at io.netty.buffer.ByteBufUtil.writeUtf8(ByteBufUtil.java:900) ~[io.netty-netty-buffer-4.1.113.Final.jar:4.1.113.Final]                                                                                                                                                                                                                 at io.netty.buffer.AbstractByteBuf.setCharSequence0(AbstractByteBuf.java:707) ~[io.netty-netty-buffer-4.1.113.Final.jar:4.1.113.Final]
Copy link

@zymap Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

@lhotari
Copy link
Member

lhotari commented Nov 19, 2024

Netty has a property io.netty.noUnsafe which allows disabling use the unsafe api. When this is enabled, the metrics won't collect because of the Unspport operation exception.

The PR itself makes sense. However, there shouldn't be a need to ever disable the unsafe API when running Pulsar. It would cause a lot of performance issues if that's used. What would be the motivation for disabling the unsafe API?

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the comment on the test.

@codelipenghui
Copy link
Contributor

We hit an issue with following core dump

---------------  T H R E A D  ---------------

Current thread (0x00007f334cfa2780):  JavaThread "broker-topic-workers-OrderedExecutor-16-0"        [_thread_in_Java, id=200, stack(0x00007f2fab00d000,0x00007f2fab10da90) (1026K)]

Stack: [0x00007f2fab00d000,0x00007f2fab10da90],  sp=0x00007f2fab10cad0,  free space=1022k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
v  ~StubRoutines::jbyte_disjoint_arraycopy_avx3 0x00007f3767573140

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x00007f2f2d883000

Registers:
RAX=0x000000000000001e, RBX=0x00007f33576fc2c0, RCX=0x000000000039211e, RDX=0x00000000003d0715
RSP=0x00007f2fab10cad0, RBP=0x00007f2fab10cad0, RSI=0x00007f2f2d0efee2, RDI=0x00007f2f2d4f0ecb
R8 =0x000000000003e537, R9 =0x00000000802ae9a0, R10=0x00007f376757b1a0, R11=0x000000000000001e
R12=0x00000000003d0715, R13=0x0000040359706768, R14=0x00000000802ce3f0, R15=0x00007f334cfa2780
RIP=0x00007f3767573140, EFLAGS=0x0000000000010202, CSGSFS=0x002b000000000033, ERR=0x0000000000000004
  TRAPNO=0x000000000000000e


Register to memory mapping:

RAX=0x000000000000001e is an unknown value
RBX={method} {0x00007f33576fc2c0} 'memoryAddress' '()J' in 'io/netty/buffer/UnpooledUnsafeDirectByteBuf'
RCX=0x000000000039211e is an unknown value
RDX=0x00000000003d0715 is an unknown value
RSP=0x00007f2fab10cad0 is pointing into the stack for thread: 0x00007f334cfa2780
RBP=0x00007f2fab10cad0 is pointing into the stack for thread: 0x00007f334cfa2780
RSI=0x00007f2f2d0efee2 points into unknown readable memory: 00 00 00 00 00 00
RDI=0x00007f2f2d4f0ecb is an unknown value
R8 =0x000000000003e537 is an unknown value
R9 =0x00000000802ae9a0 is a pointer to class: 
io.netty.buffer.UnpooledUnsafeDirectByteBuf {0x00000000802ae9a0}
 - instance size:     10
 - klass size:        309
 - access:            public synchronized 
 - flags:             rewritten has_nonstatic_fields should_verify_class has_localvariable_table has_miranda_methods has_final_method 
 - state:             fully_initialized
 - name:              'io/netty/buffer/UnpooledUnsafeDirectByteBuf'
 - super:             'io/netty/buffer/UnpooledDirectByteBuf'
 - sub:               'io/netty/buffer/ByteBufUtil$ThreadLocalUnsafeDirectByteBuf'   'io/netty/buffer/WrappedUnpooledUnsafeDirectByteBuf'   'io/netty/buffer/UnpooledByteBufAllocator$InstrumentedUnpooledUnsafeDirectByteBuf'   'io/netty/buffer/UnpooledUnsafeNoCleanerDirectByteBuf'   (0 more klasses...)
 - arrays:            null
 - methods:           Array<T>(0x00007f33576fbce8)
 - method ordering:   Array<T>(0x00007f3356000018)
 - default_methods:   Array<T>(0x0000000000000000)
 - local interfaces:  Array<T>(0x00007f3356000058)
 - trans. interfaces: Array<T>(0x00007f3357685620)
 - constants:         constant pool [246] {0x00007f33576fb4b8} for 'io/netty/buffer/UnpooledUnsafeDirectByteBuf' cache=0x00007f33595640d8
 - class loader data:  loader data: 0x00007f3355d6a670 for instance a 'jdk/internal/loader/ClassLoaders$AppClassLoader'{0x00000406bc200868}
 - source file:       'UnpooledUnsafeDirectByteBuf.java'
 - class annotations:       Array<T>(0x0000000000000000)
 - class type annotations:  Array<T>(0x0000000000000000)
 - field annotations:       Array<T>(0x0000000000000000)
 - field type annotations:  Array<T>(0x0000000000000000)
 - inner classes:     Array<T>(0x00007f3356000028)
 - nest members:     Array<T>(0x00007f3356000028)
 - permitted subclasses:     Array<T>(0x00007f3356000028)
 - java mirror:       a 'java/lang/Class'{0x000004000e274580} = 'io/netty/buffer/UnpooledUnsafeDirectByteBuf'
 - vtable length      234  (start addr: 0x00000000802aeb68)
 - itable length      17 (start addr: 0x00000000802af2b8)
 - ---- static fields (0 words):
 - ---- non-static fields (8 words):
 - 'readerIndex' 'I' @12 
 - 'writerIndex' 'I' @16 
 - private 'markedReaderIndex' 'I' @20 
 - private 'markedWriterIndex' 'I' @24 
 - private 'maxCapacity' 'I' @28 
 - private volatile 'refCnt' 'I' @32 
 - private 'capacity' 'I' @36 
 - private 'doNotFree' 'Z' @40 
 - private final 'alloc' 'Lio/netty/buffer/ByteBufAllocator;' @48 
 - 'buffer' 'Ljava/nio/ByteBuffer;' @56 
 - private 'tmpNioBuf' 'Ljava/nio/ByteBuffer;' @64 
 - 'memoryAddress' 'J' @72 
 - non-static oop maps: 48-64 
R10=0x00007f376757b1a0 is at begin+0 in a stub
StubRoutines::unsafe_arraycopy [0x00007f376757b1a0, 0x00007f376757b1d7] (55 bytes)
R11=0x000000000000001e is an unknown value
R12=0x00000000003d0715 is an unknown value
R13=0x0000040359706768 is a zaddress: org.apache.pulsar.common.api.proto.MessageMetadata 
{0x0000040359706768} - klass: 'org/apache/pulsar/common/api/proto/MessageMetadata'
 - ---- fields (total size 36 words):
 - private '_replicatedFromBufferLen' 'I' @12  -1 (0xffffffff)
 - private 'sequenceId' 'J' @16  -1 (0xffffffffffffffff)
 - private 'highestSequenceId' 'J' @24  -1 (0xffffffffffffffff)
 - private 'txnidLeastBits' 'J' @32  0 (0x0000000000000000)
 - private 'publishTime' 'J' @40  1729041383477 (0x0000019292e5c035)
 - private 'txnidMostBits' 'J' @48  0 (0x0000000000000000)
 - private 'deliverAtTime' 'J' @56  0 (0x0000000000000000)
 - private 'eventTime' 'J' @64  0 (0x0000000000000000)
 - private '_replicatedFromBufferIdx' 'I' @72  -1 (0xffffffff)
 - private '_partitionKeyBufferIdx' 'I' @76  -1 (0xffffffff)
 - private 'uncompressedSize' 'I' @80  0 (0x00000000)
 - private '_replicateTosCount' 'I' @84  0 (0x00000000)
 - private 'numMessagesInBatch' 'I' @88  3868 (0x00000f1c)
 - private '_encryptionKeysCount' 'I' @92  0 (0x00000000)
 - private '_producerNameBufferLen' 'I' @96  20 (0x00000014)
 - private '_producerNameBufferIdx' 'I' @100  -1 (0xffffffff)
 - private '_encryptionParamIdx' 'I' @104  -1 (0xffffffff)
 - private '_encryptionAlgoBufferLen' 'I' @108  -1 (0xffffffff)
 - private '_schemaVersionIdx' 'I' @112  -1 (0xffffffff)
 - private '_encryptionParamLen' 'I' @116  -1 (0xffffffff)
 - private '_schemaVersionLen' 'I' @120  -1 (0xffffffff)
 - private '_orderingKeyIdx' 'I' @124  -1 (0xffffffff)
 - private '_orderingKeyLen' 'I' @128  -1 (0xffffffff)
 - private 'markerType' 'I' @132  0 (0x00000000)
 - private '_propertiesCount' 'I' @136  1 (0x00000001)
 - private '_partitionKeyBufferLen' 'I' @140  -1 (0xffffffff)
 - private '_encryptionAlgoBufferIdx' 'I' @144  -1 (0xffffffff)
 - private '_cachedSize' 'I' @148  78 (0x0000004e)
 - private '_uuidBufferIdx' 'I' @152  -1 (0xffffffff)
 - private 'chunkId' 'I' @156  0 (0x00000000)
 - private '_uuidBufferLen' 'I' @160  -1 (0xffffffff)
 - private 'numChunksFromMsg' 'I' @164  0 (0x00000000)
 - private 'totalChunkMsgSize' 'I' @168  0 (0x00000000)
 - private '_bitField0' 'I' @172  2097671 (0x00200207)
 - private 'partitionKeyB64Encoded' 'Z' @176  false (0x00)
 - private 'nullValue' 'Z' @177  false (0x00)
 - private 'nullPartitionKey' 'Z' @178  false (0x00)
 - private 'producerName' 'Ljava/lang/String;' @184  "KOP-PID-PREFIX--1--1"{0x0000040359706500} (0x00806b2e0ca01510)
 - private 'properties' 'Ljava/util/List;' @192  a 'java/util/ArrayList'{0x0000040359706888} (0x00806b2e0d111510)
 - private 'replicatedFrom' 'Ljava/lang/String;' @200  null (0x0000000000000000)
 - private 'partitionKey' 'Ljava/lang/String;' @208  null (0x0000000000000000)
 - private 'replicateTos' 'Ljava/util/List;' @216  null (0x0000000000000000)
 - private 'compression' 'Lorg/apache/pulsar/common/api/proto/CompressionType;' @224  a 'org/apache/pulsar/common/api/proto/CompressionType'{0x00000400946c6ea8} (0x0080128d8dd51510)
 - private 'encryptionKeys' 'Ljava/util/List;' @232  null (0x0000000000000000)
 - private 'encryptionAlgo' 'Ljava/lang/String;' @240  null (0x0000000000000000)
 - private 'encryptionParam' 'Lio/netty/buffer/ByteBuf;' @248  null (0x0000000000000000)
 - private 'schemaVersion' 'Lio/netty/buffer/ByteBuf;' @256  null (0x0000000000000000)
 - private 'orderingKey' 'Lio/netty/buffer/ByteBuf;' @264  null (0x0000000000000000)
 - private 'uuid' 'Ljava/lang/String;' @272  null (0x0000000000000000)
 - private '_parsedBuffer' 'Lio/netty/buffer/ByteBuf;' @280  null (0x0000000000000000)
R14=0x00000000802ce3f0 is a pointer to class: 
io.netty.buffer.PooledUnsafeDirectByteBuf {0x00000000802ce3f0}
 - instance size:     14
 - klass size:        304
 - access:            final synchronized 
 - flags:             rewritten has_nonstatic_fields should_verify_class has_localvariable_table has_miranda_methods 
 - state:             fully_initialized
 - name:              'io/netty/buffer/PooledUnsafeDirectByteBuf'
 - super:             'io/netty/buffer/PooledByteBuf'
 - sub:               
 - arrays:            null
 - methods:           Array<T>(0x00007f33577af718)
 - method ordering:   Array<T>(0x00007f3356000018)
 - default_methods:   Array<T>(0x0000000000000000)
 - local interfaces:  Array<T>(0x00007f3356000058)
 - trans. interfaces: Array<T>(0x00007f3357685620)
 - constants:         constant pool [301] {0x00007f33577b0000} for 'io/netty/buffer/PooledUnsafeDirectByteBuf' cache=0x00007f2fa5585e58
 - class loader data:  loader data: 0x00007f3355d6a670 for instance a 'jdk/internal/loader/ClassLoaders$AppClassLoader'{0x00000406bc200868}
 - source file:       'PooledUnsafeDirectByteBuf.java'
 - class annotations:       Array<T>(0x0000000000000000)
 - class type annotations:  Array<T>(0x0000000000000000)
 - field annotations:       Array<T>(0x0000000000000000)
 - field type annotations:  Array<T>(0x0000000000000000)
 - generic signature: 'Lio/netty/buffer/PooledByteBuf<Ljava/nio/ByteBuffer;>;'
 - inner classes:     Array<T>(0x00007f33577b2da0)
 - nest members:     Array<T>(0x00007f3356000028)
 - permitted subclasses:     Array<T>(0x00007f3356000028)
 - java mirror:       a 'java/lang/Class'{0x000004000cdb1158} = 'io/netty/buffer/PooledUnsafeDirectByteBuf'
 - vtable length      229  (start addr: 0x00000000802ce5b8)
 - itable length      17 (start addr: 0x00000000802cece0)
 - ---- static fields (1 words):
 - private static final 'RECYCLER' 'Lio/netty/util/internal/ObjectPool;' @176 
 - ---- non-static fields (12 words):
 - 'readerIndex' 'I' @12 
 - 'writerIndex' 'I' @16 
 - private 'markedReaderIndex' 'I' @20 
 - private 'markedWriterIndex' 'I' @24 
 - private 'maxCapacity' 'I' @28 
 - private volatile 'refCnt' 'I' @32 
 - protected 'offset' 'I' @36 
 - protected 'handle' 'J' @40 
 - protected 'length' 'I' @48 
 - 'maxLength' 'I' @52 
 - private final 'recyclerHandle' 'Lio/netty/util/Recycler$EnhancedHandle;' @56 
 - protected 'chunk' 'Lio/netty/buffer/PoolChunk;' @64 
 - protected 'memory' 'Ljava/lang/Object;' @72 
 - 'cache' 'Lio/netty/buffer/PoolThreadCache;' @80 
 - 'tmpNioBuf' 'Ljava/nio/ByteBuffer;' @88 
 - private 'allocator' 'Lio/netty/buffer/ByteBufAllocator;' @96 
 - private 'memoryAddress' 'J' @104 
 - non-static oop maps: 56-96 
R15=0x00007f334cfa2780 is a thread

And set io.netty.noUnsafe=true can resolve the crash issue as a short term solution.

@codecov-commenter
Copy link

codecov-commenter commented Nov 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.34%. Comparing base (bbc6224) to head (cf31ebd).
Report is 756 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #23612      +/-   ##
============================================
+ Coverage     73.57%   74.34%   +0.77%     
- Complexity    32624    34463    +1839     
============================================
  Files          1877     1944      +67     
  Lines        139502   147200    +7698     
  Branches      15299    16240     +941     
============================================
+ Hits         102638   109441    +6803     
- Misses        28908    29314     +406     
- Partials       7956     8445     +489     
Flag Coverage Δ
inttests 27.30% <0.00%> (+2.72%) ⬆️
systests 24.36% <0.00%> (+0.04%) ⬆️
unittests 73.73% <100.00%> (+0.89%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...r/stats/prometheus/PrometheusMetricsGenerator.java 78.46% <100.00%> (-7.05%) ⬇️

... and 665 files with indirect coverage changes

@lhotari lhotari merged commit 9b79d70 into apache:master Nov 27, 2024
53 checks passed
lhotari pushed a commit that referenced this pull request Nov 27, 2024
Co-authored-by: Lari Hotari <lhotari@apache.org>
(cherry picked from commit 9b79d70)
lhotari pushed a commit that referenced this pull request Nov 27, 2024
Co-authored-by: Lari Hotari <lhotari@apache.org>
(cherry picked from commit 9b79d70)
lhotari pushed a commit that referenced this pull request Nov 27, 2024
Co-authored-by: Lari Hotari <lhotari@apache.org>
(cherry picked from commit 9b79d70)
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Nov 28, 2024
Co-authored-by: Lari Hotari <lhotari@apache.org>
(cherry picked from commit 9b79d70)
(cherry picked from commit aee413b)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Nov 28, 2024
Co-authored-by: Lari Hotari <lhotari@apache.org>
(cherry picked from commit 9b79d70)
(cherry picked from commit aee413b)
@zymap
Copy link
Member Author

zymap commented Dec 2, 2024

@lhotari Thanks for fixing it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants