Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented #320. #337

Closed
wants to merge 23 commits into from
Closed

Implemented #320. #337

wants to merge 23 commits into from

Conversation

redboltz
Copy link
Owner

Added buffer class that supports mqtt::string_view compatible
behavior and life keekping mechanism (optional).

Callback functions for users hold receive buffer directly via
buffer.

Removed *_ref properties. Ref or not ref is hidden by buffer.

Added buffer class that supports `mqtt::string_view` compatible
behavior and life keekping mechanism (optional).

Callback functions for users hold receive buffer directly via
`buffer`.

Removed `*_ref` properties. Ref or not ref is hidden by `buffer`.
@redboltz
Copy link
Owner Author

Status

  • Implemented buffer based zero-copy mechanism.
  • All tests have been passed on my local environment (linux only).

Problem

  • Longer compile time.
    • Attached all time report below.
    • Remarkable points

master

===-------------------------------------------------------------------------===
                         Miscellaneous Ungrouped Timers
===-------------------------------------------------------------------------===

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   3.2157 ( 57.9%)   0.0995 (  9.4%)   3.3152 ( 50.1%)   3.3390 ( 50.3%)  LLVM IR Generation Time
   2.3385 ( 42.1%)   0.9591 ( 90.6%)   3.2976 ( 49.9%)   3.3003 ( 49.7%)  Code Generation Time
   5.5542 (100.0%)   1.0586 (100.0%)   6.6128 (100.0%)   6.6392 (100.0%)  Total
===-------------------------------------------------------------------------===
                          Clang front-end time report
===-------------------------------------------------------------------------===
  Total Execution Time: 22.5844 seconds (22.6065 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  20.9391 (100.0%)   1.6453 (100.0%)  22.5844 (100.0%)  22.6065 (100.0%)  Clang front-end timer
  20.9391 (100.0%)   1.6453 (100.0%)  22.5844 (100.0%)  22.6065 (100.0%)  Total

impl_320

===-------------------------------------------------------------------------===
                         Miscellaneous Ungrouped Timers
===-------------------------------------------------------------------------===

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  18.4311 ( 81.9%)   0.5441 ( 25.3%)  18.9753 ( 76.9%)  19.1197 ( 77.0%)  LLVM IR Generation Time
   4.0834 ( 18.1%)   1.6083 ( 74.7%)   5.6918 ( 23.1%)   5.6976 ( 23.0%)  Code Generation Time
  22.5146 (100.0%)   2.1525 (100.0%)  24.6670 (100.0%)  24.8173 (100.0%)  Total
===-------------------------------------------------------------------------===
                          Clang front-end time report
===-------------------------------------------------------------------------===
  Total Execution Time: 95.5239 seconds (95.6318 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  91.3182 (100.0%)   4.2056 (100.0%)  95.5239 (100.0%)  95.6318 (100.0%)  Clang front-end timer
  91.3182 (100.0%)   4.2056 (100.0%)  95.5239 (100.0%)  95.6318 (100.0%)  Total

Time report

Using -ftime-report option.

master

757cba1

[ 50%] Building CXX object example/CMakeFiles/no_tls_both.dir/no_tls_both.cpp.o
cd /home/kondo/work/tmp/mqtt_cpp/build/example && /usr/bin/clang++   -I/home/kondo/work/tmp/mqtt_cpp/include  -DMQTT_STD_STRING_VIEW -DMQTT_USE_STR_CHECK -DMQTT_USE_WS -ftime-report -std=c++17 -g -ggdb3 -Wall -Wextra -Werror -Wno-ignored-qualifiers -Wconversion -g   -pthread -std=gnu++17 -o CMakeFiles/no_tls_both.dir/no_tls_both.cpp.o -c /home/kondo/work/tmp/mqtt_cpp/example/no_tls_both.cpp
===-------------------------------------------------------------------------===
                         Miscellaneous Ungrouped Timers
===-------------------------------------------------------------------------===

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   3.2157 ( 57.9%)   0.0995 (  9.4%)   3.3152 ( 50.1%)   3.3390 ( 50.3%)  LLVM IR Generation Time
   2.3385 ( 42.1%)   0.9591 ( 90.6%)   3.2976 ( 49.9%)   3.3003 ( 49.7%)  Code Generation Time
   5.5542 (100.0%)   1.0586 (100.0%)   6.6128 (100.0%)   6.6392 (100.0%)  Total

===-------------------------------------------------------------------------===
                      Instruction Selection and Scheduling
===-------------------------------------------------------------------------===
  Total Execution Time: 0.2686 seconds (0.2690 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.0451 ( 22.7%)   0.0161 ( 22.9%)   0.0612 ( 22.8%)   0.0610 ( 22.7%)  Instruction Selection
   0.0425 ( 21.4%)   0.0149 ( 21.3%)   0.0574 ( 21.4%)   0.0573 ( 21.3%)  Instruction Scheduling
   0.0333 ( 16.8%)   0.0118 ( 16.8%)   0.0451 ( 16.8%)   0.0464 ( 17.2%)  DAG Combining 1
   0.0263 ( 13.3%)   0.0091 ( 12.9%)   0.0354 ( 13.2%)   0.0353 ( 13.1%)  Instruction Creation
   0.0220 ( 11.1%)   0.0078 ( 11.2%)   0.0298 ( 11.1%)   0.0298 ( 11.1%)  DAG Combining 2
   0.0122 (  6.2%)   0.0044 (  6.3%)   0.0167 (  6.2%)   0.0166 (  6.2%)  DAG Legalization
   0.0091 (  4.6%)   0.0032 (  4.6%)   0.0123 (  4.6%)   0.0123 (  4.6%)  Type Legalization
   0.0038 (  1.9%)   0.0014 (  1.9%)   0.0052 (  1.9%)   0.0052 (  1.9%)  Instruction Scheduling Cleanup
   0.0034 (  1.7%)   0.0012 (  1.7%)   0.0046 (  1.7%)   0.0045 (  1.7%)  Vector Legalization
   0.0007 (  0.3%)   0.0002 (  0.3%)   0.0009 (  0.3%)   0.0009 (  0.3%)  DAG Combining after legalize types
   0.1985 (100.0%)   0.0701 (100.0%)   0.2686 (100.0%)   0.2690 (100.0%)  Total

===-------------------------------------------------------------------------===
                                 DWARF Emission
===-------------------------------------------------------------------------===
  Total Execution Time: 0.5862 seconds (0.5885 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.3044 ( 71.5%)   0.1070 ( 66.7%)   0.4114 ( 70.2%)   0.4129 ( 70.2%)  Debug Info Emission
   0.1114 ( 26.2%)   0.0501 ( 31.3%)   0.1615 ( 27.6%)   0.1623 ( 27.6%)  DWARF Exception Writer
   0.0100 (  2.4%)   0.0032 (  2.0%)   0.0133 (  2.3%)   0.0133 (  2.3%)  DWARF Debug Writer
   0.4259 (100.0%)   0.1603 (100.0%)   0.5862 (100.0%)   0.5885 (100.0%)  Total

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 2.5206 seconds (2.5192 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.7188 ( 41.3%)   0.3213 ( 41.2%)   1.0402 ( 41.3%)   1.0411 ( 41.3%)  X86 Assembly Printer
   0.4960 ( 28.5%)   0.1934 ( 24.8%)   0.6894 ( 27.4%)   0.6884 ( 27.3%)  X86 DAG->DAG Instruction Selection
   0.0997 (  5.7%)   0.0422 (  5.4%)   0.1419 (  5.6%)   0.1416 (  5.6%)  Prologue/Epilogue Insertion & Frame Finalization
   0.0493 (  2.8%)   0.0209 (  2.7%)   0.0702 (  2.8%)   0.0698 (  2.8%)  Fast Register Allocator
   0.0378 (  2.2%)   0.0160 (  2.1%)   0.0538 (  2.1%)   0.0534 (  2.1%)  Live DEBUG_VALUE analysis
   0.0320 (  1.8%)   0.0151 (  1.9%)   0.0471 (  1.9%)   0.0469 (  1.9%)  Insert stack protectors
   0.0247 (  1.4%)   0.0107 (  1.4%)   0.0354 (  1.4%)   0.0351 (  1.4%)  Two-Address instruction pass
   0.0184 (  1.1%)   0.0087 (  1.1%)   0.0270 (  1.1%)   0.0269 (  1.1%)  Dominator Tree Construction #2
   0.0194 (  1.1%)   0.0074 (  0.9%)   0.0268 (  1.1%)   0.0268 (  1.1%)  Inliner for always_inline functions
   0.0096 (  0.6%)   0.0122 (  1.6%)   0.0219 (  0.9%)   0.0218 (  0.9%)  Expand Atomic instructions
   0.0148 (  0.8%)   0.0063 (  0.8%)   0.0211 (  0.8%)   0.0210 (  0.8%)  MachineDominator Tree Construction
   0.0127 (  0.7%)   0.0061 (  0.8%)   0.0189 (  0.7%)   0.0188 (  0.7%)  Exception handling preparation
   0.0127 (  0.7%)   0.0058 (  0.7%)   0.0185 (  0.7%)   0.0184 (  0.7%)  X86 EFLAGS copy lowering
   0.0124 (  0.7%)   0.0061 (  0.8%)   0.0185 (  0.7%)   0.0184 (  0.7%)  Free MachineFunction
   0.0163 (  0.9%)   0.0002 (  0.0%)   0.0165 (  0.7%)   0.0165 (  0.7%)  CallGraph Construction
   0.0054 (  0.3%)   0.0074 (  0.9%)   0.0128 (  0.5%)   0.0128 (  0.5%)  Dominator Tree Construction
   0.0050 (  0.3%)   0.0066 (  0.8%)   0.0116 (  0.5%)   0.0116 (  0.5%)  Scalarize Masked Memory Intrinsics
   0.0049 (  0.3%)   0.0064 (  0.8%)   0.0114 (  0.5%)   0.0114 (  0.5%)  Expand reduction intrinsics
   0.0077 (  0.4%)   0.0035 (  0.4%)   0.0112 (  0.4%)   0.0112 (  0.4%)  Post-RA pseudo instruction expansion pass
   0.0072 (  0.4%)   0.0032 (  0.4%)   0.0104 (  0.4%)   0.0104 (  0.4%)  Check CFA info and insert CFI instructions if needed
   0.0068 (  0.4%)   0.0030 (  0.4%)   0.0098 (  0.4%)   0.0098 (  0.4%)  X86 pseudo instruction expansion pass
   0.0042 (  0.2%)   0.0056 (  0.7%)   0.0098 (  0.4%)   0.0097 (  0.4%)  Expand indirectbr instructions
   0.0064 (  0.4%)   0.0030 (  0.4%)   0.0094 (  0.4%)   0.0095 (  0.4%)  X86 Indirect Branch Tracking
   0.0044 (  0.3%)   0.0049 (  0.6%)   0.0092 (  0.4%)   0.0092 (  0.4%)  Basic Alias Analysis (stateless AA impl)
   0.0058 (  0.3%)   0.0028 (  0.4%)   0.0086 (  0.3%)   0.0087 (  0.3%)  Eliminate PHI nodes for register allocation
   0.0051 (  0.3%)   0.0024 (  0.3%)   0.0074 (  0.3%)   0.0074 (  0.3%)  Insert fentry calls
   0.0050 (  0.3%)   0.0023 (  0.3%)   0.0073 (  0.3%)   0.0074 (  0.3%)  Insert XRay ops
   0.0049 (  0.3%)   0.0023 (  0.3%)   0.0072 (  0.3%)   0.0072 (  0.3%)  Expand ISel Pseudo-instructions
   0.0047 (  0.3%)   0.0022 (  0.3%)   0.0069 (  0.3%)   0.0069 (  0.3%)  Bundle Machine CFG Edges
   0.0046 (  0.3%)   0.0022 (  0.3%)   0.0069 (  0.3%)   0.0068 (  0.3%)  Implement the 'patchable-function' attribute
   0.0022 (  0.1%)   0.0043 (  0.6%)   0.0065 (  0.3%)   0.0067 (  0.3%)  Instrument function entry/exit with calls to e.g. mcount() (pre inlining)
   0.0029 (  0.2%)   0.0038 (  0.5%)   0.0067 (  0.3%)   0.0067 (  0.3%)  Instrument function entry/exit with calls to e.g. mcount() (post inlining)
   0.0043 (  0.2%)   0.0019 (  0.2%)   0.0063 (  0.2%)   0.0063 (  0.3%)  Machine Optimization Remark Emitter
   0.0028 (  0.2%)   0.0036 (  0.5%)   0.0063 (  0.3%)   0.0063 (  0.3%)  Remove unreachable blocks from the CFG
   0.0043 (  0.2%)   0.0020 (  0.3%)   0.0062 (  0.2%)   0.0062 (  0.2%)  X86 PIC Global Base Reg Initialization
   0.0040 (  0.2%)   0.0019 (  0.2%)   0.0059 (  0.2%)   0.0060 (  0.2%)  Machine Optimization Remark Emitter #2
   0.0040 (  0.2%)   0.0018 (  0.2%)   0.0058 (  0.2%)   0.0059 (  0.2%)  X86 Retpoline Thunks
   0.0038 (  0.2%)   0.0018 (  0.2%)   0.0056 (  0.2%)   0.0058 (  0.2%)  X86 Insert Cache Prefetches
   0.0039 (  0.2%)   0.0018 (  0.2%)   0.0057 (  0.2%)   0.0058 (  0.2%)  Contiguously Lay Out Funclets
   0.0038 (  0.2%)   0.0019 (  0.2%)   0.0057 (  0.2%)   0.0057 (  0.2%)  X86 speculative load hardening
   0.0038 (  0.2%)   0.0019 (  0.2%)   0.0057 (  0.2%)   0.0057 (  0.2%)  X86 FP Stackifier
   0.0038 (  0.2%)   0.0018 (  0.2%)   0.0056 (  0.2%)   0.0056 (  0.2%)  Lazy Machine Block Frequency Analysis
   0.0037 (  0.2%)   0.0018 (  0.2%)   0.0055 (  0.2%)   0.0055 (  0.2%)  StackMap Liveness Analysis
   0.0038 (  0.2%)   0.0018 (  0.2%)   0.0056 (  0.2%)   0.0055 (  0.2%)  Lazy Machine Block Frequency Analysis #2
   0.0038 (  0.2%)   0.0018 (  0.2%)   0.0055 (  0.2%)   0.0055 (  0.2%)  Shadow Call Stack
   0.0038 (  0.2%)   0.0016 (  0.2%)   0.0054 (  0.2%)   0.0055 (  0.2%)  Local Stack Slot Allocation
   0.0037 (  0.2%)   0.0017 (  0.2%)   0.0054 (  0.2%)   0.0055 (  0.2%)  X86 vzeroupper inserter
   0.0036 (  0.2%)   0.0017 (  0.2%)   0.0053 (  0.2%)   0.0054 (  0.2%)  X86 Discriminate Memory Operands
   0.0036 (  0.2%)   0.0017 (  0.2%)   0.0053 (  0.2%)   0.0054 (  0.2%)  X86 WinAlloca Expander
   0.0036 (  0.2%)   0.0017 (  0.2%)   0.0052 (  0.2%)   0.0053 (  0.2%)  Analyze Machine Code For Garbage Collection
   0.0035 (  0.2%)   0.0017 (  0.2%)   0.0052 (  0.2%)   0.0053 (  0.2%)  Safe Stack instrumentation pass
   0.0021 (  0.1%)   0.0028 (  0.4%)   0.0049 (  0.2%)   0.0049 (  0.2%)  Lower Garbage Collection Instructions
   0.0021 (  0.1%)   0.0028 (  0.4%)   0.0049 (  0.2%)   0.0049 (  0.2%)  Shadow Stack GC Lowering
   0.0025 (  0.1%)   0.0000 (  0.0%)   0.0025 (  0.1%)   0.0025 (  0.1%)  Assumption Cache Tracker #2
   0.0007 (  0.0%)   0.0000 (  0.0%)   0.0008 (  0.0%)   0.0008 (  0.0%)  Pre-ISel Intrinsic Lowering
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Assumption Cache Tracker
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Profile summary info
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Rewrite Symbols
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Force set function attributes
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Library Information
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Pass Configuration
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Library Information #2
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Machine Module Information
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Machine Branch Probability Analysis
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Create Garbage Collector Module Metadata
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Transform Information
   1.7410 (100.0%)   0.7796 (100.0%)   2.5206 (100.0%)   2.5192 (100.0%)  Total

===-------------------------------------------------------------------------===
                          Clang front-end time report
===-------------------------------------------------------------------------===
  Total Execution Time: 22.5844 seconds (22.6065 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  20.9391 (100.0%)   1.6453 (100.0%)  22.5844 (100.0%)  22.6065 (100.0%)  Clang front-end timer
  20.9391 (100.0%)   1.6453 (100.0%)  22.5844 (100.0%)  22.6065 (100.0%)  Total

===-------------------------------------------------------------------------===
                                 DWARF Emission
===-------------------------------------------------------------------------===
  Total Execution Time: 0.5862 seconds (0.5885 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.3044 ( 71.5%)   0.1070 ( 66.7%)   0.4114 ( 70.2%)   0.4129 ( 70.2%)  Debug Info Emission
   0.1114 ( 26.2%)   0.0501 ( 31.3%)   0.1615 ( 27.6%)   0.1623 ( 27.6%)  DWARF Exception Writer
   0.0100 (  2.4%)   0.0032 (  2.0%)   0.0133 (  2.3%)   0.0133 (  2.3%)  DWARF Debug Writer
   0.4259 (100.0%)   0.1603 (100.0%)   0.5862 (100.0%)   0.5885 (100.0%)  Total

===-------------------------------------------------------------------------===
                      Instruction Selection and Scheduling
===-------------------------------------------------------------------------===
  Total Execution Time: 0.2686 seconds (0.2690 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.0451 ( 22.7%)   0.0161 ( 22.9%)   0.0612 ( 22.8%)   0.0610 ( 22.7%)  Instruction Selection
   0.0425 ( 21.4%)   0.0149 ( 21.3%)   0.0574 ( 21.4%)   0.0573 ( 21.3%)  Instruction Scheduling
   0.0333 ( 16.8%)   0.0118 ( 16.8%)   0.0451 ( 16.8%)   0.0464 ( 17.2%)  DAG Combining 1
   0.0263 ( 13.3%)   0.0091 ( 12.9%)   0.0354 ( 13.2%)   0.0353 ( 13.1%)  Instruction Creation
   0.0220 ( 11.1%)   0.0078 ( 11.2%)   0.0298 ( 11.1%)   0.0298 ( 11.1%)  DAG Combining 2
   0.0122 (  6.2%)   0.0044 (  6.3%)   0.0167 (  6.2%)   0.0166 (  6.2%)  DAG Legalization
   0.0091 (  4.6%)   0.0032 (  4.6%)   0.0123 (  4.6%)   0.0123 (  4.6%)  Type Legalization
   0.0038 (  1.9%)   0.0014 (  1.9%)   0.0052 (  1.9%)   0.0052 (  1.9%)  Instruction Scheduling Cleanup
   0.0034 (  1.7%)   0.0012 (  1.7%)   0.0046 (  1.7%)   0.0045 (  1.7%)  Vector Legalization
   0.0007 (  0.3%)   0.0002 (  0.3%)   0.0009 (  0.3%)   0.0009 (  0.3%)  DAG Combining after legalize types
   0.1985 (100.0%)   0.0701 (100.0%)   0.2686 (100.0%)   0.2690 (100.0%)  Total

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 2.5206 seconds (2.5192 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.7188 ( 41.3%)   0.3213 ( 41.2%)   1.0402 ( 41.3%)   1.0411 ( 41.3%)  X86 Assembly Printer
   0.4960 ( 28.5%)   0.1934 ( 24.8%)   0.6894 ( 27.4%)   0.6884 ( 27.3%)  X86 DAG->DAG Instruction Selection
   0.0997 (  5.7%)   0.0422 (  5.4%)   0.1419 (  5.6%)   0.1416 (  5.6%)  Prologue/Epilogue Insertion & Frame Finalization
   0.0493 (  2.8%)   0.0209 (  2.7%)   0.0702 (  2.8%)   0.0698 (  2.8%)  Fast Register Allocator
   0.0378 (  2.2%)   0.0160 (  2.1%)   0.0538 (  2.1%)   0.0534 (  2.1%)  Live DEBUG_VALUE analysis
   0.0320 (  1.8%)   0.0151 (  1.9%)   0.0471 (  1.9%)   0.0469 (  1.9%)  Insert stack protectors
   0.0247 (  1.4%)   0.0107 (  1.4%)   0.0354 (  1.4%)   0.0351 (  1.4%)  Two-Address instruction pass
   0.0184 (  1.1%)   0.0087 (  1.1%)   0.0270 (  1.1%)   0.0269 (  1.1%)  Dominator Tree Construction #2
   0.0194 (  1.1%)   0.0074 (  0.9%)   0.0268 (  1.1%)   0.0268 (  1.1%)  Inliner for always_inline functions
   0.0096 (  0.6%)   0.0122 (  1.6%)   0.0219 (  0.9%)   0.0218 (  0.9%)  Expand Atomic instructions
   0.0148 (  0.8%)   0.0063 (  0.8%)   0.0211 (  0.8%)   0.0210 (  0.8%)  MachineDominator Tree Construction
   0.0127 (  0.7%)   0.0061 (  0.8%)   0.0189 (  0.7%)   0.0188 (  0.7%)  Exception handling preparation
   0.0127 (  0.7%)   0.0058 (  0.7%)   0.0185 (  0.7%)   0.0184 (  0.7%)  X86 EFLAGS copy lowering
   0.0124 (  0.7%)   0.0061 (  0.8%)   0.0185 (  0.7%)   0.0184 (  0.7%)  Free MachineFunction
   0.0163 (  0.9%)   0.0002 (  0.0%)   0.0165 (  0.7%)   0.0165 (  0.7%)  CallGraph Construction
   0.0054 (  0.3%)   0.0074 (  0.9%)   0.0128 (  0.5%)   0.0128 (  0.5%)  Dominator Tree Construction
   0.0050 (  0.3%)   0.0066 (  0.8%)   0.0116 (  0.5%)   0.0116 (  0.5%)  Scalarize Masked Memory Intrinsics
   0.0049 (  0.3%)   0.0064 (  0.8%)   0.0114 (  0.5%)   0.0114 (  0.5%)  Expand reduction intrinsics
   0.0077 (  0.4%)   0.0035 (  0.4%)   0.0112 (  0.4%)   0.0112 (  0.4%)  Post-RA pseudo instruction expansion pass
   0.0072 (  0.4%)   0.0032 (  0.4%)   0.0104 (  0.4%)   0.0104 (  0.4%)  Check CFA info and insert CFI instructions if needed
   0.0068 (  0.4%)   0.0030 (  0.4%)   0.0098 (  0.4%)   0.0098 (  0.4%)  X86 pseudo instruction expansion pass
   0.0042 (  0.2%)   0.0056 (  0.7%)   0.0098 (  0.4%)   0.0097 (  0.4%)  Expand indirectbr instructions
   0.0064 (  0.4%)   0.0030 (  0.4%)   0.0094 (  0.4%)   0.0095 (  0.4%)  X86 Indirect Branch Tracking
   0.0044 (  0.3%)   0.0049 (  0.6%)   0.0092 (  0.4%)   0.0092 (  0.4%)  Basic Alias Analysis (stateless AA impl)
   0.0058 (  0.3%)   0.0028 (  0.4%)   0.0086 (  0.3%)   0.0087 (  0.3%)  Eliminate PHI nodes for register allocation
   0.0051 (  0.3%)   0.0024 (  0.3%)   0.0074 (  0.3%)   0.0074 (  0.3%)  Insert fentry calls
   0.0050 (  0.3%)   0.0023 (  0.3%)   0.0073 (  0.3%)   0.0074 (  0.3%)  Insert XRay ops
   0.0049 (  0.3%)   0.0023 (  0.3%)   0.0072 (  0.3%)   0.0072 (  0.3%)  Expand ISel Pseudo-instructions
   0.0047 (  0.3%)   0.0022 (  0.3%)   0.0069 (  0.3%)   0.0069 (  0.3%)  Bundle Machine CFG Edges
   0.0046 (  0.3%)   0.0022 (  0.3%)   0.0069 (  0.3%)   0.0068 (  0.3%)  Implement the 'patchable-function' attribute
   0.0022 (  0.1%)   0.0043 (  0.6%)   0.0065 (  0.3%)   0.0067 (  0.3%)  Instrument function entry/exit with calls to e.g. mcount() (pre inlining)
   0.0029 (  0.2%)   0.0038 (  0.5%)   0.0067 (  0.3%)   0.0067 (  0.3%)  Instrument function entry/exit with calls to e.g. mcount() (post inlining)
   0.0043 (  0.2%)   0.0019 (  0.2%)   0.0063 (  0.2%)   0.0063 (  0.3%)  Machine Optimization Remark Emitter
   0.0028 (  0.2%)   0.0036 (  0.5%)   0.0063 (  0.3%)   0.0063 (  0.3%)  Remove unreachable blocks from the CFG
   0.0043 (  0.2%)   0.0020 (  0.3%)   0.0062 (  0.2%)   0.0062 (  0.2%)  X86 PIC Global Base Reg Initialization
   0.0040 (  0.2%)   0.0019 (  0.2%)   0.0059 (  0.2%)   0.0060 (  0.2%)  Machine Optimization Remark Emitter #2
   0.0040 (  0.2%)   0.0018 (  0.2%)   0.0058 (  0.2%)   0.0059 (  0.2%)  X86 Retpoline Thunks
   0.0038 (  0.2%)   0.0018 (  0.2%)   0.0056 (  0.2%)   0.0058 (  0.2%)  X86 Insert Cache Prefetches
   0.0039 (  0.2%)   0.0018 (  0.2%)   0.0057 (  0.2%)   0.0058 (  0.2%)  Contiguously Lay Out Funclets
   0.0038 (  0.2%)   0.0019 (  0.2%)   0.0057 (  0.2%)   0.0057 (  0.2%)  X86 speculative load hardening
   0.0038 (  0.2%)   0.0019 (  0.2%)   0.0057 (  0.2%)   0.0057 (  0.2%)  X86 FP Stackifier
   0.0038 (  0.2%)   0.0018 (  0.2%)   0.0056 (  0.2%)   0.0056 (  0.2%)  Lazy Machine Block Frequency Analysis
   0.0037 (  0.2%)   0.0018 (  0.2%)   0.0055 (  0.2%)   0.0055 (  0.2%)  StackMap Liveness Analysis
   0.0038 (  0.2%)   0.0018 (  0.2%)   0.0056 (  0.2%)   0.0055 (  0.2%)  Lazy Machine Block Frequency Analysis #2
   0.0038 (  0.2%)   0.0018 (  0.2%)   0.0055 (  0.2%)   0.0055 (  0.2%)  Shadow Call Stack
   0.0038 (  0.2%)   0.0016 (  0.2%)   0.0054 (  0.2%)   0.0055 (  0.2%)  Local Stack Slot Allocation
   0.0037 (  0.2%)   0.0017 (  0.2%)   0.0054 (  0.2%)   0.0055 (  0.2%)  X86 vzeroupper inserter
   0.0036 (  0.2%)   0.0017 (  0.2%)   0.0053 (  0.2%)   0.0054 (  0.2%)  X86 Discriminate Memory Operands
   0.0036 (  0.2%)   0.0017 (  0.2%)   0.0053 (  0.2%)   0.0054 (  0.2%)  X86 WinAlloca Expander
   0.0036 (  0.2%)   0.0017 (  0.2%)   0.0052 (  0.2%)   0.0053 (  0.2%)  Analyze Machine Code For Garbage Collection
   0.0035 (  0.2%)   0.0017 (  0.2%)   0.0052 (  0.2%)   0.0053 (  0.2%)  Safe Stack instrumentation pass
   0.0021 (  0.1%)   0.0028 (  0.4%)   0.0049 (  0.2%)   0.0049 (  0.2%)  Lower Garbage Collection Instructions
   0.0021 (  0.1%)   0.0028 (  0.4%)   0.0049 (  0.2%)   0.0049 (  0.2%)  Shadow Stack GC Lowering
   0.0025 (  0.1%)   0.0000 (  0.0%)   0.0025 (  0.1%)   0.0025 (  0.1%)  Assumption Cache Tracker #2
   0.0007 (  0.0%)   0.0000 (  0.0%)   0.0008 (  0.0%)   0.0008 (  0.0%)  Pre-ISel Intrinsic Lowering
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Assumption Cache Tracker
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Rewrite Symbols
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Profile summary info
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Force set function attributes
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Transform Information
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Machine Module Information
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Pass Configuration
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Library Information #2
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Create Garbage Collector Module Metadata
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Library Information
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Machine Branch Probability Analysis
   1.7410 (100.0%)   0.7796 (100.0%)   2.5206 (100.0%)   2.5192 (100.0%)  Total

[100%] Linking CXX executable no_tls_both
cd /home/kondo/work/tmp/mqtt_cpp/build/example && /usr/bin/cmake -E cmake_link_script CMakeFiles/no_tls_both.dir/link.txt --verbose=1
/usr/bin/clang++  -DMQTT_STD_STRING_VIEW -DMQTT_USE_STR_CHECK -DMQTT_USE_WS -ftime-report -std=c++17 -g -ggdb3 -Wall -Wextra -Werror -Wno-ignored-qualifiers -Wconversion -g   CMakeFiles/no_tls_both.dir/no_tls_both.cpp.o  -o no_tls_both /usr/lib/libboost_system.so -lpthread /usr/lib/libssl.so /usr/lib/libcrypto.so -ldl 

impl_320

[ 50%] Building CXX object example/CMakeFiles/no_tls_both.dir/no_tls_both.cpp.o
cd /home/kondo/work/mqtt_cpp/build/example && /usr/bin/clang++   -I/home/kondo/work/mqtt_cpp/include  -DMQTT_STD_STRING_VIEW -DMQTT_USE_STR_CHECK -DMQTT_USE_WS -ftime-report -std=c++17 -g -ggdb3 -Wall -Wextra -Werror -Wno-ignored-qualifiers -Wconversion -g   -pthread -std=gnu++17 -o CMakeFiles/no_tls_both.dir/no_tls_both.cpp.o -c /home/kondo/work/mqtt_cpp/example/no_tls_both.cpp
===-------------------------------------------------------------------------===
                         Miscellaneous Ungrouped Timers
===-------------------------------------------------------------------------===

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  18.4311 ( 81.9%)   0.5441 ( 25.3%)  18.9753 ( 76.9%)  19.1197 ( 77.0%)  LLVM IR Generation Time
   4.0834 ( 18.1%)   1.6083 ( 74.7%)   5.6918 ( 23.1%)   5.6976 ( 23.0%)  Code Generation Time
  22.5146 (100.0%)   2.1525 (100.0%)  24.6670 (100.0%)  24.8173 (100.0%)  Total

===-------------------------------------------------------------------------===
                      Instruction Selection and Scheduling
===-------------------------------------------------------------------------===
  Total Execution Time: 0.5013 seconds (0.5020 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.0821 ( 23.3%)   0.0350 ( 23.5%)   0.1170 ( 23.3%)   0.1165 ( 23.2%)  Instruction Selection
   0.0749 ( 21.3%)   0.0317 ( 21.3%)   0.1066 ( 21.3%)   0.1062 ( 21.1%)  Instruction Scheduling
   0.0586 ( 16.6%)   0.0250 ( 16.8%)   0.0836 ( 16.7%)   0.0858 ( 17.1%)  DAG Combining 1
   0.0473 ( 13.4%)   0.0196 ( 13.2%)   0.0669 ( 13.4%)   0.0667 ( 13.3%)  Instruction Creation
   0.0394 ( 11.2%)   0.0168 ( 11.3%)   0.0562 ( 11.2%)   0.0561 ( 11.2%)  DAG Combining 2
   0.0215 (  6.1%)   0.0091 (  6.1%)   0.0305 (  6.1%)   0.0305 (  6.1%)  DAG Legalization
   0.0158 (  4.5%)   0.0066 (  4.4%)   0.0224 (  4.5%)   0.0224 (  4.5%)  Type Legalization
   0.0066 (  1.9%)   0.0027 (  1.8%)   0.0093 (  1.9%)   0.0092 (  1.8%)  Instruction Scheduling Cleanup
   0.0054 (  1.5%)   0.0022 (  1.5%)   0.0075 (  1.5%)   0.0075 (  1.5%)  Vector Legalization
   0.0008 (  0.2%)   0.0003 (  0.2%)   0.0012 (  0.2%)   0.0011 (  0.2%)  DAG Combining after legalize types
   0.3523 (100.0%)   0.1490 (100.0%)   0.5013 (100.0%)   0.5020 (100.0%)  Total

===-------------------------------------------------------------------------===
                                 DWARF Emission
===-------------------------------------------------------------------------===
  Total Execution Time: 0.9956 seconds (0.9986 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.5140 ( 70.9%)   0.1836 ( 67.9%)   0.6977 ( 70.1%)   0.6999 ( 70.1%)  Debug Info Emission
   0.1966 ( 27.1%)   0.0854 ( 31.6%)   0.2821 ( 28.3%)   0.2828 ( 28.3%)  DWARF Exception Writer
   0.0146 (  2.0%)   0.0013 (  0.5%)   0.0159 (  1.6%)   0.0159 (  1.6%)  DWARF Debug Writer
   0.7253 (100.0%)   0.2703 (100.0%)   0.9956 (100.0%)   0.9986 (100.0%)  Total

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 4.3819 seconds (4.3816 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   1.2539 ( 41.1%)   0.5457 ( 41.0%)   1.7996 ( 41.1%)   1.8016 ( 41.1%)  X86 Assembly Printer
   0.8667 ( 28.4%)   0.3714 ( 27.9%)   1.2382 ( 28.3%)   1.2370 ( 28.2%)  X86 DAG->DAG Instruction Selection
   0.1707 (  5.6%)   0.0721 (  5.4%)   0.2427 (  5.5%)   0.2420 (  5.5%)  Prologue/Epilogue Insertion & Frame Finalization
   0.0847 (  2.8%)   0.0358 (  2.7%)   0.1204 (  2.7%)   0.1200 (  2.7%)  Fast Register Allocator
   0.0670 (  2.2%)   0.0272 (  2.0%)   0.0942 (  2.1%)   0.0937 (  2.1%)  Live DEBUG_VALUE analysis
   0.0569 (  1.9%)   0.0259 (  1.9%)   0.0828 (  1.9%)   0.0823 (  1.9%)  Insert stack protectors
   0.0427 (  1.4%)   0.0184 (  1.4%)   0.0611 (  1.4%)   0.0606 (  1.4%)  Two-Address instruction pass
   0.0380 (  1.2%)   0.0073 (  0.6%)   0.0453 (  1.0%)   0.0452 (  1.0%)  Inliner for always_inline functions
   0.0308 (  1.0%)   0.0139 (  1.0%)   0.0447 (  1.0%)   0.0446 (  1.0%)  Dominator Tree Construction #2
   0.0208 (  0.7%)   0.0153 (  1.1%)   0.0361 (  0.8%)   0.0361 (  0.8%)  Expand Atomic instructions
   0.0233 (  0.8%)   0.0101 (  0.8%)   0.0334 (  0.8%)   0.0332 (  0.8%)  MachineDominator Tree Construction
   0.0225 (  0.7%)   0.0104 (  0.8%)   0.0329 (  0.8%)   0.0329 (  0.8%)  Exception handling preparation
   0.0218 (  0.7%)   0.0095 (  0.7%)   0.0313 (  0.7%)   0.0313 (  0.7%)  X86 EFLAGS copy lowering
   0.0210 (  0.7%)   0.0097 (  0.7%)   0.0308 (  0.7%)   0.0308 (  0.7%)  Free MachineFunction
   0.0247 (  0.8%)   0.0025 (  0.2%)   0.0272 (  0.6%)   0.0272 (  0.6%)  CallGraph Construction
   0.0117 (  0.4%)   0.0087 (  0.7%)   0.0204 (  0.5%)   0.0204 (  0.5%)  Dominator Tree Construction
   0.0112 (  0.4%)   0.0083 (  0.6%)   0.0195 (  0.4%)   0.0195 (  0.4%)  Expand reduction intrinsics
   0.0135 (  0.4%)   0.0057 (  0.4%)   0.0193 (  0.4%)   0.0192 (  0.4%)  Post-RA pseudo instruction expansion pass
   0.0110 (  0.4%)   0.0081 (  0.6%)   0.0191 (  0.4%)   0.0191 (  0.4%)  Scalarize Masked Memory Intrinsics
   0.0126 (  0.4%)   0.0053 (  0.4%)   0.0179 (  0.4%)   0.0180 (  0.4%)  Check CFA info and insert CFI instructions if needed
   0.0114 (  0.4%)   0.0050 (  0.4%)   0.0164 (  0.4%)   0.0165 (  0.4%)  X86 pseudo instruction expansion pass
   0.0094 (  0.3%)   0.0070 (  0.5%)   0.0164 (  0.4%)   0.0164 (  0.4%)  Expand indirectbr instructions
   0.0111 (  0.4%)   0.0050 (  0.4%)   0.0160 (  0.4%)   0.0162 (  0.4%)  X86 Indirect Branch Tracking
   0.0096 (  0.3%)   0.0059 (  0.4%)   0.0155 (  0.4%)   0.0157 (  0.4%)  Basic Alias Analysis (stateless AA impl)
   0.0106 (  0.3%)   0.0045 (  0.3%)   0.0151 (  0.3%)   0.0151 (  0.3%)  Eliminate PHI nodes for register allocation
   0.0087 (  0.3%)   0.0038 (  0.3%)   0.0125 (  0.3%)   0.0126 (  0.3%)  Insert fentry calls
   0.0086 (  0.3%)   0.0038 (  0.3%)   0.0124 (  0.3%)   0.0124 (  0.3%)  Insert XRay ops
   0.0084 (  0.3%)   0.0038 (  0.3%)   0.0122 (  0.3%)   0.0123 (  0.3%)  Expand ISel Pseudo-instructions
   0.0082 (  0.3%)   0.0037 (  0.3%)   0.0119 (  0.3%)   0.0119 (  0.3%)  Implement the 'patchable-function' attribute
   0.0080 (  0.3%)   0.0035 (  0.3%)   0.0115 (  0.3%)   0.0116 (  0.3%)  Bundle Machine CFG Edges
   0.0065 (  0.2%)   0.0048 (  0.4%)   0.0113 (  0.3%)   0.0115 (  0.3%)  Instrument function entry/exit with calls to e.g. mcount() (post inlining)
   0.0062 (  0.2%)   0.0048 (  0.4%)   0.0110 (  0.3%)   0.0111 (  0.3%)  Instrument function entry/exit with calls to e.g. mcount() (pre inlining)
   0.0074 (  0.2%)   0.0033 (  0.2%)   0.0107 (  0.2%)   0.0108 (  0.2%)  X86 PIC Global Base Reg Initialization
   0.0073 (  0.2%)   0.0033 (  0.2%)   0.0106 (  0.2%)   0.0105 (  0.2%)  Machine Optimization Remark Emitter
   0.0059 (  0.2%)   0.0044 (  0.3%)   0.0104 (  0.2%)   0.0104 (  0.2%)  Remove unreachable blocks from the CFG
   0.0069 (  0.2%)   0.0031 (  0.2%)   0.0099 (  0.2%)   0.0101 (  0.2%)  X86 Retpoline Thunks
   0.0067 (  0.2%)   0.0031 (  0.2%)   0.0098 (  0.2%)   0.0100 (  0.2%)  Contiguously Lay Out Funclets
   0.0066 (  0.2%)   0.0030 (  0.2%)   0.0096 (  0.2%)   0.0097 (  0.2%)  Machine Optimization Remark Emitter #2
   0.0067 (  0.2%)   0.0030 (  0.2%)   0.0097 (  0.2%)   0.0097 (  0.2%)  X86 FP Stackifier
   0.0067 (  0.2%)   0.0029 (  0.2%)   0.0096 (  0.2%)   0.0097 (  0.2%)  Lazy Machine Block Frequency Analysis
   0.0066 (  0.2%)   0.0030 (  0.2%)   0.0096 (  0.2%)   0.0096 (  0.2%)  X86 speculative load hardening
   0.0067 (  0.2%)   0.0029 (  0.2%)   0.0096 (  0.2%)   0.0096 (  0.2%)  X86 Insert Cache Prefetches
   0.0066 (  0.2%)   0.0029 (  0.2%)   0.0096 (  0.2%)   0.0095 (  0.2%)  Lazy Machine Block Frequency Analysis #2
   0.0066 (  0.2%)   0.0029 (  0.2%)   0.0095 (  0.2%)   0.0095 (  0.2%)  StackMap Liveness Analysis
   0.0066 (  0.2%)   0.0029 (  0.2%)   0.0095 (  0.2%)   0.0094 (  0.2%)  Local Stack Slot Allocation
   0.0065 (  0.2%)   0.0029 (  0.2%)   0.0094 (  0.2%)   0.0094 (  0.2%)  X86 WinAlloca Expander
   0.0062 (  0.2%)   0.0028 (  0.2%)   0.0091 (  0.2%)   0.0092 (  0.2%)  Shadow Call Stack
   0.0064 (  0.2%)   0.0028 (  0.2%)   0.0092 (  0.2%)   0.0092 (  0.2%)  X86 vzeroupper inserter
   0.0060 (  0.2%)   0.0027 (  0.2%)   0.0087 (  0.2%)   0.0090 (  0.2%)  X86 Discriminate Memory Operands
   0.0062 (  0.2%)   0.0028 (  0.2%)   0.0090 (  0.2%)   0.0090 (  0.2%)  Analyze Machine Code For Garbage Collection
   0.0060 (  0.2%)   0.0027 (  0.2%)   0.0087 (  0.2%)   0.0089 (  0.2%)  Safe Stack instrumentation pass
   0.0047 (  0.2%)   0.0036 (  0.3%)   0.0083 (  0.2%)   0.0083 (  0.2%)  Shadow Stack GC Lowering
   0.0046 (  0.2%)   0.0034 (  0.3%)   0.0081 (  0.2%)   0.0083 (  0.2%)  Lower Garbage Collection Instructions
   0.0031 (  0.1%)   0.0000 (  0.0%)   0.0031 (  0.1%)   0.0031 (  0.1%)  Assumption Cache Tracker #2
   0.0013 (  0.0%)   0.0000 (  0.0%)   0.0013 (  0.0%)   0.0013 (  0.0%)  Pre-ISel Intrinsic Lowering
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Assumption Cache Tracker
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Force set function attributes
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Library Information #2
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Transform Information
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Rewrite Symbols
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Create Garbage Collector Module Metadata
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Library Information
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Profile summary info
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Machine Branch Probability Analysis
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Pass Configuration
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Machine Module Information
   3.0504 (100.0%)   1.3315 (100.0%)   4.3819 (100.0%)   4.3816 (100.0%)  Total

===-------------------------------------------------------------------------===
                          Clang front-end time report
===-------------------------------------------------------------------------===
  Total Execution Time: 95.5239 seconds (95.6318 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  91.3182 (100.0%)   4.2056 (100.0%)  95.5239 (100.0%)  95.6318 (100.0%)  Clang front-end timer
  91.3182 (100.0%)   4.2056 (100.0%)  95.5239 (100.0%)  95.6318 (100.0%)  Total

===-------------------------------------------------------------------------===
                                 DWARF Emission
===-------------------------------------------------------------------------===
  Total Execution Time: 0.9956 seconds (0.9986 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.5140 ( 70.9%)   0.1836 ( 67.9%)   0.6977 ( 70.1%)   0.6999 ( 70.1%)  Debug Info Emission
   0.1966 ( 27.1%)   0.0854 ( 31.6%)   0.2821 ( 28.3%)   0.2828 ( 28.3%)  DWARF Exception Writer
   0.0146 (  2.0%)   0.0013 (  0.5%)   0.0159 (  1.6%)   0.0159 (  1.6%)  DWARF Debug Writer
   0.7253 (100.0%)   0.2703 (100.0%)   0.9956 (100.0%)   0.9986 (100.0%)  Total

===-------------------------------------------------------------------------===
                      Instruction Selection and Scheduling
===-------------------------------------------------------------------------===
  Total Execution Time: 0.5013 seconds (0.5020 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.0821 ( 23.3%)   0.0350 ( 23.5%)   0.1170 ( 23.3%)   0.1165 ( 23.2%)  Instruction Selection
   0.0749 ( 21.3%)   0.0317 ( 21.3%)   0.1066 ( 21.3%)   0.1062 ( 21.1%)  Instruction Scheduling
   0.0586 ( 16.6%)   0.0250 ( 16.8%)   0.0836 ( 16.7%)   0.0858 ( 17.1%)  DAG Combining 1
   0.0473 ( 13.4%)   0.0196 ( 13.2%)   0.0669 ( 13.4%)   0.0667 ( 13.3%)  Instruction Creation
   0.0394 ( 11.2%)   0.0168 ( 11.3%)   0.0562 ( 11.2%)   0.0561 ( 11.2%)  DAG Combining 2
   0.0215 (  6.1%)   0.0091 (  6.1%)   0.0305 (  6.1%)   0.0305 (  6.1%)  DAG Legalization
   0.0158 (  4.5%)   0.0066 (  4.4%)   0.0224 (  4.5%)   0.0224 (  4.5%)  Type Legalization
   0.0066 (  1.9%)   0.0027 (  1.8%)   0.0093 (  1.9%)   0.0092 (  1.8%)  Instruction Scheduling Cleanup
   0.0054 (  1.5%)   0.0022 (  1.5%)   0.0075 (  1.5%)   0.0075 (  1.5%)  Vector Legalization
   0.0008 (  0.2%)   0.0003 (  0.2%)   0.0012 (  0.2%)   0.0011 (  0.2%)  DAG Combining after legalize types
   0.3523 (100.0%)   0.1490 (100.0%)   0.5013 (100.0%)   0.5020 (100.0%)  Total

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 4.3819 seconds (4.3816 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   1.2539 ( 41.1%)   0.5457 ( 41.0%)   1.7996 ( 41.1%)   1.8016 ( 41.1%)  X86 Assembly Printer
   0.8667 ( 28.4%)   0.3714 ( 27.9%)   1.2382 ( 28.3%)   1.2370 ( 28.2%)  X86 DAG->DAG Instruction Selection
   0.1707 (  5.6%)   0.0721 (  5.4%)   0.2427 (  5.5%)   0.2420 (  5.5%)  Prologue/Epilogue Insertion & Frame Finalization
   0.0847 (  2.8%)   0.0358 (  2.7%)   0.1204 (  2.7%)   0.1200 (  2.7%)  Fast Register Allocator
   0.0670 (  2.2%)   0.0272 (  2.0%)   0.0942 (  2.1%)   0.0937 (  2.1%)  Live DEBUG_VALUE analysis
   0.0569 (  1.9%)   0.0259 (  1.9%)   0.0828 (  1.9%)   0.0823 (  1.9%)  Insert stack protectors
   0.0427 (  1.4%)   0.0184 (  1.4%)   0.0611 (  1.4%)   0.0606 (  1.4%)  Two-Address instruction pass
   0.0380 (  1.2%)   0.0073 (  0.6%)   0.0453 (  1.0%)   0.0452 (  1.0%)  Inliner for always_inline functions
   0.0308 (  1.0%)   0.0139 (  1.0%)   0.0447 (  1.0%)   0.0446 (  1.0%)  Dominator Tree Construction #2
   0.0208 (  0.7%)   0.0153 (  1.1%)   0.0361 (  0.8%)   0.0361 (  0.8%)  Expand Atomic instructions
   0.0233 (  0.8%)   0.0101 (  0.8%)   0.0334 (  0.8%)   0.0332 (  0.8%)  MachineDominator Tree Construction
   0.0225 (  0.7%)   0.0104 (  0.8%)   0.0329 (  0.8%)   0.0329 (  0.8%)  Exception handling preparation
   0.0218 (  0.7%)   0.0095 (  0.7%)   0.0313 (  0.7%)   0.0313 (  0.7%)  X86 EFLAGS copy lowering
   0.0210 (  0.7%)   0.0097 (  0.7%)   0.0308 (  0.7%)   0.0308 (  0.7%)  Free MachineFunction
   0.0247 (  0.8%)   0.0025 (  0.2%)   0.0272 (  0.6%)   0.0272 (  0.6%)  CallGraph Construction
   0.0117 (  0.4%)   0.0087 (  0.7%)   0.0204 (  0.5%)   0.0204 (  0.5%)  Dominator Tree Construction
   0.0112 (  0.4%)   0.0083 (  0.6%)   0.0195 (  0.4%)   0.0195 (  0.4%)  Expand reduction intrinsics
   0.0135 (  0.4%)   0.0057 (  0.4%)   0.0193 (  0.4%)   0.0192 (  0.4%)  Post-RA pseudo instruction expansion pass
   0.0110 (  0.4%)   0.0081 (  0.6%)   0.0191 (  0.4%)   0.0191 (  0.4%)  Scalarize Masked Memory Intrinsics
   0.0126 (  0.4%)   0.0053 (  0.4%)   0.0179 (  0.4%)   0.0180 (  0.4%)  Check CFA info and insert CFI instructions if needed
   0.0114 (  0.4%)   0.0050 (  0.4%)   0.0164 (  0.4%)   0.0165 (  0.4%)  X86 pseudo instruction expansion pass
   0.0094 (  0.3%)   0.0070 (  0.5%)   0.0164 (  0.4%)   0.0164 (  0.4%)  Expand indirectbr instructions
   0.0111 (  0.4%)   0.0050 (  0.4%)   0.0160 (  0.4%)   0.0162 (  0.4%)  X86 Indirect Branch Tracking
   0.0096 (  0.3%)   0.0059 (  0.4%)   0.0155 (  0.4%)   0.0157 (  0.4%)  Basic Alias Analysis (stateless AA impl)
   0.0106 (  0.3%)   0.0045 (  0.3%)   0.0151 (  0.3%)   0.0151 (  0.3%)  Eliminate PHI nodes for register allocation
   0.0087 (  0.3%)   0.0038 (  0.3%)   0.0125 (  0.3%)   0.0126 (  0.3%)  Insert fentry calls
   0.0086 (  0.3%)   0.0038 (  0.3%)   0.0124 (  0.3%)   0.0124 (  0.3%)  Insert XRay ops
   0.0084 (  0.3%)   0.0038 (  0.3%)   0.0122 (  0.3%)   0.0123 (  0.3%)  Expand ISel Pseudo-instructions
   0.0082 (  0.3%)   0.0037 (  0.3%)   0.0119 (  0.3%)   0.0119 (  0.3%)  Implement the 'patchable-function' attribute
   0.0080 (  0.3%)   0.0035 (  0.3%)   0.0115 (  0.3%)   0.0116 (  0.3%)  Bundle Machine CFG Edges
   0.0065 (  0.2%)   0.0048 (  0.4%)   0.0113 (  0.3%)   0.0115 (  0.3%)  Instrument function entry/exit with calls to e.g. mcount() (post inlining)
   0.0062 (  0.2%)   0.0048 (  0.4%)   0.0110 (  0.3%)   0.0111 (  0.3%)  Instrument function entry/exit with calls to e.g. mcount() (pre inlining)
   0.0074 (  0.2%)   0.0033 (  0.2%)   0.0107 (  0.2%)   0.0108 (  0.2%)  X86 PIC Global Base Reg Initialization
   0.0073 (  0.2%)   0.0033 (  0.2%)   0.0106 (  0.2%)   0.0105 (  0.2%)  Machine Optimization Remark Emitter
   0.0059 (  0.2%)   0.0044 (  0.3%)   0.0104 (  0.2%)   0.0104 (  0.2%)  Remove unreachable blocks from the CFG
   0.0069 (  0.2%)   0.0031 (  0.2%)   0.0099 (  0.2%)   0.0101 (  0.2%)  X86 Retpoline Thunks
   0.0067 (  0.2%)   0.0031 (  0.2%)   0.0098 (  0.2%)   0.0100 (  0.2%)  Contiguously Lay Out Funclets
   0.0066 (  0.2%)   0.0030 (  0.2%)   0.0096 (  0.2%)   0.0097 (  0.2%)  Machine Optimization Remark Emitter #2
   0.0067 (  0.2%)   0.0030 (  0.2%)   0.0097 (  0.2%)   0.0097 (  0.2%)  X86 FP Stackifier
   0.0067 (  0.2%)   0.0029 (  0.2%)   0.0096 (  0.2%)   0.0097 (  0.2%)  Lazy Machine Block Frequency Analysis
   0.0066 (  0.2%)   0.0030 (  0.2%)   0.0096 (  0.2%)   0.0096 (  0.2%)  X86 speculative load hardening
   0.0067 (  0.2%)   0.0029 (  0.2%)   0.0096 (  0.2%)   0.0096 (  0.2%)  X86 Insert Cache Prefetches
   0.0066 (  0.2%)   0.0029 (  0.2%)   0.0096 (  0.2%)   0.0095 (  0.2%)  Lazy Machine Block Frequency Analysis #2
   0.0066 (  0.2%)   0.0029 (  0.2%)   0.0095 (  0.2%)   0.0095 (  0.2%)  StackMap Liveness Analysis
   0.0066 (  0.2%)   0.0029 (  0.2%)   0.0095 (  0.2%)   0.0094 (  0.2%)  Local Stack Slot Allocation
   0.0065 (  0.2%)   0.0029 (  0.2%)   0.0094 (  0.2%)   0.0094 (  0.2%)  X86 WinAlloca Expander
   0.0062 (  0.2%)   0.0028 (  0.2%)   0.0091 (  0.2%)   0.0092 (  0.2%)  Shadow Call Stack
   0.0064 (  0.2%)   0.0028 (  0.2%)   0.0092 (  0.2%)   0.0092 (  0.2%)  X86 vzeroupper inserter
   0.0060 (  0.2%)   0.0027 (  0.2%)   0.0087 (  0.2%)   0.0090 (  0.2%)  X86 Discriminate Memory Operands
   0.0062 (  0.2%)   0.0028 (  0.2%)   0.0090 (  0.2%)   0.0090 (  0.2%)  Analyze Machine Code For Garbage Collection
   0.0060 (  0.2%)   0.0027 (  0.2%)   0.0087 (  0.2%)   0.0089 (  0.2%)  Safe Stack instrumentation pass
   0.0047 (  0.2%)   0.0036 (  0.3%)   0.0083 (  0.2%)   0.0083 (  0.2%)  Shadow Stack GC Lowering
   0.0046 (  0.2%)   0.0034 (  0.3%)   0.0081 (  0.2%)   0.0083 (  0.2%)  Lower Garbage Collection Instructions
   0.0031 (  0.1%)   0.0000 (  0.0%)   0.0031 (  0.1%)   0.0031 (  0.1%)  Assumption Cache Tracker #2
   0.0013 (  0.0%)   0.0000 (  0.0%)   0.0013 (  0.0%)   0.0013 (  0.0%)  Pre-ISel Intrinsic Lowering
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Assumption Cache Tracker
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Force set function attributes
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Transform Information
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Rewrite Symbols
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Library Information #2
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Library Information
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Create Garbage Collector Module Metadata
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Pass Configuration
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Machine Branch Probability Analysis
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Machine Module Information
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Profile summary info
   3.0504 (100.0%)   1.3315 (100.0%)   4.3819 (100.0%)   4.3816 (100.0%)  Total

[100%] Linking CXX executable no_tls_both
cd /home/kondo/work/mqtt_cpp/build/example && /usr/bin/cmake -E cmake_link_script CMakeFiles/no_tls_both.dir/link.txt --verbose=1
/usr/bin/clang++  -DMQTT_STD_STRING_VIEW -DMQTT_USE_STR_CHECK -DMQTT_USE_WS -ftime-report -std=c++17 -g -ggdb3 -Wall -Wextra -Werror -Wno-ignored-qualifiers -Wconversion -g   CMakeFiles/no_tls_both.dir/no_tls_both.cpp.o  -o no_tls_both /usr/lib/libboost_system.so -lpthread /usr/lib/libssl.so /usr/lib/libcrypto.so -ldl 

I want to improve compile time, but I don't come up with good ideas, so far.

@redboltz redboltz mentioned this pull request Aug 14, 2019
Copy link
Contributor

@jonesmz jonesmz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not yet reviewed endpoint.hpp, but I will once you've had a chance to take a look at my commentary.

template <typename Iterator>
basic_pubrel_message(Iterator b, Iterator e)
: base(b, e)
basic_pubrel_message(buffer buf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const&

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes MQTT v3.1.1's pubrel message is fixed length message. I will fix it.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed as follows for consistency.

basic_pubrel_message(mqtt::string_view)

include/mqtt/message.hpp Outdated Show resolved Hide resolved
include/mqtt/message.hpp Outdated Show resolved Hide resolved
include/mqtt/message.hpp Show resolved Hide resolved
auto qos = publish::get_qos(fixed_header_);
++b;
buf = buf.substr(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not having the basic_publish_message object actually take ownership of the data that it's referring to.

So unless you plan to have basic_publish_message hold onto an instance of the passed in mqtt::buffer, this function should instead take an mqtt::string_view (by value), to avoid all of the lifetime management activity that's going on here.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boost::container::static_vector<char, 2> len;
as::const_buffer str;
};

struct len_str {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what the purpose of this struct is.

boost::container::static_vector<char, 2> len; value can be calculated as needed, on the fly. No need to store it.

The logic from

if (!already_checked) {
    auto r = utf8string::validate_contents(buf);
    if (r != utf8string::validation::well_formed) throw utf8string_contents_error(r);
}

looks like it could be trivially merged into the UserProperty constructor, couldn't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we can't eliminate the "len_str" class entirely,

I think it should be moved into the private: section of the UserProperty class.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix it.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len_str is moved.

inline
mqtt::optional<property_variant> parse_one(It& begin, It end) {
if (begin == end) return mqtt::nullopt;
std::pair<mqtt::optional<property_variant>, buffer> parse_one(buffer buf) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why this function should return an std::pair.

The properties, if any are successfully parsed, will automatically preserve the lifetime of the mqtt::buffer object.

No need to explicitly return the buffer object.

Further, I don't think this function should have that responsibility. Callers of it should be aware that if no properties are returned, then there may be no references to buf. So they would need to ensure they manage the lifetime the way they want to.

Also, for functions like this one where the number of "assignments" of the parameter is unknown, I think it should be a const& parameter, instead of by-value.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I return buf as the second member of the pair is updating the buffer position to the next property starting point.

We can pass an iterator with buf. See the following code:

inline
std::pair<std::vector<property_variant>, buffer> parse(buffer buf) {
    std::vector<property_variant> props;
    auto it = buf.cbegin();
    while (true) {
        auto ret = parse_one(buf, it); // buf is passed by const reference, 
                                       // it is passed by non const reference 
        // it is updated to the next property start position
        if (ret.first) {
            props.push_back(std::move(ret.first.value()));
            buf = std::move(ret.second);
        }
        else {
            break;
        }
    }
    return { props, std::move(buf) };
}

Or we can also use #337 (comment) approach.

In addition parse doesn't need to return buffer. The caller knows whole properties size by Property Length field.

Any ideas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The property length field is always exactly the same, regardless of which type of property.

I would parse the property length outside of the parse_one function, and then create an MQTT::string_view with the whole data for the property. Then pass the MQTT::string_view by value, and but by const&.

If the parse_one function returns a property that needs to own a reference, then it will make a new MQTT::buffer using the lifetime from the const& buf, and the data from the MQTT::string_view.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Property length field doesn't appear each properties.

I mean the data structure is as follows:

+-----------------+-----------+-----------+-----+-----------+
| Prioerty Length | Property1 | Property2 | ... | PropertyN |
+-----------------+-----------+-----------+-----+-----------+

We cannot know each property size and number of properties.

See https://docs.oasis-open.org/mqtt/mqtt/v5.0/os/mqtt-v5.0-os.html#_Toc3901027

auto it = begin;
switch (static_cast<property::id>(*it++)) {
auto id = static_cast<property::id>(buf.front());
buf = buf.substr(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Careful with calls to substr, this causes an atomic increment in the shared pointer, and then an atomic decrement.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I should avoid unnecessary atomic increment and decrement via share_ptr_array's copy constructor.

I come up with https://wandbox.org/permlink/UramwEowhKy1oRFw. This could be another option.

I didn't check all locations I call substr() yet. If mqtt::string_view is obviously easy and clear, I use it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using rvalue references is a good way to avoid the problem in the generic sense.

But I think that there is still room for improvement in the places where MQTT::buffer is used, as well.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All buf = buf.substr(N) (buf : buffer) are replaced with buf = std::move(buf).substr() for the first step.

include/mqtt/will.hpp Outdated Show resolved Hide resolved
@@ -714,7 +714,7 @@ class test_broker {
bool unsubscribe_handler(
Endpoint& ep,
typename Endpoint::packet_id_t packet_id,
std::vector<mqtt::string_view> topics,
std::vector<mqtt::buffer> topics,
Copy link
Contributor

@jonesmz jonesmz Aug 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several other places that need attention in this code as well.

E.g. see the "do_publish" function.

void do_publish(
        std::shared_ptr<std::string> const& topic,
        std::shared_ptr<std::string> const& contents,
        std::uint8_t qos,
        bool is_retain,
        std::vector<mqtt::v5::property_variant> props) {

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that I fixed all part of test_broker using mqtt::buffer.
Please double check.

@redboltz
Copy link
Owner Author

Thank you @jonesmz , your comments are very valuable. I will take a look them later.

@jonesmz
Copy link
Contributor

jonesmz commented Aug 14, 2019

To test the compile time increase, you could try an experiment.

If you made a new branch off of master and then applied the MQTT::buffer changes to only the callbacks that user code gives us (e.g
Connect_handler), you could see if compile times changed at all.

If they have not, then modify the various property types to use MQTT::buffer, and also MQTT::will, and see if compile times change.

If those two parts of the code don't change the compile times, then we can assume its the finite state machine

@jonesmz
Copy link
Contributor

jonesmz commented Aug 14, 2019

I'm on my phone so I can't say this as a comment on the line of code directly, sorry.

I noticed in the various packet parser functions that you use "shared_from_this" for each lambda you create.

I think it would be better, and use less atomic operations, if you instead had a "lifetime" parameter to the various state machine functions, which could be std::move'd as needed.

Also, I noticed at the beginning of class endpoint that you can use "this_type" as part of the using statement for "using shared_from_this" to reduce copy-paste.

@jonesmz
Copy link
Contributor

jonesmz commented Aug 14, 2019

Also, in the various packet parsing state machine functions:

I noticed two things:

  1. You pass the async-handler func as a parameter to the lambda functions a lot. Why not capture func directly instead?
  2. You use a lot of std::function, which is very very heavy on compile times.

Instead, each function in the packet-parsing state machine functions can be a template function, and take the lambda function type directly, instead of needing to use std::function. I suspect that will help improve compile times.

Replaced template property and will constructors with `mqtt::buffer`
ones.
Added `""_mb` user defined literals.
Moved `len_str` to user_property's private member.
Added rvalue version of `buffer::substr()`.
Added `buffer::view()` to get `mqtt::string_view`.
@redboltz
Copy link
Owner Author

redboltz commented Aug 14, 2019

settings

  • target file
    • no_tls_both (in example)
  • compiler
    • clang version 8.0.1 (tags/RELEASE_801/final)
    • Target: x86_64-pc-linux-gnu
    • Thread model: posix
    • InstalledDir: /usr/bin
  • compiler options
    • -DMQTT_STD_STRING_VIEW
    • -DMQTT_STD_VARIANT
    • -DMQTT_USE_STR_CHECK
    • -DMQTT_USE_WS
    • -std=gnu++17
  • build command
    • time make VERBOSE=1 no_tls_both

log

[ 50%] Building CXX object example/CMakeFiles/no_tls_both.dir/no_tls_both.cpp.o
cd /home/kondo/work/tmp/mqtt/impl_320/mqtt_cpp/build/example && /usr/bin/clang++   -I/home/kondo/work/tmp/mqtt/impl_320/mqtt_cpp/include  -DMQTT_STD_STRING_VIEW -DMQTT_STD_VARIANT -DMQTT_USE_STR_CHECK -DMQTT_USE_WS -O0 -g   -pthread -std=gnu++17 -o CMakeFiles/no_tls_both.dir/no_tls_both.cpp.o -c /home/kondo/work/tmp/mqtt/impl_320/mqtt_cpp/example/no_tls_both.cpp
[100%] Linking CXX executable no_tls_both
cd /home/kondo/work/tmp/mqtt/impl_320/mqtt_cpp/build/example && /usr/bin/cmake -E cmake_link_script CMakeFiles/no_tls_both.dir/link.txt --verbose=1
/usr/bin/clang++  -DMQTT_STD_STRING_VIEW -DMQTT_STD_VARIANT -DMQTT_USE_STR_CHECK -DMQTT_USE_WS -O0 -g   CMakeFiles/no_tls_both.dir/no_tls_both.cpp.o  -o no_tls_both /usr/lib/libboost_system.so -lpthread /usr/lib/libssl.so /usr/lib/libcrypto.so -l

results

branch note time
master original 25.80s user 1.01s system 97% cpu 27.501 total
impl_320 use SFIANE on props 106.40s user 2.75s system 99% cpu 1:49.33 total
impl_320_work buffer on props 108.95s user 2.81s system 99% cpu 1:51.99 total
impl_320_delete_std_function replace std::function with template over 20 minutes

The master is the fastest. impl_320 and impl_320_work are similar time. They are 4 times slower than the master. impl_320_delete_std_function is very slow. I stopped the build by Ctrl-C.
I guess I did something wrong.

All branches are on the github. You can double check them on your environment.

@redboltz
Copy link
Owner Author

The class template endpoint has 4 different kinds of specialization. I compared them.
The program is no_tls_both.

NOTE: The machine spec is different from #337 (comment).

MQTT_NO_TLS MQTT_USE_WS master impl_320_work
ON OFF 11.1 16.3
OFF OFF 12.1 21.4
ON ON 15.1 41.6
OFF ON 22.5 96.6

I guess that nested classes in endpoint might be the reason of compile times. And some of them doesn't need to be defined in endpoint. The reason I defined them in endpoint is limit the scope. Instead of that, I can define them out of endpoint, in the namespace detail. I will try it.

In addition, I use #if 0, #endif suspicious place and check compile times.

@redboltz
Copy link
Owner Author

redboltz commented Aug 15, 2019

I can define them out of endpoint, in the namespace detail. I will try it.

I did it but compile times doesn't improved.

In addition, I use #if 0, #endif suspicious place and check compile times.

I added #if 0 as follows:

    void process_payload(async_handler_t func) {
#if 0
        auto control_packet_type = get_control_packet_type(fixed_header_);
        switch (control_packet_type) {
        case control_packet_type::connect:
        ...
        default:
            break;
        }
#endif
    }

Then compile times become 15.9 sec from 96.6 sec. So I guess that packet processing functions are dominant factor.

@redboltz
Copy link
Owner Author

    void process_properties(
        async_handler_t func,
        buffer buf,
        std::function<void(std::vector<v5::property_variant>, buffer, async_handler_t)> handler) {
#if 0
        ....
#endif
    }

When I add #if 0 to process_properties() then compile times become 86.1 sec. 10 seconds shorter than impl_320_work but it is not dominant.

@redboltz
Copy link
Owner Author

On the branch impl_320_struct_outside:

command time
connect 34
+connack 40
+publish 47
+puback 51
+pubrec 58
+pubrel 62
+pubcomp 65
+subscribe 70
+suback 75
+unsubscribe 78
+unsuback 83
+pingreq 87
+pingresp 86
+disconnect 89
+auth 93

@redboltz
Copy link
Owner Author

Let's focus on publish command.

The target function is process_publish().

function time
empty 16
+process_header 20
+topic_name 24
+packet_id 26
+properties 35
+payload(all) 38

Hmm, it seems that properties is dominant.

@@ -69,6 +71,7 @@ template <typename Socket, typename Mutex = std::mutex, template<typename...> cl
class endpoint : public std::enable_shared_from_this<endpoint<Socket, Mutex, LockGuard, PacketIdBytes>> {
using this_type = endpoint<Socket, Mutex, LockGuard, PacketIdBytes>;
public:
using std::enable_shared_from_this<endpoint<Socket, Mutex, LockGuard, PacketIdBytes>>::shared_from_this;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into a problem earlier today.

The way enable_shared_from_this is used in the mqtt_cpp codebase makes it so that if someone inherits from mqtt::endpoint, then

Derived * p;
p->shared_from_this();

Returns an instance of std::shared_ptr<mqtt::endpoint> and not std::shared_ptr<Derived>

One way to solve that is to use the "Curiously Recurring Template Pattern", and have the endpoint<> class take as a template parameter the type of the most-derived type of the inheritance graph, and then pass that type to std::enable_shared_from_this<>.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... CRTP is work well with Derived. But I couldn't a good way to support endpoint as leaf.
https://wandbox.org/permlink/2jSaOy3xkNgO9sAu
Any ideas?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced base_selector meta function. See https://wandbox.org/permlink/Mm74LrjbYOMq4AZo
I think that it works well.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the following approach is better:

https://wandbox.org/permlink/7r1s3XgsIxNpxtNB

No meta function required.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the issue is not related to the PR. So it should be a separate issue.

auto sp = std::make_shared<std::string>(b, e);
restore_serialized_message(basic_publish_message<PacketIdBytes>(sp->begin(), sp->end()), sp);
auto size = static_cast<std::size_t>(std::distance(b, e));
shared_ptr_array spa { new char[size] };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In C++20, you can do instead:

std::make_shared<char[]>(size);

Which avoids a secondary allocation behind the scenes.

For now, using boost::make_shared<char[]>(size) will avoid that problem.

Perhaps introduce a new CMake option?

MQTT_USE_BOOST_SHARED_PTR

If it's enabled, then mqtt_cpp uses boost::shared_ptr, and can do this with one allocation instead of two.

If it's disabled, then do an #if cpp20 check, and if compiling with cpp 20, use std::make_shared<char[]>(size);

And lastly, use the current code if neither of the above are true.

Copy link
Owner Author

@redboltz redboltz Aug 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the following code is good enough:

#if !defined(MQTT_SHARED_PTR_ARRAY_HPP)
#define MQTT_SHARED_PTR_ARRAY_HPP

#ifdef MQTT_STD_SHARED_PTR_ARRAY

#include <memory>

namespace mqtt {

using shared_ptr_array = std::shared_ptr<char []>;
using shared_ptr_const_array = std::shared_ptr<char const []>;

inline shared_ptr_array make_shared_ptr_array(std::size_t size) {
#if __cplusplus > 201703L // C++20 date is not determined yet
    return std::make_shared<char[]>(size);
#else  // __cplusplus > 201703L
    return std::shared_ptr<char[]>(new char[size]);
#endif // __cplusplus > 201703L
}

} // namespace mqtt

#else  // MQTT_STD_SHARED_PTR_ARRAY

#include <boost/shared_ptr.hpp>
#include <boost/smart_ptr/make_shared.hpp>

namespace mqtt {

using shared_ptr_array = boost::shared_ptr<char []>;
using shared_ptr_const_array = boost::shared_ptr<char const []>;

inline shared_ptr_array make_shared_ptr_array(std::size_t size) {
    return boost::make_shared<char[]>(size);
}

} // namespace mqtt

#endif // MQTT_STD_SHARED_PTR_ARRAY

#endif // MQTT_SHARED_PTR_ARRAY_HPP

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented. Use boost::shared_ptr and boost::make_shared by default.

include/mqtt/endpoint.hpp Outdated Show resolved Hide resolved
include/mqtt/endpoint.hpp Outdated Show resolved Hide resolved
`mqtt::socket` is type erased socket.
It covers tcp, tls, ws, and wss.

Compile times become faster.
boost type_erasure is based on vtable emulation. So member functions
of the `mqtt::socket` are called indirectly.

The cost of the indirect call can be ignored in `mqtt_cpp` usecase.

I've checked calling cost between the following three approach.

1. Traditional object oriented style virtual function call.
2. variant based visit call.
3. boost type_erasure.

There aren't any siginificant difference.

The approach 1 requires inheritance and (I guess) doesn't work well
with template.

The approach 2 needs to all actual types in the variant definition.

So I choose approach 3.
Used boost asio buffer() overload for will.
@@ -1,6 +1,6 @@
sudo: false
language: cpp
dist: trusty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the change?

Ddid we change the minimum C++ standard version or something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I'm using mqtt_cpp on OpenWRT, which only has GCC 7.3, so I wouldn't be able to use the latest release if we need GCC 9 now.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

g++ outputs compile error due to g++'s bug. I need to avoid it.
I don't know how to install g++9 on trusty so I updated it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the compiler error? That's really weird, I did see any code that looked problematic?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compile error is happened on the following code:

[
    self = shared_from_this()
]

https://travis-ci.org/redboltz/mqtt_cpp/jobs/572260184#L1050

/home/travis/build/redboltz/mqtt_cpp/include/mqtt/endpoint.hpp:9949:52: error: cannot call member function ‘std::shared_ptr<_Tp> std::enable_shared_from_this<_Tp>::shared_from_this() [with _Tp = mqtt::endpoint<std::mutex, std::lock_guard, 2>]’ without object
                             self = shared_from_this(),
                                    ~~~~~~~~~~~~~~~~^~

I can avoid it adding redundant this-> as follows:

[
    self = this->shared_from_this()
]

However, if I do that, then MSVC reports the following compile error:

https://redboltz.visualstudio.com/redboltz/_build/results?buildId=469

  C:\Program Files\Boost\1.69.0\include\boost-1_69\boost/pending/integer_log2.hpp(7): note: This header is deprecated. Use <boost/integer/integer_log2.hpp> instead.
  Please define _WIN32_WINNT or _WIN32_WINDOWS appropriately. For example:
  - add -D_WIN32_WINNT=0x0501 to the compiler command line; or
  - add _WIN32_WINNT=0x0501 to your project's Preprocessor Definitions.
  Assuming _WIN32_WINNT=0x0501 (i.e. Windows XP target).
D:\a\1\s\include\mqtt/endpoint.hpp(9982): error C2039: 'shared_from_this': is not a member of 'mqtt::endpoint<Socket,std::mutex,std::lock_guard,2>::process_connect_impl::<lambda_f447512e4e7450f0685e2f392f3b8c35>' [D:\a\1\s\build\test\as_buffer_async_pubsub_1.vcxproj]
          with
          [
              Socket=mqtt::tcp_endpoint<boost::asio::ssl::stream<boost::asio::ip::tcp::socket>,boost::asio::io_context::strand>
          ]
  D:\a\1\s\include\mqtt/endpoint.hpp(9983): note: see declaration of 'mqtt::endpoint<Socket,std::mutex,std::lock_guard,2>::process_connect_impl::<lambda_f447512e4e7450f0685e2f392f3b8c35>'
          with
          [
              Socket=mqtt::tcp_endpoint<boost::asio::ssl::stream<boost::asio::ip::tcp::socket>,boost::asio::io_context::strand>
          ]
  D:\a\1\s\include\mqtt/endpoint.hpp(9828): note: while compiling class template member function 'void mqtt::endpoint<Socket,std::mutex,std::lock_guard,2>::process_connect_impl(mqtt::endpoint<Socket,std::mutex,std::lock_guard,2>::async_handler_t,mqtt::buffer,mqtt::endpoint<Socket,std::mutex,std::lock_guard,2>::connect_phase,mqtt::endpoint<Socket,std::mutex,std::lock_guard,2>::connect_info &&)'
          with
          [
              Socket=mqtt::tcp_endpoint<boost::asio::ssl::stream<boost::asio::ip::tcp::socket>,boost::asio::io_context::strand>
          ]
  D:\a\1\s\include\mqtt/endpoint.hpp(9813): note: see reference to function template instantiation 'void mqtt::endpoint<Socket,std::mutex,std::lock_guard,2>::process_connect_impl(mqtt::endpoint<Socket,std::mutex,std::lock_guard,2>::async_handler_t,mqtt::buffer,mqtt::endpoint<Socket,std::mutex,std::lock_guard,2>::connect_phase,mqtt::endpoint<Socket,std::mutex,std::lock_guard,2>::connect_info &&)' being compiled
          with
          [
              Socket=mqtt::tcp_endpoint<boost::asio::ssl::stream<boost::asio::ip::tcp::socket>,boost::asio::io_context::strand>
          ]
  D:\a\1\s\include\mqtt/client.hpp(40): note: see reference to class template instantiation 'mqtt::endpoint<Socket,std::mutex,std::lock_guard,2>' being compiled
          with
          [
              Socket=mqtt::tcp_endpoint<boost::asio::ssl::stream<boost::asio::ip::tcp::socket>,boost::asio::io_context::strand>
          ]

It seems that MSVC misunderstood that this is internally created lambda class object.

To solve both incomplete compiler problems, I remove the redundant this-> and update g++.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I normally want to use the latest compiler version, but I'm currently limited to GCC 7.3.0 (I am, of course, always working on ways to upgrade to newer compilers..., but currently not able to), so I hope we can find a different work around.

Why not do:

auto self = shared_from_this();

before creating the lambda?

I also pointed out that there's a lot of places where we're calling shared_from_this() repeatedly, and could instead pass the shared_ptr into the function that then creates the lambda, avoiding the need for repeated calls to shared_from_this().

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved shared_from_this() call to outside of lambda expressions.
In addition, minimized shared_from_this() call.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I downgrade g++ version to 7.4. Not only to test the workaround but also codecov support. I couldn't install gcov-9.

include/mqtt/buffer.hpp Outdated Show resolved Hide resolved
@redboltz
Copy link
Owner Author

I've checked compile times using #if 0 ... #endif. In addition, I've checked binary size and its contents. Then I got the dominant reason is various kinds of class template endpoint instantiation.

So I removed Socket template parameter from the class template endpoint. Instead of that, endpoint has type erased mqtt::socket.

Then compile times are decreased.

approach pros cons
template faster runtime speed longer compile times
virtual function natural member function call inheritance required. not template friendly
variant standard supported (C++17) visit required
boost type_erasure natural member function call a little hacky

The indirect function call speed between three approaches are not so different.

`mqtt::buffer` can be convert to `mqtt::string_view` autonatically.
include/mqtt/buffer.hpp Outdated Show resolved Hide resolved
@@ -7758,7 +7758,10 @@ class endpoint : public std::enable_shared_from_this<endpoint<Socket, Mutex, Loc
if (connected_) {
connected_ = false;
mqtt_connected_ = false;
shutdown_from_server(*socket_);
{
boost::system::error_code ec;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any meaningful check that can be made on the error_code?

I can't think of any. Some kind of log message perhaps?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to avoid throwing exception. I think that it should be logged as warning level in the future.

@@ -31,6 +31,11 @@ using boost::optional;
using nullopt_t = boost::none_t;
static const auto nullopt = boost::none;

template <typename T, typename U>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this get used?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to erase it. At first, I thought that boost::optional requires in-place factory but I noticed that boost::optional also have emplace().

I will remove it.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

>
>
binary_property(property::id id, Buffer&& buf)
binary_property(property::id id, buffer buf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this change help at all with the compile times?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. compile times is not affected by this change. This change is to avoid implicit conversion. Users need to pass mqtt::buffer or "abc"_mb explicitly.

See #337 (comment)

Am I understand correctly??

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try to avoid using universal / forwarding references when the thing being "forwarded" will be stored into a member variable.

I like pushing the responsibility of providing a fully constructed object onto the caller of the function. It's much less typing, less template instantiation, and causes much less surprises than using forwarding references.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I understand that you think this change is valuable even if compile times is not changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

I expected a small improvement to compile times, but mostly I thought it was an improvement to the API regardless of whether compile times improved.

This video explains a bit: https://www.youtube.com/watch?v=PNRju6_yn3o

Note that using the universal / forwarding references is technically better in terms of (completely unoptimized) runtime performance because it has slightly less calls to the move constructor, but that's not relevant for header-only code. The compiler will almost certainly inline the call to the function, and therefore we should already have very efficient outcomes.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented.

include/mqtt/property.hpp Show resolved Hide resolved
struct len_str {
explicit len_str(buffer b, bool already_checked = false)
: buf(std::move(b)),
len{ num_to_2bytes(static_cast<std::uint16_t>(buf.size())) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boost::numeric_cast?

That way, if the current value of the buffer.size() is larger than max uint16_t, an exception is thrown.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

namespace mpl = boost::mpl;

template <typename Concept>
class shared_any : public boost::type_erasure::any<Concept, boost::type_erasure::_self&> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the goal here.

Can't std::any / boost::any hold a shared_ptr ?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very difficult to explain. The goal is erasing type.
boost::type_erasure::any can call member functions without cast. See type_erased_socket.hpp.

It is one of runtime type erasure.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.boost.org/doc/libs/1_70_0/doc/html/boost_typeerasure.html is document.

Boost type erasure doesn't support shared_ptr directly. I need to adapt the pointee object that is held by shared_ptr to the concept.

The 2nd argument of boost::type_erasure_any is to support reference semantics.

test/property.cpp Show resolved Hide resolved
@redboltz
Copy link
Owner Author

redboltz commented Aug 17, 2019

Here is current result. Good compile times.

MQTT_NO_TLS MQTT_USE_WS master impl_320_work impl_320(type erased)
ON OFF 11.1 16.3 13.4
OFF OFF 12.1 21.4 13.8
ON ON 15.1 41.6 14.1
OFF ON 22.5 96.6 14.5

@codecov-io
Copy link

codecov-io commented Aug 18, 2019

Codecov Report

Merging #337 into master will increase coverage by 0.56%.
The diff coverage is 73.51%.

@@            Coverage Diff             @@
##           master     #337      +/-   ##
==========================================
+ Coverage   85.53%   86.09%   +0.56%     
==========================================
  Files          34       37       +3     
  Lines        4889     6237    +1348     
==========================================
+ Hits         4182     5370    +1188     
- Misses        707      867     +160

Replaced static_cast with boost::numeric_cast if overflow could happen.
Removed redundant static_cast.
If use boost (default), then use `boost::make_shared<char[]>(size)`.
If user defines MQTT_STD_SHARED_PTR_ARRAY,
    if __cplusplus is greater than 201703L (means C++20 or later),
        then `std::make_shared<char[]>(size)`,
        otherwise `std::shared_ptr<char[]>(new size)`
    is called.
@redboltz
Copy link
Owner Author

I believe that the PR is ready to merge.

@jonesmz
Copy link
Contributor

jonesmz commented Aug 18, 2019

Would you consider closing this PR and making a new one? Then I could review everything thoroughly without being confused by old comments.

@redboltz
Copy link
Owner Author

Yes, I just created #339.

@redboltz redboltz closed this Aug 18, 2019
@redboltz redboltz deleted the impl_320 branch September 3, 2019 04:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants