Using Intel® Optane™ memory to expand the capacity of a typical realtime kdb market data system

Abstract

Wrote a testing framework to simulate a realtime kdb market data system and measure latency within the system.
Benchmarked DRAM vs Optane memory when processing US equity data for one of the busiest days in 2020
Wrote utility script to easily move in-memory datadata between DRAM and Optane memory
Found that:
- Optane is a viable solution to augment the realtime capacity of a kdb system
- Optane allows better utiliation of existing hardware where DRAM was previously a limiting factor

Background

While there has been research published in the kdb community showing the effectiveness of Optane as an extremely fast disk, there as yet, has been no published research using Optane as a volatile memory source.

Traditionally the choke point of a market data platform has been the amount of DRAM available on a server, which determines how many RDBs can be run to server queries for todays data.This blog:

Looks at wether Optane mounted in AppDirect mode can be used to help solve this problem
Provides useful utilities to move data in and out of Optane memory
Documents the observed performance

It is assumed the reader already has some knowledge of kdb and Optane. This paper contains some details on how to use Optane but a lot more detail is available here on the Kx wiki. A good tech talk is also available from AquaQ.

This paper was produced in partnership with Intel, so focuses on Optane memory, but the concept is interchangeable wiht other filesystem backed memory sources.

KDB Utility Functions for Optane Memory

Optane memory is accessible in kdb by use of the -m command line option and the .m namespace. To move data into Optane simply define it in the .m namespace. No other code changes are required. This can be accomplished by using the following two functions from the mutil.q script.

// ex .mutil.colsToDotM[`trade;`price`size]
.mutil.colsToDotM:{[t;cs] 
    / enusre cs is list
    cs,:();
    / pull out columns to move an assign in .m namespace
    (` sv `.m,t) set ?[t;();0b;cs!cs];
    / replace columns in table
    t set get[t],'.m[t];
    }   

.mutil.tblToDotM:.mUtil.colsToDotM[;()]

The only difference between running a DRAM and Optane RDB is starting it with the -m option pointing to an Optane mount and running the following command after loading the schema:

.mutil.tblToDotM each tables[];

Using .mutil.colsToDotM rather than .mutil.tblToDotM allows users to move a subset of columns into Optane, giving a hybrid model which allows users to keep their most accessed columns in DRAM and less accessed columns in Optane.

Testing Framework

The framework was designed to mimic a typical kdb market data capture platform and then observe and record system latency. A command line flag would tell the framework if it should start its RDB in DRAM or Optane. The capture section of the framework is largely base on the standard tick architecture

The framework simulates the volume and velocity of market data processed on March 9th 2020. It prestresses the RDB with 65 million records, and then sends 48,000 updates per second for the next 30 minutes, split between trades and quotes in a 1:10 ratio.

A tickerplant (tp) consumed this feed and on a timer distributed the messages to an aggregation engine which stats aggregated to a minute and daily level before publishing them back to the tickerplant. The RDB consumes the raw and aggregated data. On a seperate configurable timer, the monitor process queried the RDB and measures all of the below, considered to be a "typical kdb query":

Max trade / quote latency (time between the tick being generated in the feed process and being accessible via a query to the RDB)
Max aggregated trade / quote stats latency (as above - but also including the time for the aggregation engine to perform it's calculations and publish the results)
Number of records in each table
Time taken to get the prevailing quote information for every trade for a single ticker
Memory utilisation stats

The above was considered a "stack". The testing consisted of running mutliple stacks with their RDB hosted in either DRAM or Optane and varying the tickerplant publish / query request frequencies to determine the maximum number of the "typical kdb queries" the system could server and the max latency in each setup.

A more detailed description of the architecture can be found in the appendix.

Test Cases and Results

Test Case 1

Test Case:

Run two DRAM RDBs and two Optane RDBs. Volume of data being published is constant but TP and query frequencies are altered to compare latency and check if RDBs can respond to queries at the same rate that they are recieved.

Results

Tickerplant publish frequency 50ms

Monitor query frequency: 300ms

Stat	DRAM	Optane	Comparison*
Max Latency	0D00:00:00.036520548	0D00:00:00.059900059	0.6097
Bytes in Memory	171893717664	5765504	29814
Average Query Time	0D00:00:00.047099680	0D00:00:00.067810036	0.6946
Queries per minute	200	200	1

Tickerplant publish frequency 50ms

Monitor query frequency: 50ms

Stat	DRAM	Optane	Comparison*
Max Latency	0D00:00:00.125182124	0D00:04:45.203363517	7.320505e-06
Bytes in Memory	171893717648	5765488	29814
Average Query Time	0D00:00:00.050620035	0D00:00:00.069780700	0.725416
Queries per minute	880	600	0.6818182

Tickerplant publish frequency 1000ms

Monitor query frequency: 50ms

Stat	DRAM	Optane	Comparison*
Max Latency	0D00:00:01.054148932	0D00:00:01.037875019	0.984562
Bytes in Memory	171893717392	5765488	29814.25
Average Query Time	0D00:00:00.047581782	0D00:00:00.067816369	0.7016268
Queries per minute	1000	1000	1

Tickerplant publish frequency 50ms

Monitor query frequency: 20ms

Stat	DRAM	Optane	Comparison*
Max Latency	0D00:01:17.585672136	0D00:14:29.525561499	0.08861793
Bytes in Memory	171888597648	5765488	29813
Average Query Time	0D00:00:00.050031983	0D00:00:00.070396229	0.7107196
Queries per minute	1000	700	0.7

*Comparison is factor. Higher is better. Factor of 1 = same performance. Factor of 2 = 200% faster or half memory used. <TODO DO WE WANT TO KEEP COMPARISON FACTOR - SOUNDS A BIT LIKE 'CAVEMAN LIKE BIG'>

Analysis

Max latency is as expected, consistently higher in Optane RDBs
As query frequency increases, Optane and DRAM both see latency increase but Optane struggles more.
By increasing the TP timer and allowing larger writes, the impact of slower read / writes in Optane can be reduced. (Note the 1 second max latency is due to data being published on a 1s timer)
The number of queries per minute supported in Optane was significantly less than DRAM in a number of situations so isn't a like for like replacement, but does offer great potential to augment DRAM capacity.
It's worth noting that Queries per minute in DRAM differs from Optane, it means the Optane resource is saturated, but the DRAM is not necessarily saturated. So these figures shouldn't be used as the upper limits of DRAM capacity

The trade off in latency and max queries per minute is the compromise for a massive reduction in memory usage. This is expected though given the IO overhead of Optane vs DRAM.

Test Case 2

Test Case:

DRAM will excel when compared like for like with Optane, however Optane is better suited to horizontal scaling. A typical Optane server could host up to 10 Optane RDBs vs 2 Standard RDBs on a typical DRAM only server. Compare two DRAM RDBs running compared to 10 Optane RDBs. Run with one second TP timer.

Results

Tickerplant publish frequency 1000ms

Monitor query frequency: 50ms

	2 DRAM RDBs	10 Optane RDBs	Comparison*
Max Latency	00:01.042379141	00:01.040085588	1.000038
% Memory of DRAM used on server	63.2989	2.25653	28.05143
Average Query Time	00:00.046716868	00:00.067810036	0.6889374
Max queries per minute	2314	7735	3.432697

*Comparison is factor. Higher is better. Factor of 1 = same performance. Factor of 2 = 200% faster or half memory used.

Latency of two systems is comparable. % memory usage is much smaller for 10 appDirect RDBs than 2 DRAM RDBs. Average query time is slightly slower in appDirect but can runs more total queries because of extra services.

Test 2 - We can run many more appDirect RDBs on a single server than DRAM RDBs

We are able to run x5 more RDB services on the exact same hardware just with the addition of optane mounts (Mounts were 85% full at the end testing period). While only using less than 3% of the memory of the box. Increasing our ability to services 5x the volume of queries being run on the RDB. We do however have to be wary of trying to access too much of the data in optane at the same time. Similar to how it standard set up we need to ensure we have enough memory to do calculations. If we need code to access more memory than is available this can also use appDirect memory. See this section in appendix for more details.

Queries take longer so make sure extra services are worth it

Further exploration into the query performance of the Optane RDBs should be looked at. could be 2-5x slower for raw pulling from DRAM compared to PMEM but seeing how this would actually affect a api and if this slow down is completely offset by ability to run more RDBs.

File system backed memory seems to be better suited to less volatile data storage

As well as testing storing our RDB data in Optane we also tried moving the table in the aggEngine into Optane. We found that the aggEngine then started to struggle to keep up and we saw latency grow. Most likely we believe this is due to the nature of the aggEngine code constantly reading from and writing to its cache of data these constant look up and re writing of data does not be the optimal use case for the technology and doesn't even give us a big memory saving benefit.

Read versus write

The general consensus was that PMEM reads are much closer to DRAM performance that writes however we ran into more issues with the reads when querying the RDB than writing the data. We believe this is because out typical market data system will be writing small inserts to the optane mounts on each update where as queries can attempt to read the entire data that is stored there. .e.g exec max time from quote. This is no different to how kdb behaves with DRAM it is just more noticeable as queries are slightly slower.

Middle ground

Steps for further investigation or post could involve

enhancing query testing, have load balancer in-front of RDBs and compare replacing 1 DRAM RDB with 3 Optane RDBs. Ensure latency not an issue due to write contention and confirm that query impact not affect slow down mitigated by extra services available to run them.
using Optane for large static tables that take up a lot of room in memory but are queried so often take to long to have written to disk. .e.d flat-file table you load into hdbs or reference data tables saved in gateways
replay performance - we have seen write contention so be interesting to explore how replay would work for PMEM RDBs. This would probably stress test the write performance to the PMEM mounts. May need to batch updates and and write to DRAM as intermediary

Conclusion

It is possible to run RDBs with data stored in Optane instead of DRAM. Code required for such changes is outlined.
There is a trade off in performance for querying from file system backed memory. Can be seen to be 2-5x slower for raw data pull. But as you move to more aggregated cpu bound type queries this can become almost negligible.
Possible that just trying to convert a working RDB to Optane. This slow down in read speed will cause queries to take longer and could result in RDB delaying processing of of new messages for tp.
If a RDB is already seeing very high query volumes may not be advisable to move everything to Optane memory.
More realistic use case is if you have columns that are seldom queried that subset of columns could be moved to .m namespace freeing up ram for more RDBs.
Not only is same amount of memory cheaper in optane compared to DRAM but the max size optane chips is significantly larger than for DRAM. Meaning the memory limit of single server is significantly increased.
While it isn't primarily marketed as a DRAM replacement technology, we found it was a very helpful addition in augmenting the volatile memory capacity of a server hosting realtime data.
Reduce cost of infrastructure running with less servers and DRAM to support data processing and analytic workloads

About the authors

Eoin Cunning is a kdb consultant, based in London, who has worked on several KX solutions projects and currently works on a market data system for a leading global market maker firm.

Nick McDaid is a kdb developer, based in London working for a leading gloabl market making firm.

Appendix

KDB 4.0 Release notes regarding Filesystem backed memory

2019.10.22
NUC
memory can be backed by a filesystem, allowing use of dax-enabled filesystems (e.g. AppDirect) as a non-persistent memory extension for kdb.
cmd line option -m path to use the filesystem path specified as a separate memory domain. This splits every thread's heap in to two:
  domain 0: regular anonymous memory (active and used for all allocs by default)
  domain 1: filesystem-backed memory
 .m namespace is reserved for objects in domain 1, however names from other namespaces can reference them too, e.g. a:.m.a:1 2 3
 \d .m changes current domain to 1, causing it to be used by all further allocs. \d .anyotherns sets it back to 0
 .m.x:x ensures the entirety of .m.x is in the domain 1, performing a deep copy of x as needed (objects of types 100-103h,112h are not copied and remain in domain 0)
 lambdas defined in .m set current domain to 1 during execution. This will nest since other lambdas don't change domains:
c  q)\d .myns
  q)g:{til x}
  q)\d .m
  q)w:{system"w"};f:{.myns.g x}
  q)\d .
  q)x:.m.f 1000000;.m.w` / x allocated in domain 1
 -120!x returns x's domain (currently 0 or 1), e.g 0 1~-120!'(1 2 3;.m.x:1 2 3)
 \w returns memory info for the current domain only:
  q)value each ("\\d .m";"\\w";"\\d .";"\\w")
-w limit (M1/m2) is no longer thread-local, but domain-local; cmdline -w, \w set limit for domain 0
mapped is a single global counter, same in every thread's \w

Hardware

Server details

[root@clx4 ~]# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              56
On-line CPU(s) list: 0-55
Thread(s) per core:  1
Core(s) per socket:  28
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Genuine Intel(R) CPU 0000%@
Stepping:            6
CPU MHz:             2538.054
CPU max MHz:         4000.0000
CPU min MHz:         1000.0000
BogoMIPS:            4400.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            39424K
NUMA node0 CPU(s):   0-27
NUMA node1 CPU(s):   28-55
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
[root@clx4 ~]# free -h 
              total        used        free      shared  buff/cache   available
Mem:          374Gi       5.0Gi       352Gi        29Mi        17Gi       367Gi
Swap:         4.0Gi       0.0Ki       4.0Gi
[root@clx4 ~]# 
[root@clx4 ~]# df -h
Filesystem           Size  Used Avail Use% Mounted on
devtmpfs             188G     0  188G   0% /dev
tmpfs                188G     0  188G   0% /dev/shm
tmpfs                188G   27M  188G   1% /run
tmpfs                188G     0  188G   0% /sys/fs/cgroup
/dev/mapper/cl-root   50G  8.0G   43G  16% /
/dev/mapper/cl-home  690G   84G  607G  13% /home
/dev/sda1            976M  257M  653M  29% /boot
tmpfs                 38G   20K   38G   1% /run/user/42
tmpfs                 38G     0   38G   0% /run/user/1000
/dev/pmem0           732G   73M  695G   1% /mnt/pmem0
/dev/pmem1           732G   73M  695G   1% /mnt/pmem1
[root@clx4 ~]# 
[root@clx4 ~]# ipmctl show -topology
 DimmID | MemoryType                  | Capacity    | PhysicalID| DeviceLocator 
================================================================================
 0x0001 | Logical Non-Volatile Device | 126.375 GiB | 0x0026    | CPU1_DIMM_A2
 0x0011 | Logical Non-Volatile Device | 126.375 GiB | 0x0028    | CPU1_DIMM_B2
 0x0021 | Logical Non-Volatile Device | 126.375 GiB | 0x002a    | CPU1_DIMM_C2
 0x0101 | Logical Non-Volatile Device | 126.375 GiB | 0x002c    | CPU1_DIMM_D2
 0x0111 | Logical Non-Volatile Device | 126.375 GiB | 0x002e    | CPU1_DIMM_E2
 0x0121 | Logical Non-Volatile Device | 126.375 GiB | 0x0030    | CPU1_DIMM_F2
 0x1001 | Logical Non-Volatile Device | 126.375 GiB | 0x0032    | CPU2_DIMM_A2
 0x1011 | Logical Non-Volatile Device | 126.375 GiB | 0x0034    | CPU2_DIMM_B2
 0x1021 | Logical Non-Volatile Device | 126.375 GiB | 0x0036    | CPU2_DIMM_C2
 0x1101 | Logical Non-Volatile Device | 126.375 GiB | 0x0038    | CPU2_DIMM_D2
 0x1111 | Logical Non-Volatile Device | 126.375 GiB | 0x003a    | CPU2_DIMM_E2
 0x1121 | Logical Non-Volatile Device | 126.375 GiB | 0x003c    | CPU2_DIMM_F2
 N/A    | DDR4                        | 32.000 GiB  | 0x0025    | CPU1_DIMM_A1
 N/A    | DDR4                        | 32.000 GiB  | 0x0027    | CPU1_DIMM_B1
 N/A    | DDR4                        | 32.000 GiB  | 0x0029    | CPU1_DIMM_C1
 N/A    | DDR4                        | 32.000 GiB  | 0x002b    | CPU1_DIMM_D1
 N/A    | DDR4                        | 32.000 GiB  | 0x002d    | CPU1_DIMM_E1
 N/A    | DDR4                        | 32.000 GiB  | 0x002f    | CPU1_DIMM_F1
 N/A    | DDR4                        | 32.000 GiB  | 0x0031    | CPU2_DIMM_A1
 N/A    | DDR4                        | 32.000 GiB  | 0x0033    | CPU2_DIMM_B1
 N/A    | DDR4                        | 32.000 GiB  | 0x0035    | CPU2_DIMM_C1
 N/A    | DDR4                        | 32.000 GiB  | 0x0037    | CPU2_DIMM_D1
 N/A    | DDR4                        | 32.000 GiB  | 0x0039    | CPU2_DIMM_E1
 N/A    | DDR4                        | 32.000 GiB  | 0x003b    | CPU2_DIMM_F1
[root@clx4 ~]# 
[root@clx4 ~]# ipmctl show -memoryresources
 MemoryType   | DDR         | PMemModule   | Total        
==========================================================
 Volatile     | 381.500 GiB | 0.000 GiB    | 381.500 GiB
 AppDirect    | -           | 1512.000 GiB | 1512.000 GiB
 Cache        | 0.000 GiB   | -            | 0.000 GiB
 Inaccessible | 2.500 GiB   | 5.066 GiB    | 7.566 GiB
 Physical     | 384.000 GiB | 1517.066 GiB | 1901.066 GiB

Server set up for appDirect

# set all Optane to appDirect either via command below of BIOs settings (required reboot)
ipmctl create -goal persistentmemorytype=appdirect

# creating namespace from persistent namespace (align good for pages)
ndctl create-namespace --mode=fsdax --region=0 --align=2M
ndctl create-namespace --mode=fsdax --region=1 --align=2M

# make a file system
mkfs -t xfs /dev/pmem0
mkfs -t xfs /dev/pmem1

# mount file system make writeable
mkdir /mnt/pmem0
mkdir /mnt/pmem1

# mount in direct access mode
mount -o dax /dev/pmem0 /mnt/pmem0/
mount -o dax /dev/pmem1 /mnt/pmem1/

# permissions mounts
chmod 777 /mnt/pmem0
chmod 777 /mnt/pmem1

Numa settings

The standard recommendation when using numa is to set --interleave=all numactl --interleave=all q but found slightly better performance in aligning the numa nodes with the persistent memory namespaces numactl -N 0 -m 0 q -q -m /mnt/pmem0/ and numactl -N 1 -m 1 q -q -m /mnt/pmem1/

Testing Framework Architecture

Feed

Data arrives from a feed. Normally this would be a feed-handler publishing data from exchanges or vendors. For consistent testing we have simulated this feed from another q process. This process generates random data and publishes down stream. For the rate of data to send we looked and the largest day of market activity in 2020. which during its last half hour of trading before the close consisted of 80,000,000 quote msgs and 15,000,000 trades. Code can be viewed here

Tp

Standard kdb tp running in batch mode. Code can be viewed here

AggEngine

This process is the main departure for standard tick set up. This process subscribes to standard trade and quote tables and calculates running daily and minute level stacks for all symbols. These aggregated tables are then published to the RDB from which they could then be queried. This process was added in order to have some kind of more complex event process as well as standard RDB. This process will constantly have to read and write to memory. where generally only has to write as it appends data and only read for queries) Code can be viewed here

RDB

Standard rdb subscribes to tables from the tp. We also added option to prestress the memory before our half hour of testing again looking at the market data on 2020.03.02 there were 650,000,000 quote msgs and 85,000,000 trades at 15:30 so we insert these volumes into the rdb at start up. This aims to ensure that the ram is already somewhat saturated. Full code availablehere

Monitor

The Monitor process connects to the rdb and collect performance stats on a timer. Main measurements are for latency of quote table this will track if messages getting queued from the tp, quote stats table if this falls behind indicates issue in aggEngine and the query time which measures how long it takes to run some typical rdb queries .e.g aj On start up this process also kicks off the feed once having successfully connected to rdb to start testing run. Once endTime has been reached the stats collected are aggregated and written to csv file. Again link for full code available here

Defining functions to use file system backed memory

We also need to be aware that it is not just data stored in tables that uses memory. Queries and functions that we run also need to assign memory and which domain they use will depend on how the function was defined.

Here for example we compare the same functionality defined in both the root and .m namespace.

q)n:1000000
q)tab:([]t:n?.z.n;s:n?`4;n?100f;n?100)
q)f:{`s`t xasc x}
q)\d .m
q.m)f:{`s`t xasc x}
q.m)\d .
q)\ts f tab
133 41944096
q)\ts .m.f tab
351 512
q)f[tab]~.m.f[tab]
1b

A full copy of the data is required for the sort listed above. When running the function defined in the .m namespace, that copy is assigned to the filesystem backed memory. While this results in a drop off in performance, it does provide a huge saving of DRAM memory. This could be very useful in instances where users need to run multiple concurrent back-loaders or are struggling to perform end of day sorts with all the data in memory.

We did explore this functionality and some utility functions are available in mutil.q script to redefine functions and namespaces to allocate memory to file system back memory instead of DRAM but exploring uses of this further was deemed to be out of scope for this post.

We do however have to be aware of it as if we aren't defining code in the .m namespace then we will be constrained to the amount of DRAM we have. If we have more memory stored in appDirect that there is DRAM available and have queries that copies a lot of the data we could run out of DRAM.

So if leaving an existing api/query code as is there will be an initial over head for pulling the data from appDirect instead of DRAM. After that once we cause kdb to assign data to new memory space it will get assigned in DRAM and everything from that point on should be as fast as it was in a DRAM rdb.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
figs		figs
README.md		README.md
README.pdf		README.pdf
numaExplanation.txt		numaExplanation.txt
pointsToInclude.txt		pointsToInclude.txt

nmcdaid/optane-kdb-report

Folders and files

Latest commit

History

Repository files navigation

Using Intel® Optane™ memory to expand the capacity of a typical realtime kdb market data system

Abstract

Table of Contents

Background

KDB Utility Functions for Optane Memory

Testing Framework

Test Cases and Results

Test Case 1

Test Case:

Results

Analysis

Test Case 2

Test Case:

Results

Test 2 - We can run many more appDirect RDBs on a single server than DRAM RDBs

Queries take longer so make sure extra services are worth it

File system backed memory seems to be better suited to less volatile data storage

Read versus write

Middle ground

Steps for further investigation or post could involve

Conclusion

About the authors

Appendix

KDB 4.0 Release notes regarding Filesystem backed memory

Hardware

Server details

Server set up for appDirect

Numa settings

Testing Framework Architecture

Feed

Tp

AggEngine

RDB

Monitor

Defining functions to use file system backed memory

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages