Kernel Panic when using OpenCAS on CentOS 7.8 #454

darktorana · 2020-06-30T05:00:02Z

Environment:

OS: CentOS 7.8
Kernel: 3.10.0-1127.8.2.el7.x86_64
OpenCAS Version: 20.06.00.00000603

Issue:

While operating normally for almost 2 weeks (11 days to be more precise) we had a server using OpenCAS Kernel Panic and the crash logs are pointing towards OpenCAS.
We do have another server in operation configured the same way that has been fine for over a month so we don't believe this is a common issue.

Crash Information:

The last of the crash log has the following:

crash> log
--------Lines Removed--------
[949957.219356] ------------[ cut here ]------------
[949957.219387] kernel BUG at /opt/open-cas-linux/modules/cas_cache/src/ocf/utils/utils_cache_line.c:29!
[949957.219424] invalid opcode: 0000 [#1] SMP
[949957.219446] Modules linked in: fuse dm_mod snapapi26(POE) ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 ip6table_filter nf_log_ipv4 nf_log_common xt_set ip_set_hash_net ip_set nfnetlink nf_nat_ftp xt_REDIRECT nf_nat_redirect xt_conntrack nf_conntrack_ftp xt_LOG xt_limit xt_multiport iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 ip6table_mangle ip6table_raw ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 nf_nat nf_conntrack ip6_tables iptable_mangle iptable_raw ipt_REJECT nf_reject_ipv4 iptable_filter ext4 mbcache jbd2 vfat fat loop dell_smbios dell_wmi_descriptor dcdbas skx_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel ipmi_ssif kvm cas_cache(OE) irqbypass crc32_pclmul ghash_clmulni_intel cas_disk(OE) aesni_intel lrw gf128mul glue_helper ablk_helper cryptd wdat_wdt pcspkr
[949957.219835]  sg joydev lpc_ich i2c_i801 mei_me mei wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci i40e drm crct10dif_pclmul crct10dif_common crc32c_intel libahci megaraid_sas igb libata dca i2c_algo_bit ptp pps_core drm_panel_orientation_quirks nfit libnvdimm
[949957.220050] CPU: 16 PID: 1189 Comm: cas_io_cache1_1 Kdump: loaded Tainted: P           OEL ------------   3.10.0-1127.8.2.el7.x86_64 #1
[949957.220097] Hardware name: Dell Inc. PowerEdge C6420/0YTVTT, BIOS 2.5.4 01/14/2020
[949957.220128] task: ffff9758b2773150 ti: ffff97389e340000 task.ti: ffff97389e340000
[949957.220158] RIP: 0010:[<ffffffffc0d25abc>]  [<ffffffffc0d25abc>] __set_cache_line_invalid+0x15c/0x160 [cas_cache]
[949957.220221] RSP: 0018:ffff97389e343ce0  EFLAGS: 00010212
[949957.220243] RAX: 00000000000000ff RBX: ffffb32d9b7f1000 RCX: 000000000699dc9f
[949957.220272] RDX: 0000000000000007 RSI: 0000000000000000 RDI: ffffb32d9b7f1000
[949957.220301] RBP: ffff97389e343d18 R08: 0000000000001000 R09: 00000000000000ff
[949957.220329] R10: 0000000000000007 R11: 00000000054b64ba R12: 000000000699dc9f
[949957.220358] R13: 0000000000001000 R14: 0000000000000007 R15: ffffb32d9b7f1000
[949957.220387] FS:  0000000000000000(0000) GS:ffff9738bf000000(0000) knlGS:0000000000000000
[949957.220420] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[949957.220444] CR2: 00007f038824c7cc CR3: 000000205653c000 CR4: 00000000007607e0
[949957.220479] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[949957.220508] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[949957.220538] PKRU: 00000000
[949957.220551] Call Trace:
[949957.220584]  [<ffffffffc0d166c4>] ? ocf_metadata_hash_get_core_and_part_id+0x64/0xd0 [cas_cache]
[949957.220631]  [<ffffffffc0d25c13>] set_cache_line_invalid_no_flush+0x63/0x90 [cas_cache]
[949957.220674]  [<ffffffffc0d0ed4f>] ocf_engine_map+0x2df/0x340 [cas_cache]
[949957.220712]  [<ffffffffc0d0f298>] ocf_engine_prepare_clines+0x168/0x170 [cas_cache]
[949957.220753]  [<ffffffffc0d1127e>] ocf_read_generic+0x5e/0x110 [cas_cache]
[949957.220793]  [<ffffffffc0d1ef89>] ocf_io_handle+0x29/0x50 [cas_cache]
[949957.220831]  [<ffffffffc0d1eff5>] ocf_queue_run_single+0x45/0x50 [cas_cache]
[949957.220869]  [<ffffffffc0d1f028>] ocf_queue_run+0x28/0x50 [cas_cache]
[949957.220902]  [<ffffffffc0cfa81a>] _cas_io_queue_thread+0xfa/0x150 [cas_cache]
[949957.220935]  [<ffffffff8dac7780>] ? wake_up_atomic_t+0x30/0x30
[949957.222102]  [<ffffffffc0cfa720>] ? cas_blk_identify_type_atomic+0x10/0x10 [cas_cache]
[949957.223270]  [<ffffffff8dac6691>] kthread+0xd1/0xe0
[949957.224430]  [<ffffffff8dac65c0>] ? insert_kthread_work+0x40/0x40
[949957.225595]  [<ffffffff8e192d37>] ret_from_fork_nospec_begin+0x21/0x21
[949957.226776]  [<ffffffff8dac65c0>] ? insert_kthread_work+0x40/0x40
[949957.227957] Code: 01 00 e8 f8 09 07 cd 4c 89 ef c6 07 00 0f 1f 40 00 44 89 e6 48 89 df e8 73 70 ff ff 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 0f 0b 0f 1f 44 00 00 55 48 89 e5 41 57 45 89 c7 41 56 41
[949957.230464] RIP  [<ffffffffc0d25abc>] __set_cache_line_invalid+0x15c/0x160 [cas_cache]
[949957.231678]  RSP <ffff97389e343ce0>

Output of bt:

crash> bt
PID: 1189   TASK: ffff9758b2773150  CPU: 16  COMMAND: "cas_io_cache1_1"
 #0 [ffff97389e343990] machine_kexec at ffffffff8da66044
 #1 [ffff97389e3439f0] __crash_kexec at ffffffff8db22ea2
 #2 [ffff97389e343ac0] crash_kexec at ffffffff8db22f90
 #3 [ffff97389e343ad8] oops_end at ffffffff8e18a798
 #4 [ffff97389e343b00] die at ffffffff8da30a7b
 #5 [ffff97389e343b30] do_trap at ffffffff8e189ee0
 #6 [ffff97389e343b80] do_invalid_op at ffffffff8da2d2a4
 #7 [ffff97389e343c30] invalid_op at ffffffff8e19622e
    [exception RIP: __set_cache_line_invalid+348]
    RIP: ffffffffc0d25abc  RSP: ffff97389e343ce0  RFLAGS: 00010212
    RAX: 00000000000000ff  RBX: ffffb32d9b7f1000  RCX: 000000000699dc9f
    RDX: 0000000000000007  RSI: 0000000000000000  RDI: ffffb32d9b7f1000
    RBP: ffff97389e343d18   R8: 0000000000001000   R9: 00000000000000ff
    R10: 0000000000000007  R11: 00000000054b64ba  R12: 000000000699dc9f
    R13: 0000000000001000  R14: 0000000000000007  R15: ffffb32d9b7f1000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ffff97389e343ce0] ocf_metadata_hash_get_core_and_part_id at ffffffffc0d166c4 [cas_cache]
 #9 [ffff97389e343d20] set_cache_line_invalid_no_flush at ffffffffc0d25c13 [cas_cache]
#10 [ffff97389e343d60] ocf_engine_map at ffffffffc0d0ed4f [cas_cache]
#11 [ffff97389e343dc0] ocf_engine_prepare_clines at ffffffffc0d0f298 [cas_cache]
#12 [ffff97389e343df0] ocf_read_generic at ffffffffc0d1127e [cas_cache]
#13 [ffff97389e343e10] ocf_io_handle at ffffffffc0d1ef89 [cas_cache]
#14 [ffff97389e343e20] ocf_queue_run_single at ffffffffc0d1eff5 [cas_cache]
#15 [ffff97389e343e30] ocf_queue_run at ffffffffc0d1f028 [cas_cache]
#16 [ffff97389e343e50] _cas_io_queue_thread at ffffffffc0cfa81a [cas_cache]
#17 [ffff97389e343ec8] kthread at ffffffff8dac6691

Reproduce Steps:

Unfortunately we have not had this happen again in order to provide more information so it might just be a once off, but was hoping to at least find out why it occurred.

Extra Information:

If there is anything else I can get you or information that might shed some light on why this has crashed please don't hesitate. We are using this in a production environment (after testing of course) and are keen to find out what caused the panic and if something can be done to prevent it.

The text was updated successfully, but these errors were encountered:

mmichal10 · 2020-07-01T12:29:38Z

Hi @darktorana,

Thank You for report. Could you please provide more information about CAS configuration You are using? We are interested in:

cache mode,
cache line size,
cleaning policy,
sequential cutoff policy,
promotion policy

darktorana · 2020-07-01T23:44:48Z

Hey @mmichal10,

To answer your questions:
Cache Mode: wb
Cache Line Size: 4 kiB
Cleaning Policy: alru
Sequential Cutoff Policy: I'm not sure where to find this, but I haven't changed it so I'd say 'ocf_seq_cutoff_policy_default'
Promotion Policy: always

I'm not sure if it will help, but the 2TB cache is sitting in front of 5TB of data though the raid is 10TB (will be expanded to 20TB).
Server has 256GB of ram & 64 cores.

If there's anything else you need please just ask.

darktorana · 2020-07-15T06:47:24Z

Hi guys,

Is there anything else I would be able to provide you to track down the root cause of this issue?

mmichal10 · 2020-07-16T10:54:49Z

@darktorana

Could You tell something about workload? We will try to reproduce it in our test environment.

About using CAS in production environment: using master in production is not recommended since it is not validated as good as release version. CAS v20.3 is the most stable, so please consider using it.

darktorana · 2020-07-17T04:52:21Z

The server is a mail server that has about 5,000 accounts on it. The server is fairly high IO, though the OpenCAS has massively taken the load off the HDDs underneath.

The OpenCAS is a 1TB SSD Cache in Raid1 sitting above the 22TB Raid10 HDD Partition consisting of 6 8TB HDDs.

The casadm -P output is below:

The last box (the errors) are all 0 and not in the above picture.

A bit of output from iostat is below. I have told it to show vdb (Core disk 22TB) and vdc (Cache Disk 1TB):

If there is anything else I can provide, any other information or logs/stats please let me know!

mmichal10 · 2020-07-17T13:06:49Z

What mail server is it? Workload as similar as possible to the one Your server is handling could help us reproduce you issue. If it is a free software, we could try to set it up on our machine.

darktorana · 2020-07-17T15:29:57Z

The software is called Axigen and does have a free version. You can find the version here:
https://www.axigen.com/mail-server/download/

If you need any help setting it up let me know.

jfckm added the P0-critical label Jul 16, 2020

jfckm assigned mmichal10 Jul 16, 2020

igorkonopko mentioned this issue Aug 12, 2020

[BUG] NULL pointer dereference - remove_lru_list() #493

Closed

robertbaldyga mentioned this issue Sep 3, 2020

Update OCF - Fix bugs in engine and attach/detach path #519

Merged

robertbaldyga closed this as completed in #519 Sep 3, 2020

robertbaldyga mentioned this issue Sep 16, 2020

CentOS7.6 kernel crash #524

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel Panic when using OpenCAS on CentOS 7.8 #454

Kernel Panic when using OpenCAS on CentOS 7.8 #454

darktorana commented Jun 30, 2020

mmichal10 commented Jul 1, 2020 •

edited

Loading

darktorana commented Jul 1, 2020

darktorana commented Jul 15, 2020

mmichal10 commented Jul 16, 2020

darktorana commented Jul 17, 2020 •

edited

Loading

mmichal10 commented Jul 17, 2020

darktorana commented Jul 17, 2020

Kernel Panic when using OpenCAS on CentOS 7.8 #454

Kernel Panic when using OpenCAS on CentOS 7.8 #454

Comments

darktorana commented Jun 30, 2020

Environment:

Issue:

Crash Information:

Reproduce Steps:

Extra Information:

mmichal10 commented Jul 1, 2020 • edited Loading

darktorana commented Jul 1, 2020

darktorana commented Jul 15, 2020

mmichal10 commented Jul 16, 2020

darktorana commented Jul 17, 2020 • edited Loading

mmichal10 commented Jul 17, 2020

darktorana commented Jul 17, 2020

mmichal10 commented Jul 1, 2020 •

edited

Loading

darktorana commented Jul 17, 2020 •

edited

Loading