Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setfacl on 0.6.5.8 causes hang/load spike #5340

Closed
Lalufu opened this issue Oct 26, 2016 · 27 comments
Closed

setfacl on 0.6.5.8 causes hang/load spike #5340

Lalufu opened this issue Oct 26, 2016 · 27 comments
Labels
Type: Performance Performance improvement or performance problem

Comments

@Lalufu
Copy link
Contributor

Lalufu commented Oct 26, 2016

I have a script that runs every minute, running this code (paths/usernames changed to protect the innocent):

for i in /tank/a; do
    find "${i}" -type d -user user1 -print0 | xargs -r -0 setfacl \
        -m 'user::rwx' \
        -m 'user:user1:rwx' \
        -m 'user:user2:rwx' \
        -m 'group::rx' \
        -m 'mask::rwx' \
        -m 'other::rx' \
        -m 'default:user::rwx' \
        -m 'default:user:user1:rwx' \
        -m 'default:user:user2:rwx' \
        -m 'default:group::rx' \
        -m 'default:mask::rwx' \
        -m 'default:other::rx'
    find "${i}" -type f -user user1 -print0 | xargs -r -0 setfacl \
        -m 'user::rw' \
        -m 'user:user1:rw' \
        -m 'user:user2:rw' \
        -m 'group::r' \
        -m 'mask::rw' \
        -m 'other::r'

    /usr/sbin/restorecon -Fr "${i}"
done

This finds around 4800 directories and 63000 files, /tank/a is on a ZFS pool.

On 0.6.5.7/4.6.5-200.fc23.x86_64 this caused no issues.
On 0.6.5.8/4.7.9-100.fc23.x86_64 running this script causes the machine to become unresponsive for a few seconds, and the load to spike to around 20 or 30, from a base load of <1.

File system properties:

NAME                    PROPERTY               VALUE                    SOURCE
tank/a  type                   filesystem               -
tank/a  creation               Thu Oct  3 12:02 2013    -
tank/a  used                   368G                     -
tank/a  available              1.25T                    -
tank/a  referenced             368G                     -
tank/a  compressratio          1.00x                    -
tank/a  mounted                yes                      -
tank/a  quota                  none                     default
tank/a  reservation            none                     default
tank/a  recordsize             128K                     default
tank/a  mountpoint             /tank/a  inherited from tank
tank/a  sharenfs               rw=10.200.254.1,ro       inherited from tank
tank/a  checksum               on                       default
tank/a  compression            off                      inherited from tank
tank/a  atime                  off                      inherited from tank
tank/a  devices                off                      inherited from tank
tank/a  exec                   on                       default
tank/a  setuid                 off                      inherited from tank
tank/a  readonly               off                      default
tank/a  zoned                  off                      default
tank/a  snapdir                hidden                   default
tank/a  aclinherit             restricted               default
tank/a  canmount               on                       default
tank/a  xattr                  sa                       inherited from tank
tank/a  copies                 1                        default
tank/a  version                5                        -
tank/a  utf8only               off                      -
tank/a  normalization          none                     -
tank/a  casesensitivity        sensitive                -
tank/a  vscan                  off                      default
tank/a  nbmand                 off                      default
tank/a  sharesmb               off                      default
tank/a  refquota               none                     default
tank/a  refreservation         none                     default
tank/a  primarycache           all                      default
tank/a  secondarycache         metadata                 inherited from tank
tank/a  usedbysnapshots        368K                     -
tank/a  usedbydataset          368G                     -
tank/a  usedbychildren         0                        -
tank/a  usedbyrefreservation   0                        -
tank/a  logbias                latency                  default
tank/a  dedup                  off                      default
tank/a  mlslabel               none                     default
tank/a  sync                   standard                 default
tank/a  refcompressratio       1.00x                    -
tank/a  written                0                        -
tank/a  logicalused            365G                     -
tank/a  logicalreferenced      365G                     -
tank/a  filesystem_limit       none                     default
tank/a  snapshot_limit         none                     default
tank/a  filesystem_count       none                     default
tank/a  snapshot_count         none                     default
tank/a  snapdev                hidden                   default
tank/a  acltype                posixacl                 local
tank/a  context                none                     default
tank/a  fscontext              none                     default
tank/a  defcontext             none                     default
tank/a  rootcontext            none                     default
tank/a  relatime               off                      default
tank/a  redundant_metadata     all                      default
tank/a  overlay                off                      default
tank/a  com.sun:auto-snapshot  true                     inherited from tank

Pool properties:

NAME  PROPERTY                    VALUE                       SOURCE
tank  size                        16.2T                       -
tank  capacity                    83%                         -
tank  altroot                     -                           default
tank  health                      ONLINE                      -
tank  guid                        13759943618182453813        default
tank  version                     -                           default
tank  bootfs                      -                           default
tank  delegation                  on                          default
tank  autoreplace                 off                         default
tank  cachefile                   -                           default
tank  failmode                    wait                        default
tank  listsnapshots               off                         default
tank  autoexpand                  off                         default
tank  dedupditto                  0                           default
tank  dedupratio                  1.00x                       -
tank  free                        2.65T                       -
tank  allocated                   13.6T                       -
tank  readonly                    off                         -
tank  ashift                      12                          local
tank  comment                     -                           default
tank  expandsize                  -                           -
tank  freeing                     0                           default
tank  fragmentation               -                           -
tank  leaked                      0                           default
tank  feature@async_destroy       enabled                     local
tank  feature@empty_bpobj         active                      local
tank  feature@lz4_compress        active                      local
tank  feature@spacemap_histogram  disabled                    local
tank  feature@enabled_txg         disabled                    local
tank  feature@hole_birth          disabled                    local
tank  feature@extensible_dataset  disabled                    local
tank  feature@embedded_data       disabled                    local
tank  feature@bookmarks           disabled                    local
tank  feature@filesystem_limits   disabled                    local
tank  feature@large_blocks        disabled                    local
@behlendorf behlendorf added Type: Performance Performance improvement or performance problem Type: Regression Indicates a functional regression labels Oct 26, 2016
@behlendorf
Copy link
Contributor

The only ACL related change in 0.6.5.8 looks like #4922. If possible could you revert to the previous kernel with 0.6.5.8 to try and narrow down where the problem was introduced.

@dweeezil
Copy link
Contributor

I did run a quick test of this with a stock upstream stable 4.7.9 kernel and a stock 0.6.5.8 build of SPL and ZFS on a filesystem with 1111 directories and 4.8M files and there were no load problems. @Lalufu You should at least run ps, top or perf to see where the excessive load might be coming from. There's a slim chance that db3f5ed might be coming into play; particularly on a system without a lot of memory.

NOTE: My "quick test" only did the "setfacl"s and not the "restorecon" (no SELinux configured).

@tuxoko
Copy link
Contributor

tuxoko commented Oct 26, 2016

This is already fixed in #4946

Oh, never mind, the patch is in 0.6.5.8.
@Lalufu
Does dmesg show anything?

@Lalufu
Copy link
Contributor Author

Lalufu commented Oct 27, 2016

There's nothing in dmesg except these:

[ 4218.122953] perf: interrupt took too long (2502 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[ 6581.710623] perf: interrupt took too long (3132 > 3127), lowering kernel.perf_event_max_sample_rate to 63000
[31979.651510] perf: interrupt took too long (3948 > 3915), lowering kernel.perf_event_max_sample_rate to 50000

Now I do remember getting those before as well, so it's not totally new.

Some further testing shows that just running the setfacl is not sufficient to trigger this, it seems some other writes/reads are necessary to trigger this as well (cache flushes? No idea).

I've tried running a top -d 1 -b in the background to capture the state of the machine. The output of that command also stops during the pauses, but maybe it's useful anyway (I've removed some processes from the output list that are not relevant to this).
top.txt.zip

The first load jump is between 08:14:41 and 08:14:54. The script was started at :00 by cron.

@Lalufu
Copy link
Contributor Author

Lalufu commented Oct 27, 2016

On Wed, Oct 26, 2016 at 11:48:05AM -0700, Tim Chase wrote:

I did run a quick test of this with a stock upstream stable 4.7.9
kernel and a stock 0.6.5.8 build of SPL and ZFS on a filesystem with
1111 directories and 4.8M files and there were no load problems.
@Lalufu You should at least run ps, top or perf to see where the
excessive load might be coming from. There's a slim chance that
db3f5ed might be coming into play; particularly on a system without a
lot of memory.

If you can suggest a useful command to run in the background I can
certainly try to capture the output. It'll have to be this way
as I cannot do anything on the box during the few seconds where it
locks up.

The machine in question has 16GB of RAM, with no special settings to
ZFS variables.

@dweeezil
Copy link
Contributor

@Lalufu The %si in top spiked along with the load average. I suppose you could sample /proc/softirqs while this is happening and look for anything abnormal.

@Lalufu
Copy link
Contributor Author

Lalufu commented Oct 28, 2016

I've captured top and the softirqs during an incident:

top2.zip

I've attached a modified version of the softirqs log as well, which contains deltas to the previous iteration:
softirqs-delta.txt

According to this less softirqs happened in the 15 second "window" than normally happen in a one second window. I don't know what to make of this.

@Lalufu
Copy link
Contributor Author

Lalufu commented Oct 28, 2016

I've been looking at perf top, and two things happen during/after a system pause:

a) I'm getting a warning window from perf top telling me that events are being lost
b) These two entries appear on the top of the list:

19.92% [kernel] [k] task_expire
13.26% [kernel] [k] queued_spin_lock_slowpath

I've been trying to figure out how to make perf write a log of events that can be inspected later, but I'm not having much luck so far. All the examples I'm finding for this assume one wants to perf a single command, and not the system in general.

@dweeezil
Copy link
Contributor

@Lalufu I think you've got enough information. The problem seems to be a stampede of Posix ACL expirations. I'm not sure whether posix_acl_release() (actually, the underlying RCU function) is available under a 4.6.5 kernel to non-GPL modules so I suppose that could be the difference. I'm thinking we ought to add some randomization to the 60-second interval in zpl_posix_acl_release(). Also, the use of the system taskq, which has at least 64 threads, probably isn't helping matters much, either.

@Lalufu
Copy link
Contributor Author

Lalufu commented Oct 28, 2016

Thanks for the update.

Is there a way to reduce the number of threads?

@behlendorf
Copy link
Contributor

@dweeezil adding a small random factor is a nice easy improvement. If there is contention here it's going to be on the taskq spinlock so it's tempting to fan that out. One tempting way to do this would be to dispatch the delayed frees to the spa's ZIO_TYPE_FREE/ISSUE issue taskq which is ZTI_P(12, 8).

@dweeezil
Copy link
Contributor

@behlendorf I'll work up a patch which does just that or something similar. It sure seems we don't need yet another taskq.

@tuxoko
Copy link
Contributor

tuxoko commented Oct 28, 2016

I think we should do batch free to reduce both the timers and tasks firing contention.

@dweeezil
Copy link
Contributor

@tuxoko That's probably a better idea.

@behlendorf
Copy link
Contributor

Somehow batching them locklessly would be ideal.

@tuxoko
Copy link
Contributor

tuxoko commented Oct 28, 2016

I hacked up a quick patch doing batch free. I haven't test it though.
tuxoko@02b7f81

Making it lockless should be possible, but I'll need to think about it.

@behlendorf
Copy link
Contributor

Unless we can do this without a lock it's not clear that this will be a significant win. Batching things will reduce the number of tasks dispatched, but we may just have shifted the contention point from the taskq spinlock to the one protecting the batch list.

@tuxoko
Copy link
Contributor

tuxoko commented Oct 28, 2016

Here's a patch to make it lockless. I haven't tested it either.
tuxoko@5cb4231

@tuxoko
Copy link
Contributor

tuxoko commented Oct 28, 2016

@behlendorf
Without either of my patch, I can easily cause a lockup, the problem is when you hog the taskq with a lot of delayed taskq, you'll generate so many timers that just the timer interrupt is going to hang your system.

@behlendorf
Copy link
Contributor

That makes sense, when you're happy with the lockless implementation can you open a new PR.

@tuxoko
Copy link
Contributor

tuxoko commented Oct 31, 2016

#5353

@behlendorf
Copy link
Contributor

@Lalufu could you apply the proposed patches in #5353 and confirm they resolve the issue.

@behlendorf behlendorf removed the Type: Regression Indicates a functional regression label Nov 2, 2016
@Lalufu
Copy link
Contributor Author

Lalufu commented Nov 3, 2016

On Wed, Nov 02, 2016 at 03:55:39PM -0700, Brian Behlendorf wrote:

@Lalufu could you apply the proposed patches in #5353 and confirm they resolve the issue.

Does this apply on top of 0.6.5.8? The machine is kinda in production,
so I'd rather not go to HEAD on it.

@tuxoko
Copy link
Contributor

tuxoko commented Nov 3, 2016

@Lalufu
Yes, it does applies on 0.6.5.8.

@Lalufu
Copy link
Contributor Author

Lalufu commented Nov 5, 2016

On Thu, Nov 03, 2016 at 04:00:02PM -0700, tuxoko wrote:

@Lalufu
Yes, it does applies on 0.6.5.8.

So far this seems to fix the issue, I can no longer reproduce the load
spikes.

@Lalufu
Copy link
Contributor Author

Lalufu commented Nov 16, 2016

On Thu, Nov 03, 2016 at 04:00:02PM -0700, tuxoko wrote:

@Lalufu
Yes, it does applies on 0.6.5.8.

Still hasn't blown up, so I'll declare this good for my purpose.

@behlendorf
Copy link
Contributor

@Lalufu thanks for confirming the fix. The patches from #5353 were merged to master and added to the 0.6.5.9 project tracker for possible inclusion there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

4 participants