Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to compile wlan.ko? #6

Open
mrafra opened this issue Feb 5, 2015 · 1 comment
Open

How to compile wlan.ko? #6

mrafra opened this issue Feb 5, 2015 · 1 comment

Comments

@mrafra
Copy link

mrafra commented Feb 5, 2015

After compiling and booting wifi doesn't work...

liuguo09 pushed a commit that referenced this issue Mar 31, 2015
Setting an empty security context (length=0) on a file will
lead to incorrectly dereferencing the type and other fields
of the security context structure, yielding a kernel BUG.
As a zero-length security context is never valid, just reject
all such security contexts whether coming from userspace
via setxattr or coming from the filesystem upon a getxattr
request by SELinux.

Setting a security context value (empty or otherwise) unknown to
SELinux in the first place is only possible for a root process
(CAP_MAC_ADMIN), and, if running SELinux in enforcing mode, only
if the corresponding SELinux mac_admin permission is also granted
to the domain by policy.  In Fedora policies, this is only allowed for
specific domains such as livecd for setting down security contexts
that are not defined in the build host policy.

[On Android, this can only be set by root/CAP_MAC_ADMIN processes,
and if running SELinux in enforcing mode, only if mac_admin permission
is granted in policy.  In Android 4.4, this would only be allowed for
root/CAP_MAC_ADMIN processes that are also in unconfined domains. In current
AOSP master, mac_admin is not allowed for any domains except the recovery
console which has a legitimate need for it.  The other potential vector
is mounting a maliciously crafted filesystem for which SELinux fetches
xattrs (e.g. an ext4 filesystem on a SDcard).  However, the end result is
only a local denial-of-service (DOS) due to kernel BUG.  This fix is
queued for 3.14.]

Reproducer:
su
setenforce 0
touch foo
setfattr -n security.selinux foo

Caveat:
Relabeling or removing foo after doing the above may not be possible
without booting with SELinux disabled.  Any subsequent access to foo
after doing the above will also trigger the BUG.

BUG output from Matthew Thode:
[  473.893141] ------------[ cut here ]------------
[  473.962110] kernel BUG at security/selinux/ss/services.c:654!
[  473.995314] invalid opcode: 0000 [#6] SMP
[  474.027196] Modules linked in:
[  474.058118] CPU: 0 PID: 8138 Comm: ls Tainted: G      D   I
3.13.0-grsec #1
[  474.116637] Hardware name: Supermicro X8ST3/X8ST3, BIOS 2.0
07/29/10
[  474.149768] task: ffff8805f50cd010 ti: ffff8805f50cd488 task.ti:
ffff8805f50cd488
[  474.183707] RIP: 0010:[<ffffffff814681c7>]  [<ffffffff814681c7>]
context_struct_compute_av+0xce/0x308
[  474.219954] RSP: 0018:ffff8805c0ac3c38  EFLAGS: 00010246
[  474.252253] RAX: 0000000000000000 RBX: ffff8805c0ac3d94 RCX:
0000000000000100
[  474.287018] RDX: ffff8805e8aac000 RSI: 00000000ffffffff RDI:
ffff8805e8aaa000
[  474.321199] RBP: ffff8805c0ac3cb8 R08: 0000000000000010 R09:
0000000000000006
[  474.357446] R10: 0000000000000000 R11: ffff8805c567a000 R12:
0000000000000006
[  474.419191] R13: ffff8805c2b74e88 R14: 00000000000001da R15:
0000000000000000
[  474.453816] FS:  00007f2e75220800(0000) GS:ffff88061fc00000(0000)
knlGS:0000000000000000
[  474.489254] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  474.522215] CR2: 00007f2e74716090 CR3: 00000005c085e000 CR4:
00000000000207f0
[  474.556058] Stack:
[  474.584325]  ffff8805c0ac3c98 ffffffff811b549b ffff8805c0ac3c98
ffff8805f1190a40
[  474.618913]  ffff8805a6202f08 ffff8805c2b74e88 00068800d0464990
ffff8805e8aac860
[  474.653955]  ffff8805c0ac3cb8 000700068113833a ffff880606c75060
ffff8805c0ac3d94
[  474.690461] Call Trace:
[  474.723779]  [<ffffffff811b549b>] ? lookup_fast+0x1cd/0x22a
[  474.778049]  [<ffffffff81468824>] security_compute_av+0xf4/0x20b
[  474.811398]  [<ffffffff8196f419>] avc_compute_av+0x2a/0x179
[  474.843813]  [<ffffffff8145727b>] avc_has_perm+0x45/0xf4
[  474.875694]  [<ffffffff81457d0e>] inode_has_perm+0x2a/0x31
[  474.907370]  [<ffffffff81457e76>] selinux_inode_getattr+0x3c/0x3e
[  474.938726]  [<ffffffff81455cf6>] security_inode_getattr+0x1b/0x22
[  474.970036]  [<ffffffff811b057d>] vfs_getattr+0x19/0x2d
[  475.000618]  [<ffffffff811b05e5>] vfs_fstatat+0x54/0x91
[  475.030402]  [<ffffffff811b063b>] vfs_lstat+0x19/0x1b
[  475.061097]  [<ffffffff811b077e>] SyS_newlstat+0x15/0x30
[  475.094595]  [<ffffffff8113c5c1>] ? __audit_syscall_entry+0xa1/0xc3
[  475.148405]  [<ffffffff8197791e>] system_call_fastpath+0x16/0x1b
[  475.179201] Code: 00 48 85 c0 48 89 45 b8 75 02 0f 0b 48 8b 45 a0 48
8b 3d 45 d0 b6 00 8b 40 08 89 c6 ff ce e8 d1 b0 06 00 48 85 c0 49 89 c7
75 02 <0f> 0b 48 8b 45 b8 4c 8b 28 eb 1e 49 8d 7d 08 be 80 01 00 00 e8
[  475.255884] RIP  [<ffffffff814681c7>]
context_struct_compute_av+0xce/0x308
[  475.296120]  RSP <ffff8805c0ac3c38>
[  475.328734] ---[ end trace f076482e9d754adc ]---

[sds:  commit message edited to note Android implications and
to generate a unique Change-Id for gerrit]

Change-Id: I4d5389f0cfa72b5f59dada45081fa47e03805413
Reported-by:  Matthew Thode <mthode@mthode.org>
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Cc: stable@vger.kernel.org
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Sivasri Kumar Vanka <sivasri@codeaurora.org>
liuguo09 pushed a commit that referenced this issue Mar 31, 2015
Setting an empty security context (length=0) on a file will
lead to incorrectly dereferencing the type and other fields
of the security context structure, yielding a kernel BUG.
As a zero-length security context is never valid, just reject
all such security contexts whether coming from userspace
via setxattr or coming from the filesystem upon a getxattr
request by SELinux.

Setting a security context value (empty or otherwise) unknown to
SELinux in the first place is only possible for a root process
(CAP_MAC_ADMIN), and, if running SELinux in enforcing mode, only
if the corresponding SELinux mac_admin permission is also granted
to the domain by policy.  In Fedora policies, this is only allowed for
specific domains such as livecd for setting down security contexts
that are not defined in the build host policy.

[On Android, this can only be set by root/CAP_MAC_ADMIN processes,
and if running SELinux in enforcing mode, only if mac_admin permission
is granted in policy.  In Android 4.4, this would only be allowed for
root/CAP_MAC_ADMIN processes that are also in unconfined domains. In current
AOSP master, mac_admin is not allowed for any domains except the recovery
console which has a legitimate need for it.  The other potential vector
is mounting a maliciously crafted filesystem for which SELinux fetches
xattrs (e.g. an ext4 filesystem on a SDcard).  However, the end result is
only a local denial-of-service (DOS) due to kernel BUG.  This fix is
queued for 3.14.]

Reproducer:
su
setenforce 0
touch foo
setfattr -n security.selinux foo

Caveat:
Relabeling or removing foo after doing the above may not be possible
without booting with SELinux disabled.  Any subsequent access to foo
after doing the above will also trigger the BUG.

BUG output from Matthew Thode:
[  473.893141] ------------[ cut here ]------------
[  473.962110] kernel BUG at security/selinux/ss/services.c:654!
[  473.995314] invalid opcode: 0000 [#6] SMP
[  474.027196] Modules linked in:
[  474.058118] CPU: 0 PID: 8138 Comm: ls Tainted: G      D   I
3.13.0-grsec #1
[  474.116637] Hardware name: Supermicro X8ST3/X8ST3, BIOS 2.0
07/29/10
[  474.149768] task: ffff8805f50cd010 ti: ffff8805f50cd488 task.ti:
ffff8805f50cd488
[  474.183707] RIP: 0010:[<ffffffff814681c7>]  [<ffffffff814681c7>]
context_struct_compute_av+0xce/0x308
[  474.219954] RSP: 0018:ffff8805c0ac3c38  EFLAGS: 00010246
[  474.252253] RAX: 0000000000000000 RBX: ffff8805c0ac3d94 RCX:
0000000000000100
[  474.287018] RDX: ffff8805e8aac000 RSI: 00000000ffffffff RDI:
ffff8805e8aaa000
[  474.321199] RBP: ffff8805c0ac3cb8 R08: 0000000000000010 R09:
0000000000000006
[  474.357446] R10: 0000000000000000 R11: ffff8805c567a000 R12:
0000000000000006
[  474.419191] R13: ffff8805c2b74e88 R14: 00000000000001da R15:
0000000000000000
[  474.453816] FS:  00007f2e75220800(0000) GS:ffff88061fc00000(0000)
knlGS:0000000000000000
[  474.489254] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  474.522215] CR2: 00007f2e74716090 CR3: 00000005c085e000 CR4:
00000000000207f0
[  474.556058] Stack:
[  474.584325]  ffff8805c0ac3c98 ffffffff811b549b ffff8805c0ac3c98
ffff8805f1190a40
[  474.618913]  ffff8805a6202f08 ffff8805c2b74e88 00068800d0464990
ffff8805e8aac860
[  474.653955]  ffff8805c0ac3cb8 000700068113833a ffff880606c75060
ffff8805c0ac3d94
[  474.690461] Call Trace:
[  474.723779]  [<ffffffff811b549b>] ? lookup_fast+0x1cd/0x22a
[  474.778049]  [<ffffffff81468824>] security_compute_av+0xf4/0x20b
[  474.811398]  [<ffffffff8196f419>] avc_compute_av+0x2a/0x179
[  474.843813]  [<ffffffff8145727b>] avc_has_perm+0x45/0xf4
[  474.875694]  [<ffffffff81457d0e>] inode_has_perm+0x2a/0x31
[  474.907370]  [<ffffffff81457e76>] selinux_inode_getattr+0x3c/0x3e
[  474.938726]  [<ffffffff81455cf6>] security_inode_getattr+0x1b/0x22
[  474.970036]  [<ffffffff811b057d>] vfs_getattr+0x19/0x2d
[  475.000618]  [<ffffffff811b05e5>] vfs_fstatat+0x54/0x91
[  475.030402]  [<ffffffff811b063b>] vfs_lstat+0x19/0x1b
[  475.061097]  [<ffffffff811b077e>] SyS_newlstat+0x15/0x30
[  475.094595]  [<ffffffff8113c5c1>] ? __audit_syscall_entry+0xa1/0xc3
[  475.148405]  [<ffffffff8197791e>] system_call_fastpath+0x16/0x1b
[  475.179201] Code: 00 48 85 c0 48 89 45 b8 75 02 0f 0b 48 8b 45 a0 48
8b 3d 45 d0 b6 00 8b 40 08 89 c6 ff ce e8 d1 b0 06 00 48 85 c0 49 89 c7
75 02 <0f> 0b 48 8b 45 b8 4c 8b 28 eb 1e 49 8d 7d 08 be 80 01 00 00 e8
[  475.255884] RIP  [<ffffffff814681c7>]
context_struct_compute_av+0xce/0x308
[  475.296120]  RSP <ffff8805c0ac3c38>
[  475.328734] ---[ end trace f076482e9d754adc ]---

[sds:  commit message edited to note Android implications and
to generate a unique Change-Id for gerrit]

Change-Id: I4d5389f0cfa72b5f59dada45081fa47e03805413
Reported-by:  Matthew Thode <mthode@mthode.org>
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Cc: stable@vger.kernel.org
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Sivasri Kumar Vanka <sivasri@codeaurora.org>
@liuguo09
Copy link
Contributor

Please refer to this FAQ

rooque pushed a commit to rooque/android_kernel_xiaomi_cancro that referenced this issue May 29, 2015
workqueue: change BUG_ON() to WARN_ON()

This BUG_ON() can be triggered if you call schedule_work() before
calling INIT_WORK().  It is a bug definitely, but it's nicer to just
print a stack trace and return.

Reported-by: Matt Renzelmann <mjr@cs.wisc.edu>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: Catch more locking problems with flush_work()

If a workqueue is flushed with flush_work() lockdep checking can
be circumvented. For example:

 static DEFINE_MUTEX(mutex);

 static void my_work(struct work_struct *w)
 {
         mutex_lock(&mutex);
         mutex_unlock(&mutex);
 }

 static DECLARE_WORK(work, my_work);

 static int __init start_test_module(void)
 {
         schedule_work(&work);
         return 0;
 }
 module_init(start_test_module);

 static void __exit stop_test_module(void)
 {
         mutex_lock(&mutex);
         flush_work(&work);
         mutex_unlock(&mutex);
 }
 module_exit(stop_test_module);

would not always print a warning when flush_work() was called.
In this trivial example nothing could go wrong since we are
guaranteed module_init() and module_exit() don't run concurrently,
but if the work item is schedule asynchronously we could have a
scenario where the work item is running just at the time flush_work()
is called resulting in a classic ABBA locking problem.

Add a lockdep hint by acquiring and releasing the work item
lockdep_map in flush_work() so that we always catch this
potential deadlock scenario.

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Reviewed-by: Yong Zhang <yong.zhang0@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

lockdep: fix oops in processing workqueue

Under memory load, on x86_64, with lockdep enabled, the workqueue's
process_one_work() has been seen to oops in __lock_acquire(), barfing
on a 0xffffffff00000000 pointer in the lockdep_map's class_cache[].

Because it's permissible to free a work_struct from its callout function,
the map used is an onstack copy of the map given in the work_struct: and
that copy is made without any locking.

Surprisingly, gcc (4.5.1 in Hugh's case) uses "rep movsl" rather than
"rep movsq" for that structure copy: which might race with a workqueue
user's wait_on_work() doing lock_map_acquire() on the source of the
copy, putting a pointer into the class_cache[], but only in time for
the top half of that pointer to be copied to the destination map.

Boom when process_one_work() subsequently does lock_map_acquire()
on its onstack copy of the lockdep_map.

Fix this, and a similar instance in call_timer_fn(), with a
lockdep_copy_map() function which additionally NULLs the class_cache[].

Note: this oops was actually seen on 3.4-next, where flush_work() newly
does the racing lock_map_acquire(); but Tejun points out that 3.4 and
earlier are already vulnerable to the same through wait_on_work().

* Patch orginally from Peter.  Hugh modified it a bit and wrote the
  description.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Reported-by: Hugh Dickins <hughd@google.com>
LKML-Reference: <alpine.LSU.2.00.1205070951170.1544@eggly.anvils>
Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: perform cpu down operations from low priority cpu_notifier()

Currently, all workqueue cpu hotplug operations run off
CPU_PRI_WORKQUEUE which is higher than normal notifiers.  This is to
ensure that workqueue is up and running while bringing up a CPU before
other notifiers try to use workqueue on the CPU.

Per-cpu workqueues are supposed to remain working and bound to the CPU
for normal CPU_DOWN_PREPARE notifiers.  This holds mostly true even
with workqueue offlining running with higher priority because
workqueue CPU_DOWN_PREPARE only creates a bound trustee thread which
runs the per-cpu workqueue without concurrency management without
explicitly detaching the existing workers.

However, if the trustee needs to create new workers, it creates
unbound workers which may wander off to other CPUs while
CPU_DOWN_PREPARE notifiers are in progress.  Furthermore, if the CPU
down is cancelled, the per-CPU workqueue may end up with workers which
aren't bound to the CPU.

While reliably reproducible with a convoluted artificial test-case
involving scheduling and flushing CPU burning work items from CPU down
notifiers, this isn't very likely to happen in the wild, and, even
when it happens, the effects are likely to be hidden by the following
successful CPU down.

Fix it by using different priorities for up and down notifiers - high
priority for up operations and low priority for down operations.

Workqueue cpu hotplug operations will soon go through further cleanup.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>

workqueue: drop CPU_DYING notifier operation

Workqueue used CPU_DYING notification to mark GCWQ_DISASSOCIATED.
This was necessary because workqueue's CPU_DOWN_PREPARE happened
before other DOWN_PREPARE notifiers and workqueue needed to stay
associated across the rest of DOWN_PREPARE.

After the previous patch, workqueue's DOWN_PREPARE happens after
others and can set GCWQ_DISASSOCIATED directly.  Drop CPU_DYING and
let the trustee set GCWQ_DISASSOCIATED after disabling concurrency
management.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>

workqueue: ROGUE workers are UNBOUND workers

Currently, WORKER_UNBOUND is used to mark workers for the unbound
global_cwq and WORKER_ROGUE is used to mark workers for disassociated
per-cpu global_cwqs.  Both are used to make the marked worker skip
concurrency management and the only place they make any difference is
in worker_enter_idle() where WORKER_ROGUE is used to skip scheduling
idle timer, which can easily be replaced with trustee state testing.

This patch replaces WORKER_ROGUE with WORKER_UNBOUND and drops
WORKER_ROGUE.  This is to prepare for removing trustee and handling
disassociated global_cwqs as unbound.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>

workqueue: use mutex for global_cwq manager exclusion

POOL_MANAGING_WORKERS is used to ensure that at most one worker takes
the manager role at any given time on a given global_cwq.  Trustee
later hitched on it to assume manager adding blocking wait for the
bit.  As trustee already needed a custom wait mechanism, waiting for
MANAGING_WORKERS was rolled into the same mechanism.

Trustee is scheduled to be removed.  This patch separates out
MANAGING_WORKERS wait into per-pool mutex.  Workers use
mutex_trylock() to test for manager role and trustee uses mutex_lock()
to claim manager roles.

gcwq_claim/release_management() helpers are added to grab and release
manager roles of all pools on a global_cwq.  gcwq_claim_management()
always grabs pool manager mutexes in ascending pool index order and
uses pool index as lockdep subclass.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>

workqueue: drop @bind from create_worker()

Currently, create_worker()'s callers are responsible for deciding
whether the newly created worker should be bound to the associated CPU
and create_worker() sets WORKER_UNBOUND only for the workers for the
unbound global_cwq.  Creation during normal operation is always via
maybe_create_worker() and @bind is true.  For workers created during
hotplug, @bind is false.

Normal operation path is planned to be used even while the CPU is
going through hotplug operations or offline and this static decision
won't work.

Drop @bind from create_worker() and decide whether to bind by looking
at GCWQ_DISASSOCIATED.  create_worker() will also set WORKER_UNBOUND
autmatically if disassociated.  To avoid flipping GCWQ_DISASSOCIATED
while create_worker() is in progress, the flag is now allowed to be
changed only while holding all manager_mutexes on the global_cwq.

This requires that GCWQ_DISASSOCIATED is not cleared behind trustee's
back.  CPU_ONLINE no longer clears DISASSOCIATED before flushing
trustee, which clears DISASSOCIATED before rebinding remaining workers
if asked to release.  For cases where trustee isn't around, CPU_ONLINE
clears DISASSOCIATED after flushing trustee.  Also, now, first_idle
has UNBOUND set on creation which is explicitly cleared by CPU_ONLINE
while binding it.  These convolutions will soon be removed by further
simplification of CPU hotplug path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>

workqueue: reimplement CPU online rebinding to handle idle workers

Currently, if there are left workers when a CPU is being brough back
online, the trustee kills all idle workers and scheduled rebind_work
so that they re-bind to the CPU after the currently executing work is
finished.  This works for busy workers because concurrency management
doesn't try to wake up them from scheduler callbacks, which require
the target task to be on the local run queue.  The busy worker bumps
concurrency counter appropriately as it clears WORKER_UNBOUND from the
rebind work item and it's bound to the CPU before returning to the
idle state.

To reduce CPU on/offlining overhead (as many embedded systems use it
for powersaving) and simplify the code path, workqueue is planned to
be modified to retain idle workers across CPU on/offlining.  This
patch reimplements CPU online rebinding such that it can also handle
idle workers.

As noted earlier, due to the local wakeup requirement, rebinding idle
workers is tricky.  All idle workers must be re-bound before scheduler
callbacks are enabled.  This is achieved by interlocking idle
re-binding.  Idle workers are requested to re-bind and then hold until
all idle re-binding is complete so that no bound worker starts
executing work item.  Only after all idle workers are re-bound and
parked, CPU_ONLINE proceeds to release them and queue rebind work item
to busy workers thus guaranteeing scheduler callbacks aren't invoked
until all idle workers are ready.

worker_rebind_fn() is renamed to busy_worker_rebind_fn() and
idle_worker_rebind() for idle workers is added.  Rebinding logic is
moved to rebind_workers() and now called from CPU_ONLINE after
flushing trustee.  While at it, add CPU sanity check in
worker_thread().

Note that now a worker may become idle or the manager between trustee
release and rebinding during CPU_ONLINE.  As the previous patch
updated create_worker() so that it can be used by regular manager
while unbound and this patch implements idle re-binding, this is safe.

This prepares for removal of trustee and keeping idle workers across
CPU hotplugs.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>

workqueue: don't butcher idle workers on an offline CPU

Currently, during CPU offlining, after all pending work items are
drained, the trustee butchers all workers.  Also, on CPU onlining
failure, workqueue_cpu_callback() ensures that the first idle worker
is destroyed.  Combined, these guarantee that an offline CPU doesn't
have any worker for it once all the lingering work items are finished.

This guarantee isn't really necessary and makes CPU on/offlining more
expensive than needs to be, especially for platforms which use CPU
hotplug for powersaving.

This patch lets offline CPUs removes idle worker butchering from the
trustee and let a CPU which failed onlining keep the created first
worker.  The first worker is created if the CPU doesn't have any
during CPU_DOWN_PREPARE and started right away.  If onlining succeeds,
the rebind_workers() call in CPU_ONLINE will rebind it like any other
workers.  If onlining fails, the worker is left alone till the next
try.

This makes CPU hotplugs cheaper by allowing global_cwqs to keep
workers across them and simplifies code.

Note that trustee doesn't re-arm idle timer when it's done and thus
the disassociated global_cwq will keep all workers until it comes back
online.  This will be improved by further patches.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>

workqueue: remove CPU offline trustee

With the previous changes, a disassociated global_cwq now can run as
an unbound one on its own - it can create workers as necessary to
drain remaining works after the CPU has been brought down and manage
the number of workers using the usual idle timer mechanism making
trustee completely redundant except for the actual unbinding
operation.

This patch removes the trustee and let a disassociated global_cwq
manage itself.  Unbinding is moved to a work item (for CPU affinity)
which is scheduled and flushed from CPU_DONW_PREPARE.

This patch moves nr_running clearing outside gcwq and manager locks to
simplify the code.  As nr_running is unused at the point, this is
safe.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>

workqueue: simplify CPU hotplug code

With trustee gone, CPU hotplug code can be simplified.

* gcwq_claim/release_management() now grab and release gcwq lock too
  respectively and gained _and_lock and _and_unlock postfixes.

* All CPU hotplug logic was implemented in workqueue_cpu_callback()
  which was called by workqueue_cpu_up/down_callback() for the correct
  priority.  This was because up and down paths shared a lot of logic,
  which is no longer true.  Remove workqueue_cpu_callback() and move
  all hotplug logic into the two actual callbacks.

This patch doesn't make any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>

workqueue: fix spurious CPU locality WARN from process_one_work()

25511a4776 "workqueue: reimplement CPU online rebinding to handle idle
workers" added CPU locality sanity check in process_one_work().  It
triggers if a worker is executing on a different CPU without UNBOUND
or REBIND set.

This works for all normal workers but rescuers can trigger this
spuriously when they're serving the unbound or a disassociated
global_cwq - rescuers don't have either flag set and thus its
gcwq->cpu can be a different value including %WORK_CPU_UNBOUND.

Fix it by additionally testing %GCWQ_DISASSOCIATED.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
LKML-Refence: <20120721213656.GA7783@linux.vnet.ibm.com>

workqueue: reorder queueing functions so that _on() variants are on top

Currently, queue/schedule[_delayed]_work_on() are located below the
counterpart without the _on postifx even though the latter is usually
implemented using the former.  Swap them.

This is cleanup and doesn't cause any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: make queueing functions return bool

All queueing functions return 1 on success, 0 if the work item was
already pending.  Update them to return bool instead.  This signifies
better that they don't return 0 / -errno.

This is cleanup and doesn't cause any functional difference.

While at it, fix comment opening for schedule_work_on().

Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: add missing smp_wmb() in process_one_work()

WORK_STRUCT_PENDING is used to claim ownership of a work item and
process_one_work() releases it before starting execution.  When
someone else grabs PENDING, all pre-release updates to the work item
should be visible and all updates made by the new owner should happen
afterwards.

Grabbing PENDING uses test_and_set_bit() and thus has a full barrier;
however, clearing doesn't have a matching wmb.  Given the preceding
spin_unlock and use of clear_bit, I don't believe this can be a
problem on an actual machine and there hasn't been any related report
but it still is theretically possible for clear_pending to permeate
upwards and happen before work->entry update.

Add an explicit smp_wmb() before work_clear_pending().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org

workqueue: disable irq while manipulating PENDING

Queueing operations use WORK_STRUCT_PENDING_BIT to synchronize access
to the target work item.  They first try to claim the bit and proceed
with queueing only after that succeeds and there's a window between
PENDING being set and the actual queueing where the task can be
interrupted or preempted.

There's also a similar window in process_one_work() when clearing
PENDING.  A work item is dequeued, gcwq->lock is released and then
PENDING is cleared and the worker might get interrupted or preempted
between releasing gcwq->lock and clearing PENDING.

cancel[_delayed]_work_sync() tries to claim or steal PENDING.  The
function assumes that a work item with PENDING is either queued or in
the process of being [de]queued.  In the latter case, it busy-loops
until either the work item loses PENDING or is queued.  If canceling
coincides with the above described interrupts or preemptions, the
canceling task will busy-loop while the queueing or executing task is
preempted.

This patch keeps irq disabled across claiming PENDING and actual
queueing and moves PENDING clearing in process_one_work() inside
gcwq->lock so that busy looping from PENDING && !queued doesn't wait
for interrupted/preempted tasks.  Note that, in process_one_work(),
setting last CPU and clearing PENDING got merged into single
operation.

This removes possible long busy-loops and will allow using
try_to_grab_pending() from bh and irq contexts.

v2: __queue_work() was testing preempt_count() to ensure that the
    caller has disabled preemption.  This triggers spuriously if
    !CONFIG_PREEMPT_COUNT.  Use preemptible() instead.  Reported by
    Fengguang Wu.

v3: Disable irq instead of preemption.  IRQ will be disabled while
    grabbing gcwq->lock later anyway and this allows using
    try_to_grab_pending() from bh and irq contexts.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>

workqueue: set delayed_work->timer function on initialization

delayed_work->timer.function is currently initialized during
queue_delayed_work_on().  Export delayed_work_timer_fn() and set
delayed_work timer function during delayed_work initialization
together with other fields.

This ensures the timer function is always valid on an initialized
delayed_work.  This is to help mod_delayed_work() implementation.

To detect delayed_work users which diddle with the internal timer,
trigger WARN if timer function doesn't match on queue.

Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: unify local CPU queueing handling

Queueing functions have been using different methods to determine the
local CPU.

* queue_work() superflously uses get/put_cpu() to acquire and hold the
  local CPU across queue_work_on().

* delayed_work_timer_fn() uses smp_processor_id().

* queue_delayed_work() calls queue_delayed_work_on() with -1 @cpu
  which is interpreted as the local CPU.

* flush_delayed_work[_sync]() were using raw_smp_processor_id().

* __queue_work() interprets %WORK_CPU_UNBOUND as local CPU if the
  target workqueue is bound one but nobody uses this.

This patch converts all functions to uniformly use %WORK_CPU_UNBOUND
to indicate local CPU and use the local binding feature of
__queue_work().  unlikely() is dropped from %WORK_CPU_UNBOUND handling
in __queue_work().

Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: fix zero @delay handling of queue_delayed_work_on()

If @delay is zero and the dealyed_work is idle, queue_delayed_work()
queues it for immediate execution; however, queue_delayed_work_on()
lacks this logic and always goes through timer regardless of @delay.

This patch moves 0 @delay handling logic from queue_delayed_work() to
queue_delayed_work_on() so that both functions behave the same.

Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: move try_to_grab_pending() upwards

try_to_grab_pending() will be used by to-be-implemented
mod_delayed_work[_on]().  Move try_to_grab_pending() and related
functions above queueing functions.

This patch only moves functions around.

Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: introduce WORK_OFFQ_FLAG_*

Low WORK_STRUCT_FLAG_BITS bits of work_struct->data contain
WORK_STRUCT_FLAG_* and flush color.  If the work item is queued, the
rest point to the cpu_workqueue with WORK_STRUCT_CWQ set; otherwise,
WORK_STRUCT_CWQ is clear and the bits contain the last CPU number -
either a real CPU number or one of WORK_CPU_*.

Scheduled addition of mod_delayed_work[_on]() requires an additional
flag, which is used only while a work item is off queue.  There are
more than enough bits to represent off-queue CPU number on both 32 and
64bits.  This patch introduces WORK_OFFQ_FLAG_* which occupy the lower
part of the @work->data high bits while off queue.  This patch doesn't
define any actual OFFQ flag yet.

Off-queue CPU number is now shifted by WORK_OFFQ_CPU_SHIFT, which adds
the number of bits used by OFFQ flags to WORK_STRUCT_FLAG_SHIFT, to
make room for OFFQ flags.

To avoid shift width warning with large WORK_OFFQ_FLAG_BITS, ulong
cast is added to WORK_STRUCT_NO_CPU and, just in case, BUILD_BUG_ON()
to check that there are enough bits to accomodate off-queue CPU number
is added.

This patch doesn't make any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: factor out __queue_delayed_work() from queue_delayed_work_on()

This is to prepare for mod_delayed_work[_on]() and doesn't cause any
functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: reorganize try_to_grab_pending() and __cancel_timer_work()

* Use bool @is_dwork instead of @timer and let try_to_grab_pending()
  use to_delayed_work() to determine the delayed_work address.

* Move timer handling from __cancel_work_timer() to
  try_to_grab_pending().

* Make try_to_grab_pending() use -EAGAIN instead of -1 for
  busy-looping and drop the ret local variable.

* Add proper function comment to try_to_grab_pending().

This makes the code a bit easier to understand and will ease further
changes.  This patch doesn't make any functional change.

v2: Use @is_dwork instead of @timer.

Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: mark a work item being canceled as such

There can be two reasons try_to_grab_pending() can fail with -EAGAIN.
One is when someone else is queueing or deqeueing the work item.  With
the previous patches, it is guaranteed that PENDING and queued state
will soon agree making it safe to busy-retry in this case.

The other is if multiple __cancel_work_timer() invocations are racing
one another.  __cancel_work_timer() grabs PENDING and then waits for
running instances of the target work item on all CPUs while holding
PENDING and !queued.  try_to_grab_pending() invoked from another task
will keep returning -EAGAIN while the current owner is waiting.

Not distinguishing the two cases is okay because __cancel_work_timer()
is the only user of try_to_grab_pending() and it invokes
wait_on_work() whenever grabbing fails.  For the first case, busy
looping should be fine but wait_on_work() doesn't cause any critical
problem.  For the latter case, the new contender usually waits for the
same condition as the current owner, so no unnecessarily extended
busy-looping happens.  Combined, these make __cancel_work_timer()
technically correct even without irq protection while grabbing PENDING
or distinguishing the two different cases.

While the current code is technically correct, not distinguishing the
two cases makes it difficult to use try_to_grab_pending() for other
purposes than canceling because it's impossible to tell whether it's
safe to busy-retry grabbing.

This patch adds a mechanism to mark a work item being canceled.
try_to_grab_pending() now disables irq on success and returns -EAGAIN
to indicate that grabbing failed but PENDING and queued states are
gonna agree soon and it's safe to busy-loop.  It returns -ENOENT if
the work item is being canceled and it may stay PENDING && !queued for
arbitrary amount of time.

__cancel_work_timer() is modified to mark the work canceling with
WORK_OFFQ_CANCELING after grabbing PENDING, thus making
try_to_grab_pending() fail with -ENOENT instead of -EAGAIN.  Also, it
invokes wait_on_work() iff grabbing failed with -ENOENT.  This isn't
necessary for correctness but makes it consistent with other future
users of try_to_grab_pending().

v2: try_to_grab_pending() was testing preempt_count() to ensure that
    the caller has disabled preemption.  This triggers spuriously if
    !CONFIG_PREEMPT_COUNT.  Use preemptible() instead.  Reported by
    Fengguang Wu.

v3: Updated so that try_to_grab_pending() disables irq on success
    rather than requiring preemption disabled by the caller.  This
    makes busy-looping easier and will allow try_to_grap_pending() to
    be used from bh/irq contexts.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>

workqueue: implement mod_delayed_work[_on]()

Workqueue was lacking a mechanism to modify the timeout of an already
pending delayed_work.  delayed_work users have been working around
this using several methods - using an explicit timer + work item,
messing directly with delayed_work->timer, and canceling before
re-queueing, all of which are error-prone and/or ugly.

This patch implements mod_delayed_work[_on]() which behaves similarly
to mod_timer() - if the delayed_work is idle, it's queued with the
given delay; otherwise, its timeout is modified to the new value.
Zero @delay guarantees immediate execution.

v2: Updated to reflect try_to_grab_pending() changes.  Now safe to be
    called from bh context.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>

workqueue: fix CPU binding of flush_delayed_work[_sync]()

delayed_work encodes the workqueue to use and the last CPU in
delayed_work->work.data while it's on timer.  The target CPU is
implicitly recorded as the CPU the timer is queued on and
delayed_work_timer_fn() queues delayed_work->work to the CPU it is
running on.

Unfortunately, this leaves flush_delayed_work[_sync]() no way to find
out which CPU the delayed_work was queued for when they try to
re-queue after killing the timer.  Currently, it chooses the local CPU
flush is running on.  This can unexpectedly move a delayed_work queued
on a specific CPU to another CPU and lead to subtle errors.

There isn't much point in trying to save several bytes in struct
delayed_work, which is already close to a hundred bytes on 64bit with
all debug options turned off.  This patch adds delayed_work->cpu to
remember the CPU it's queued for.

Note that if the timer is migrated during CPU down, the work item
could be queued to the downed global_cwq after this change.  As a
detached global_cwq behaves like an unbound one, this doesn't change
much for the delayed_work.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>

workqueue: add missing wmb() in clear_work_data()

Any operation which clears PENDING should be preceded by a wmb to
guarantee that the next PENDING owner sees all the changes made before
PENDING release.

There are only two places where PENDING is cleared -
set_work_cpu_and_clear_pending() and clear_work_data().  The caller of
the former already does smp_wmb() but the latter doesn't have any.

Move the wmb above set_work_cpu_and_clear_pending() into it and add
one to clear_work_data().

There hasn't been any report related to this issue, and, given how
clear_work_data() is used, it is extremely unlikely to have caused any
actual problems on any architecture.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>

workqueue: use enum value to set array size of pools in gcwq

Commit 3270476a6c0ce322354df8679652f060d66526dc ('workqueue: reimplement
WQ_HIGHPRI using a separate worker_pool') introduce separate worker_pool
for HIGHPRI. Although there is NR_WORKER_POOLS enum value which represent
size of pools, definition of worker_pool in gcwq doesn't use it.
Using it makes code robust and prevent future mistakes.
So change code to use this enum value.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: correct req_cpu in trace_workqueue_queue_work()

When we do tracing workqueue_queue_work(), it records requested cpu.
But, if !(@wq->flag & WQ_UNBOUND) and @cpu is WORK_CPU_UNBOUND,
requested cpu is changed as local cpu.
In case of @wq->flag & WQ_UNBOUND, above change is not occured,
therefore it is reasonable to correct it.

Use temporary local variable for storing requested cpu.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: change value of lcpu in __queue_delayed_work_on()

We assign cpu id into work struct's data field in __queue_delayed_work_on().
In current implementation, when work is come in first time,
current running cpu id is assigned.
If we do __queue_delayed_work_on() with CPU A on CPU B,
__queue_work() invoked in delayed_work_timer_fn() go into
the following sub-optimal path in case of WQ_NON_REENTRANT.

	gcwq = get_gcwq(cpu);
	if (wq->flags & WQ_NON_REENTRANT &&
		(last_gcwq = get_work_gcwq(work)) && last_gcwq != gcwq) {

Change lcpu to @cpu and rechange lcpu to local cpu if lcpu is WORK_CPU_UNBOUND.
It is sufficient to prevent to go into sub-optimal path.

tj: Slightly rephrased the comment.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: introduce system_highpri_wq

Commit 3270476a6c0ce322354df8679652f060d66526dc ('workqueue: reimplement
WQ_HIGHPRI using a separate worker_pool') introduce separate worker pool
for HIGHPRI. When we handle busyworkers for gcwq, it can be normal worker
or highpri worker. But, we don't consider this difference in rebind_workers(),
we use just system_wq for highpri worker. It makes mismatch between
cwq->pool and worker->pool.

It doesn't make error in current implementation, but possible in the future.
Now, we introduce system_highpri_wq to use proper cwq for highpri workers
in rebind_workers(). Following patch fix this issue properly.

tj: Even apart from rebinding, having system_highpri_wq generally
    makes sense.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: use system_highpri_wq for highpri workers in rebind_workers()

In rebind_workers(), we do inserting a work to rebind to cpu for busy workers.
Currently, in this case, we use only system_wq. This makes a possible
error situation as there is mismatch between cwq->pool and worker->pool.

To prevent this, we should use system_highpri_wq for highpri worker
to match theses. This implements it.

tj: Rephrased comment a bit.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: use system_highpri_wq for unbind_work

To speed cpu down processing up, use system_highpri_wq.
As scheduling priority of workers on it is higher than system_wq and
it is not contended by other normal works on this cpu, work on it
is processed faster than system_wq.

tj: CPU up/downs care quite a bit about latency these days.  This
    shouldn't hurt anything and makes sense.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: fix checkpatch issues

Fixed some checkpatch warnings.

tj: adapted to wq/for-3.7 and massaged pr_xxx() format strings a bit.

Signed-off-by: Valentin Ilie <valentin.ilie@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <1345326762-21747-1-git-send-email-valentin.ilie@gmail.com>

workqueue: make all workqueues non-reentrant

By default, each per-cpu part of a bound workqueue operates separately
and a work item may be executing concurrently on different CPUs.  The
behavior avoids some cross-cpu traffic but leads to subtle weirdities
and not-so-subtle contortions in the API.

* There's no sane usefulness in allowing a single work item to be
  executed concurrently on multiple CPUs.  People just get the
  behavior unintentionally and get surprised after learning about it.
  Most either explicitly synchronize or use non-reentrant/ordered
  workqueue but this is error-prone.

* flush_work() can't wait for multiple instances of the same work item
  on different CPUs.  If a work item is executing on cpu0 and then
  queued on cpu1, flush_work() can only wait for the one on cpu1.

  Unfortunately, work items can easily cross CPU boundaries
  unintentionally when the queueing thread gets migrated.  This means
  that if multiple queuers compete, flush_work() can't even guarantee
  that the instance queued right before it is finished before
  returning.

* flush_work_sync() was added to work around some of the deficiencies
  of flush_work().  In addition to the usual flushing, it ensures that
  all currently executing instances are finished before returning.
  This operation is expensive as it has to walk all CPUs and at the
  same time fails to address competing queuer case.

  Incorrectly using flush_work() when flush_work_sync() is necessary
  is an easy error to make and can lead to bugs which are difficult to
  reproduce.

* Similar problems exist for flush_delayed_work[_sync]().

Other than the cross-cpu access concern, there's no benefit in
allowing parallel execution and it's plain silly to have this level of
contortion for workqueue which is widely used from core code to
extremely obscure drivers.

This patch makes all workqueues non-reentrant.  If a work item is
executing on a different CPU when queueing is requested, it is always
queued to that CPU.  This guarantees that any given work item can be
executing on one CPU at maximum and if a work item is queued and
executing, both are on the same CPU.

The only behavior change which may affect workqueue users negatively
is that non-reentrancy overrides the affinity specified by
queue_work_on().  On a reentrant workqueue, the affinity specified by
queue_work_on() is always followed.  Now, if the work item is
executing on one of the CPUs, the work item will be queued there
regardless of the requested affinity.  I've reviewed all workqueue
users which request explicit affinity, and, fortunately, none seems to
be crazy enough to exploit parallel execution of the same work item.

This adds an additional busy_hash lookup if the work item was
previously queued on a different CPU.  This shouldn't be noticeable
under any sane workload.  Work item queueing isn't a very
high-frequency operation and they don't jump across CPUs all the time.
In a micro benchmark to exaggerate this difference - measuring the
time it takes for two work items to repeatedly jump between two CPUs a
number (10M) of times with busy_hash table densely populated, the
difference was around 3%.

While the overhead is measureable, it is only visible in pathological
cases and the difference isn't huge.  This change brings much needed
sanity to workqueue and makes its behavior consistent with timer.  I
think this is the right tradeoff to make.

This enables significant simplification of workqueue API.
Simplification patches will follow.

Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: gut flush[_delayed]_work_sync()

Now that all workqueues are non-reentrant, flush[_delayed]_work_sync()
are equivalent to flush[_delayed]_work().  Drop the separate
implementation and make them thin wrappers around
flush[_delayed]_work().

* start_flush_work() no longer takes @wait_executing as the only left
  user - flush_work() - always sets it to %true.

* __cancel_work_timer() uses flush_work() instead of wait_on_work().

Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: gut system_nrt[_freezable]_wq()

Now that all workqueues are non-reentrant, system[_freezable]_wq() are
equivalent to system_nrt[_freezable]_wq().  Replace the latter with
wrappers around system[_freezable]_wq().  The wrapping goes through
inline functions so that __deprecated can be added easily.

Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: cosmetic whitespace updates for macro definitions

Consistently use the last tab position for '\' line continuation in
complex macro definitions.  This is to help the following patches.

This patch is cosmetic.

Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()

workqueue_cpu_down_callback() is used only if HOTPLUG_CPU=y, so
hotcpu_notifier() fits better than cpu_notifier().

When HOTPLUG_CPU=y, hotcpu_notifier() and cpu_notifier() are the same.

When HOTPLUG_CPU=n, if we use cpu_notifier(),
workqueue_cpu_down_callback() will be called during boot to do
nothing, and the memory of workqueue_cpu_down_callback() and
gcwq_unbind_fn() will be discarded after boot.

If we use hotcpu_notifier(), we can avoid the no-op call of
workqueue_cpu_down_callback() and the memory of
workqueue_cpu_down_callback() and gcwq_unbind_fn() will be discard at
build time:

$ ls -l kernel/workqueue.o.cpu_notifier kernel/workqueue.o.hotcpu_notifier
-rw-rw-r-- 1 laijs laijs 484080 Sep 15 11:31 kernel/workqueue.o.cpu_notifier
-rw-rw-r-- 1 laijs laijs 478240 Sep 15 11:31 kernel/workqueue.o.hotcpu_notifier

$ size kernel/workqueue.o.cpu_notifier kernel/workqueue.o.hotcpu_notifier
   text	   data	    bss	    dec	    hex	filename
  18513	   2387	   1221	  22121	   5669	kernel/workqueue.o.cpu_notifier
  18082	   2355	   1221	  21658	   549a	kernel/workqueue.o.hotcpu_notifier

tj: Updated description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()

cancel_delayed_work() can't be called from IRQ handlers due to its use
of del_timer_sync() and can't cancel work items which are already
transferred from timer to worklist.

Also, unlike other flush and cancel functions, a canceled delayed_work
would still point to the last associated cpu_workqueue.  If the
workqueue is destroyed afterwards and the work item is re-used on a
different workqueue, the queueing code can oops trying to dereference
already freed cpu_workqueue.

This patch reimplements cancel_delayed_work() using
try_to_grab_pending() and set_work_cpu_and_clear_pending().  This
allows the function to be called from IRQ handlers and makes its
behavior consistent with other flush / cancel functions.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>

workqueue: UNBOUND -> REBIND morphing in rebind_workers() should be atomic

The compiler may compile the following code into TWO write/modify
instructions.

	worker->flags &= ~WORKER_UNBOUND;
	worker->flags |= WORKER_REBIND;

so the other CPU may temporarily see worker->flags which doesn't have
either WORKER_UNBOUND or WORKER_REBIND set and perform local wakeup
prematurely.

Fix it by using single explicit assignment via ACCESS_ONCE().

Because idle workers have another WORKER_NOT_RUNNING flag, this bug
doesn't exist for them; however, update it to use the same pattern for
consistency.

tj: Applied the change to idle workers too and updated comments and
    patch description a bit.

Change-Id: I9b95f51d146c40c31ba028668d6f412bd74c6026
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org

workqueue: move WORKER_REBIND clearing in rebind_workers() to the end of the function

This doesn't make any functional difference and is purely to help the
next patch to be simpler.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>

workqueue: fix possible deadlock in idle worker rebinding

Currently, rebind_workers() and idle_worker_rebind() are two-way
interlocked.  rebind_workers() waits for idle workers to finish
rebinding and rebound idle workers wait for rebind_workers() to finish
rebinding busy workers before proceeding.

Unfortunately, this isn't enough.  The second wait from idle workers
is implemented as follows.

	wait_event(gcwq->rebind_hold, !(worker->flags & WORKER_REBIND));

rebind_workers() clears WORKER_REBIND, wakes up the idle workers and
then returns.  If CPU hotplug cycle happens again before one of the
idle workers finishes the above wait_event(), rebind_workers() will
repeat the first part of the handshake - set WORKER_REBIND again and
wait for the idle worker to finish rebinding - and this leads to
deadlock because the idle worker would be waiting for WORKER_REBIND to
clear.

This is fixed by adding another interlocking step at the end -
rebind_workers() now waits for all the idle workers to finish the
above WORKER_REBIND wait before returning.  This ensures that all
rebinding steps are complete on all idle workers before the next
hotplug cycle can happen.

This problem was diagnosed by Lai Jiangshan who also posted a patch to
fix the issue, upon which this patch is based.

This is the minimal fix and further patches are scheduled for the next
merge window to simplify the CPU hotplug path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Original-patch-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <1346516916-1991-3-git-send-email-laijs@cn.fujitsu.com>

workqueue: restore POOL_MANAGING_WORKERS

This patch restores POOL_MANAGING_WORKERS which was replaced by
pool->manager_mutex by 6037315269 "workqueue: use mutex for global_cwq
manager exclusion".

There's a subtle idle worker depletion bug across CPU hotplug events
and we need to distinguish an actual manager and CPU hotplug
preventing management.  POOL_MANAGING_WORKERS will be used for the
former and manager_mutex the later.

This patch just lays POOL_MANAGING_WORKERS on top of the existing
manager_mutex and doesn't introduce any synchronization changes.  The
next patch will update it.

Note that this patch fixes a non-critical anomaly where
too_many_workers() may return %true spuriously while CPU hotplug is in
progress.  While the issue could schedule idle timer spuriously, it
didn't trigger any actual misbehavior.

tj: Rewrote patch description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: fix possible idle worker depletion across CPU hotplug

To simplify both normal and CPU hotplug paths, worker management is
prevented while CPU hoplug is in progress.  This is achieved by CPU
hotplug holding the same exclusion mechanism used by workers to ensure
there's only one manager per pool.

If someone else seems to be performing the manager role, workers
proceed to execute work items.  CPU hotplug using the same mechanism
can lead to idle worker depletion because all workers could proceed to
execute work items while CPU hotplug is in progress and CPU hotplug
itself wouldn't actually perform the worker management duty - it
doesn't guarantee that there's an idle worker left when it releases
management.

This idle worker depletion, under extreme circumstances, can break
forward-progress guarantee and thus lead to deadlock.

This patch fixes the bug by using separate mechanisms for manager
exclusion among workers and hotplug exclusion.  For manager exclusion,
POOL_MANAGING_WORKERS which was restored by the previous patch is
used.  pool->manager_mutex is now only used for exclusion between the
elected manager and CPU hotplug.  The elected manager won't proceed
without holding pool->manager_mutex.

This ensures that the worker which won the manager position can't skip
managing while CPU hotplug is in progress.  It will block on
manager_mutex and perform management after CPU hotplug is complete.

Note that hotplug may happen while waiting for manager_mutex.  A
manager isn't either on idle or busy list and thus the hoplug code
can't unbind/rebind it.  Make the manager handle its own un/rebinding.

tj: Updated comment and description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: always clear WORKER_REBIND in busy_worker_rebind_fn()

busy_worker_rebind_fn() didn't clear WORKER_REBIND if rebinding failed
(CPU is down again).  This used to be okay because the flag wasn't
used for anything else.

However, after 25511a477 "workqueue: reimplement CPU online rebinding
to handle idle workers", WORKER_REBIND is also used to command idle
workers to rebind.  If not cleared, the worker may confuse the next
CPU_UP cycle by having REBIND spuriously set or oops / get stuck by
prematurely calling idle_worker_rebind().

  WARNING: at /work/os/wq/kernel/workqueue.c:1323 worker_thread+0x4cd/0x5
 00()
  Hardware name: Bochs
  Modules linked in: test_wq(O-)
  Pid: 33, comm: kworker/1:1 Tainted: G           O 3.6.0-rc1-work+ #3
  Call Trace:
   [<ffffffff8109039f>] warn_slowpath_common+0x7f/0xc0
   [<ffffffff810903fa>] warn_slowpath_null+0x1a/0x20
   [<ffffffff810b3f1d>] worker_thread+0x4cd/0x500
   [<ffffffff810bc16e>] kthread+0xbe/0xd0
   [<ffffffff81bd2664>] kernel_thread_helper+0x4/0x10
  ---[ end trace e977cf20f4661968 ]---
  BUG: unable to handle kernel NULL pointer dereference at           (null)
  IP: [<ffffffff810b3db0>] worker_thread+0x360/0x500
  PGD 0
  Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
  Modules linked in: test_wq(O-)
  CPU 0
  Pid: 33, comm: kworker/1:1 Tainted: G        W  O 3.6.0-rc1-work+ #3 Bochs Bochs
  RIP: 0010:[<ffffffff810b3db0>]  [<ffffffff810b3db0>] worker_thread+0x360/0x500
  RSP: 0018:ffff88001e1c9de0  EFLAGS: 00010086
  RAX: 0000000000000000 RBX: ffff88001e633e00 RCX: 0000000000004140
  RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000009
  RBP: ffff88001e1c9ea0 R08: 0000000000000000 R09: 0000000000000001
  R10: 0000000000000002 R11: 0000000000000000 R12: ffff88001fc8d580
  R13: ffff88001fc8d590 R14: ffff88001e633e20 R15: ffff88001e1c6900
  FS:  0000000000000000(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  CR2: 0000000000000000 CR3: 00000000130e8000 CR4: 00000000000006f0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
  Process kworker/1:1 (pid: 33, threadinfo ffff88001e1c8000, task ffff88001e1c6900)
  Stack:
   ffff880000000000 ffff88001e1c9e40 0000000000000001 ffff88001e1c8010
   ffff88001e519c78 ffff88001e1c9e58 ffff88001e1c6900 ffff88001e1c6900
   ffff88001e1c6900 ffff88001e1c6900 ffff88001fc8d340 ffff88001fc8d340
  Call Trace:
   [<ffffffff810bc16e>] kthread+0xbe/0xd0
   [<ffffffff81bd2664>] kernel_thread_helper+0x4/0x10
  Code: b1 00 f6 43 48 02 0f 85 91 01 00 00 48 8b 43 38 48 89 df 48 8b 00 48 89 45 90 e8 ac f0 ff ff 3c 01 0f 85 60 01 00 00 48 8b 53 50 <8b> 02 83 e8 01 85 c0 89 02 0f 84 3b 01 00 00 48 8b 43 38 48 8b
  RIP  [<ffffffff810b3db0>] worker_thread+0x360/0x500
   RSP <ffff88001e1c9de0>
  CR2: 0000000000000000

There was no reason to keep WORKER_REBIND on failure in the first
place - WORKER_UNBOUND is guaranteed to be set in such cases
preventing incorrectly activating concurrency management.  Always
clear WORKER_REBIND.

tj: Updated comment and description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: reimplement idle worker rebinding

Currently rebind_workers() uses rebinds idle workers synchronously
before proceeding to requesting busy workers to rebind.  This is
necessary because all workers on @worker_pool->idle_list must be bound
before concurrency management local wake-ups from the busy workers
take place.

Unfortunately, the synchronous idle rebinding is quite complicated.
This patch reimplements idle rebinding to simplify the code path.

Rather than trying to make all idle workers bound before rebinding
busy workers, we simply remove all to-be-bound idle workers from the
idle list and let them add themselves back after completing rebinding
(successful or not).

As only workers which finished rebinding can on on the idle worker
list, the idle worker list is guaranteed to have only bound workers
unless CPU went down again and local wake-ups are safe.

After the change, @worker_pool->nr_idle may deviate than the actual
number of idle workers on @worker_pool->idle_list.  More specifically,
nr_idle may be non-zero while ->idle_list is empty.  All users of
->nr_idle and ->idle_list are audited.  The only affected one is
too_many_workers() which is updated to check %false if ->idle_list is
empty regardless of ->nr_idle.

After this patch, rebind_workers() no longer performs the nasty
idle-rebind retries which require temporary release of gcwq->lock, and
both unbinding and rebinding are atomic w.r.t. global_cwq->lock.

worker->idle_rebind and global_cwq->rebind_hold are now unnecessary
and removed along with the definition of struct idle_rebind.

Changed from V1:
	1) remove unlikely from too_many_workers(), ->idle_list can be empty
	   anytime, even before this patch, no reason to use unlikely.
	2) fix a small rebasing mistake.
	   (which is from rebasing the orignal fixing patch to for-next)
	3) add a lot of comments.
	4) clear WORKER_REBIND unconditionaly in idle_worker_rebind()

tj: Updated comments and description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: WORKER_REBIND is no longer necessary for busy rebinding

Because the old unbind/rebinding implementation wasn't atomic w.r.t.
GCWQ_DISASSOCIATED manipulation which is protected by
global_cwq->lock, we had to use two flags, WORKER_UNBOUND and
WORKER_REBIND, to avoid incorrectly losing all NOT_RUNNING bits with
back-to-back CPU hotplug operations; otherwise, completion of
rebinding while another unbinding is in progress could clear UNBIND
prematurely.

Now that both unbind/rebinding are atomic w.r.t. GCWQ_DISASSOCIATED,
there's no need to use two flags.  Just one is enough.  Don't use
WORKER_REBIND for busy rebinding.

tj: Updated description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: WORKER_REBIND is no longer necessary for idle rebinding

Now both worker destruction and idle rebinding remove the worker from
idle list while it's still idle, so list_empty(&worker->entry) can be
used to test whether either is pending and WORKER_DIE to distinguish
between the two instead making WORKER_REBIND unnecessary.

Use list_empty(&worker->entry) to determine whether destruction or
rebinding is pending.  This simplifies worker state transitions.

WORKER_REBIND is not needed anymore.  Remove it.

tj: Updated comments and description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: rename manager_mutex to assoc_mutex

Now that manager_mutex's role has changed from synchronizing manager
role to excluding hotplug against manager, the name is misleading.

As it is protecting the CPU-association of the gcwq now, rename it to
assoc_mutex.

This patch is pure rename and doesn't introduce any functional change.

tj: Updated comments and description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: use __cpuinit instead of __devinit for cpu callbacks

For workqueue hotplug callbacks, it makes less sense to use __devinit
which discards the memory after boot if !HOTPLUG.  __cpuinit, which
discards the memory after boot if !HOTPLUG_CPU fits better.

tj: Updated description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: fix possible stall on try_to_grab_pending() of a delayed work item

Currently, when try_to_grab_pending() grabs a delayed work item, it
leaves its linked work items alone on the delayed_works.  The linked
work items are always NO_COLOR and will cause future
cwq_activate_first_delayed() increase cwq->nr_active incorrectly, and
may cause the whole cwq to stall.  For example,

state: cwq->max_active = 1, cwq->nr_active = 1
       one work in cwq->pool, many in cwq->delayed_works.

step1: try_to_grab_pending() removes a work item from delayed_works
       but leaves its NO_COLOR linked work items on it.

step2: Later on, cwq_activate_first_delayed() activates the linked
       work item increasing ->nr_active.

step3: cwq->nr_active = 1, but all activated work items of the cwq are
       NO_COLOR.  When they finish, cwq->nr_active will not be
       decreased due to NO_COLOR, and no further work items will be
       activated from cwq->delayed_works. the cwq stalls.

Fix it by ensuring the target work item is activated before stealing
PENDING in try_to_grab_pending().  This ensures that all the linked
work items are activated without incorrectly bumping cwq->nr_active.

tj: Updated comment and description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org

workqueue: reimplement work_on_cpu() using system_wq

The existing work_on_cpu() implementation is hugely inefficient.  It
creates a new kthread, execute that single function and then let the
kthread die on each invocation.

Now that system_wq can handle concurrent executions, there's no
advantage of doing this.  Reimplement work_on_cpu() using system_wq
which makes it simpler and way more efficient.

stable: While this isn't a fix in itself, it's needed to fix a
        workqueue related bug in cpufreq/powernow-k8.  AFAICS, this
        shouldn't break other existing users.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: stable@vger.kernel.org

workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()

Using a helper instead of open code makes thaw_workqueues() clearer.
The helper will also be used by the next patch.

tj: Slight update to comment and description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()

workqueue_set_max_active() may increase ->max_active without
activating delayed works and may make the activation order differ from
the queueing order.  Both aren't strictly bugs but the resulting
behavior could be a bit odd.

To make things more consistent, use cwq_set_max_active() helper which
immediately makes use of the newly increased max_mactive if there are
delayed work items and also keeps the activation order.

tj: Slight update to description.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()

e0aecdd874 ("workqueue: use irqsafe timer for delayed_work") made
try_to_grab_pending() safe to use from irq context but forgot to
remove WARN_ON_ONCE(in_irq()).  Remove it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>

workqueue: cancel_delayed_work() should return %false if work item is idle

57b30ae77b ("workqueue: reimplement cancel_delayed_work() using
try_to_grab_pending()") made cancel_delayed_work() always return %true
unless someone else is also trying to cancel the work item, which is
broken - if the target work item is idle, the return value should be
%false.

try_to_grab_pending() indicates that the target work item was idle by
zero return value.  Use it for return.  Note that this brings
cancel_delayed_work() in line with __cancel_work_timer() in return
value handling.

Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <444a6439-b1a4-4740-9e7e-bc37267cfe73@default>

workqueue: exit rescuer_thread() as TASK_RUNNING

A rescue thread exiting TASK_INTERRUPTIBLE can lead to a task scheduling
off, never to be seen again.  In the case where this occurred, an exiting
thread hit reiserfs homebrew conditional resched while holding a mutex,
bringing the box to its knees.

PID: 18105  TASK: ffff8807fd412180  CPU: 5   COMMAND: "kdmflush"
 #0 [ffff8808157e7670] schedule at ffffffff8143f489
 #1 [ffff8808157e77b8] reiserfs_get_block at ffffffffa038ab2d [reiserfs]
 #2 [ffff8808157e79a8] __block_write_begin at ffffffff8117fb14
 #3 [ffff8808157e7a98] reiserfs_write_begin at ffffffffa0388695 [reiserfs]
 #4 [ffff8808157e7ad8] generic_perform_write at ffffffff810ee9e2
 #5 [ffff8808157e7b58] generic_file_buffered_write at ffffffff810eeb41
 #6 [ffff8808157e7ba8] __generic_file_aio_write at ffffffff810f1a3a
 #7 [ffff8808157e7c58] generic_file_aio_write at ffffffff810f1c88
 #8 [ffff8808157e7cc8] do_sync_write at ffffffff8114f850
 #9 [ffff8808157e7dd8] do_acct_process at ffffffff810a268f
    [exception RIP: kernel_thread_helper]
    RIP: ffffffff8144a5c0  RSP: ffff8808157e7f58  RFLAGS: 00000202
    RAX: 0000000000000000  RBX: 0000000000000000  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: ffffffff8107af60  RDI: ffff8803ee491d18
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018

Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org

workqueue: mod_delayed_work_on() shouldn't queue timer on 0 delay

8376fe22c7 ("workqueue: implement mod_delayed_work[_on]()")
implemented mod_delayed_work[_on]() using the improved
try_to_grab_pending().  The function is later used, among others, to
replace [__]candel_delayed_work() + queue_delayed_work() combinations.

Unfortunately, a delayed_work item w/ zero @delay is handled slightly
differently by mod_delayed_work_on() compared to
queue_delayed_work_on().  The latter skips timer altogether and
directly queues it using queue_work_on() while the former schedules
timer which will expire on the closest tick.  This means, when @delay
is zero, that [__]cancel_delayed_work() + queue_delayed_work_on()
makes the target item immediately executable while
mod_delayed_work_on() may induce delay of upto a full tick.

This somewhat subtle difference breaks some of the converted users.
e.g. block queue plugging uses delayed_work for deferred processing
and uses mod_delayed_work_on() when the queue needs to be immediately
unplugged.  The above problem manifested as noticeably higher number
of context switches under certain circumstances.

The difference in behavior was caused by missing special case handling
for 0 delay in mod_delayed_work_on() compared to
queue_delayed_work_on().  Joonsoo Kim posted a patch to add it -
("workqueue: optimize mod_delayed_work_on() when @delay == 0")[1].
The patch was queued for 3.8 but it was described as optimization and
I missed that it was a correctness issue.

As both queue_delayed_work_on() and mod_delayed_work_on() use
__queue_delayed_work() for queueing, it seems that the better approach
is to move the 0 delay special handling to the function instead of
duplicating it in mod_delayed_work_on().

Fix the problem by moving 0 delay special case handling from
queue_delayed_work_on() to __queue_delayed_work().  This replaces
Joonsoo's patch.

[1] http://thread.gmane.org/gmane.linux.kernel/1379011/focus=1379012

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Anders Kaseorg <andersk@MIT.EDU>
Reported-and-tested-by: Zlatko Calusic <zlatko.calusic@iskon.hr>
LKML-Reference: <alpine.DEB.2.00.1211280953350.26602@dr-wily.mit.edu>
LKML-Reference: <50A78AA9.5040904@iskon.hr>
Cc: Joonsoo Kim <js1304@gmail.com>

workqueue: trivial fix for return statement in work_busy()

Return type of work_busy() is unsigned int.
There is return statement returning boolean value, 'false' in work_busy().
It is not problem, because 'false' may be treated '0'.
However, fixing it would make code robust.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: add WARN_ON_ONCE() on CPU number to wq_worker_waking_up()

Recently, workqueue code has gone through some changes and we found
some bugs related to concurrency management operations happening on
the wrong CPU.  When a worker is concurrency managed
(!WORKER_NOT_RUNNIG), it should be bound to its associated cpu and
woken up to that cpu.  Add WARN_ON_ONCE() to verify this.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

workqueue: convert BUG_ON()s in __queue_delayed_work() to WARN_ON_ONCE()s

8852aac25e ("workqueue: mod_delayed_work_on() shouldn't queue timer on
0 delay") unexpectedly uncovered a very nasty abuse of delayed_work in
megaraid - it allocated work_struct, casted it to delayed_work and
then pass that into queue_delayed_work().

Previously, this was okay because 0 @delay short-circuited to
queue_work() before doing anything with delayed_work.  8852aac25e
moved 0 @delay test into __queue_delayed_work() after sanity check on
delayed_work making megaraid trigger BUG_ON().

Although megaraid is already fixed by c1d390d8e6 ("megaraid: fix
BUG_ON() from incorrect use of delayed work"), this patch converts
BUG_ON()s in __queue_delayed_work() to WARN_ON_ONCE()s so that such
abusers, if there are more, trigger warning but don't crash the
machine.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Xiaotian Feng <xtfeng@gmail.com>

wq

Change-Id: Ia3c507777a995f32bf6b40dc8318203e53134229
Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>
fefifofum pushed a commit to armani-dev/android_kernel_xiaomi_armani_kk that referenced this issue Oct 23, 2015
Setting an empty security context (length=0) on a file will
lead to incorrectly dereferencing the type and other fields
of the security context structure, yielding a kernel BUG.
As a zero-length security context is never valid, just reject
all such security contexts whether coming from userspace
via setxattr or coming from the filesystem upon a getxattr
request by SELinux.

Setting a security context value (empty or otherwise) unknown to
SELinux in the first place is only possible for a root process
(CAP_MAC_ADMIN), and, if running SELinux in enforcing mode, only
if the corresponding SELinux mac_admin permission is also granted
to the domain by policy.  In Fedora policies, this is only allowed for
specific domains such as livecd for setting down security contexts
that are not defined in the build host policy.

[On Android, this can only be set by root/CAP_MAC_ADMIN processes,
and if running SELinux in enforcing mode, only if mac_admin permission
is granted in policy.  In Android 4.4, this would only be allowed for
root/CAP_MAC_ADMIN processes that are also in unconfined domains. In current
AOSP master, mac_admin is not allowed for any domains except the recovery
console which has a legitimate need for it.  The other potential vector
is mounting a maliciously crafted filesystem for which SELinux fetches
xattrs (e.g. an ext4 filesystem on a SDcard).  However, the end result is
only a local denial-of-service (DOS) due to kernel BUG.  This fix is
queued for 3.14.]

Reproducer:
su
setenforce 0
touch foo
setfattr -n security.selinux foo

Caveat:
Relabeling or removing foo after doing the above may not be possible
without booting with SELinux disabled.  Any subsequent access to foo
after doing the above will also trigger the BUG.

BUG output from Matthew Thode:
[  473.893141] ------------[ cut here ]------------
[  473.962110] kernel BUG at security/selinux/ss/services.c:654!
[  473.995314] invalid opcode: 0000 [MiCode#6] SMP
[  474.027196] Modules linked in:
[  474.058118] CPU: 0 PID: 8138 Comm: ls Tainted: G      D   I
3.13.0-grsec #1
[  474.116637] Hardware name: Supermicro X8ST3/X8ST3, BIOS 2.0
07/29/10
[  474.149768] task: ffff8805f50cd010 ti: ffff8805f50cd488 task.ti:
ffff8805f50cd488
[  474.183707] RIP: 0010:[<ffffffff814681c7>]  [<ffffffff814681c7>]
context_struct_compute_av+0xce/0x308
[  474.219954] RSP: 0018:ffff8805c0ac3c38  EFLAGS: 00010246
[  474.252253] RAX: 0000000000000000 RBX: ffff8805c0ac3d94 RCX:
0000000000000100
[  474.287018] RDX: ffff8805e8aac000 RSI: 00000000ffffffff RDI:
ffff8805e8aaa000
[  474.321199] RBP: ffff8805c0ac3cb8 R08: 0000000000000010 R09:
0000000000000006
[  474.357446] R10: 0000000000000000 R11: ffff8805c567a000 R12:
0000000000000006
[  474.419191] R13: ffff8805c2b74e88 R14: 00000000000001da R15:
0000000000000000
[  474.453816] FS:  00007f2e75220800(0000) GS:ffff88061fc00000(0000)
knlGS:0000000000000000
[  474.489254] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  474.522215] CR2: 00007f2e74716090 CR3: 00000005c085e000 CR4:
00000000000207f0
[  474.556058] Stack:
[  474.584325]  ffff8805c0ac3c98 ffffffff811b549b ffff8805c0ac3c98
ffff8805f1190a40
[  474.618913]  ffff8805a6202f08 ffff8805c2b74e88 00068800d0464990
ffff8805e8aac860
[  474.653955]  ffff8805c0ac3cb8 000700068113833a ffff880606c75060
ffff8805c0ac3d94
[  474.690461] Call Trace:
[  474.723779]  [<ffffffff811b549b>] ? lookup_fast+0x1cd/0x22a
[  474.778049]  [<ffffffff81468824>] security_compute_av+0xf4/0x20b
[  474.811398]  [<ffffffff8196f419>] avc_compute_av+0x2a/0x179
[  474.843813]  [<ffffffff8145727b>] avc_has_perm+0x45/0xf4
[  474.875694]  [<ffffffff81457d0e>] inode_has_perm+0x2a/0x31
[  474.907370]  [<ffffffff81457e76>] selinux_inode_getattr+0x3c/0x3e
[  474.938726]  [<ffffffff81455cf6>] security_inode_getattr+0x1b/0x22
[  474.970036]  [<ffffffff811b057d>] vfs_getattr+0x19/0x2d
[  475.000618]  [<ffffffff811b05e5>] vfs_fstatat+0x54/0x91
[  475.030402]  [<ffffffff811b063b>] vfs_lstat+0x19/0x1b
[  475.061097]  [<ffffffff811b077e>] SyS_newlstat+0x15/0x30
[  475.094595]  [<ffffffff8113c5c1>] ? __audit_syscall_entry+0xa1/0xc3
[  475.148405]  [<ffffffff8197791e>] system_call_fastpath+0x16/0x1b
[  475.179201] Code: 00 48 85 c0 48 89 45 b8 75 02 0f 0b 48 8b 45 a0 48
8b 3d 45 d0 b6 00 8b 40 08 89 c6 ff ce e8 d1 b0 06 00 48 85 c0 49 89 c7
75 02 <0f> 0b 48 8b 45 b8 4c 8b 28 eb 1e 49 8d 7d 08 be 80 01 00 00 e8
[  475.255884] RIP  [<ffffffff814681c7>]
context_struct_compute_av+0xce/0x308
[  475.296120]  RSP <ffff8805c0ac3c38>
[  475.328734] ---[ end trace f076482e9d754adc ]---

[sds:  commit message edited to note Android implications and
to generate a unique Change-Id for gerrit]

Change-Id: I4d5389f0cfa72b5f59dada45081fa47e03805413
Reported-by:  Matthew Thode <mthode@mthode.org>
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Cc: stable@vger.kernel.org
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Sivasri Kumar Vanka <sivasri@codeaurora.org>
fefifofum pushed a commit to armani-dev/android_kernel_xiaomi_armani_kk that referenced this issue Oct 23, 2015
Setting an empty security context (length=0) on a file will
lead to incorrectly dereferencing the type and other fields
of the security context structure, yielding a kernel BUG.
As a zero-length security context is never valid, just reject
all such security contexts whether coming from userspace
via setxattr or coming from the filesystem upon a getxattr
request by SELinux.

Setting a security context value (empty or otherwise) unknown to
SELinux in the first place is only possible for a root process
(CAP_MAC_ADMIN), and, if running SELinux in enforcing mode, only
if the corresponding SELinux mac_admin permission is also granted
to the domain by policy.  In Fedora policies, this is only allowed for
specific domains such as livecd for setting down security contexts
that are not defined in the build host policy.

[On Android, this can only be set by root/CAP_MAC_ADMIN processes,
and if running SELinux in enforcing mode, only if mac_admin permission
is granted in policy.  In Android 4.4, this would only be allowed for
root/CAP_MAC_ADMIN processes that are also in unconfined domains. In current
AOSP master, mac_admin is not allowed for any domains except the recovery
console which has a legitimate need for it.  The other potential vector
is mounting a maliciously crafted filesystem for which SELinux fetches
xattrs (e.g. an ext4 filesystem on a SDcard).  However, the end result is
only a local denial-of-service (DOS) due to kernel BUG.  This fix is
queued for 3.14.]

Reproducer:
su
setenforce 0
touch foo
setfattr -n security.selinux foo

Caveat:
Relabeling or removing foo after doing the above may not be possible
without booting with SELinux disabled.  Any subsequent access to foo
after doing the above will also trigger the BUG.

BUG output from Matthew Thode:
[  473.893141] ------------[ cut here ]------------
[  473.962110] kernel BUG at security/selinux/ss/services.c:654!
[  473.995314] invalid opcode: 0000 [MiCode#6] SMP
[  474.027196] Modules linked in:
[  474.058118] CPU: 0 PID: 8138 Comm: ls Tainted: G      D   I
3.13.0-grsec #1
[  474.116637] Hardware name: Supermicro X8ST3/X8ST3, BIOS 2.0
07/29/10
[  474.149768] task: ffff8805f50cd010 ti: ffff8805f50cd488 task.ti:
ffff8805f50cd488
[  474.183707] RIP: 0010:[<ffffffff814681c7>]  [<ffffffff814681c7>]
context_struct_compute_av+0xce/0x308
[  474.219954] RSP: 0018:ffff8805c0ac3c38  EFLAGS: 00010246
[  474.252253] RAX: 0000000000000000 RBX: ffff8805c0ac3d94 RCX:
0000000000000100
[  474.287018] RDX: ffff8805e8aac000 RSI: 00000000ffffffff RDI:
ffff8805e8aaa000
[  474.321199] RBP: ffff8805c0ac3cb8 R08: 0000000000000010 R09:
0000000000000006
[  474.357446] R10: 0000000000000000 R11: ffff8805c567a000 R12:
0000000000000006
[  474.419191] R13: ffff8805c2b74e88 R14: 00000000000001da R15:
0000000000000000
[  474.453816] FS:  00007f2e75220800(0000) GS:ffff88061fc00000(0000)
knlGS:0000000000000000
[  474.489254] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  474.522215] CR2: 00007f2e74716090 CR3: 00000005c085e000 CR4:
00000000000207f0
[  474.556058] Stack:
[  474.584325]  ffff8805c0ac3c98 ffffffff811b549b ffff8805c0ac3c98
ffff8805f1190a40
[  474.618913]  ffff8805a6202f08 ffff8805c2b74e88 00068800d0464990
ffff8805e8aac860
[  474.653955]  ffff8805c0ac3cb8 000700068113833a ffff880606c75060
ffff8805c0ac3d94
[  474.690461] Call Trace:
[  474.723779]  [<ffffffff811b549b>] ? lookup_fast+0x1cd/0x22a
[  474.778049]  [<ffffffff81468824>] security_compute_av+0xf4/0x20b
[  474.811398]  [<ffffffff8196f419>] avc_compute_av+0x2a/0x179
[  474.843813]  [<ffffffff8145727b>] avc_has_perm+0x45/0xf4
[  474.875694]  [<ffffffff81457d0e>] inode_has_perm+0x2a/0x31
[  474.907370]  [<ffffffff81457e76>] selinux_inode_getattr+0x3c/0x3e
[  474.938726]  [<ffffffff81455cf6>] security_inode_getattr+0x1b/0x22
[  474.970036]  [<ffffffff811b057d>] vfs_getattr+0x19/0x2d
[  475.000618]  [<ffffffff811b05e5>] vfs_fstatat+0x54/0x91
[  475.030402]  [<ffffffff811b063b>] vfs_lstat+0x19/0x1b
[  475.061097]  [<ffffffff811b077e>] SyS_newlstat+0x15/0x30
[  475.094595]  [<ffffffff8113c5c1>] ? __audit_syscall_entry+0xa1/0xc3
[  475.148405]  [<ffffffff8197791e>] system_call_fastpath+0x16/0x1b
[  475.179201] Code: 00 48 85 c0 48 89 45 b8 75 02 0f 0b 48 8b 45 a0 48
8b 3d 45 d0 b6 00 8b 40 08 89 c6 ff ce e8 d1 b0 06 00 48 85 c0 49 89 c7
75 02 <0f> 0b 48 8b 45 b8 4c 8b 28 eb 1e 49 8d 7d 08 be 80 01 00 00 e8
[  475.255884] RIP  [<ffffffff814681c7>]
context_struct_compute_av+0xce/0x308
[  475.296120]  RSP <ffff8805c0ac3c38>
[  475.328734] ---[ end trace f076482e9d754adc ]---

[sds:  commit message edited to note Android implications and
to generate a unique Change-Id for gerrit]

Change-Id: I4d5389f0cfa72b5f59dada45081fa47e03805413
Reported-by:  Matthew Thode <mthode@mthode.org>
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Cc: stable@vger.kernel.org
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Sivasri Kumar Vanka <sivasri@codeaurora.org>
fefifofum pushed a commit to armani-dev/android_kernel_xiaomi_armani_kk that referenced this issue Nov 21, 2015
Setting an empty security context (length=0) on a file will
lead to incorrectly dereferencing the type and other fields
of the security context structure, yielding a kernel BUG.
As a zero-length security context is never valid, just reject
all such security contexts whether coming from userspace
via setxattr or coming from the filesystem upon a getxattr
request by SELinux.

Setting a security context value (empty or otherwise) unknown to
SELinux in the first place is only possible for a root process
(CAP_MAC_ADMIN), and, if running SELinux in enforcing mode, only
if the corresponding SELinux mac_admin permission is also granted
to the domain by policy.  In Fedora policies, this is only allowed for
specific domains such as livecd for setting down security contexts
that are not defined in the build host policy.

[On Android, this can only be set by root/CAP_MAC_ADMIN processes,
and if running SELinux in enforcing mode, only if mac_admin permission
is granted in policy.  In Android 4.4, this would only be allowed for
root/CAP_MAC_ADMIN processes that are also in unconfined domains. In current
AOSP master, mac_admin is not allowed for any domains except the recovery
console which has a legitimate need for it.  The other potential vector
is mounting a maliciously crafted filesystem for which SELinux fetches
xattrs (e.g. an ext4 filesystem on a SDcard).  However, the end result is
only a local denial-of-service (DOS) due to kernel BUG.  This fix is
queued for 3.14.]

Reproducer:
su
setenforce 0
touch foo
setfattr -n security.selinux foo

Caveat:
Relabeling or removing foo after doing the above may not be possible
without booting with SELinux disabled.  Any subsequent access to foo
after doing the above will also trigger the BUG.

BUG output from Matthew Thode:
[  473.893141] ------------[ cut here ]------------
[  473.962110] kernel BUG at security/selinux/ss/services.c:654!
[  473.995314] invalid opcode: 0000 [MiCode#6] SMP
[  474.027196] Modules linked in:
[  474.058118] CPU: 0 PID: 8138 Comm: ls Tainted: G      D   I
3.13.0-grsec #1
[  474.116637] Hardware name: Supermicro X8ST3/X8ST3, BIOS 2.0
07/29/10
[  474.149768] task: ffff8805f50cd010 ti: ffff8805f50cd488 task.ti:
ffff8805f50cd488
[  474.183707] RIP: 0010:[<ffffffff814681c7>]  [<ffffffff814681c7>]
context_struct_compute_av+0xce/0x308
[  474.219954] RSP: 0018:ffff8805c0ac3c38  EFLAGS: 00010246
[  474.252253] RAX: 0000000000000000 RBX: ffff8805c0ac3d94 RCX:
0000000000000100
[  474.287018] RDX: ffff8805e8aac000 RSI: 00000000ffffffff RDI:
ffff8805e8aaa000
[  474.321199] RBP: ffff8805c0ac3cb8 R08: 0000000000000010 R09:
0000000000000006
[  474.357446] R10: 0000000000000000 R11: ffff8805c567a000 R12:
0000000000000006
[  474.419191] R13: ffff8805c2b74e88 R14: 00000000000001da R15:
0000000000000000
[  474.453816] FS:  00007f2e75220800(0000) GS:ffff88061fc00000(0000)
knlGS:0000000000000000
[  474.489254] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  474.522215] CR2: 00007f2e74716090 CR3: 00000005c085e000 CR4:
00000000000207f0
[  474.556058] Stack:
[  474.584325]  ffff8805c0ac3c98 ffffffff811b549b ffff8805c0ac3c98
ffff8805f1190a40
[  474.618913]  ffff8805a6202f08 ffff8805c2b74e88 00068800d0464990
ffff8805e8aac860
[  474.653955]  ffff8805c0ac3cb8 000700068113833a ffff880606c75060
ffff8805c0ac3d94
[  474.690461] Call Trace:
[  474.723779]  [<ffffffff811b549b>] ? lookup_fast+0x1cd/0x22a
[  474.778049]  [<ffffffff81468824>] security_compute_av+0xf4/0x20b
[  474.811398]  [<ffffffff8196f419>] avc_compute_av+0x2a/0x179
[  474.843813]  [<ffffffff8145727b>] avc_has_perm+0x45/0xf4
[  474.875694]  [<ffffffff81457d0e>] inode_has_perm+0x2a/0x31
[  474.907370]  [<ffffffff81457e76>] selinux_inode_getattr+0x3c/0x3e
[  474.938726]  [<ffffffff81455cf6>] security_inode_getattr+0x1b/0x22
[  474.970036]  [<ffffffff811b057d>] vfs_getattr+0x19/0x2d
[  475.000618]  [<ffffffff811b05e5>] vfs_fstatat+0x54/0x91
[  475.030402]  [<ffffffff811b063b>] vfs_lstat+0x19/0x1b
[  475.061097]  [<ffffffff811b077e>] SyS_newlstat+0x15/0x30
[  475.094595]  [<ffffffff8113c5c1>] ? __audit_syscall_entry+0xa1/0xc3
[  475.148405]  [<ffffffff8197791e>] system_call_fastpath+0x16/0x1b
[  475.179201] Code: 00 48 85 c0 48 89 45 b8 75 02 0f 0b 48 8b 45 a0 48
8b 3d 45 d0 b6 00 8b 40 08 89 c6 ff ce e8 d1 b0 06 00 48 85 c0 49 89 c7
75 02 <0f> 0b 48 8b 45 b8 4c 8b 28 eb 1e 49 8d 7d 08 be 80 01 00 00 e8
[  475.255884] RIP  [<ffffffff814681c7>]
context_struct_compute_av+0xce/0x308
[  475.296120]  RSP <ffff8805c0ac3c38>
[  475.328734] ---[ end trace f076482e9d754adc ]---

[sds:  commit message edited to note Android implications and
to generate a unique Change-Id for gerrit]

Change-Id: I4d5389f0cfa72b5f59dada45081fa47e03805413
Reported-by:  Matthew Thode <mthode@mthode.org>
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Cc: stable@vger.kernel.org
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Sivasri Kumar Vanka <sivasri@codeaurora.org>
fefifofum pushed a commit to armani-dev/android_kernel_xiaomi_armani_kk that referenced this issue Nov 21, 2015
Setting an empty security context (length=0) on a file will
lead to incorrectly dereferencing the type and other fields
of the security context structure, yielding a kernel BUG.
As a zero-length security context is never valid, just reject
all such security contexts whether coming from userspace
via setxattr or coming from the filesystem upon a getxattr
request by SELinux.

Setting a security context value (empty or otherwise) unknown to
SELinux in the first place is only possible for a root process
(CAP_MAC_ADMIN), and, if running SELinux in enforcing mode, only
if the corresponding SELinux mac_admin permission is also granted
to the domain by policy.  In Fedora policies, this is only allowed for
specific domains such as livecd for setting down security contexts
that are not defined in the build host policy.

[On Android, this can only be set by root/CAP_MAC_ADMIN processes,
and if running SELinux in enforcing mode, only if mac_admin permission
is granted in policy.  In Android 4.4, this would only be allowed for
root/CAP_MAC_ADMIN processes that are also in unconfined domains. In current
AOSP master, mac_admin is not allowed for any domains except the recovery
console which has a legitimate need for it.  The other potential vector
is mounting a maliciously crafted filesystem for which SELinux fetches
xattrs (e.g. an ext4 filesystem on a SDcard).  However, the end result is
only a local denial-of-service (DOS) due to kernel BUG.  This fix is
queued for 3.14.]

Reproducer:
su
setenforce 0
touch foo
setfattr -n security.selinux foo

Caveat:
Relabeling or removing foo after doing the above may not be possible
without booting with SELinux disabled.  Any subsequent access to foo
after doing the above will also trigger the BUG.

BUG output from Matthew Thode:
[  473.893141] ------------[ cut here ]------------
[  473.962110] kernel BUG at security/selinux/ss/services.c:654!
[  473.995314] invalid opcode: 0000 [MiCode#6] SMP
[  474.027196] Modules linked in:
[  474.058118] CPU: 0 PID: 8138 Comm: ls Tainted: G      D   I
3.13.0-grsec #1
[  474.116637] Hardware name: Supermicro X8ST3/X8ST3, BIOS 2.0
07/29/10
[  474.149768] task: ffff8805f50cd010 ti: ffff8805f50cd488 task.ti:
ffff8805f50cd488
[  474.183707] RIP: 0010:[<ffffffff814681c7>]  [<ffffffff814681c7>]
context_struct_compute_av+0xce/0x308
[  474.219954] RSP: 0018:ffff8805c0ac3c38  EFLAGS: 00010246
[  474.252253] RAX: 0000000000000000 RBX: ffff8805c0ac3d94 RCX:
0000000000000100
[  474.287018] RDX: ffff8805e8aac000 RSI: 00000000ffffffff RDI:
ffff8805e8aaa000
[  474.321199] RBP: ffff8805c0ac3cb8 R08: 0000000000000010 R09:
0000000000000006
[  474.357446] R10: 0000000000000000 R11: ffff8805c567a000 R12:
0000000000000006
[  474.419191] R13: ffff8805c2b74e88 R14: 00000000000001da R15:
0000000000000000
[  474.453816] FS:  00007f2e75220800(0000) GS:ffff88061fc00000(0000)
knlGS:0000000000000000
[  474.489254] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  474.522215] CR2: 00007f2e74716090 CR3: 00000005c085e000 CR4:
00000000000207f0
[  474.556058] Stack:
[  474.584325]  ffff8805c0ac3c98 ffffffff811b549b ffff8805c0ac3c98
ffff8805f1190a40
[  474.618913]  ffff8805a6202f08 ffff8805c2b74e88 00068800d0464990
ffff8805e8aac860
[  474.653955]  ffff8805c0ac3cb8 000700068113833a ffff880606c75060
ffff8805c0ac3d94
[  474.690461] Call Trace:
[  474.723779]  [<ffffffff811b549b>] ? lookup_fast+0x1cd/0x22a
[  474.778049]  [<ffffffff81468824>] security_compute_av+0xf4/0x20b
[  474.811398]  [<ffffffff8196f419>] avc_compute_av+0x2a/0x179
[  474.843813]  [<ffffffff8145727b>] avc_has_perm+0x45/0xf4
[  474.875694]  [<ffffffff81457d0e>] inode_has_perm+0x2a/0x31
[  474.907370]  [<ffffffff81457e76>] selinux_inode_getattr+0x3c/0x3e
[  474.938726]  [<ffffffff81455cf6>] security_inode_getattr+0x1b/0x22
[  474.970036]  [<ffffffff811b057d>] vfs_getattr+0x19/0x2d
[  475.000618]  [<ffffffff811b05e5>] vfs_fstatat+0x54/0x91
[  475.030402]  [<ffffffff811b063b>] vfs_lstat+0x19/0x1b
[  475.061097]  [<ffffffff811b077e>] SyS_newlstat+0x15/0x30
[  475.094595]  [<ffffffff8113c5c1>] ? __audit_syscall_entry+0xa1/0xc3
[  475.148405]  [<ffffffff8197791e>] system_call_fastpath+0x16/0x1b
[  475.179201] Code: 00 48 85 c0 48 89 45 b8 75 02 0f 0b 48 8b 45 a0 48
8b 3d 45 d0 b6 00 8b 40 08 89 c6 ff ce e8 d1 b0 06 00 48 85 c0 49 89 c7
75 02 <0f> 0b 48 8b 45 b8 4c 8b 28 eb 1e 49 8d 7d 08 be 80 01 00 00 e8
[  475.255884] RIP  [<ffffffff814681c7>]
context_struct_compute_av+0xce/0x308
[  475.296120]  RSP <ffff8805c0ac3c38>
[  475.328734] ---[ end trace f076482e9d754adc ]---

[sds:  commit message edited to note Android implications and
to generate a unique Change-Id for gerrit]

Change-Id: I4d5389f0cfa72b5f59dada45081fa47e03805413
Reported-by:  Matthew Thode <mthode@mthode.org>
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Cc: stable@vger.kernel.org
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Sivasri Kumar Vanka <sivasri@codeaurora.org>
corphish pushed a commit to TeamButter/android_kernel_xiaomi_kenzo that referenced this issue Aug 1, 2016
commit 346c09f upstream.

The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
with the following backtrace:

[  601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
[  601.347574]       Tainted: G           O    4.4.5-1-storage+ MiCode#6
[  601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  601.348142] kworker/u129:5  D ffff880803077988     0  1636      2 0x00000000
[  601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
[  601.348999]  ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
[  601.349662]  ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
[  601.350333]  ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
[  601.350965] Call Trace:
[  601.351203]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[  601.351444]  [<ffffffff815b01d5>] schedule+0x35/0x80
[  601.351709]  [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
[  601.351958]  [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
[  601.352208]  [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
[  601.352446]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[  601.352688]  [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
[  601.352951]  [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
[  601.353196]  [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
[  601.353440]  [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
[  601.353689]  [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
[  601.353958]  [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
[  601.354200]  [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
[  601.354441]  [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
[  601.354688]  [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
[  601.354932]  [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
[  601.355193]  [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
[  601.355432]  [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
[  601.355679]  [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
[  601.355925]  [<ffffffff81198379>] vfs_write+0xa9/0x1a0
[  601.356164]  [<ffffffff811c59d8>] kernel_write+0x38/0x50

The underlying device is a null_blk, with default parameters:

  queue_mode    = MQ
  submit_queues = 1

Verification that nullb0 has something inflight:

root@pserver8:~# cat /sys/block/nullb0/inflight
       0        1
root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
...
/sys/block/nullb0/mq/0/cpu2/rq_list
CTX pending:
        ffff8838038e2400
...

During debug it became clear that stalled request is always inserted in
the rq_list from the following path:

   save_stack_trace_tsk + 34
   blk_mq_insert_requests + 231
   blk_mq_flush_plug_list + 281
   blk_flush_plug_list + 199
   wait_on_page_bit + 192
   __filemap_fdatawait_range + 228
   filemap_fdatawait_range + 20
   filemap_write_and_wait_range + 63
   blkdev_fsync + 27
   vfs_fsync_range + 73
   blkdev_write_iter + 202
   __vfs_write + 170
   vfs_write + 169
   kernel_write + 56

So blk_flush_plug_list() was called with from_schedule == true.

If from_schedule is true, that means that finally blk_mq_insert_requests()
offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
i.e. it calls kblockd_schedule_delayed_work_on().

That means, that we race with another CPU, which is about to execute
__blk_mq_run_hw_queue() work.

Further debugging shows the following traces from different CPUs:

  CPU#0                                  CPU#1
  ----------------------------------     -------------------------------
  reqeust A inserted
  STORE hctx->ctx_map[0] bit marked
  kblockd_schedule...() returns 1
  <schedule to kblockd workqueue>
                                         request B inserted
                                         STORE hctx->ctx_map[1] bit marked
                                         kblockd_schedule...() returns 0
  *** WORK PENDING bit is cleared ***
  flush_busy_ctxs() is executed, but
  bit 1, set by CPU#1, is not observed

As a result request B pended forever.

This behaviour can be explained by speculative LOAD of hctx->ctx_map on
CPU#0, which is reordered with clear of PENDING bit and executed _before_
actual STORE of bit 1 on CPU#1.

The proper fix is an explicit full barrier <mfence>, which guarantees
that clear of PENDING bit is to be executed before all possible
speculative LOADS or STORES inside actual work function.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Michael Wang <yun.wang@profitbricks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
bgcngm pushed a commit to Mi5Devs/android_kernel_xiaomi_msm8996 that referenced this issue Oct 8, 2016
[ Upstream commit ecf5fc6 ]

Nikolay has reported a hang when a memcg reclaim got stuck with the
following backtrace:

PID: 18308  TASK: ffff883d7c9b0a30  CPU: 1   COMMAND: "rsync"
  #0 __schedule at ffffffff815ab152
  #1 schedule at ffffffff815ab76e
  #2 schedule_timeout at ffffffff815ae5e5
  MiCode#3 io_schedule_timeout at ffffffff815aad6a
  MiCode#4 bit_wait_io at ffffffff815abfc6
  MiCode#5 __wait_on_bit at ffffffff815abda5
  MiCode#6 wait_on_page_bit at ffffffff8111fd4f
  MiCode#7 shrink_page_list at ffffffff81135445
  MiCode#8 shrink_inactive_list at ffffffff81135845
  MiCode#9 shrink_lruvec at ffffffff81135ead
 MiCode#10 shrink_zone at ffffffff811360c3
 MiCode#11 shrink_zones at ffffffff81136eff
 MiCode#12 do_try_to_free_pages at ffffffff8113712f
 MiCode#13 try_to_free_mem_cgroup_pages at ffffffff811372be
 MiCode#14 try_charge at ffffffff81189423
 MiCode#15 mem_cgroup_try_charge at ffffffff8118c6f5
 MiCode#16 __add_to_page_cache_locked at ffffffff8112137d
 MiCode#17 add_to_page_cache_lru at ffffffff81121618
 MiCode#18 pagecache_get_page at ffffffff8112170b
 MiCode#19 grow_dev_page at ffffffff811c8297
 MiCode#20 __getblk_slow at ffffffff811c91d6
 MiCode#21 __getblk_gfp at ffffffff811c92c1
 MiCode#22 ext4_ext_grow_indepth at ffffffff8124565c
 MiCode#23 ext4_ext_create_new_leaf at ffffffff81246ca8
 MiCode#24 ext4_ext_insert_extent at ffffffff81246f09
 MiCode#25 ext4_ext_map_blocks at ffffffff8124a848
 MiCode#26 ext4_map_blocks at ffffffff8121a5b7
 MiCode#27 mpage_map_one_extent at ffffffff8121b1fa
 MiCode#28 mpage_map_and_submit_extent at ffffffff8121f07b
 MiCode#29 ext4_writepages at ffffffff8121f6d5
 MiCode#30 do_writepages at ffffffff8112c490
 MiCode#31 __filemap_fdatawrite_range at ffffffff81120199
 MiCode#32 filemap_flush at ffffffff8112041c
 MiCode#33 ext4_alloc_da_blocks at ffffffff81219da1
 MiCode#34 ext4_rename at ffffffff81229b91
 MiCode#35 ext4_rename2 at ffffffff81229e32
 MiCode#36 vfs_rename at ffffffff811a08a5
 MiCode#37 SYSC_renameat2 at ffffffff811a3ffc
 MiCode#38 sys_renameat2 at ffffffff811a408e
 MiCode#39 sys_rename at ffffffff8119e51e
 MiCode#40 system_call_fastpath at ffffffff815afa89

Dave Chinner has properly pointed out that this is a deadlock in the
reclaim code because ext4 doesn't submit pages which are marked by
PG_writeback right away.

The heuristic was introduced by commit e62e384 ("memcg: prevent OOM
with too many dirty pages") and it was applied only when may_enter_fs
was specified.  The code has been changed by c3b94f4 ("memcg:
further prevent OOM with too many dirty pages") which has removed the
__GFP_FS restriction with a reasoning that we do not get into the fs
code.  But this is not sufficient apparently because the fs doesn't
necessarily submit pages marked PG_writeback for IO right away.

ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily
submit the bio.  Instead it tries to map more pages into the bio and
mpage_map_one_extent might trigger memcg charge which might end up
waiting on a page which is marked PG_writeback but hasn't been submitted
yet so we would end up waiting for something that never finishes.

Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2)
before we go to wait on the writeback.  The page fault path, which is
the only path that triggers memcg oom killer since 3.12, shouldn't
require GFP_NOFS and so we shouldn't reintroduce the premature OOM
killer issue which was originally addressed by the heuristic.

As per David Chinner the xfs is doing similar thing since 2.6.15 already
so ext4 is not the only affected filesystem.  Moreover he notes:

: For example: IO completion might require unwritten extent conversion
: which executes filesystem transactions and GFP_NOFS allocations. The
: writeback flag on the pages can not be cleared until unwritten
: extent conversion completes. Hence memory reclaim cannot wait on
: page writeback to complete in GFP_NOFS context because it is not
: safe to do so, memcg reclaim or otherwise.

Cc: stable@vger.kernel.org # 3.9+
[tytso@mit.edu: corrected the control flow]
Fixes: c3b94f4 ("memcg: further prevent OOM with too many dirty pages")
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
bgcngm pushed a commit to Mi5Devs/android_kernel_xiaomi_msm8996 that referenced this issue Oct 8, 2016
[ Upstream commit fc5fee8 ]

It turns out that a PV domU also requires the "Xen PV" APIC
driver. Otherwise, the flat driver is used and we get stuck in busy
loops that never exit, such as in this stack trace:

(gdb) target remote localhost:9999
Remote debugging using localhost:9999
__xapic_wait_icr_idle () at ./arch/x86/include/asm/ipi.h:56
56              while (native_apic_mem_read(APIC_ICR) & APIC_ICR_BUSY)
(gdb) bt
 #0  __xapic_wait_icr_idle () at ./arch/x86/include/asm/ipi.h:56
 #1  __default_send_IPI_shortcut (shortcut=<optimized out>,
dest=<optimized out>, vector=<optimized out>) at
./arch/x86/include/asm/ipi.h:75
 #2  apic_send_IPI_self (vector=246) at arch/x86/kernel/apic/probe_64.c:54
 MiCode#3  0xffffffff81011336 in arch_irq_work_raise () at
arch/x86/kernel/irq_work.c:47
 MiCode#4  0xffffffff8114990c in irq_work_queue (work=0xffff88000fc0e400) at
kernel/irq_work.c:100
 MiCode#5  0xffffffff8110c29d in wake_up_klogd () at kernel/printk/printk.c:2633
 MiCode#6  0xffffffff8110ca60 in vprintk_emit (facility=0, level=<optimized
out>, dict=0x0 <irq_stack_union>, dictlen=<optimized out>,
fmt=<optimized out>, args=<optimized out>)
    at kernel/printk/printk.c:1778
 MiCode#7  0xffffffff816010c8 in printk (fmt=<optimized out>) at
kernel/printk/printk.c:1868
 MiCode#8  0xffffffffc00013ea in ?? ()
 MiCode#9  0x0000000000000000 in ?? ()

Mailing-list-thread: https://lkml.org/lkml/2015/8/4/755
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
bgcngm pushed a commit to Mi5Devs/android_kernel_xiaomi_msm8996 that referenced this issue Oct 8, 2016
[ Upstream commit 1c2cb59 ]

The EPOW interrupt handler uses rtas_get_sensor(), which in turn
uses rtas_busy_delay() to wait for RTAS becoming ready in case it
is necessary. But rtas_busy_delay() is annotated with might_sleep()
and thus may not be used by interrupts handlers like the EPOW handler!
This leads to the following BUG when CONFIG_DEBUG_ATOMIC_SLEEP is
enabled:

 BUG: sleeping function called from invalid context at arch/powerpc/kernel/rtas.c:496
 in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1
 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.2.0-rc2-thuth MiCode#6
 Call Trace:
 [c00000007ffe7b90] [c000000000807670] dump_stack+0xa0/0xdc (unreliable)
 [c00000007ffe7bc0] [c0000000000e1f14] ___might_sleep+0x134/0x180
 [c00000007ffe7c20] [c00000000002aec0] rtas_busy_delay+0x30/0xd0
 [c00000007ffe7c50] [c00000000002bde4] rtas_get_sensor+0x74/0xe0
 [c00000007ffe7ce0] [c000000000083264] ras_epow_interrupt+0x44/0x450
 [c00000007ffe7d90] [c000000000120260] handle_irq_event_percpu+0xa0/0x300
 [c00000007ffe7e70] [c000000000120524] handle_irq_event+0x64/0xc0
 [c00000007ffe7eb0] [c000000000124dbc] handle_fasteoi_irq+0xec/0x260
 [c00000007ffe7ef0] [c00000000011f4f0] generic_handle_irq+0x50/0x80
 [c00000007ffe7f20] [c000000000010f3c] __do_irq+0x8c/0x200
 [c00000007ffe7f90] [c0000000000236cc] call_do_irq+0x14/0x24
 [c00000007e6f39e0] [c000000000011144] do_IRQ+0x94/0x110
 [c00000007e6f3a30] [c000000000002594] hardware_interrupt_common+0x114/0x180

Fix this issue by introducing a new rtas_get_sensor_fast() function
that does not use rtas_busy_delay() - and thus can only be used for
sensors that do not cause a BUSY condition - known as "fast" sensors.

The EPOW sensor is defined to be "fast" in sPAPR - mpe.

Fixes: 587f83e ("powerpc/pseries: Use rtas_get_sensor in RAS code")
Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
bgcngm pushed a commit to Mi5Devs/android_kernel_xiaomi_msm8996 that referenced this issue Oct 8, 2016
[ Upstream commit e81107d ]

My colleague ran into a program stall on a x86_64 server, where
n_tty_read() was waiting for data even if there was data in the buffer
in the pty.  kernel stack for the stuck process looks like below.
 #0 [ffff88303d107b58] __schedule at ffffffff815c4b20
 #1 [ffff88303d107bd0] schedule at ffffffff815c513e
 #2 [ffff88303d107bf0] schedule_timeout at ffffffff815c7818
 MiCode#3 [ffff88303d107ca0] wait_woken at ffffffff81096bd2
 MiCode#4 [ffff88303d107ce0] n_tty_read at ffffffff8136fa23
 MiCode#5 [ffff88303d107dd0] tty_read at ffffffff81368013
 MiCode#6 [ffff88303d107e20] __vfs_read at ffffffff811a3704
 MiCode#7 [ffff88303d107ec0] vfs_read at ffffffff811a3a57
 MiCode#8 [ffff88303d107f00] sys_read at ffffffff811a4306
 MiCode#9 [ffff88303d107f50] entry_SYSCALL_64_fastpath at ffffffff815c86d7

There seems to be two problems causing this issue.

First, in drivers/tty/n_tty.c, __receive_buf() stores the data and
updates ldata->commit_head using smp_store_release() and then checks
the wait queue using waitqueue_active().  However, since there is no
memory barrier, __receive_buf() could return without calling
wake_up_interactive_poll(), and at the same time, n_tty_read() could
start to wait in wait_woken() as in the following chart.

        __receive_buf()                         n_tty_read()
------------------------------------------------------------------------
if (waitqueue_active(&tty->read_wait))
/* Memory operations issued after the
   RELEASE may be completed before the
   RELEASE operation has completed */
                                        add_wait_queue(&tty->read_wait, &wait);
                                        ...
                                        if (!input_available_p(tty, 0)) {
smp_store_release(&ldata->commit_head,
                  ldata->read_head);
                                        ...
                                        timeout = wait_woken(&wait,
                                          TASK_INTERRUPTIBLE, timeout);
------------------------------------------------------------------------

The second problem is that n_tty_read() also lacks a memory barrier
call and could also cause __receive_buf() to return without calling
wake_up_interactive_poll(), and n_tty_read() to wait in wait_woken()
as in the chart below.

        __receive_buf()                         n_tty_read()
------------------------------------------------------------------------
                                        spin_lock_irqsave(&q->lock, flags);
                                        /* from add_wait_queue() */
                                        ...
                                        if (!input_available_p(tty, 0)) {
                                        /* Memory operations issued after the
                                           RELEASE may be completed before the
                                           RELEASE operation has completed */
smp_store_release(&ldata->commit_head,
                  ldata->read_head);
if (waitqueue_active(&tty->read_wait))
                                        __add_wait_queue(q, wait);
                                        spin_unlock_irqrestore(&q->lock,flags);
                                        /* from add_wait_queue() */
                                        ...
                                        timeout = wait_woken(&wait,
                                          TASK_INTERRUPTIBLE, timeout);
------------------------------------------------------------------------

There are also other places in drivers/tty/n_tty.c which have similar
calls to waitqueue_active(), so instead of adding many memory barrier
calls, this patch simply removes the call to waitqueue_active(),
leaving just wake_up*() behind.

This fixes both problems because, even though the memory access before
or after the spinlocks in both wake_up*() and add_wait_queue() can
sneak into the critical section, it cannot go past it and the critical
section assures that they will be serialized (please see "INTER-CPU
ACQUIRING BARRIER EFFECTS" in Documentation/memory-barriers.txt for a
better explanation).  Moreover, the resulting code is much simpler.

Latency measurement using a ping-pong test over a pty doesn't show any
visible performance drop.

Signed-off-by: Kosuke Tatsukawa <tatsu@ab.jp.nec.com>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
bgcngm pushed a commit to Mi5Devs/android_kernel_xiaomi_msm8996 that referenced this issue Oct 8, 2016
[ Upstream commit e81107d ]

My colleague ran into a program stall on a x86_64 server, where
n_tty_read() was waiting for data even if there was data in the buffer
in the pty.  kernel stack for the stuck process looks like below.
 #0 [ffff88303d107b58] __schedule at ffffffff815c4b20
 #1 [ffff88303d107bd0] schedule at ffffffff815c513e
 #2 [ffff88303d107bf0] schedule_timeout at ffffffff815c7818
 MiCode#3 [ffff88303d107ca0] wait_woken at ffffffff81096bd2
 MiCode#4 [ffff88303d107ce0] n_tty_read at ffffffff8136fa23
 MiCode#5 [ffff88303d107dd0] tty_read at ffffffff81368013
 MiCode#6 [ffff88303d107e20] __vfs_read at ffffffff811a3704
 MiCode#7 [ffff88303d107ec0] vfs_read at ffffffff811a3a57
 MiCode#8 [ffff88303d107f00] sys_read at ffffffff811a4306
 MiCode#9 [ffff88303d107f50] entry_SYSCALL_64_fastpath at ffffffff815c86d7

There seems to be two problems causing this issue.

First, in drivers/tty/n_tty.c, __receive_buf() stores the data and
updates ldata->commit_head using smp_store_release() and then checks
the wait queue using waitqueue_active().  However, since there is no
memory barrier, __receive_buf() could return without calling
wake_up_interactive_poll(), and at the same time, n_tty_read() could
start to wait in wait_woken() as in the following chart.

        __receive_buf()                         n_tty_read()
------------------------------------------------------------------------
if (waitqueue_active(&tty->read_wait))
/* Memory operations issued after the
   RELEASE may be completed before the
   RELEASE operation has completed */
                                        add_wait_queue(&tty->read_wait, &wait);
                                        ...
                                        if (!input_available_p(tty, 0)) {
smp_store_release(&ldata->commit_head,
                  ldata->read_head);
                                        ...
                                        timeout = wait_woken(&wait,
                                          TASK_INTERRUPTIBLE, timeout);
------------------------------------------------------------------------

The second problem is that n_tty_read() also lacks a memory barrier
call and could also cause __receive_buf() to return without calling
wake_up_interactive_poll(), and n_tty_read() to wait in wait_woken()
as in the chart below.

        __receive_buf()                         n_tty_read()
------------------------------------------------------------------------
                                        spin_lock_irqsave(&q->lock, flags);
                                        /* from add_wait_queue() */
                                        ...
                                        if (!input_available_p(tty, 0)) {
                                        /* Memory operations issued after the
                                           RELEASE may be completed before the
                                           RELEASE operation has completed */
smp_store_release(&ldata->commit_head,
                  ldata->read_head);
if (waitqueue_active(&tty->read_wait))
                                        __add_wait_queue(q, wait);
                                        spin_unlock_irqrestore(&q->lock,flags);
                                        /* from add_wait_queue() */
                                        ...
                                        timeout = wait_woken(&wait,
                                          TASK_INTERRUPTIBLE, timeout);
------------------------------------------------------------------------

There are also other places in drivers/tty/n_tty.c which have similar
calls to waitqueue_active(), so instead of adding many memory barrier
calls, this patch simply removes the call to waitqueue_active(),
leaving just wake_up*() behind.

This fixes both problems because, even though the memory access before
or after the spinlocks in both wake_up*() and add_wait_queue() can
sneak into the critical section, it cannot go past it and the critical
section assures that they will be serialized (please see "INTER-CPU
ACQUIRING BARRIER EFFECTS" in Documentation/memory-barriers.txt for a
better explanation).  Moreover, the resulting code is much simpler.

Latency measurement using a ping-pong test over a pty doesn't show any
visible performance drop.

Signed-off-by: Kosuke Tatsukawa <tatsu@ab.jp.nec.com>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
bgcngm pushed a commit to Mi5Devs/android_kernel_xiaomi_msm8996 that referenced this issue Oct 8, 2016
[ Upstream commit d144dfe ]

If we use USB ID pin as wakeup source, and there is a USB block
device on this USB OTG (ID) cable, the system will be deadlock
after system resume.

The root cause for this problem is: the workqueue ci_otg may try
to remove hcd before the driver resume has finished, and hcd will
disconnect the device on it, then, it will call device_release_driver,
and holds the device lock "dev->mutex", but it is never unlocked since
it waits workqueue writeback to run to flush the block information, but
the workqueue writeback is freezable, it is not thawed before driver
resume has finished.

When the driver (device: sd 0:0:0:0:) resume goes to dpm_complete, it
tries to get its device lock "dev->mutex", but it can't get it forever,
then the deadlock occurs. Below call stacks show the situation.

So, in order to fix this problem, we need to change workqueue ci_otg
as freezable, then the work item in this workqueue will be run after
driver's resume, this workqueue will not be blocked forever like above
case since the workqueue writeback has been thawed too.

Tested at: i.mx6qdl-sabresd and i.mx6sx-sdb.

[  555.178869] kworker/u2:13   D c07de74c     0   826      2 0x00000000
[  555.185310] Workqueue: ci_otg ci_otg_work
[  555.189353] Backtrace:
[  555.191849] [<c07de4fc>] (__schedule) from [<c07dec6c>] (schedule+0x48/0xa0)
[  555.198912]  r10:ee471ba0 r9:00000000 r8:00000000 r7:00000002 r6:ee470000 r5:ee471ba4
[  555.206867]  r4:ee470000
[  555.209453] [<c07dec24>] (schedule) from [<c07e2fc4>] (schedule_timeout+0x15c/0x1e0)
[  555.217212]  r4:7fffffff r3:edc2b000
[  555.220862] [<c07e2e68>] (schedule_timeout) from [<c07df6c8>] (wait_for_common+0x94/0x144)
[  555.229140]  r8:00000000 r7:00000002 r6:ee470000 r5:ee471ba4 r4:7fffffff
[  555.235980] [<c07df634>] (wait_for_common) from [<c07df790>] (wait_for_completion+0x18/0x1c)
[  555.244430]  r10:00000001 r9:c0b5563c r8:c0042e48 r7:ef086000 r6:eea4372c r5:ef131b00
[  555.252383]  r4:00000000
[  555.254970] [<c07df778>] (wait_for_completion) from [<c0043cb8>] (flush_work+0x19c/0x234)
[  555.263177] [<c0043b1c>] (flush_work) from [<c0043fac>] (flush_delayed_work+0x48/0x4c)
[  555.271106]  r8:ed5b5000 r7:c0b38a3c r6:eea439cc r5:eea4372c r4:eea4372c
[  555.277958] [<c0043f64>] (flush_delayed_work) from [<c00eae18>] (bdi_unregister+0x84/0xec)
[  555.286236]  r4:eea43520 r3:20000153
[  555.289885] [<c00ead94>] (bdi_unregister) from [<c02c2154>] (blk_cleanup_queue+0x180/0x29c)
[  555.298250]  r5:eea43808 r4:eea43400
[  555.301909] [<c02c1fd4>] (blk_cleanup_queue) from [<c0417914>] (__scsi_remove_device+0x48/0xb8)
[  555.310623]  r7:00000000 r6:20000153 r5:ededa950 r4:ededa800
[  555.316403] [<c04178cc>] (__scsi_remove_device) from [<c0415e90>] (scsi_forget_host+0x64/0x68)
[  555.325028]  r5:ededa800 r4:ed5b5000
[  555.328689] [<c0415e2c>] (scsi_forget_host) from [<c0409828>] (scsi_remove_host+0x78/0x104)
[  555.337054]  r5:ed5b5068 r4:ed5b5000
[  555.340709] [<c04097b0>] (scsi_remove_host) from [<c04cdfcc>] (usb_stor_disconnect+0x50/0xb4)
[  555.349247]  r6:ed5b56e4 r5:ed5b5818 r4:ed5b5690 r3:00000008
[  555.355025] [<c04cdf7c>] (usb_stor_disconnect) from [<c04b3bc8>] (usb_unbind_interface+0x78/0x25c)
[  555.363997]  r8:c13919b4 r7:edd3c000 r6:edd3c020 r5:ee551c68 r4:ee551c00 r3:c04cdf7c
[  555.371892] [<c04b3b50>] (usb_unbind_interface) from [<c03dc248>] (__device_release_driver+0x8c/0x118)
[  555.381213]  r10:00000001 r9:edd90c00 r8:c13919b4 r7:ee551c68 r6:c0b546e0 r5:c0b5563c
[  555.389167]  r4:edd3c020
[  555.391752] [<c03dc1bc>] (__device_release_driver) from [<c03dc2fc>] (device_release_driver+0x28/0x34)
[  555.401071]  r5:edd3c020 r4:edd3c054
[  555.404721] [<c03dc2d4>] (device_release_driver) from [<c03db304>] (bus_remove_device+0xe0/0x110)
[  555.413607]  r5:edd3c020 r4:ef17f04c
[  555.417253] [<c03db224>] (bus_remove_device) from [<c03d8128>] (device_del+0x114/0x21c)
[  555.425270]  r6:edd3c028 r5:edd3c020 r4:ee551c00 r3:00000000
[  555.431045] [<c03d8014>] (device_del) from [<c04b1560>] (usb_disable_device+0xa4/0x1e8)
[  555.439061]  r8:edd3c000 r7:eded8000 r6:00000000 r5:00000001 r4:ee551c00
[  555.445906] [<c04b14bc>] (usb_disable_device) from [<c04a8e54>] (usb_disconnect+0x74/0x224)
[  555.454271]  r9:edd90c00 r8:ee551000 r7:ee551c68 r6:ee551c9c r5:ee551c00 r4:00000001
[  555.462156] [<c04a8de0>] (usb_disconnect) from [<c04a8fb8>] (usb_disconnect+0x1d8/0x224)
[  555.470259]  r10:00000001 r9:edd90000 r8:ee471e2c r7:ee551468 r6:ee55149c r5:ee551400
[  555.478213]  r4:00000001
[  555.480797] [<c04a8de0>] (usb_disconnect) from [<c04ae5ec>] (usb_remove_hcd+0xa0/0x1ac)
[  555.488813]  r10:00000001 r9:ee471eb0 r8:00000000 r7:ef3d9500 r6:eded810c r5:eded80b0
[  555.496765]  r4:eded8000
[  555.499351] [<c04ae54c>] (usb_remove_hcd) from [<c04d4158>] (host_stop+0x28/0x64)
[  555.506847]  r6:eeb50010 r5:eded8000 r4:eeb51010
[  555.511563] [<c04d4130>] (host_stop) from [<c04d09b8>] (ci_otg_work+0xc4/0x124)
[  555.518885]  r6:00000001 r5:eeb50010 r4:eeb502a0 r3:c04d4130
[  555.524665] [<c04d08f4>] (ci_otg_work) from [<c00454f0>] (process_one_work+0x194/0x420)
[  555.532682]  r6:ef086000 r5:eeb502a0 r4:edc44480
[  555.537393] [<c004535c>] (process_one_work) from [<c00457b0>] (worker_thread+0x34/0x514)
[  555.545496]  r10:edc44480 r9:ef086000 r8:c0b1a100 r7:ef086034 r6:00000088 r5:edc44498
[  555.553450]  r4:ef086000
[  555.556032] [<c004577c>] (worker_thread) from [<c004bab4>] (kthread+0xdc/0xf8)
[  555.563268]  r10:00000000 r9:00000000 r8:00000000 r7:c004577c r6:edc44480 r5:eddc15c0
[  555.571221]  r4:00000000
[  555.573804] [<c004b9d8>] (kthread) from [<c000fef0>] (ret_from_fork+0x14/0x24)
[  555.581040]  r7:00000000 r6:00000000 r5:c004b9d8 r4:eddc15c0

[  553.429383] sh              D c07de74c     0   694    691 0x00000000
[  553.435801] Backtrace:
[  553.438295] [<c07de4fc>] (__schedule) from [<c07dec6c>] (schedule+0x48/0xa0)
[  553.445358]  r10:edd3c054 r9:edd3c078 r8:edddbd50 r7:edcbbc00 r6:c1377c34 r5:60000153
[  553.453313]  r4:eddda000
[  553.455896] [<c07dec24>] (schedule) from [<c07deff8>] (schedule_preempt_disabled+0x10/0x14)
[  553.464261]  r4:edd3c058 r3:0000000a
[  553.467910] [<c07defe8>] (schedule_preempt_disabled) from [<c07e0bbc>] (mutex_lock_nested+0x1a0/0x3e8)
[  553.477254] [<c07e0a1c>] (mutex_lock_nested) from [<c03e927c>] (dpm_complete+0xc0/0x1b0)
[  553.485358]  r10:00561408 r9:edd3c054 r8:c0b4863c r7:edddbd90 r6:c0b485d8 r5:edd3c020
[  553.493313]  r4:edd3c0d0
[  553.495896] [<c03e91bc>] (dpm_complete) from [<c03e9388>] (dpm_resume_end+0x1c/0x20)
[  553.503652]  r9:00000000 r8:c0b1a9d0 r7:c1334ec0 r6:c1334edc r5:00000003 r4:00000010
[  553.511544] [<c03e936c>] (dpm_resume_end) from [<c0079894>] (suspend_devices_and_enter+0x158/0x504)
[  553.520604]  r4:00000000 r3:c1334efc
[  553.524250] [<c007973c>] (suspend_devices_and_enter) from [<c0079e74>] (pm_suspend+0x234/0x2cc)
[  553.532961]  r10:00561408 r9:ed6b7300 r8:00000004 r7:c1334eec r6:00000000 r5:c1334ee8
[  553.540914]  r4:00000003
[  553.543493] [<c0079c40>] (pm_suspend) from [<c0078a6c>] (state_store+0x6c/0xc0)

[  555.703684] 7 locks held by kworker/u2:13/826:
[  555.708140]  #0:  ("%s""ci_otg"){++++.+}, at: [<c0045484>] process_one_work+0x128/0x420
[  555.716277]  #1:  ((&ci->work)){+.+.+.}, at: [<c0045484>] process_one_work+0x128/0x420
[  555.724317]  #2:  (usb_bus_list_lock){+.+.+.}, at: [<c04ae5e4>] usb_remove_hcd+0x98/0x1ac
[  555.732626]  MiCode#3:  (&dev->mutex){......}, at: [<c04a8e28>] usb_disconnect+0x48/0x224
[  555.740403]  MiCode#4:  (&dev->mutex){......}, at: [<c04a8e28>] usb_disconnect+0x48/0x224
[  555.748179]  MiCode#5:  (&dev->mutex){......}, at: [<c03dc2f4>] device_release_driver+0x20/0x34
[  555.756487]  MiCode#6:  (&shost->scan_mutex){+.+.+.}, at: [<c04097d0>] scsi_remove_host+0x20/0x104

Cc: <stable@vger.kernel.org> #v3.14+
Cc: Jun Li <jun.li@nxp.com>
Signed-off-by: Peter Chen <peter.chen@nxp.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
bgcngm pushed a commit to Mi5Devs/android_kernel_xiaomi_msm8996 that referenced this issue Oct 8, 2016
[ Upstream commit 346c09f ]

The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
with the following backtrace:

[  601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
[  601.347574]       Tainted: G           O    4.4.5-1-storage+ MiCode#6
[  601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  601.348142] kworker/u129:5  D ffff880803077988     0  1636      2 0x00000000
[  601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
[  601.348999]  ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
[  601.349662]  ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
[  601.350333]  ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
[  601.350965] Call Trace:
[  601.351203]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[  601.351444]  [<ffffffff815b01d5>] schedule+0x35/0x80
[  601.351709]  [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
[  601.351958]  [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
[  601.352208]  [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
[  601.352446]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[  601.352688]  [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
[  601.352951]  [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
[  601.353196]  [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
[  601.353440]  [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
[  601.353689]  [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
[  601.353958]  [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
[  601.354200]  [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
[  601.354441]  [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
[  601.354688]  [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
[  601.354932]  [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
[  601.355193]  [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
[  601.355432]  [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
[  601.355679]  [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
[  601.355925]  [<ffffffff81198379>] vfs_write+0xa9/0x1a0
[  601.356164]  [<ffffffff811c59d8>] kernel_write+0x38/0x50

The underlying device is a null_blk, with default parameters:

  queue_mode    = MQ
  submit_queues = 1

Verification that nullb0 has something inflight:

root@pserver8:~# cat /sys/block/nullb0/inflight
       0        1
root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
...
/sys/block/nullb0/mq/0/cpu2/rq_list
CTX pending:
        ffff8838038e2400
...

During debug it became clear that stalled request is always inserted in
the rq_list from the following path:

   save_stack_trace_tsk + 34
   blk_mq_insert_requests + 231
   blk_mq_flush_plug_list + 281
   blk_flush_plug_list + 199
   wait_on_page_bit + 192
   __filemap_fdatawait_range + 228
   filemap_fdatawait_range + 20
   filemap_write_and_wait_range + 63
   blkdev_fsync + 27
   vfs_fsync_range + 73
   blkdev_write_iter + 202
   __vfs_write + 170
   vfs_write + 169
   kernel_write + 56

So blk_flush_plug_list() was called with from_schedule == true.

If from_schedule is true, that means that finally blk_mq_insert_requests()
offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
i.e. it calls kblockd_schedule_delayed_work_on().

That means, that we race with another CPU, which is about to execute
__blk_mq_run_hw_queue() work.

Further debugging shows the following traces from different CPUs:

  CPU#0                                  CPU#1
  ----------------------------------     -------------------------------
  reqeust A inserted
  STORE hctx->ctx_map[0] bit marked
  kblockd_schedule...() returns 1
  <schedule to kblockd workqueue>
                                         request B inserted
                                         STORE hctx->ctx_map[1] bit marked
                                         kblockd_schedule...() returns 0
  *** WORK PENDING bit is cleared ***
  flush_busy_ctxs() is executed, but
  bit 1, set by CPU#1, is not observed

As a result request B pended forever.

This behaviour can be explained by speculative LOAD of hctx->ctx_map on
CPU#0, which is reordered with clear of PENDING bit and executed _before_
actual STORE of bit 1 on CPU#1.

The proper fix is an explicit full barrier <mfence>, which guarantees
that clear of PENDING bit is to be executed before all possible
speculative LOADS or STORES inside actual work function.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Michael Wang <yun.wang@profitbricks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
AndropaX pushed a commit to AndropaX/android_kernel_xiaomi_msm8992 that referenced this issue Oct 25, 2016
We will encounter oops by executing below command.
getfattr -n system.advise /mnt/f2fs/file
Killed

message log:
BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: [<f8b54d69>] f2fs_xattr_advise_get+0x29/0x40 [f2fs]
*pdpt = 00000000319b7001 *pde = 0000000000000000
Oops: 0002 [#1] SMP
Modules linked in: f2fs(O) snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq joydev
snd_seq_device snd_timer bnep snd rfcomm microcode bluetooth soundcore i2c_piix4 mac_hid serio_raw parport_pc ppdev lp parport
binfmt_misc hid_generic psmouse usbhid hid e1000 [last unloaded: f2fs]
CPU: 3 PID: 3134 Comm: getfattr Tainted: G           O    4.0.0-rc1 MiCode#6
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
task: f3a71b60 ti: f19a6000 task.ti: f19a6000
EIP: 0060:[<f8b54d69>] EFLAGS: 00010246 CPU: 3
EIP is at f2fs_xattr_advise_get+0x29/0x40 [f2fs]
EAX: 00000000 EBX: f19a7e71 ECX: 00000000 EDX: f8b5b467
ESI: 00000000 EDI: f2008570 EBP: f19a7e14 ESP: f19a7e08
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 80050033 CR2: 00000000 CR3: 319b8000 CR4: 000007f0
Stack:
 f8b5a634 c0cbb580 00000000 f19a7e34 c1193850 00000000 00000007 f19a7e71
 f19a7e64 c0cbb580 c1193810 f19a7e50 c1193c00 00000000 00000000 00000000
 c0cbb580 00000000 f19a7f70 c1194097 00000000 00000000 00000000 74737973
Call Trace:
 [<c1193850>] generic_getxattr+0x40/0x50
 [<c1193810>] ? xattr_resolve_name+0x80/0x80
 [<c1193c00>] vfs_getxattr+0x70/0xa0
 [<c1194097>] getxattr+0x87/0x190
 [<c11801d7>] ? path_lookupat+0x57/0x5f0
 [<c11819d2>] ? putname+0x32/0x50
 [<c116653a>] ? kmem_cache_alloc+0x2a/0x130
 [<c11819d2>] ? putname+0x32/0x50
 [<c11819d2>] ? putname+0x32/0x50
 [<c11819d2>] ? putname+0x32/0x50
 [<c11827f9>] ? user_path_at_empty+0x49/0x70
 [<c118283f>] ? user_path_at+0x1f/0x30
 [<c11941e7>] path_getxattr+0x47/0x80
 [<c11948e7>] SyS_getxattr+0x27/0x30
 [<c163f748>] sysenter_do_call+0x12/0x12
Code: 66 90 55 89 e5 57 56 53 66 66 66 66 90 8b 78 20 89 d3 ba 67 b4 b5 f8 89 d8 89 ce e8 42 7c 7b c8 85 c0 75 16 0f b6 87 44 01 00
00 <88> 06 b8 01 00 00 00 5b 5e 5f 5d c3 8d 76 00 b8 ea ff ff ff eb
EIP: [<f8b54d69>] f2fs_xattr_advise_get+0x29/0x40 [f2fs] SS:ESP 0068:f19a7e08
CR2: 0000000000000000
---[ end trace 860260654f1f416a ]---

The reason is that in getfattr there are two steps which is indicated by strace info:
1) try to lookup and get size of specified xattr.
2) get value of the extented attribute.

strace info:
getxattr("/mnt/f2fs/file", "system.advise", 0x0, 0) = 1
getxattr("/mnt/f2fs/file", "system.advise", "\x00", 256) = 1

For the first step, getfattr may pass a NULL pointer in @value and zero in @SiZe
as parameters for ->getxattr, but we access this @value pointer directly without
checking whether the pointer is valid or not in f2fs_xattr_advise_get, so the
oops occurs.

This patch fixes this issue by verifying @value pointer before using.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
AndropaX pushed a commit to AndropaX/android_kernel_xiaomi_msm8992 that referenced this issue Dec 16, 2016
commit 1c2cb59 upstream.

The EPOW interrupt handler uses rtas_get_sensor(), which in turn
uses rtas_busy_delay() to wait for RTAS becoming ready in case it
is necessary. But rtas_busy_delay() is annotated with might_sleep()
and thus may not be used by interrupts handlers like the EPOW handler!
This leads to the following BUG when CONFIG_DEBUG_ATOMIC_SLEEP is
enabled:

 BUG: sleeping function called from invalid context at arch/powerpc/kernel/rtas.c:496
 in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1
 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.2.0-rc2-thuth MiCode#6
 Call Trace:
 [c00000007ffe7b90] [c000000000807670] dump_stack+0xa0/0xdc (unreliable)
 [c00000007ffe7bc0] [c0000000000e1f14] ___might_sleep+0x134/0x180
 [c00000007ffe7c20] [c00000000002aec0] rtas_busy_delay+0x30/0xd0
 [c00000007ffe7c50] [c00000000002bde4] rtas_get_sensor+0x74/0xe0
 [c00000007ffe7ce0] [c000000000083264] ras_epow_interrupt+0x44/0x450
 [c00000007ffe7d90] [c000000000120260] handle_irq_event_percpu+0xa0/0x300
 [c00000007ffe7e70] [c000000000120524] handle_irq_event+0x64/0xc0
 [c00000007ffe7eb0] [c000000000124dbc] handle_fasteoi_irq+0xec/0x260
 [c00000007ffe7ef0] [c00000000011f4f0] generic_handle_irq+0x50/0x80
 [c00000007ffe7f20] [c000000000010f3c] __do_irq+0x8c/0x200
 [c00000007ffe7f90] [c0000000000236cc] call_do_irq+0x14/0x24
 [c00000007e6f39e0] [c000000000011144] do_IRQ+0x94/0x110
 [c00000007e6f3a30] [c000000000002594] hardware_interrupt_common+0x114/0x180

Fix this issue by introducing a new rtas_get_sensor_fast() function
that does not use rtas_busy_delay() - and thus can only be used for
sensors that do not cause a BUSY condition - known as "fast" sensors.

The EPOW sensor is defined to be "fast" in sPAPR - mpe.

Fixes: 587f83e ("powerpc/pseries: Use rtas_get_sensor in RAS code")
Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
AndropaX pushed a commit to AndropaX/android_kernel_xiaomi_msm8992 that referenced this issue Dec 16, 2016
commit 346c09f upstream.

The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
with the following backtrace:

[  601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
[  601.347574]       Tainted: G           O    4.4.5-1-storage+ MiCode#6
[  601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  601.348142] kworker/u129:5  D ffff880803077988     0  1636      2 0x00000000
[  601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
[  601.348999]  ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
[  601.349662]  ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
[  601.350333]  ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
[  601.350965] Call Trace:
[  601.351203]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[  601.351444]  [<ffffffff815b01d5>] schedule+0x35/0x80
[  601.351709]  [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
[  601.351958]  [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
[  601.352208]  [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
[  601.352446]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[  601.352688]  [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
[  601.352951]  [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
[  601.353196]  [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
[  601.353440]  [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
[  601.353689]  [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
[  601.353958]  [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
[  601.354200]  [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
[  601.354441]  [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
[  601.354688]  [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
[  601.354932]  [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
[  601.355193]  [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
[  601.355432]  [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
[  601.355679]  [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
[  601.355925]  [<ffffffff81198379>] vfs_write+0xa9/0x1a0
[  601.356164]  [<ffffffff811c59d8>] kernel_write+0x38/0x50

The underlying device is a null_blk, with default parameters:

  queue_mode    = MQ
  submit_queues = 1

Verification that nullb0 has something inflight:

root@pserver8:~# cat /sys/block/nullb0/inflight
       0        1
root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
...
/sys/block/nullb0/mq/0/cpu2/rq_list
CTX pending:
        ffff8838038e2400
...

During debug it became clear that stalled request is always inserted in
the rq_list from the following path:

   save_stack_trace_tsk + 34
   blk_mq_insert_requests + 231
   blk_mq_flush_plug_list + 281
   blk_flush_plug_list + 199
   wait_on_page_bit + 192
   __filemap_fdatawait_range + 228
   filemap_fdatawait_range + 20
   filemap_write_and_wait_range + 63
   blkdev_fsync + 27
   vfs_fsync_range + 73
   blkdev_write_iter + 202
   __vfs_write + 170
   vfs_write + 169
   kernel_write + 56

So blk_flush_plug_list() was called with from_schedule == true.

If from_schedule is true, that means that finally blk_mq_insert_requests()
offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
i.e. it calls kblockd_schedule_delayed_work_on().

That means, that we race with another CPU, which is about to execute
__blk_mq_run_hw_queue() work.

Further debugging shows the following traces from different CPUs:

  CPU#0                                  CPU#1
  ----------------------------------     -------------------------------
  reqeust A inserted
  STORE hctx->ctx_map[0] bit marked
  kblockd_schedule...() returns 1
  <schedule to kblockd workqueue>
                                         request B inserted
                                         STORE hctx->ctx_map[1] bit marked
                                         kblockd_schedule...() returns 0
  *** WORK PENDING bit is cleared ***
  flush_busy_ctxs() is executed, but
  bit 1, set by CPU#1, is not observed

As a result request B pended forever.

This behaviour can be explained by speculative LOAD of hctx->ctx_map on
CPU#0, which is reordered with clear of PENDING bit and executed _before_
actual STORE of bit 1 on CPU#1.

The proper fix is an explicit full barrier <mfence>, which guarantees
that clear of PENDING bit is to be executed before all possible
speculative LOADS or STORES inside actual work function.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Michael Wang <yun.wang@profitbricks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
AndropaX pushed a commit to AndropaX/android_kernel_xiaomi_msm8992 that referenced this issue Mar 10, 2017
Squashed commit of the following:

commit 869d61cda160f0e824032c84aa5ac041639f5e24
Author: Scott Mertz <scott@cyngn.com>
Date:   Fri Jun 10 10:24:28 2016 -0700

    BACKPORT: f2fs: add a max block check for get_data_block_bmap

    (cherry pick from commit 179448bfe4cd201e98e728391c6b01b25c849fe8)

    This patch adds a max block check for get_data_block_bmap.

    Trinity test program will send a block number as parameter into
    ioctl_fibmap, which will be used in get_node_path(), when the block
    number large than f2fs max blocks, it will trigger kernel bug.

    Signed-off-by: Yunlei He <heyunlei@huawei.com>
    Signed-off-by: Xue Liu <liuxueliu.liu@huawei.com>
    [Jaegeuk Kim: fix missing condition, pointed by Chao Yu]
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

    Change-Id: Ia3d24f23735c73bf1dc2c885512afcc393d2ba25

commit d162ca69fefad82d965f8a9335c1c546a82ff9ea
Author: Keith Mok <kmok@cyngn.com>
Date:   Mon Feb 29 14:54:35 2016 -0800

    f2fs: Use crypto crc32 functions

    The crc function is done bit by bit and
    painfully slow, switch to use crypto
    crc32 function which is backed by h/w/ acceleration.

    Change-Id: I653b0d11d06db5aaae181fef15e67840d29edbca

commit 546d80887c268311b75cc5b56d359cb6c9d42fb5
Author: Keith Mok <kmok@cyngn.com>
Date:   Mon Jan 18 14:19:37 2016 -0800

    f2fs: Backport v4.4-rc8

    Fix f2fs to make it build for 3.10

    Change-Id: I38fbd1dfcdfd4293d93ceb54a45ba06a2793c8b9

commit 23348e15b5315a11949f7f95d5cf0bc1c3ea4e54
Author: Keith Mok <kmok@cyngn.com>
Date:   Mon Jan 18 13:39:41 2016 -0800

    f2fs: catch up to v4.4-rc8

    The last patch is:

    commit 5d2eb548b309be34ecf3b91f0b7300a2b9d09b8c
    Merge: 2870f6c 29608d2
    Author: Linus Torvalds <torvalds@linux-foundation.org>
    Date:   Fri Nov 13 18:02:30 2015 -0800

        Merge branch 'for-linus-3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

        Pull vfs xattr cleanups from Al Viro.

        * 'for-linus-3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
          f2fs: xattr simplifications
          squashfs: xattr simplifications
          9p: xattr simplifications
          xattr handlers: Pass handler to operations instead of flags
          jffs2: Add missing capability check for listing trusted xattrs
          hfsplus: Remove unused xattr handler list operations
          ubifs: Remove unused security xattr handler
          vfs: Fix the posix_acl_xattr_list return value
          vfs: Check attribute names in posix acl xattr handers

    Change-Id: I91363c68f2d4f1b0a8228bbbc2b8dcf9e2d93137

commit a66078e9dad2c760dfea9232d5a1cb77db0a2065
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Thu May 28 18:19:17 2015 -0700

    f2fs: fix a deadlock for summary page lock vs. sentry_lock

    In f2fs_gc:                      In f2fs_replace_block:
     - lock_page(sum_page)
      - check_valid_map()            - mutex_lock(sentry_lock)
       - mutex_lock(sentry_lock)     - change_curseg()
                                      - lock_page(sum_page)

    This patch fixes the deadlock condition.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 116ad121a93688ec612ec027ae11c982109c5269
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Thu May 28 17:06:40 2015 -0700

    f2fs crypto: clean up error handling in f2fs_fname_setup_filename

    Sync with:
      ext4 crypto: clean up error handling in ext4_fname_setup_filename

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit b74cd9ec13d46b69dc133c4c55893e8a4ed17f54
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Sat May 30 04:00:15 2015 +0300

    f2fs: support 3.10

    Conflicts:
    	fs/f2fs/data.c
    	fs/f2fs/file.c
    	fs/f2fs/namei.c

    Change-Id: I2540749a14e6c9f1788a09a06afd586b691f0edd
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit bbc554deecde3c026d45b0f1d8b440867f296ec4
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed May 27 19:51:42 2015 -0700

    f2fs crypto: avoid f2fs_inherit_context for symlink

    This patch fixes to call f2fs_inherit_context twice for newly created symlink.
    The original one is called by f2fs_add_link(), which invokes f2fs_setxattr.
    If the second one is called again, f2fs_setxattr is triggered again with same
    encryption index.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 7b5d4613056bcae3273688c7387e296bc85dfd2f
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed May 20 19:12:30 2015 -0700

    f2fs crypto: introduce a mempool for bounce pages

    If a lot of write streams are triggered, alloc_page and __free_page are
    costly called, resulting in high memory pressure.

    In order to avoid that, this patch introduces an additional mempool for writeback pages.
    Note that, the existing mempool is used for the emergency purpose.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 9748515b5cda4e36a5909a220e9b004c98280d6a
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Wed May 27 15:27:49 2015 +0800

    f2fs crypto: do not set encryption policy for non-directory by ioctl

    Encryption policy should only be set to an empty directory through ioctl,
    This patch add a judgement condition to verify type of the target inode
    to avoid incorrectly configuring for non-directory.

    Additionally, remove unneeded inline data conversion since regular or symlink
    file should not be processed here.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 3d0c5ba34a3b827b1ed6fb7d4df0da6f11bd7251
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Mon May 25 18:09:03 2015 +0800

    f2fs crypto: allow setting encryption policy once

    This patch add XATTR_CREATE flag in setxattr when setting encryption
    context for inode. Without this flag the context could be set more than
    once, this should never happen. So, fix it.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 02bfdd3df8279d4a31431cede82c6eb8da852a2c
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Mon May 25 18:07:02 2015 +0800

    f2fs crypto: check context consistent for rename2

    For exchange rename, we should check context consistent of encryption
    between new_dir and old_inode or old_dir and new_inode. Otherwise
    inheritance of parent's encryption context will be broken.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit e829d7e3bc102879bec9b755906a60742cf9d930
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Mon May 25 18:03:38 2015 +0800

    f2fs: avoid duplicated code by reusing f2fs_read_end_io

    This patch tries to clean up code because part code of f2fs_read_end_io
    and mpage_end_io are the same, so it's better to merge and reuse them.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit a318ba8662f5a41b95e751fe617a7ae75a92b1d6
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Tue May 19 22:26:54 2015 -0700

    f2fs crypto: use per-inode tfm structure

    This patch applies the following ext4 patch:

      ext4 crypto: use per-inode tfm structure

    As suggested by Herbert Xu, we shouldn't allocate a new tfm each time
    we read or write a page.  Instead we can use a single tfm hanging off
    the inode's crypt_info structure for all of our encryption needs for
    that inode, since the tfm can be used by multiple crypto requests in
    parallel.

    Also use cmpxchg() to avoid races that could result in crypt_info
    structure getting doubly allocated or doubly freed.

    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 11125bb53eb78f8d346e22995d990ab5cccbd9d3
Author: hujianyang <hujianyang@huawei.com>
Date:   Thu May 21 14:42:53 2015 +0800

    f2fs: recovering broken superblock during mount

    This patch recovers a broken superblock with the other valid one.

    Signed-off-by: hujianyang <hujianyang@huawei.com>
    [Jaegeuk Kim: reinitialize local variables in f2fs_fill_super for retrial]
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 0581fe2dd6c59599aaed98f970a649c2c4ee6a2d
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Tue May 19 16:11:40 2015 -0700

    f2fs crypto: check encryption for tmpfile

    This patch adds to check encryption for tmpfile in early stage.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 9779b9015d454a8a23f329c315e0582e0b01796c
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Sat May 30 03:57:24 2015 +0300

    f2fs: support RENAME_WHITEOUT

    As the description of rename in manual, RENAME_WHITEOUT is a special operation
    that only makes sense for overlay/union type filesystem.

    When performing rename with RENAME_WHITEOUT, dst will be replace with src, and
    meanwhile, a 'whiteout' will be create with name of src.

    A "whiteout" is designed to be a char device with 0,0 device number, it has
    specially meaning for stackable filesystem. In these filesystems, there are
    multiple layers exist, and only top of these can be modified. So a whiteout
    in top layer is used to hide a corresponding file in lower layer, as well
    removal of whiteout will make the file appear.

    Now in overlayfs, when we rename a file which is exist in lower layer, it
    will be copied up to upper if it is not on upper layer yet, and then rename
    it on upper layer, source file will be whiteouted to hide corresponding file
    in lower layer at the same time.

    So in upper layer filesystem, implementation of RENAME_WHITEOUT provide a
    atomic operation for stackable filesystem to support rename operation.

    There are multiple ways to implement RENAME_WHITEOUT in log of this commit:
    7dcf5c3e4527 ("xfs: add RENAME_WHITEOUT support") which pointed out by
    Dave Chinner.

    For now, we just try to follow the way that xfs/ext4 use.

    Conflicts:
    	fs/f2fs/namei.c

    Change-Id: Iac408db95a4133ac04c7d612ae06771afe238349
    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 832a7efd1724c4d896e64ab35df66514c27b7f15
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Tue May 19 17:40:04 2015 +0800

    f2fs: introduce update_meta_page

    Add a help function update_meta_page() to update meta page with specified
    buffer.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit a3c1fc3eba871337cebf74f6ed8d3a087e22e0b3
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Mon May 18 18:00:06 2015 +0800

    f2fs crypto: zero next free dnode block

    Now page cache of meta inode is used by garbage collection for encrypted page,
    it may contain random data, so we should zero it before issuing discard.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 02a47222f7b5ca017e6e5a8728502c06bf3cf42f
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Fri May 15 15:37:24 2015 -0700

    f2fs crypto: split f2fs_crypto_init/exit with two parts

    This patch splits f2fs_crypto_init/exit with two parts: base initialization and
    memory allocation.

    Firstly, f2fs module declares the base encryption memory pointers.
    Then, allocating internal memories is done at the first encrypted inode access.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 41640159159e81a9549554fc7a4eef1f8b7d793d
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Fri May 15 11:14:34 2015 +0800

    f2fs crypto: fix incorrect release for crypto ctx

    When encryption feature is enable, if we rmmod f2fs module,
    we will encounter a stack backtrace reported in syslog:

    "BUG: Bad page state in process rmmod  pfn:aaf8a
    page:f0f4f148 count:0 mapcount:129 mapping:ee2f4104 index:0x80
    flags: 0xee2830a4(referenced|lru|slab|private_2|writeback|swapbacked|mlocked)
    page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
    bad because of flags:
    flags: 0x2030a0(lru|slab|private_2|writeback|mlocked)
    Modules linked in: f2fs(O-) fuse bnep rfcomm bluetooth dm_crypt binfmt_misc snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm
    snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device joydev ppdev mac_hid lp hid_generic i2c_piix4
    parport_pc psmouse snd serio_raw parport soundcore ext4 jbd2 mbcache usbhid hid e1000 [last unloaded: f2fs]
    CPU: 1 PID: 3049 Comm: rmmod Tainted: G    B      O    4.1.0-rc3+ #10
    Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
    00000000 00000000 c0021eb4 c15b7518 f0f4f148 c0021ed8 c112e0b7 c1779174
    c9b75674 000aaf8a 01b13ce1 c17791a4 f0f4f148 ee2830a4 c0021ef8 c112e3c3
    00000000 f0f4f148 c0021f34 f0f4f148 ee2830a4 ef9f0000 c0021f20 c112fdf8
    Call Trace:
    [<c15b7518>] dump_stack+0x41/0x52
    [<c112e0b7>] bad_page.part.72+0xa7/0x100
    [<c112e3c3>] free_pages_prepare+0x213/0x220
    [<c112fdf8>] free_hot_cold_page+0x28/0x120
    [<c1073380>] ? try_to_wake_up+0x2b0/0x2b0
    [<c112ff15>] __free_pages+0x25/0x30
    [<c112c4fd>] mempool_free_pages+0xd/0x10
    [<c112c5f1>] mempool_free+0x31/0x90
    [<f0f441cf>] f2fs_exit_crypto+0x6f/0xf0 [f2fs]
    [<f0f456c4>] exit_f2fs_fs+0x23/0x95f [f2fs]
    [<c10c30e0>] SyS_delete_module+0x130/0x180
    [<c11556d6>] ? vm_munmap+0x46/0x60
    [<c15bd888>] sysenter_do_call+0x12/0x12"

    The reason is that:

    since commit 0827e645fd35
    ("f2fs crypto: shrink size of the f2fs_crypto_ctx structure") is merged,
    some fields in f2fs_crypto_ctx structure are merged into a union as they
    will never be used simultaneously in write path, read path or on free list.

    In f2fs_exit_crypto, we traverse each crypto ctx from free list, in this
    moment, our free_list field in union is valid, but still we will try to
    release memory space which is pointed by other invalid field in union
    structure for each ctx.

    Then the error occurs, let's fix it with this patch.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 402d19ed0be8fb13cd82092aa3902a0d27015551
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Wed May 13 18:20:54 2015 +0800

    f2fs crypto: fix to release buffer for fname crypto

    This patch fixes memory leak issue in error path of f2fs_fname_setup_filename().

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit f9f4324ba8e1da3994eb31f44c81e719c4fc49eb
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Tue May 12 13:40:20 2015 -0700

    f2fs crypto: shrink size of the f2fs_crypto_ctx structure

    This patch integrates the below patch into f2fs.

    "ext4 crypto: shrink size of the ext4_crypto_ctx structure

    Some fields are only used when the crypto_ctx is being used on the
    read path, some are only used on the write path, and some are only
    used when the structure is on free list.  Optimize memory use by using
    a union."

    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 695330214abbf0ea33ac04d1e10df79769871e72
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Tue May 12 13:33:00 2015 -0700

    f2fs crypto: get rid of ci_mode from struct f2fs_crypt_info

    This patch integrates the below patch into f2fs.

    "ext4 crypto: get rid of ci_mode from struct ext4_crypt_info

    The ci_mode field was superfluous, and getting rid of it gets rid of
    an unused hole in the structure."

    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 8bc541a35a5bf57387007ec523fb4caf8e11eb5f
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Tue May 12 13:26:54 2015 -0700

    f2fs crypto: use slab caches

    This patch integrates the below patch into f2fs.

    "ext4 crypto: use slab caches

    Use slab caches the ext4_crypto_ctx and ext4_crypt_info structures for
    slighly better memory efficiency and debuggability."

    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 211ffb46af5bdb0ccd4f7933550f55fa3832d7c7
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed May 13 14:35:14 2015 -0700

    f2fs: truncate data blocks for orphan inode

    As Hu reported, F2FS has a space leak problem, when conducting:

    1) format a 4GB f2fs partition
    2) dd a 3G file,
    3) unlink it.

    So, when doing f2fs_drop_inode(), we need to truncate data blocks
    before skipping it.
    We can also drop unused caches assigned to each inode.

    Reported-by: hujianyang <hujianyang@huawei.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 3bf4f596420b436915247c3ed406e1b56609091c
Author: Dan Carpenter <dan.carpenter@oracle.com>
Date:   Thu May 14 11:52:28 2015 +0300

    f2fs: cleanup a confusing indent

    The return was not indented far enough so it looked like it was supposed
    to go with the other if statement.

    Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit ad49d0578fd113dcde01f073cae05c5e97690cbd
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Wed May 13 22:49:58 2015 +0200

    f2fs: fix building on 32-bit architectures

    A bug fix to the debug output extended the type of some local
    variables to 64-bit, which now causes the kernel to fail building
    because of missing 64-bit division functions:

    ERROR: "__aeabi_uldivmod" [fs/f2fs/f2fs.ko] undefined!

    In the kernel, we have to use div_u64 or do_div to do this,
    in order to annotate that this is an expensive operation.

    As the function is only called for debug out, we know this
    is not performance critical, so it is safe to use div_u64.

    Signed-off-by: Arnd Bergmann <arnd@arndb.de>
    Fixes: d1f85bd38db19 ("f2fs: avoid value overflow in showing current status")
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 5d390f29c5afd9f78c819da8297d8e4ed8b05a0e
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon May 18 11:45:15 2015 -0700

    f2fs: avoid buggy functions

    This patch avoids to use a buggy function for now.
    It needs to fix them later.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 26320f96a8cc3918d48527b69167d857899e4efa
Author: hujianyang <hujianyang@huawei.com>
Date:   Tue May 12 16:05:57 2015 +0800

    f2fs: add compat_ioctl to provide backward compatability

    introduce compat_ioctl to regular files, but doesn't add this
    functionality to f2fs_dir_operations.

    While running a 32-bit busybox, I met an error like this:
    (A is a directory)

    chattr: reading flags on A: Inappropriate ioctl for device

    This patch copies compat_ioctl from f2fs_file_operations and
    fix this problem.

    Signed-off-by: hujianyang <hujianyang@huawei.com>
    Reviewed-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 9c7dc1f03c8b0cc74de9cb73349c340f07ea6be3
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon May 11 20:03:49 2015 -0700

    f2fs: do not issue next dnode discard redundantly

    We have a discard map, so that we can avoid redundant discard issues.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit f62c1b25a0ba7fd699e397b01d117ce4d77c2dbb
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon May 11 20:02:14 2015 -0700

    f2fs: disable the discard option when device does not support

    This patch disables given discard option when device does not support it.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 0092b3dc467ad4e1ecf6b732a0aa9352f94d7600
Author: Yunlei He <heyunlei@huawei.com>
Date:   Thu May 7 18:11:37 2015 +0800

    f2fs: add default mount options to remount

    I use f2fs filesystem with /data partition on my Android phone
    by the default mount options. When I remount /data in order to
    adding discard option to run some benchmarks, I find the default
    options such as background_gc, user_xattr and acl turned off.

    So I introduce a function named default_options in super.c. It do
    some default setting, and both mount and remount operations will
    call this function to complete default setting.

    Signed-off-by: Yunlei He <heyunlei@huawei.com>
    Reviewed-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 16fa79a4f39e3250c1282519ca7f97275081a3a1
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed May 6 18:23:21 2015 -0700

    f2fs crypto: remove checking key context during lookup

    No matter what the key is valid or not, readdir shows the dir entries correctly.
    So, lookup should not failed.
    But, we expect further accesses should be denied from open, rename, link, and so
    on.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit dc70a8a153167920018e6f603bde29efee225946
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Tue May 5 20:20:29 2015 -0700

    f2fs crypto: fix missing key when reading a page

    1. mount $mnt
    2. cp data $mnt/
    3. umount $mnt
    4. log out
    5. log in
    6. cat $mnt/data

    -> panic, due to no i_crypt_info.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit e0a1ff90c374c3a19822d84a2888049723b691a6
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed Apr 29 15:10:53 2015 -0700

    f2fs crypto: add symlink encryption

    This patch implements encryption support for symlink.

    Signed-off-by: Uday Savagaonkar <savagaon@google.com>
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit e6422e991e35f6db0de71b1ed56156693c954a38
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed Apr 29 17:02:18 2015 -0700

    f2fs crypto: add filename encryption for roll-forward recovery

    This patch adds a bit flag to indicate whether or not i_name in the inode
    is encrypted.

    If this name is encrypted, we can't do recover_dentry during roll-forward.
    So, f2fs_sync_file() needs to do checkpoint, if this will be needed in future.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 0c2ecd76ef04b9331b3e1af6a84ea610ff88f395
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon Apr 27 17:12:39 2015 -0700

    f2fs crypto: add filename encryption for f2fs_lookup

    This patch implements filename encryption support for f2fs_lookup.

    Note that, f2fs_find_entry should be outside of f2fs_(un)lock_op().

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 0922ccb6bcc70e5499af179c5409f21bc5f8043e
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon Apr 27 16:26:24 2015 -0700

    f2fs crypto: add filename encryption for f2fs_readdir

    This patch implements filename encryption support for f2fs_readdir.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 5db97f6f11bf84975a5af993194729315b9773dd
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon Apr 27 14:51:02 2015 -0700

    f2fs crypto: add filename encryption for f2fs_add_link

    This patch adds filename encryption support for f2fs_add_link.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 37982bdc2790b8792ea0ecdd4c4e8134a2e0c57b
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Thu Apr 23 12:04:33 2015 -0700

    f2fs crypto: add encryption support in read/write paths

    This patch adds encryption support in read and write paths.

    Note that, in f2fs, we need to consider cleaning operation.
    In cleaning procedure, we must avoid encrypting and decrypting written blocks.
    So, this patch implements move_encrypted_block().

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit d1825dd987f1b1271b0de66eac87e4926653fc6d
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Sat May 30 03:55:32 2015 +0300

    f2fs crypto: activate encryption support for fs APIs

    This patch activates the following APIs for encryption support.

    The rules quoted by ext4 are:
     - An unencrypted directory may contain encrypted or unencrypted files
       or directories.
     - All files or directories in a directory must be protected using the
       same key as their containing directory.
     - Encrypted inode for regular file should not have inline_data.
     - Encrypted symlink and directory may have inline_data and inline_dentry.

    This patch activates the following APIs.
    1. f2fs_link              : validate context
    2. f2fs_lookup            :      ''
    3. f2fs_rename            :      ''
    4. f2fs_create/f2fs_mkdir : inherit its dir's context
    5. f2fs_direct_IO         : do buffered io for regular files
    6. f2fs_open              : check encryption info
    7. f2fs_file_mmap         :      ''
    8. f2fs_setattr           :      ''
    9. f2fs_file_write_iter   :      ''           (Called by sys_io_submit)
    10. f2fs_fallocate        : do not support fcollapse
    11. f2fs_evict_inode      : free_encryption_info

    Conflicts:
    	fs/f2fs/data.c

    Change-Id: I407f8edbb730451cc82186ddbd8a444acc2d5153
    Signed-off-by: Michael Halcrow <mhalcrow@google.com>
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit c9057553829b1e4aec898b147c21fcbbdaf42a81
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Sun Apr 26 00:12:50 2015 -0700

    f2fs crypto: filename encryption facilities

    This patch adds filename encryption infra.
    Most of codes are copied from ext4 part, but changed to adjust f2fs
    directory structure.

    Signed-off-by: Uday Savagaonkar <savagaon@google.com>
    Signed-off-by: Ildar Muslukhov <ildarm@google.com>
    Signed-off-by: Michael Halcrow <mhalcrow@google.com>
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit e4849088ad5692a4e87f5e1944ee8d95553d6af4
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Tue Apr 21 16:23:47 2015 -0700

    f2fs crypto: add encryption key management facilities

    This patch copies from encrypt_key.c in ext4, and modifies for f2fs.

    Use GFP_NOFS, since _f2fs_get_encryption_info is called under f2fs_lock_op.

    Signed-off-by: Michael Halcrow <mhalcrow@google.com>
    Signed-off-by: Ildar Muslukhov <muslukhovi@gmail.com>
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 7a6b88d7b76d0ec33a1a9b3b4652a701e04ab8d7
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon Apr 20 19:52:47 2015 -0700

    f2fs crypto: add f2fs encryption facilities

    Most of parts were copied from ext4, except:

     - add f2fs_restore_and_release_control_page which returns control page and
       restore control page
     - remove ext4_encrypted_zeroout()
     - remove sbi->s_file_encryption_mode & sbi->s_dir_encryption_mode
     - add f2fs_end_io_crypto_work for mpage_end_io

    Signed-off-by: Michael Halcrow <mhalcrow@google.com>
    Signed-off-by: Ildar Muslukhov <ildarm@google.com>
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit e3c6346af07fcac38a55e2bb362df325060dbfa4
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Sat May 30 03:53:02 2015 +0300

    f2fs crypto: add encryption policy and password salt support

    This patch adds encryption policy and password salt support through ioctl
    implementation.

    It adds three ioctls:
     F2FS_IOC_SET_ENCRYPTION_POLICY,
     F2FS_IOC_GET_ENCRYPTION_POLICY,
     F2FS_IOC_GET_ENCRYPTION_PWSALT, which use xattr operations.

    Note that, these definition and codes are taken from ext4 crypto support.
    For f2fs, xattr operations and on-disk flags for superblock and inode were
    changed.

    Conflicts:
    	fs/f2fs/f2fs.h

    Change-Id: Ie26c1e21bd8d9220a470cfe1cad2decb8ef2740e
    Signed-off-by: Michael Halcrow <mhalcrow@google.com>
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Ildar Muslukhov <muslukhovi@gmail.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 9364af8d6fd9d83c276773fd839ec164854d065e
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Fri Apr 10 16:43:31 2015 -0700

    f2fs crypto: add encryption xattr support

    This patch add some definition for enrcyption xattr.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit cfca666a259a7ed2863e4be41f53a71a94e230ed
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Fri Apr 10 16:28:26 2015 -0700

    f2fs crypto: add f2fs encryption Kconfig

    This patch adds f2fs encryption config.

    This patch integrates:

    "ext4 crypto: require CONFIG_CRYPTO_CTR if ext4 encryption is enabled

    On arm64 this is apparently needed for CTS mode to function correctly.
    Otherwise attempts to use CTS return ENOENT."

    Signed-off-by: Michael Halcrow <mhalcrow@google.com>
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit c2f8df83f2fc990fd80e00b6c2df604a9894d092
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon Apr 20 13:57:51 2015 -0700

    f2fs crypto: declare some definitions for f2fs encryption feature

    This definitions will be used by inode and superblock for encyption.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit f1b4fdf0ad9546b4db680b58a97f2e3de3b4784a
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Fri May 8 19:30:32 2015 -0700

    f2fs: report unwritten area in f2fs_fiemap

    This patch slightly changes f2fs_fiemap function to report unwritten area.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit e002552aab14ce70781a772c10d2d3f5b20baaec
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Fri May 8 16:37:28 2015 -0700

    f2fs: avoid value overflow in showing current status

    This patch fixes overflow when do cat /sys/kernel/debug/f2fs/status.
    If a section is relatively large, dist value can be overflowed.

    Reported-by: Yossi Goldfill <ygoldfill@radianmemory.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 298b7e72abe8fd01162966d2a3278fcfa3091abd
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Wed May 6 13:11:13 2015 +0800

    f2fs: support FALLOC_FL_ZERO_RANGE

    Now, FALLOC_FL_ZERO_RANGE flag in ->fallocate is supported in ext4/xfs.

    In commit, the semantics of this flag is descripted as following:"
    1) Make sure that both offset and len are block size aligned.
    2) Update the i_size of inode by len bytes.
    3) Compute the file's logical block number against offset. If the computed
       block number is not the starting block of the extent, split the extent
       such that the block number is the starting block of the extent.
    4) Shift all the extents which are lying between
       [offset, last allocated extent] towards right by len bytes. This step
       will make a hole of len bytes at offset."

    This patch implements fallocate's FALLOC_FL_ZERO_RANGE for f2fs.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 69592cd8daa1027741ac86e498ae9e71c8eccf95
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Wed May 6 13:09:46 2015 +0800

    f2fs: support FALLOC_FL_COLLAPSE_RANGE

    Now, FALLOC_FL_COLLAPSE_RANGE flag in ->fallocate is supported in ext4/xfs.

    In commit, the semantics of this flag is descripted as following:"
    1) It collapses the range lying between offset and length by removing any
       data blocks which are present in this range and than updates all the
       logical offsets of extents beyond "offset + len" to nullify the hole
       created by removing blocks. In short, it does not leave a hole.
    2) It should be used exclusively. No other fallocate flag in combination.
    3) Offset and length supplied to fallocate should be fs block size aligned
       in case of xfs and ext4.
    4) Collaspe range does not work beyond i_size."

    This patch implements fallocate's FALLOC_FL_COLLAPSE_RANGE for f2fs.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 1c615b44cea0f68a8880cfdef1d8fdc6c3412249
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Wed May 6 13:08:06 2015 +0800

    f2fs: introduce f2fs_replace_block() for reuse

    Introduce a generic function replace_block base on recover_data_page,
    and export it. So with it we can operate file's meta data which is in
    CP/SSA area when we invoke fallocate with FALLOC_FL_COLLAPSE_RANGE
    flag.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 668cd2607b7ab0a4fc94563d2bfcd3e2460827bb
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Thu Apr 30 18:35:50 2015 +0800

    f2fs: do not re-lookup nat cache with same nid

    In set_node_addr, we try to lookup cached nat entry of inode and then
    set flag in it.

    But previously in this function, we have already grabbed nat entry with
    current node id, if the node id is the same as the one of inode, we
    do not need to lookup it in cache again.

    So this patch adds condition judgment for reducing unneeded lookup.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 9a5b896f752fc26046f4c5cc9390fcfa5e9044ee
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Thu Apr 30 18:34:41 2015 +0800

    f2fs: remove unneeded f2fs_make_empty declaration

    Remove f2fs_make_empty() declaration, since the main body of this function
    is move into do_make_empty_dir() and the function is obsolete now.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 91a71eae9377fdcddfd526dad85c7868bb17863f
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Thu Apr 30 22:50:06 2015 -0700

    f2fs: issue discard with finally produced len and minlen

    This patch determines to issue discard commands by comparing given minlen and
    the length of produced final candidates.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 74a0bd91eb27bc6bf8b7d5b4d0576ec9ceaa31ad
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Thu Apr 30 22:37:50 2015 -0700

    f2fs: introduce discard_map for f2fs_trim_fs

    This patch adds a bitmap for discard issues from f2fs_trim_fs.
    There-in rule is to issue discard commands only for invalidated blocks
    after mount.
    Once mount is done, f2fs_trim_fs trims out whole invalid area.
    After ehn, it will not issue and discrads redundantly.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit c61181f0e958976a2d4a37dca015419dde22ddc0
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Fri May 1 11:08:59 2015 -0700

    f2fs: revmove spin_lock for write_orphan_inodes

    This patch removes spin_lock, since this is covered by f2fs_lock_op already.
    And, we should avoid to use page operations inside spin_lock.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 5afc4d50c96103050a35d1cea5e87dbc09ae10f8
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Thu Apr 30 17:00:33 2015 -0700

    f2fs: split find_data_page according to specific purposes

    This patch splits find_data_page as follows.

    1. f2fs_gc
     - use get_read_data_page() with read only

    2. find_in_level
     - use find_data_page without locked page

    3. truncate_partial_page
     - In the case cache_only mode, just drop cached page.
     - Ohterwise, use get_lock_data_page() and guarantee to truncate

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 87aafb380f1996e4c52598a3efd91d920bcfa7f9
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Thu Apr 30 18:58:22 2015 -0700

    f2fs: fix counting the number of inline_data inodes

    This patch fixes to count the missing symlink case.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit ffef9b60036ca4538b55a701fe04191b636217ef
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed Apr 29 18:31:19 2015 -0700

    f2fs: add need_dentry_mark

    This patch introduces need_dentry_mark() to clean up and avoid redundant
    node locks.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit ba47ac654a4c59152f1ce6db8fd05e9383dec1fe
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed Apr 29 11:18:42 2015 -0700

    f2fs: fix race on allocating and deallocating a dentry block

    There are two threads:
     f2fs_delete_entry()              get_new_data_page()
                                      f2fs_reserve_block()
    				  dn.blkaddr = XXX
     lock_page(dentry_block)
     truncate_hole()
     dn.blkaddr = NULL
     unlock_page(dentry_block)
                                      lock_page(dentry_block)
                                      fill the block from XXX address
                                      add new dentries
                                      unlock_page(dentry_block)

    Later, f2fs_write_data_page() will truncate the dentry_block, since
    its block address is NULL.

    The reason for this was due to the wrong lock order.
    In this case, we should do f2fs_reserve_block() after locking its dentry block.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit a42b0a273a344637eb5f5471e5d28ac7dda7e8e9
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Sun Apr 26 00:15:29 2015 -0700

    f2fs: introduce dot and dotdot name check

    This patch adds an inline function to check dot and dotdot names.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 18c184303ab9d3c678e995b5bf7ca5342fbbf6a4
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Fri Apr 24 14:34:30 2015 -0700

    f2fs: move get_page for gc victims

    This patch moves getting victim page into move_data_page.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit b2f7597deea395e7f7d1cd3be8611f32e1e3efaa
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Thu Apr 23 14:38:15 2015 -0700

    f2fs: add sbi and page pointer in f2fs_io_info

    This patch adds f2fs_sb_info and page pointers in f2fs_io_info structure.
    With this change, we can reduce a lot of parameters for IO functions.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit d8e89b536fda0e5cfefe39ce08da6f9e0dbe0ebb
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Thu Apr 23 10:27:21 2015 -0700

    f2fs: add f2fs_may_inline_{data, dentry}

    This patch adds f2fs_may_inline_data and f2fs_may_inline_dentry.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit cb883ba942a6ed6d456336fe564c4b44e6198c0d
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed Apr 22 11:40:27 2015 -0700

    f2fs: clean up f2fs_lookup

    This patch cleans up to avoid deep indentation.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 01a04b4c504ca53b225f4c5d790fd4401be51093
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Sat May 30 03:51:08 2015 +0300

    f2fs: expose f2fs_mpage_readpages

    This patch implements f2fs_mpage_readpages for further optimization on
    encryption support.

    The basic code was taken from fs/mpage.c, and changed to be simple by adjusting
    that block_size is equal to page_size in f2fs.

    Conflicts:
    	fs/f2fs/data.c

    Change-Id: Id413493a8fe6c7f6e69093127f8c6e9a7a8ba89d
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 8b2b31668d8f0b71f8b2e6dd9e569aa4f2839802
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon Apr 20 18:49:51 2015 -0700

    f2fs: introduce f2fs_commit_super

    This patch introduces f2fs_commit_super to write updated superblock.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit d6c2b7a44a3b04d55d19c15a1d9715262dc5f9ca
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon Apr 6 19:55:34 2015 -0700

    f2fs: add f2fs_map_blocks

    This patch introduces f2fs_map_blocks structure likewise ext4_map_blocks.
    Now, f2fs uses f2fs_map_blocks when handling get_block.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit d7059701be7fdeeeedf4cad7421215304893ec77
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon Apr 13 15:10:36 2015 -0700

    f2fs: add feature facility in superblock

    This patch introduces a feature in superblock, which will indicate any new
    features for f2fs.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 0f7e2640a2dcbb9518db29f7f11efa4bbcb4da68
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon Apr 20 11:52:23 2015 -0700

    f2fs: add missing version info in superblock

    The mkfs.f2fs remains kernel version in superblock, but f2fs module has not
    added that so far.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit eac4ad9a9caf2487ac4c3f52b2bbbda78aa6abd9
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon Apr 20 13:44:41 2015 -0700

    f2fs: move existing definitions into f2fs.h

    This patch moves some inode-related definitions from node.h to f2fs.h to
    add new features.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 631726792831934a9bc52b80774041154a9606c7
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Sat Apr 18 18:06:49 2015 +0800

    f2fs: make has_fsynced_inode static

    has_fsynced_inode() has no other caller out of node.c, make it static.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 5f32f11835a35945f51a4d167ea7584454c0f8de
Author: Taehee Yoo <ap420073@gmail.com>
Date:   Tue Apr 21 15:59:12 2015 +0900

    f2fs: add offset check routine before punch_hole() in f2fs_fallocate()

    In the punch_hole(), if offset bigger than inode size, it returns SUCCESS.
    Then f2fs_fallocate() will update time and dirty mark.
    In that case, inode has not been modified actually.
    So I have added offset check routine that prevent to call the punch_hole().

    Signed-off-by: Taehee Yoo <ap420073@gmail.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 540a536766de5ff64402bdb6efc53552d6fe0313
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Sat Apr 18 18:05:36 2015 +0800

    f2fs: use is_valid_blkaddr to verify blkaddr for readability

    Export is_valid_blkaddr() and use it to replace some codes for readability.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit d8972539d263574eef3555446dd79b9255d1585a
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Sat Apr 18 18:03:58 2015 +0800

    f2fs: make posix_acl_create() safer and cleaner

    Our f2fs_acl_create is copied from posix_acl_create in ./fs/posix_acl.c and
    modified to avoid deadlock bug when inline_dentry feature is enabled.

    Dan Carpenter rewrites posix_acl_create in commit 2799563b281f
    ("fs/posix_acl.c: make posix_acl_create() safer and cleaner") to make this
    function more safer, so that we can avoid potential bug in its caller,
    especially for ocfs2.

    Let's back port the patch to f2fs.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit a9cf8749358973f83ff535d1958b067f8ad0e807
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed Apr 22 11:03:48 2015 -0700

    f2fs: fix wrong error hanlder in f2fs_follow_link

    The page_follow_link_light returns NULL and its error pointer was remained
    in nd->path.

    Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
    Reviewed-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit ee92d4b167e63cb6f0fba74a6ada00863057cb67
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Tue Apr 21 10:40:54 2015 -0700

    Revert "f2fs: enhance multi-threads performance"

    This reports performance regression by Yuanhan Liu.
    The basic idea was to reduce one-point mutex, but it turns out this causes
    another contention like context swithes.

    https://lkml.org/lkml/2015/4/21/11

    Until finishing the analysis on this issue, I'd like to revert this for a while.

    This reverts commit 78373b7319abdf15050af5b1632c4c8b8b398f33.

commit 7883504e5d988e192ee57e695da9448d821f50e2
Author: doc <doc.divxm@gmail.com>
Date:   Sat May 30 10:58:40 2015 +0300

    Revert "f2fs: support 3.10"

    This reverts commit 89b2b2ae32d324f0dfc0b8898798969134fd5b84.

    Change-Id: I73f831eea4b616c07413c8f3f3f3bac4111c801c

commit c3fab7fea7fea818a1f0af2493d827dd8c1dfb2e
Author: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Date:   Tue Feb 4 14:20:16 2014 +0900

    f2fs: support 3.10

    Change-Id: I9059ac5ed39e25b31be078399452d9625506b780
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 39520b54dab76f5d79248b9ff7584934fb04b2bd
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Thu Apr 9 17:03:53 2015 -0700

    f2fs: pass checkpoint reason on roll-forward recovery

    This patch adds CP_RECOVERY to remain recovery information for checkpoint.
    And, it makes sure writing checkpoint in this case.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 783f95a08679ffbd1e5ae04db571cebfa3b5108d
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed Apr 15 13:49:55 2015 -0700

    f2fs: avoid abnormal behavior on broken symlink

    When f2fs_symlink was triggered and checkpoint was done before syncing its
    link path, f2fs can get broken symlink like "xxx -> \0\0\0".
    This incurs abnormal path_walk by VFS.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 2d428db306541ce5c1b3cb84854df35637b15cc3
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed Apr 15 13:37:53 2015 -0700

    f2fs: flush symlink path to avoid broken symlink after POR

    This patch tries to avoid broken symlink case after POR in best effort.
    This results in performance regression.
    But, if f2fs has inline_data and the target path is under 3KB-sized long,
    the page would be stored in its inode_block, so that there would be no
    performance regression.

    Note that, if user wants to keep this file atomically, it needs to trigger
    dir->fsync.
    And, there is still a hole to produce broken symlink.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit e5ad0b9633c463b93cee9987f6244caa0b8e653a
Author: Taehee Yoo <ap420073@gmail.com>
Date:   Mon Apr 13 21:48:06 2015 +0900

    f2fs: change 0 to false for bool type

    in the f2fs_fill_super function, variable "retry" is bool type
    i think that it should be set as false.

    Signed-off-by: Taehee Yoo <ap420073@gmail.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 9484926fdc3cba35d52e67c37136c84cc6a3d936
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed Apr 1 19:38:20 2015 -0700

    f2fs: do not recover wrong data index

    During the roll-forward recovery, if we found a new data index written fsync
    lastly, we need to recover new block address.
    But, if that address was corrupted, we should not recover that.
    Otherwise, f2fs gets kernel panic from:

     In check_index_in_prev_nodes(),

        sentry = get_seg_entry(sbi, segno);
                 --------------------------> out-of-range segno.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 02dba4ef5e99c6f724e20b6449b64c98ba3a7ed4
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Tue Mar 31 18:03:29 2015 -0700

    f2fs: do not increase link count during recovery

    If there are multiple fsynced dnodes having a dent flag, roll-forward routine
    sets FI_INC_LINK for their inode, and recovery_dentry increases its link count
    accordingly.
    That results in normal file having a link count as 2, so we can't unlink those
    files.

    This was added to handle several inode blocks having same inode number with
    different directory paths.
    But, current f2fs doesn't replay all of path changes and only recover its dentry
    for the last fsynced inode block.
    So, there is no reason to do this.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit e5bae38d60d93e40dee266a3198be4b1e4337215
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon Mar 30 15:23:45 2015 -0700

    f2fs: assign parent's i_mode for empty dir

    When assigning i_mode for dotdot, it needs to assign parent's i_mode.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 945df7d6bd0c818e81da1f583fcb6d8bbe217ed4
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon Mar 30 15:07:16 2015 -0700

    f2fs: add F2FS_INLINE_DOTS to recover missing dot dentries

    If f2fs was corrupted with missing dot dentries, it needs to recover them after
    fsck.f2fs detection.

    The underlying precedure is:

    1. The fsck.f2fs remains F2FS_INLINE_DOTS flag in directory inode, if it detects
    missing dot dentries.

    2. When f2fs looks up the corrupted directory, it triggers f2fs_add_link with
    proper inode numbers and their dot and dotdot names.

    3. Once f2fs recovers the directory without errors, it removes F2FS_INLINE_DOTS
    finally.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit feb625fb93d4251aa16a23ccabdd827b223733cb
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Thu Mar 26 18:46:38 2015 -0700

    f2fs: fix mismatching lock and unlock pages for roll-forward recovery

    Previously, inode page is not correctly locked and unlocked in pair during
    the roll-forward recovery.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit f87dbe5bf4010a3ac30da36b837380d2f78262e0
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Tue Mar 24 12:04:20 2015 -0700

    f2fs: fix sparse warnings

    This patch fixes the below warning.

    sparse warnings: (new ones prefixed by >>)

    >> fs/f2fs/inode.c:56:23: sparse: restricted __le32 degrades to integer
    >> fs/f2fs/inode.c:56:52: sparse: restricted __le32 degrades to integer

    Reported-by: kbuild test robot <fengguang.wu@intel.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 55872dc313a783379de2385454bf7f4cd1f44fff
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Tue Mar 24 13:08:05 2015 +0800

    f2fs: limit b_size of mapped bh in f2fs_map_bh

    Map bh over max size which caller defined is not needed, limit it in
    f2fs_map_bh.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 252d8d969f381ff097c746d402523835a30b5418
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Mon Mar 23 10:37:39 2015 +0800

    f2fs: persist system.advise into on-disk inode

    This patch fixes to dirty inode for persisting i_advise of f2fs inode info into
    on-disk inode if user sets system.advise through setxattr. Otherwise the new
    value will be lost.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit ddb5c14b13a9de1196b382d75a103a9cfd4c575e
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Mon Mar 23 10:36:15 2015 +0800

    f2fs: avoid NULL pointer dereference in f2fs_xattr_advise_get

    We will encounter oops by executing below command.
    getfattr -n system.advise /mnt/f2fs/file
    Killed

    message log:
    BUG: unable to handle kernel NULL pointer dereference at   (null)
    IP: [<f8b54d69>] f2fs_xattr_advise_get+0x29/0x40 [f2fs]
    *pdpt = 00000000319b7001 *pde = 0000000000000000
    Oops: 0002 [#1] SMP
    Modules linked in: f2fs(O) snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq joydev
    snd_seq_device snd_timer bnep snd rfcomm microcode bluetooth soundcore i2c_piix4 mac_hid serio_raw parport_pc ppdev lp parport
    binfmt_misc hid_generic psmouse usbhid hid e1000 [last unloaded: f2fs]
    CPU: 3 PID: 3134 Comm: getfattr Tainted: G           O    4.0.0-rc1 #6
    Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
    task: f3a71b60 ti: f19a6000 task.ti: f19a6000
    EIP: 0060:[<f8b54d69>] EFLAGS: 00010246 CPU: 3
    EIP is at f2fs_xattr_advise_get+0x29/0x40 [f2fs]
    EAX: 00000000 EBX: f19a7e71 ECX: 00000000 EDX: f8b5b467
    ESI: 00000000 EDI: f2008570 EBP: f19a7e14 ESP: f19a7e08
     DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
    CR0: 80050033 CR2: 00000000 CR3: 319b8000 CR4: 000007f0
    Stack:
     f8b5a634 c0cbb580 00000000 f19a7e34 c1193850 00000000 00000007 f19a7e71
     f19a7e64 c0cbb580 c1193810 f19a7e50 c1193c00 00000000 00000000 00000000
     c0cbb580 00000000 f19a7f70 c1194097 00000000 00000000 00000000 74737973
    Call Trace:
     [<c1193850>] generic_getxattr+0x40/0x50
     [<c1193810>] ? xattr_resolve_name+0x80/0x80
     [<c1193c00>] vfs_getxattr+0x70/0xa0
     [<c1194097>] getxattr+0x87/0x190
     [<c11801d7>] ? path_lookupat+0x57/0x5f0
     [<c11819d2>] ? putname+0x32/0x50
     [<c116653a>] ? kmem_cache_alloc+0x2a/0x130
     [<c11819d2>] ? putname+0x32/0x50
     [<c11819d2>] ? putname+0x32/0x50
     [<c11819d2>] ? putname+0x32/0x50
     [<c11827f9>] ? user_path_at_empty+0x49/0x70
     [<c118283f>] ? user_path_at+0x1f/0x30
     [<c11941e7>] path_getxattr+0x47/0x80
     [<c11948e7>] SyS_getxattr+0x27/0x30
     [<c163f748>] sysenter_do_call+0x12/0x12
    Code: 66 90 55 89 e5 57 56 53 66 66 66 66 90 8b 78 20 89 d3 ba 67 b4 b5 f8 89 d8 89 ce e8 42 7c 7b c8 85 c0 75 16 0f b6 87 44 01 00
    00 <88> 06 b8 01 00 00 00 5b 5e 5f 5d c3 8d 76 00 b8 ea ff ff ff eb
    EIP: [<f8b54d69>] f2fs_xattr_advise_get+0x29/0x40 [f2fs] SS:ESP 0068:f19a7e08
    CR2: 0000000000000000
    ---[ end trace 860260654f1f416a ]---

    The reason is that in getfattr there are two steps which is indicated by strace info:
    1) try to lookup and get size of specified xattr.
    2) get value of the extented attribute.

    strace info:
    getxattr("/mnt/f2fs/file", "system.advise", 0x0, 0) = 1
    getxattr("/mnt/f2fs/file", "system.advise", "\x00", 256) = 1

    For the first step, getfattr may pass a NULL pointer in @value and zero in @size
    as parameters for ->getxattr, but we access this @value pointer directly without
    checking whether the pointer is valid or not in f2fs_xattr_advise_get, so the
    oops occurs.

    This patch fixes this issue by verifying @value pointer before using.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit b0b3d399c9c7d01a052bb1643c6727bb0380850e
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Mon Mar 23 10:33:37 2015 +0800

    f2fs: preallocate fallocated blocks for direct IO

    Normally, due to DIO_SKIP_HOLES flag is set by default, blockdev_direct_IO in
    f2fs_direct_IO tries to skip DIO in holes when writing inside i_size, this
    makes us falling back to buffered IO which shows lower performance.

    So in commit 59b802e5a453 ("f2fs: allocate data blocks in advance for
    f2fs_direct_IO"), we improve perfromance by allocating data blocks in advance
    if we meet holes no matter in i_size or not, since with it we can avoid falling
    back to buffered IO.

    But we forget to consider for unwritten fallocated block in this commit.
    This patch tries to fix it for fallocate case, this helps to improve
    performance.

    Test result:
    Storage info: sandisk ultra 64G micro sd card.

    touch /mnt/f2fs/file
    truncate -s 67108864 /mnt/f2fs/file
    fallocate -o 0 -l 67108864 /mnt/f2fs/file
    time dd if=/dev/zero of=/mnt/f2fs/file bs=1M count=64 conv=notrunc oflag=direct

    Time before applying the patch:
    67108864 bytes (67 MB) copied, 36.16 s, 1.9 MB/s
    real    0m36.162s
    user    0m0.000s
    sys     0m0.180s

    Time after applying the patch:
    67108864 bytes (67 MB) copied, 27.7776 s, 2.4 MB/s
    real    0m27.780s
    user    0m0.000s
    sys     0m0.036s

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 5b2ffbe519dddc8205a615a97943c978906262eb
Author: Wanpeng Li <wanpeng.li@linux.intel.com>
Date:   Tue Mar 24 10:20:27 2015 +0800

    f2fs: enable inline data by default

    Enable inline_data feature by default since it brings us better
    performance and space utilization and now has already stable.
    Add another option noinline_data to disable it during mount.

    Suggested-by: Jaegeuk Kim <jaegeuk@kernel.org>
    Suggested-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 5b6c1edfe66a6429acb9d5216aafb340022709bd
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Thu Mar 19 19:27:51 2015 +0800

    f2fs: preserve extent info for extent cache

    This patch tries to preserve last extent info in extent tree cache into on-disk
    inode, so this can help us to reuse the last extent info next time for
    performance.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit ef0c384c2b9eeb740ba676a10e5a2b194b23b1b0
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Thu Mar 19 19:26:02 2015 +0800

    f2fs: initialize extent tree with on-disk extent info of inode

    With normal extent info cache, we records largest extent mapping between logical
    block and physical block into extent info, and we persist extent info in on-disk
    inode.

    When we enable extent tree cache, if extent info of on-disk inode is exist, and
    the extent is not a small fragmented mapping extent. We'd better to load the
    extent info into extent tree cache when inode is loaded. By this way we can have
    more chance to hit extent tree cache rather than taking more time to read dnode
    page for block address.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 49b71c3af3ee7476f647071eee659998f11678d1
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Thu Mar 19 19:24:59 2015 +0800

    f2fs: introduce __{find,grab}_extent_tree

    This patch introduces __{find,grab}_extent_tree for reusing by following
    patches.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 841e0eccff53960fc7dcfec4a9f1e09186478205
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Thu Mar 19 19:23:32 2015 +0800

    f2fs: split set_data_blkaddr from f2fs_update_extent_cache

    Split __set_data_blkaddr from f2fs_update_extent_cache for readability.

    Additionally rename __set_data_blkaddr to set_data_blkaddr for exporting.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit f0683a2560052529e3b64180f54f2289463d8d82
Author: Wanpeng Li <wanpeng.li@linux.intel.com>
Date:   Thu Mar 19 13:23:48 2015 +0800

    f2fs: enable fast symlink by utilizing inline data

    Fast symlink can utilize inline data flow to avoid using any
    i_addr region, since we need to handle many cases such as
    truncation, roll-forward recovery, and fsck/dump tools.

    Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 139dd93c93898c3ccf19791d29890fe594332d64
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Tue Mar 17 17:58:08 2015 -0700

    f2fs: add some tracepoints to debug volatile and atomic writes

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 94718185df166d6c87cf856375479ee2c296be2b
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Tue Mar 17 17:16:35 2015 -0700

    f2fs: avoid punch_hole overhead when releasing volatile data

    This patch is to avoid some punch_hole overhead when releasing volatile data.
    If volatile data was not written yet, we just can make the first page as zero.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 3e8a62b8c2858369f7d563525fb528f7a4f1cb30
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon Mar 16 16:54:52 2015 -0700

    f2fs: avoid wrong f2fs_bug_on when truncating inline_data

    This patch removes wrong f2fs_bug_on in truncate_inline_inode.

    When there is no space, it can happen a corner case where i_isze is over
    MAX_INLINE_SIZE while its inode is still inline_data.

    The scenario is
     1. write small data into file #A.
     2. fill the whole partition to 100%.
     3. truncate 4096 on file #A.
     4. write data at 8192 offset.
      --> f2fs_write_begin
        -> -ENOSPC = f2fs_convert_inline_page
        -> f2fs_write_failed
          -> truncate_blocks
            -> truncate_inline_inode
    	  BUG_ON, since i_size is 4096.

    Reviewed-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 91c955cae47d1e00bf6b8db945f603df00b9feac
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Fri Mar 13 21:44:36 2015 -0700

    f2fs: enhance multi-threads performance

    Previously, f2fs_write_data_pages has a mutex, sbi->writepages, to serialize
    data writes to maximize write bandwidth, while sacrificing multi-threads
    performance.
    Practically, however, multi-threads environment is much more important for
    users. So this patch tries to remove the mutex.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 4c02c4dac27ed95f8397c32ecce4a822f531d9f2
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed Mar 11 23:27:25 2015 -0400

    f2fs: set buffer_new when new blocks are allocated

    This patch modifies to call set_buffer_new, if new blocks are allocated.

    Reviewed-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 6a69e9c5bfef161547fe3efdcf0894faeb45e37b
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Mon Mar 16 21:08:44 2015 +0800

    f2fs: set SBI_NEED_FSCK when encountering exception in recovery

    This patch tries to set SBI_NEED_FSCK flag into sbi only when we fail to recover
    in fill_super, so we could skip fscking image when we fail to fill super for
    other reason.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 8c91980616b9af9cfeb810d1a7c82ad9323b3e7d
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed Mar 11 13:42:48 2015 -0400

    f2fs: fix to cover sentry_lock for block allocation

    In the following call stack, f2fs changes the bitmap for dirty segments and # of
    dirty sentries without grabbing sit_i->sentry_lock.
    This can result in mismatch on bitmap and # of dirty sentries, since if there
    are some direct_io operations.

    In allocate_data_block,
     - __allocate_new_segments
      - mutex_lock(&curseg->curseg_mutex);
      - s_ops->allocate_segment
       - new_curseg/change_curseg
        - reset_curseg
         - __set_sit_entry_type
          - __mark_sit_entry_dirty
           - set_bit(dirty_sentries_bitmap)
           - dirty_sentries++;

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 8189fdf992151e8826c924241674ef0de44fde40
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Thu Mar 12 17:04:24 2015 +0800

    f2fs: fix to check current blkaddr in __allocate_data_blocks

    In __allocate_data_blocks, we should check current blkaddr which is located at
    ofs_in_node of dnode page instead of checking first blkaddr all the time.
    Otherwise we can only allocate one blkaddr in each dnode page. Fix it.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit d616037f0c30568df61b77c738a4f6dfdde4b20b
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Tue Mar 10 13:16:25 2015 +0800

    f2fs: fix to truncate inline data past EOF

    Previously if inode is with inline data, we will try to invalid partial inline
    data in page #0 when we truncate size of inode in truncate_partial_data_page().
    And then we set page #0 to dirty, after this we can synchronize inode page with
    page #0 at ->writepage().

    But sometimes we will fail to operate page #0 in truncate_partial_data_page()
    due to below reason:
    a) if offset is zero, we will skip setting page #0 to dirty.
    b) if page #0 is not uptodate, we will fail to update it as it has no mapping
    data.

    So with following operations, we will meet recent data which should be
    truncated.

    1.write inline data to file
    2.sync first data page to inode page
    3.truncate file size to 0
    4.truncate file size to max_inline_size
    5.echo 1 > /proc/sys/vm/drop_caches
    6.read file --> meet original inline data which is remained in inode page.

    This patch renames truncate_inline_data() to truncate_inline_inode() for code
    readability, then use truncate_inline_inode() to truncate inline data in inode
    page in truncate_blocks() and truncate page #0 in truncate_partial_data_page()
    for fixing.

    v2:
     o truncate partially #0 page in truncate_partial_data_page to avoid keeping
       old data in #0 page.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 74707d08913a157106af3dcacfce3be4f67f2c54
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Mon Mar 9 18:18:19 2015 +0800

    f2fs: fix reference leaks in f2fs_acl_create

    Our f2fs_acl_create is copied and modified from posix_acl_create to avoid
    deadlock bug when inline_dentry feature is enabled.

    Now, we got reference leaks in posix_acl_create, and this has been fixed in
    commit fed0b588be2f ("posix_acl: fix reference leaks in posix_acl_create")
    by Omar Sandoval.
    https://lkml.org/lkml/2015/2/9/5

    Let's fix this issue in f2fs_acl_create too.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Reviewed-by: Changman Lee <cm224.lee@ssamsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit c8539703ed9549d476a7ea4134e83a26104e309d
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Mon Mar 9 17:33:16 2015 +0800

    f2fs: fix to calculate max length of contiguous free slots correctly

    When l…
AndropaX pushed a commit to AndropaX/android_kernel_xiaomi_msm8992 that referenced this issue Apr 1, 2017
Squashed commit of the following:

commit 869d61cda160f0e824032c84aa5ac041639f5e24
Author: Scott Mertz <scott@cyngn.com>
Date:   Fri Jun 10 10:24:28 2016 -0700

    BACKPORT: f2fs: add a max block check for get_data_block_bmap

    (cherry pick from commit 179448bfe4cd201e98e728391c6b01b25c849fe8)

    This patch adds a max block check for get_data_block_bmap.

    Trinity test program will send a block number as parameter into
    ioctl_fibmap, which will be used in get_node_path(), when the block
    number large than f2fs max blocks, it will trigger kernel bug.

    Signed-off-by: Yunlei He <heyunlei@huawei.com>
    Signed-off-by: Xue Liu <liuxueliu.liu@huawei.com>
    [Jaegeuk Kim: fix missing condition, pointed by Chao Yu]
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

    Change-Id: Ia3d24f23735c73bf1dc2c885512afcc393d2ba25

commit d162ca69fefad82d965f8a9335c1c546a82ff9ea
Author: Keith Mok <kmok@cyngn.com>
Date:   Mon Feb 29 14:54:35 2016 -0800

    f2fs: Use crypto crc32 functions

    The crc function is done bit by bit and
    painfully slow, switch to use crypto
    crc32 function which is backed by h/w/ acceleration.

    Change-Id: I653b0d11d06db5aaae181fef15e67840d29edbca

commit 546d80887c268311b75cc5b56d359cb6c9d42fb5
Author: Keith Mok <kmok@cyngn.com>
Date:   Mon Jan 18 14:19:37 2016 -0800

    f2fs: Backport v4.4-rc8

    Fix f2fs to make it build for 3.10

    Change-Id: I38fbd1dfcdfd4293d93ceb54a45ba06a2793c8b9

commit 23348e15b5315a11949f7f95d5cf0bc1c3ea4e54
Author: Keith Mok <kmok@cyngn.com>
Date:   Mon Jan 18 13:39:41 2016 -0800

    f2fs: catch up to v4.4-rc8

    The last patch is:

    commit 5d2eb548b309be34ecf3b91f0b7300a2b9d09b8c
    Merge: 2870f6c 29608d2
    Author: Linus Torvalds <torvalds@linux-foundation.org>
    Date:   Fri Nov 13 18:02:30 2015 -0800

        Merge branch 'for-linus-3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

        Pull vfs xattr cleanups from Al Viro.

        * 'for-linus-3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
          f2fs: xattr simplifications
          squashfs: xattr simplifications
          9p: xattr simplifications
          xattr handlers: Pass handler to operations instead of flags
          jffs2: Add missing capability check for listing trusted xattrs
          hfsplus: Remove unused xattr handler list operations
          ubifs: Remove unused security xattr handler
          vfs: Fix the posix_acl_xattr_list return value
          vfs: Check attribute names in posix acl xattr handers

    Change-Id: I91363c68f2d4f1b0a8228bbbc2b8dcf9e2d93137

commit a66078e9dad2c760dfea9232d5a1cb77db0a2065
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Thu May 28 18:19:17 2015 -0700

    f2fs: fix a deadlock for summary page lock vs. sentry_lock

    In f2fs_gc:                      In f2fs_replace_block:
     - lock_page(sum_page)
      - check_valid_map()            - mutex_lock(sentry_lock)
       - mutex_lock(sentry_lock)     - change_curseg()
                                      - lock_page(sum_page)

    This patch fixes the deadlock condition.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 116ad121a93688ec612ec027ae11c982109c5269
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Thu May 28 17:06:40 2015 -0700

    f2fs crypto: clean up error handling in f2fs_fname_setup_filename

    Sync with:
      ext4 crypto: clean up error handling in ext4_fname_setup_filename

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit b74cd9ec13d46b69dc133c4c55893e8a4ed17f54
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Sat May 30 04:00:15 2015 +0300

    f2fs: support 3.10

    Conflicts:
    	fs/f2fs/data.c
    	fs/f2fs/file.c
    	fs/f2fs/namei.c

    Change-Id: I2540749a14e6c9f1788a09a06afd586b691f0edd
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit bbc554deecde3c026d45b0f1d8b440867f296ec4
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed May 27 19:51:42 2015 -0700

    f2fs crypto: avoid f2fs_inherit_context for symlink

    This patch fixes to call f2fs_inherit_context twice for newly created symlink.
    The original one is called by f2fs_add_link(), which invokes f2fs_setxattr.
    If the second one is called again, f2fs_setxattr is triggered again with same
    encryption index.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 7b5d4613056bcae3273688c7387e296bc85dfd2f
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed May 20 19:12:30 2015 -0700

    f2fs crypto: introduce a mempool for bounce pages

    If a lot of write streams are triggered, alloc_page and __free_page are
    costly called, resulting in high memory pressure.

    In order to avoid that, this patch introduces an additional mempool for writeback pages.
    Note that, the existing mempool is used for the emergency purpose.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 9748515b5cda4e36a5909a220e9b004c98280d6a
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Wed May 27 15:27:49 2015 +0800

    f2fs crypto: do not set encryption policy for non-directory by ioctl

    Encryption policy should only be set to an empty directory through ioctl,
    This patch add a judgement condition to verify type of the target inode
    to avoid incorrectly configuring for non-directory.

    Additionally, remove unneeded inline data conversion since regular or symlink
    file should not be processed here.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 3d0c5ba34a3b827b1ed6fb7d4df0da6f11bd7251
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Mon May 25 18:09:03 2015 +0800

    f2fs crypto: allow setting encryption policy once

    This patch add XATTR_CREATE flag in setxattr when setting encryption
    context for inode. Without this flag the context could be set more than
    once, this should never happen. So, fix it.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 02bfdd3df8279d4a31431cede82c6eb8da852a2c
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Mon May 25 18:07:02 2015 +0800

    f2fs crypto: check context consistent for rename2

    For exchange rename, we should check context consistent of encryption
    between new_dir and old_inode or old_dir and new_inode. Otherwise
    inheritance of parent's encryption context will be broken.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit e829d7e3bc102879bec9b755906a60742cf9d930
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Mon May 25 18:03:38 2015 +0800

    f2fs: avoid duplicated code by reusing f2fs_read_end_io

    This patch tries to clean up code because part code of f2fs_read_end_io
    and mpage_end_io are the same, so it's better to merge and reuse them.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit a318ba8662f5a41b95e751fe617a7ae75a92b1d6
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Tue May 19 22:26:54 2015 -0700

    f2fs crypto: use per-inode tfm structure

    This patch applies the following ext4 patch:

      ext4 crypto: use per-inode tfm structure

    As suggested by Herbert Xu, we shouldn't allocate a new tfm each time
    we read or write a page.  Instead we can use a single tfm hanging off
    the inode's crypt_info structure for all of our encryption needs for
    that inode, since the tfm can be used by multiple crypto requests in
    parallel.

    Also use cmpxchg() to avoid races that could result in crypt_info
    structure getting doubly allocated or doubly freed.

    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 11125bb53eb78f8d346e22995d990ab5cccbd9d3
Author: hujianyang <hujianyang@huawei.com>
Date:   Thu May 21 14:42:53 2015 +0800

    f2fs: recovering broken superblock during mount

    This patch recovers a broken superblock with the other valid one.

    Signed-off-by: hujianyang <hujianyang@huawei.com>
    [Jaegeuk Kim: reinitialize local variables in f2fs_fill_super for retrial]
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 0581fe2dd6c59599aaed98f970a649c2c4ee6a2d
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Tue May 19 16:11:40 2015 -0700

    f2fs crypto: check encryption for tmpfile

    This patch adds to check encryption for tmpfile in early stage.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 9779b9015d454a8a23f329c315e0582e0b01796c
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Sat May 30 03:57:24 2015 +0300

    f2fs: support RENAME_WHITEOUT

    As the description of rename in manual, RENAME_WHITEOUT is a special operation
    that only makes sense for overlay/union type filesystem.

    When performing rename with RENAME_WHITEOUT, dst will be replace with src, and
    meanwhile, a 'whiteout' will be create with name of src.

    A "whiteout" is designed to be a char device with 0,0 device number, it has
    specially meaning for stackable filesystem. In these filesystems, there are
    multiple layers exist, and only top of these can be modified. So a whiteout
    in top layer is used to hide a corresponding file in lower layer, as well
    removal of whiteout will make the file appear.

    Now in overlayfs, when we rename a file which is exist in lower layer, it
    will be copied up to upper if it is not on upper layer yet, and then rename
    it on upper layer, source file will be whiteouted to hide corresponding file
    in lower layer at the same time.

    So in upper layer filesystem, implementation of RENAME_WHITEOUT provide a
    atomic operation for stackable filesystem to support rename operation.

    There are multiple ways to implement RENAME_WHITEOUT in log of this commit:
    7dcf5c3e4527 ("xfs: add RENAME_WHITEOUT support") which pointed out by
    Dave Chinner.

    For now, we just try to follow the way that xfs/ext4 use.

    Conflicts:
    	fs/f2fs/namei.c

    Change-Id: Iac408db95a4133ac04c7d612ae06771afe238349
    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 832a7efd1724c4d896e64ab35df66514c27b7f15
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Tue May 19 17:40:04 2015 +0800

    f2fs: introduce update_meta_page

    Add a help function update_meta_page() to update meta page with specified
    buffer.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit a3c1fc3eba871337cebf74f6ed8d3a087e22e0b3
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Mon May 18 18:00:06 2015 +0800

    f2fs crypto: zero next free dnode block

    Now page cache of meta inode is used by garbage collection for encrypted page,
    it may contain random data, so we should zero it before issuing discard.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 02a47222f7b5ca017e6e5a8728502c06bf3cf42f
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Fri May 15 15:37:24 2015 -0700

    f2fs crypto: split f2fs_crypto_init/exit with two parts

    This patch splits f2fs_crypto_init/exit with two parts: base initialization and
    memory allocation.

    Firstly, f2fs module declares the base encryption memory pointers.
    Then, allocating internal memories is done at the first encrypted inode access.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 41640159159e81a9549554fc7a4eef1f8b7d793d
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Fri May 15 11:14:34 2015 +0800

    f2fs crypto: fix incorrect release for crypto ctx

    When encryption feature is enable, if we rmmod f2fs module,
    we will encounter a stack backtrace reported in syslog:

    "BUG: Bad page state in process rmmod  pfn:aaf8a
    page:f0f4f148 count:0 mapcount:129 mapping:ee2f4104 index:0x80
    flags: 0xee2830a4(referenced|lru|slab|private_2|writeback|swapbacked|mlocked)
    page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
    bad because of flags:
    flags: 0x2030a0(lru|slab|private_2|writeback|mlocked)
    Modules linked in: f2fs(O-) fuse bnep rfcomm bluetooth dm_crypt binfmt_misc snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm
    snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device joydev ppdev mac_hid lp hid_generic i2c_piix4
    parport_pc psmouse snd serio_raw parport soundcore ext4 jbd2 mbcache usbhid hid e1000 [last unloaded: f2fs]
    CPU: 1 PID: 3049 Comm: rmmod Tainted: G    B      O    4.1.0-rc3+ #10
    Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
    00000000 00000000 c0021eb4 c15b7518 f0f4f148 c0021ed8 c112e0b7 c1779174
    c9b75674 000aaf8a 01b13ce1 c17791a4 f0f4f148 ee2830a4 c0021ef8 c112e3c3
    00000000 f0f4f148 c0021f34 f0f4f148 ee2830a4 ef9f0000 c0021f20 c112fdf8
    Call Trace:
    [<c15b7518>] dump_stack+0x41/0x52
    [<c112e0b7>] bad_page.part.72+0xa7/0x100
    [<c112e3c3>] free_pages_prepare+0x213/0x220
    [<c112fdf8>] free_hot_cold_page+0x28/0x120
    [<c1073380>] ? try_to_wake_up+0x2b0/0x2b0
    [<c112ff15>] __free_pages+0x25/0x30
    [<c112c4fd>] mempool_free_pages+0xd/0x10
    [<c112c5f1>] mempool_free+0x31/0x90
    [<f0f441cf>] f2fs_exit_crypto+0x6f/0xf0 [f2fs]
    [<f0f456c4>] exit_f2fs_fs+0x23/0x95f [f2fs]
    [<c10c30e0>] SyS_delete_module+0x130/0x180
    [<c11556d6>] ? vm_munmap+0x46/0x60
    [<c15bd888>] sysenter_do_call+0x12/0x12"

    The reason is that:

    since commit 0827e645fd35
    ("f2fs crypto: shrink size of the f2fs_crypto_ctx structure") is merged,
    some fields in f2fs_crypto_ctx structure are merged into a union as they
    will never be used simultaneously in write path, read path or on free list.

    In f2fs_exit_crypto, we traverse each crypto ctx from free list, in this
    moment, our free_list field in union is valid, but still we will try to
    release memory space which is pointed by other invalid field in union
    structure for each ctx.

    Then the error occurs, let's fix it with this patch.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 402d19ed0be8fb13cd82092aa3902a0d27015551
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Wed May 13 18:20:54 2015 +0800

    f2fs crypto: fix to release buffer for fname crypto

    This patch fixes memory leak issue in error path of f2fs_fname_setup_filename().

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit f9f4324ba8e1da3994eb31f44c81e719c4fc49eb
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Tue May 12 13:40:20 2015 -0700

    f2fs crypto: shrink size of the f2fs_crypto_ctx structure

    This patch integrates the below patch into f2fs.

    "ext4 crypto: shrink size of the ext4_crypto_ctx structure

    Some fields are only used when the crypto_ctx is being used on the
    read path, some are only used on the write path, and some are only
    used when the structure is on free list.  Optimize memory use by using
    a union."

    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 695330214abbf0ea33ac04d1e10df79769871e72
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Tue May 12 13:33:00 2015 -0700

    f2fs crypto: get rid of ci_mode from struct f2fs_crypt_info

    This patch integrates the below patch into f2fs.

    "ext4 crypto: get rid of ci_mode from struct ext4_crypt_info

    The ci_mode field was superfluous, and getting rid of it gets rid of
    an unused hole in the structure."

    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 8bc541a35a5bf57387007ec523fb4caf8e11eb5f
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Tue May 12 13:26:54 2015 -0700

    f2fs crypto: use slab caches

    This patch integrates the below patch into f2fs.

    "ext4 crypto: use slab caches

    Use slab caches the ext4_crypto_ctx and ext4_crypt_info structures for
    slighly better memory efficiency and debuggability."

    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 211ffb46af5bdb0ccd4f7933550f55fa3832d7c7
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed May 13 14:35:14 2015 -0700

    f2fs: truncate data blocks for orphan inode

    As Hu reported, F2FS has a space leak problem, when conducting:

    1) format a 4GB f2fs partition
    2) dd a 3G file,
    3) unlink it.

    So, when doing f2fs_drop_inode(), we need to truncate data blocks
    before skipping it.
    We can also drop unused caches assigned to each inode.

    Reported-by: hujianyang <hujianyang@huawei.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 3bf4f596420b436915247c3ed406e1b56609091c
Author: Dan Carpenter <dan.carpenter@oracle.com>
Date:   Thu May 14 11:52:28 2015 +0300

    f2fs: cleanup a confusing indent

    The return was not indented far enough so it looked like it was supposed
    to go with the other if statement.

    Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit ad49d0578fd113dcde01f073cae05c5e97690cbd
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Wed May 13 22:49:58 2015 +0200

    f2fs: fix building on 32-bit architectures

    A bug fix to the debug output extended the type of some local
    variables to 64-bit, which now causes the kernel to fail building
    because of missing 64-bit division functions:

    ERROR: "__aeabi_uldivmod" [fs/f2fs/f2fs.ko] undefined!

    In the kernel, we have to use div_u64 or do_div to do this,
    in order to annotate that this is an expensive operation.

    As the function is only called for debug out, we know this
    is not performance critical, so it is safe to use div_u64.

    Signed-off-by: Arnd Bergmann <arnd@arndb.de>
    Fixes: d1f85bd38db19 ("f2fs: avoid value overflow in showing current status")
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 5d390f29c5afd9f78c819da8297d8e4ed8b05a0e
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon May 18 11:45:15 2015 -0700

    f2fs: avoid buggy functions

    This patch avoids to use a buggy function for now.
    It needs to fix them later.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 26320f96a8cc3918d48527b69167d857899e4efa
Author: hujianyang <hujianyang@huawei.com>
Date:   Tue May 12 16:05:57 2015 +0800

    f2fs: add compat_ioctl to provide backward compatability

    introduce compat_ioctl to regular files, but doesn't add this
    functionality to f2fs_dir_operations.

    While running a 32-bit busybox, I met an error like this:
    (A is a directory)

    chattr: reading flags on A: Inappropriate ioctl for device

    This patch copies compat_ioctl from f2fs_file_operations and
    fix this problem.

    Signed-off-by: hujianyang <hujianyang@huawei.com>
    Reviewed-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 9c7dc1f03c8b0cc74de9cb73349c340f07ea6be3
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon May 11 20:03:49 2015 -0700

    f2fs: do not issue next dnode discard redundantly

    We have a discard map, so that we can avoid redundant discard issues.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit f62c1b25a0ba7fd699e397b01d117ce4d77c2dbb
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon May 11 20:02:14 2015 -0700

    f2fs: disable the discard option when device does not support

    This patch disables given discard option when device does not support it.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 0092b3dc467ad4e1ecf6b732a0aa9352f94d7600
Author: Yunlei He <heyunlei@huawei.com>
Date:   Thu May 7 18:11:37 2015 +0800

    f2fs: add default mount options to remount

    I use f2fs filesystem with /data partition on my Android phone
    by the default mount options. When I remount /data in order to
    adding discard option to run some benchmarks, I find the default
    options such as background_gc, user_xattr and acl turned off.

    So I introduce a function named default_options in super.c. It do
    some default setting, and both mount and remount operations will
    call this function to complete default setting.

    Signed-off-by: Yunlei He <heyunlei@huawei.com>
    Reviewed-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 16fa79a4f39e3250c1282519ca7f97275081a3a1
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed May 6 18:23:21 2015 -0700

    f2fs crypto: remove checking key context during lookup

    No matter what the key is valid or not, readdir shows the dir entries correctly.
    So, lookup should not failed.
    But, we expect further accesses should be denied from open, rename, link, and so
    on.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit dc70a8a153167920018e6f603bde29efee225946
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Tue May 5 20:20:29 2015 -0700

    f2fs crypto: fix missing key when reading a page

    1. mount $mnt
    2. cp data $mnt/
    3. umount $mnt
    4. log out
    5. log in
    6. cat $mnt/data

    -> panic, due to no i_crypt_info.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit e0a1ff90c374c3a19822d84a2888049723b691a6
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed Apr 29 15:10:53 2015 -0700

    f2fs crypto: add symlink encryption

    This patch implements encryption support for symlink.

    Signed-off-by: Uday Savagaonkar <savagaon@google.com>
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit e6422e991e35f6db0de71b1ed56156693c954a38
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed Apr 29 17:02:18 2015 -0700

    f2fs crypto: add filename encryption for roll-forward recovery

    This patch adds a bit flag to indicate whether or not i_name in the inode
    is encrypted.

    If this name is encrypted, we can't do recover_dentry during roll-forward.
    So, f2fs_sync_file() needs to do checkpoint, if this will be needed in future.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 0c2ecd76ef04b9331b3e1af6a84ea610ff88f395
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon Apr 27 17:12:39 2015 -0700

    f2fs crypto: add filename encryption for f2fs_lookup

    This patch implements filename encryption support for f2fs_lookup.

    Note that, f2fs_find_entry should be outside of f2fs_(un)lock_op().

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 0922ccb6bcc70e5499af179c5409f21bc5f8043e
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon Apr 27 16:26:24 2015 -0700

    f2fs crypto: add filename encryption for f2fs_readdir

    This patch implements filename encryption support for f2fs_readdir.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 5db97f6f11bf84975a5af993194729315b9773dd
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon Apr 27 14:51:02 2015 -0700

    f2fs crypto: add filename encryption for f2fs_add_link

    This patch adds filename encryption support for f2fs_add_link.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 37982bdc2790b8792ea0ecdd4c4e8134a2e0c57b
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Thu Apr 23 12:04:33 2015 -0700

    f2fs crypto: add encryption support in read/write paths

    This patch adds encryption support in read and write paths.

    Note that, in f2fs, we need to consider cleaning operation.
    In cleaning procedure, we must avoid encrypting and decrypting written blocks.
    So, this patch implements move_encrypted_block().

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit d1825dd987f1b1271b0de66eac87e4926653fc6d
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Sat May 30 03:55:32 2015 +0300

    f2fs crypto: activate encryption support for fs APIs

    This patch activates the following APIs for encryption support.

    The rules quoted by ext4 are:
     - An unencrypted directory may contain encrypted or unencrypted files
       or directories.
     - All files or directories in a directory must be protected using the
       same key as their containing directory.
     - Encrypted inode for regular file should not have inline_data.
     - Encrypted symlink and directory may have inline_data and inline_dentry.

    This patch activates the following APIs.
    1. f2fs_link              : validate context
    2. f2fs_lookup            :      ''
    3. f2fs_rename            :      ''
    4. f2fs_create/f2fs_mkdir : inherit its dir's context
    5. f2fs_direct_IO         : do buffered io for regular files
    6. f2fs_open              : check encryption info
    7. f2fs_file_mmap         :      ''
    8. f2fs_setattr           :      ''
    9. f2fs_file_write_iter   :      ''           (Called by sys_io_submit)
    10. f2fs_fallocate        : do not support fcollapse
    11. f2fs_evict_inode      : free_encryption_info

    Conflicts:
    	fs/f2fs/data.c

    Change-Id: I407f8edbb730451cc82186ddbd8a444acc2d5153
    Signed-off-by: Michael Halcrow <mhalcrow@google.com>
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit c9057553829b1e4aec898b147c21fcbbdaf42a81
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Sun Apr 26 00:12:50 2015 -0700

    f2fs crypto: filename encryption facilities

    This patch adds filename encryption infra.
    Most of codes are copied from ext4 part, but changed to adjust f2fs
    directory structure.

    Signed-off-by: Uday Savagaonkar <savagaon@google.com>
    Signed-off-by: Ildar Muslukhov <ildarm@google.com>
    Signed-off-by: Michael Halcrow <mhalcrow@google.com>
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit e4849088ad5692a4e87f5e1944ee8d95553d6af4
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Tue Apr 21 16:23:47 2015 -0700

    f2fs crypto: add encryption key management facilities

    This patch copies from encrypt_key.c in ext4, and modifies for f2fs.

    Use GFP_NOFS, since _f2fs_get_encryption_info is called under f2fs_lock_op.

    Signed-off-by: Michael Halcrow <mhalcrow@google.com>
    Signed-off-by: Ildar Muslukhov <muslukhovi@gmail.com>
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 7a6b88d7b76d0ec33a1a9b3b4652a701e04ab8d7
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon Apr 20 19:52:47 2015 -0700

    f2fs crypto: add f2fs encryption facilities

    Most of parts were copied from ext4, except:

     - add f2fs_restore_and_release_control_page which returns control page and
       restore control page
     - remove ext4_encrypted_zeroout()
     - remove sbi->s_file_encryption_mode & sbi->s_dir_encryption_mode
     - add f2fs_end_io_crypto_work for mpage_end_io

    Signed-off-by: Michael Halcrow <mhalcrow@google.com>
    Signed-off-by: Ildar Muslukhov <ildarm@google.com>
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit e3c6346af07fcac38a55e2bb362df325060dbfa4
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Sat May 30 03:53:02 2015 +0300

    f2fs crypto: add encryption policy and password salt support

    This patch adds encryption policy and password salt support through ioctl
    implementation.

    It adds three ioctls:
     F2FS_IOC_SET_ENCRYPTION_POLICY,
     F2FS_IOC_GET_ENCRYPTION_POLICY,
     F2FS_IOC_GET_ENCRYPTION_PWSALT, which use xattr operations.

    Note that, these definition and codes are taken from ext4 crypto support.
    For f2fs, xattr operations and on-disk flags for superblock and inode were
    changed.

    Conflicts:
    	fs/f2fs/f2fs.h

    Change-Id: Ie26c1e21bd8d9220a470cfe1cad2decb8ef2740e
    Signed-off-by: Michael Halcrow <mhalcrow@google.com>
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Ildar Muslukhov <muslukhovi@gmail.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 9364af8d6fd9d83c276773fd839ec164854d065e
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Fri Apr 10 16:43:31 2015 -0700

    f2fs crypto: add encryption xattr support

    This patch add some definition for enrcyption xattr.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit cfca666a259a7ed2863e4be41f53a71a94e230ed
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Fri Apr 10 16:28:26 2015 -0700

    f2fs crypto: add f2fs encryption Kconfig

    This patch adds f2fs encryption config.

    This patch integrates:

    "ext4 crypto: require CONFIG_CRYPTO_CTR if ext4 encryption is enabled

    On arm64 this is apparently needed for CTS mode to function correctly.
    Otherwise attempts to use CTS return ENOENT."

    Signed-off-by: Michael Halcrow <mhalcrow@google.com>
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit c2f8df83f2fc990fd80e00b6c2df604a9894d092
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon Apr 20 13:57:51 2015 -0700

    f2fs crypto: declare some definitions for f2fs encryption feature

    This definitions will be used by inode and superblock for encyption.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit f1b4fdf0ad9546b4db680b58a97f2e3de3b4784a
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Fri May 8 19:30:32 2015 -0700

    f2fs: report unwritten area in f2fs_fiemap

    This patch slightly changes f2fs_fiemap function to report unwritten area.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit e002552aab14ce70781a772c10d2d3f5b20baaec
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Fri May 8 16:37:28 2015 -0700

    f2fs: avoid value overflow in showing current status

    This patch fixes overflow when do cat /sys/kernel/debug/f2fs/status.
    If a section is relatively large, dist value can be overflowed.

    Reported-by: Yossi Goldfill <ygoldfill@radianmemory.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 298b7e72abe8fd01162966d2a3278fcfa3091abd
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Wed May 6 13:11:13 2015 +0800

    f2fs: support FALLOC_FL_ZERO_RANGE

    Now, FALLOC_FL_ZERO_RANGE flag in ->fallocate is supported in ext4/xfs.

    In commit, the semantics of this flag is descripted as following:"
    1) Make sure that both offset and len are block size aligned.
    2) Update the i_size of inode by len bytes.
    3) Compute the file's logical block number against offset. If the computed
       block number is not the starting block of the extent, split the extent
       such that the block number is the starting block of the extent.
    4) Shift all the extents which are lying between
       [offset, last allocated extent] towards right by len bytes. This step
       will make a hole of len bytes at offset."

    This patch implements fallocate's FALLOC_FL_ZERO_RANGE for f2fs.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 69592cd8daa1027741ac86e498ae9e71c8eccf95
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Wed May 6 13:09:46 2015 +0800

    f2fs: support FALLOC_FL_COLLAPSE_RANGE

    Now, FALLOC_FL_COLLAPSE_RANGE flag in ->fallocate is supported in ext4/xfs.

    In commit, the semantics of this flag is descripted as following:"
    1) It collapses the range lying between offset and length by removing any
       data blocks which are present in this range and than updates all the
       logical offsets of extents beyond "offset + len" to nullify the hole
       created by removing blocks. In short, it does not leave a hole.
    2) It should be used exclusively. No other fallocate flag in combination.
    3) Offset and length supplied to fallocate should be fs block size aligned
       in case of xfs and ext4.
    4) Collaspe range does not work beyond i_size."

    This patch implements fallocate's FALLOC_FL_COLLAPSE_RANGE for f2fs.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 1c615b44cea0f68a8880cfdef1d8fdc6c3412249
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Wed May 6 13:08:06 2015 +0800

    f2fs: introduce f2fs_replace_block() for reuse

    Introduce a generic function replace_block base on recover_data_page,
    and export it. So with it we can operate file's meta data which is in
    CP/SSA area when we invoke fallocate with FALLOC_FL_COLLAPSE_RANGE
    flag.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 668cd2607b7ab0a4fc94563d2bfcd3e2460827bb
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Thu Apr 30 18:35:50 2015 +0800

    f2fs: do not re-lookup nat cache with same nid

    In set_node_addr, we try to lookup cached nat entry of inode and then
    set flag in it.

    But previously in this function, we have already grabbed nat entry with
    current node id, if the node id is the same as the one of inode, we
    do not need to lookup it in cache again.

    So this patch adds condition judgment for reducing unneeded lookup.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 9a5b896f752fc26046f4c5cc9390fcfa5e9044ee
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Thu Apr 30 18:34:41 2015 +0800

    f2fs: remove unneeded f2fs_make_empty declaration

    Remove f2fs_make_empty() declaration, since the main body of this function
    is move into do_make_empty_dir() and the function is obsolete now.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 91a71eae9377fdcddfd526dad85c7868bb17863f
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Thu Apr 30 22:50:06 2015 -0700

    f2fs: issue discard with finally produced len and minlen

    This patch determines to issue discard commands by comparing given minlen and
    the length of produced final candidates.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 74a0bd91eb27bc6bf8b7d5b4d0576ec9ceaa31ad
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Thu Apr 30 22:37:50 2015 -0700

    f2fs: introduce discard_map for f2fs_trim_fs

    This patch adds a bitmap for discard issues from f2fs_trim_fs.
    There-in rule is to issue discard commands only for invalidated blocks
    after mount.
    Once mount is done, f2fs_trim_fs trims out whole invalid area.
    After ehn, it will not issue and discrads redundantly.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit c61181f0e958976a2d4a37dca015419dde22ddc0
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Fri May 1 11:08:59 2015 -0700

    f2fs: revmove spin_lock for write_orphan_inodes

    This patch removes spin_lock, since this is covered by f2fs_lock_op already.
    And, we should avoid to use page operations inside spin_lock.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 5afc4d50c96103050a35d1cea5e87dbc09ae10f8
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Thu Apr 30 17:00:33 2015 -0700

    f2fs: split find_data_page according to specific purposes

    This patch splits find_data_page as follows.

    1. f2fs_gc
     - use get_read_data_page() with read only

    2. find_in_level
     - use find_data_page without locked page

    3. truncate_partial_page
     - In the case cache_only mode, just drop cached page.
     - Ohterwise, use get_lock_data_page() and guarantee to truncate

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 87aafb380f1996e4c52598a3efd91d920bcfa7f9
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Thu Apr 30 18:58:22 2015 -0700

    f2fs: fix counting the number of inline_data inodes

    This patch fixes to count the missing symlink case.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit ffef9b60036ca4538b55a701fe04191b636217ef
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed Apr 29 18:31:19 2015 -0700

    f2fs: add need_dentry_mark

    This patch introduces need_dentry_mark() to clean up and avoid redundant
    node locks.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit ba47ac654a4c59152f1ce6db8fd05e9383dec1fe
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed Apr 29 11:18:42 2015 -0700

    f2fs: fix race on allocating and deallocating a dentry block

    There are two threads:
     f2fs_delete_entry()              get_new_data_page()
                                      f2fs_reserve_block()
    				  dn.blkaddr = XXX
     lock_page(dentry_block)
     truncate_hole()
     dn.blkaddr = NULL
     unlock_page(dentry_block)
                                      lock_page(dentry_block)
                                      fill the block from XXX address
                                      add new dentries
                                      unlock_page(dentry_block)

    Later, f2fs_write_data_page() will truncate the dentry_block, since
    its block address is NULL.

    The reason for this was due to the wrong lock order.
    In this case, we should do f2fs_reserve_block() after locking its dentry block.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit a42b0a273a344637eb5f5471e5d28ac7dda7e8e9
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Sun Apr 26 00:15:29 2015 -0700

    f2fs: introduce dot and dotdot name check

    This patch adds an inline function to check dot and dotdot names.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 18c184303ab9d3c678e995b5bf7ca5342fbbf6a4
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Fri Apr 24 14:34:30 2015 -0700

    f2fs: move get_page for gc victims

    This patch moves getting victim page into move_data_page.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit b2f7597deea395e7f7d1cd3be8611f32e1e3efaa
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Thu Apr 23 14:38:15 2015 -0700

    f2fs: add sbi and page pointer in f2fs_io_info

    This patch adds f2fs_sb_info and page pointers in f2fs_io_info structure.
    With this change, we can reduce a lot of parameters for IO functions.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit d8e89b536fda0e5cfefe39ce08da6f9e0dbe0ebb
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Thu Apr 23 10:27:21 2015 -0700

    f2fs: add f2fs_may_inline_{data, dentry}

    This patch adds f2fs_may_inline_data and f2fs_may_inline_dentry.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit cb883ba942a6ed6d456336fe564c4b44e6198c0d
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed Apr 22 11:40:27 2015 -0700

    f2fs: clean up f2fs_lookup

    This patch cleans up to avoid deep indentation.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 01a04b4c504ca53b225f4c5d790fd4401be51093
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Sat May 30 03:51:08 2015 +0300

    f2fs: expose f2fs_mpage_readpages

    This patch implements f2fs_mpage_readpages for further optimization on
    encryption support.

    The basic code was taken from fs/mpage.c, and changed to be simple by adjusting
    that block_size is equal to page_size in f2fs.

    Conflicts:
    	fs/f2fs/data.c

    Change-Id: Id413493a8fe6c7f6e69093127f8c6e9a7a8ba89d
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 8b2b31668d8f0b71f8b2e6dd9e569aa4f2839802
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon Apr 20 18:49:51 2015 -0700

    f2fs: introduce f2fs_commit_super

    This patch introduces f2fs_commit_super to write updated superblock.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit d6c2b7a44a3b04d55d19c15a1d9715262dc5f9ca
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon Apr 6 19:55:34 2015 -0700

    f2fs: add f2fs_map_blocks

    This patch introduces f2fs_map_blocks structure likewise ext4_map_blocks.
    Now, f2fs uses f2fs_map_blocks when handling get_block.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit d7059701be7fdeeeedf4cad7421215304893ec77
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon Apr 13 15:10:36 2015 -0700

    f2fs: add feature facility in superblock

    This patch introduces a feature in superblock, which will indicate any new
    features for f2fs.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 0f7e2640a2dcbb9518db29f7f11efa4bbcb4da68
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon Apr 20 11:52:23 2015 -0700

    f2fs: add missing version info in superblock

    The mkfs.f2fs remains kernel version in superblock, but f2fs module has not
    added that so far.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit eac4ad9a9caf2487ac4c3f52b2bbbda78aa6abd9
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon Apr 20 13:44:41 2015 -0700

    f2fs: move existing definitions into f2fs.h

    This patch moves some inode-related definitions from node.h to f2fs.h to
    add new features.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 631726792831934a9bc52b80774041154a9606c7
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Sat Apr 18 18:06:49 2015 +0800

    f2fs: make has_fsynced_inode static

    has_fsynced_inode() has no other caller out of node.c, make it static.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 5f32f11835a35945f51a4d167ea7584454c0f8de
Author: Taehee Yoo <ap420073@gmail.com>
Date:   Tue Apr 21 15:59:12 2015 +0900

    f2fs: add offset check routine before punch_hole() in f2fs_fallocate()

    In the punch_hole(), if offset bigger than inode size, it returns SUCCESS.
    Then f2fs_fallocate() will update time and dirty mark.
    In that case, inode has not been modified actually.
    So I have added offset check routine that prevent to call the punch_hole().

    Signed-off-by: Taehee Yoo <ap420073@gmail.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 540a536766de5ff64402bdb6efc53552d6fe0313
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Sat Apr 18 18:05:36 2015 +0800

    f2fs: use is_valid_blkaddr to verify blkaddr for readability

    Export is_valid_blkaddr() and use it to replace some codes for readability.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit d8972539d263574eef3555446dd79b9255d1585a
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Sat Apr 18 18:03:58 2015 +0800

    f2fs: make posix_acl_create() safer and cleaner

    Our f2fs_acl_create is copied from posix_acl_create in ./fs/posix_acl.c and
    modified to avoid deadlock bug when inline_dentry feature is enabled.

    Dan Carpenter rewrites posix_acl_create in commit 2799563b281f
    ("fs/posix_acl.c: make posix_acl_create() safer and cleaner") to make this
    function more safer, so that we can avoid potential bug in its caller,
    especially for ocfs2.

    Let's back port the patch to f2fs.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit a9cf8749358973f83ff535d1958b067f8ad0e807
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed Apr 22 11:03:48 2015 -0700

    f2fs: fix wrong error hanlder in f2fs_follow_link

    The page_follow_link_light returns NULL and its error pointer was remained
    in nd->path.

    Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
    Reviewed-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit ee92d4b167e63cb6f0fba74a6ada00863057cb67
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Tue Apr 21 10:40:54 2015 -0700

    Revert "f2fs: enhance multi-threads performance"

    This reports performance regression by Yuanhan Liu.
    The basic idea was to reduce one-point mutex, but it turns out this causes
    another contention like context swithes.

    https://lkml.org/lkml/2015/4/21/11

    Until finishing the analysis on this issue, I'd like to revert this for a while.

    This reverts commit 78373b7319abdf15050af5b1632c4c8b8b398f33.

commit 7883504e5d988e192ee57e695da9448d821f50e2
Author: doc <doc.divxm@gmail.com>
Date:   Sat May 30 10:58:40 2015 +0300

    Revert "f2fs: support 3.10"

    This reverts commit 89b2b2ae32d324f0dfc0b8898798969134fd5b84.

    Change-Id: I73f831eea4b616c07413c8f3f3f3bac4111c801c

commit c3fab7fea7fea818a1f0af2493d827dd8c1dfb2e
Author: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Date:   Tue Feb 4 14:20:16 2014 +0900

    f2fs: support 3.10

    Change-Id: I9059ac5ed39e25b31be078399452d9625506b780
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 39520b54dab76f5d79248b9ff7584934fb04b2bd
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Thu Apr 9 17:03:53 2015 -0700

    f2fs: pass checkpoint reason on roll-forward recovery

    This patch adds CP_RECOVERY to remain recovery information for checkpoint.
    And, it makes sure writing checkpoint in this case.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 783f95a08679ffbd1e5ae04db571cebfa3b5108d
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed Apr 15 13:49:55 2015 -0700

    f2fs: avoid abnormal behavior on broken symlink

    When f2fs_symlink was triggered and checkpoint was done before syncing its
    link path, f2fs can get broken symlink like "xxx -> \0\0\0".
    This incurs abnormal path_walk by VFS.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 2d428db306541ce5c1b3cb84854df35637b15cc3
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed Apr 15 13:37:53 2015 -0700

    f2fs: flush symlink path to avoid broken symlink after POR

    This patch tries to avoid broken symlink case after POR in best effort.
    This results in performance regression.
    But, if f2fs has inline_data and the target path is under 3KB-sized long,
    the page would be stored in its inode_block, so that there would be no
    performance regression.

    Note that, if user wants to keep this file atomically, it needs to trigger
    dir->fsync.
    And, there is still a hole to produce broken symlink.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit e5ad0b9633c463b93cee9987f6244caa0b8e653a
Author: Taehee Yoo <ap420073@gmail.com>
Date:   Mon Apr 13 21:48:06 2015 +0900

    f2fs: change 0 to false for bool type

    in the f2fs_fill_super function, variable "retry" is bool type
    i think that it should be set as false.

    Signed-off-by: Taehee Yoo <ap420073@gmail.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 9484926fdc3cba35d52e67c37136c84cc6a3d936
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed Apr 1 19:38:20 2015 -0700

    f2fs: do not recover wrong data index

    During the roll-forward recovery, if we found a new data index written fsync
    lastly, we need to recover new block address.
    But, if that address was corrupted, we should not recover that.
    Otherwise, f2fs gets kernel panic from:

     In check_index_in_prev_nodes(),

        sentry = get_seg_entry(sbi, segno);
                 --------------------------> out-of-range segno.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 02dba4ef5e99c6f724e20b6449b64c98ba3a7ed4
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Tue Mar 31 18:03:29 2015 -0700

    f2fs: do not increase link count during recovery

    If there are multiple fsynced dnodes having a dent flag, roll-forward routine
    sets FI_INC_LINK for their inode, and recovery_dentry increases its link count
    accordingly.
    That results in normal file having a link count as 2, so we can't unlink those
    files.

    This was added to handle several inode blocks having same inode number with
    different directory paths.
    But, current f2fs doesn't replay all of path changes and only recover its dentry
    for the last fsynced inode block.
    So, there is no reason to do this.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit e5bae38d60d93e40dee266a3198be4b1e4337215
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon Mar 30 15:23:45 2015 -0700

    f2fs: assign parent's i_mode for empty dir

    When assigning i_mode for dotdot, it needs to assign parent's i_mode.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 945df7d6bd0c818e81da1f583fcb6d8bbe217ed4
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon Mar 30 15:07:16 2015 -0700

    f2fs: add F2FS_INLINE_DOTS to recover missing dot dentries

    If f2fs was corrupted with missing dot dentries, it needs to recover them after
    fsck.f2fs detection.

    The underlying precedure is:

    1. The fsck.f2fs remains F2FS_INLINE_DOTS flag in directory inode, if it detects
    missing dot dentries.

    2. When f2fs looks up the corrupted directory, it triggers f2fs_add_link with
    proper inode numbers and their dot and dotdot names.

    3. Once f2fs recovers the directory without errors, it removes F2FS_INLINE_DOTS
    finally.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit feb625fb93d4251aa16a23ccabdd827b223733cb
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Thu Mar 26 18:46:38 2015 -0700

    f2fs: fix mismatching lock and unlock pages for roll-forward recovery

    Previously, inode page is not correctly locked and unlocked in pair during
    the roll-forward recovery.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit f87dbe5bf4010a3ac30da36b837380d2f78262e0
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Tue Mar 24 12:04:20 2015 -0700

    f2fs: fix sparse warnings

    This patch fixes the below warning.

    sparse warnings: (new ones prefixed by >>)

    >> fs/f2fs/inode.c:56:23: sparse: restricted __le32 degrades to integer
    >> fs/f2fs/inode.c:56:52: sparse: restricted __le32 degrades to integer

    Reported-by: kbuild test robot <fengguang.wu@intel.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 55872dc313a783379de2385454bf7f4cd1f44fff
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Tue Mar 24 13:08:05 2015 +0800

    f2fs: limit b_size of mapped bh in f2fs_map_bh

    Map bh over max size which caller defined is not needed, limit it in
    f2fs_map_bh.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 252d8d969f381ff097c746d402523835a30b5418
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Mon Mar 23 10:37:39 2015 +0800

    f2fs: persist system.advise into on-disk inode

    This patch fixes to dirty inode for persisting i_advise of f2fs inode info into
    on-disk inode if user sets system.advise through setxattr. Otherwise the new
    value will be lost.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit ddb5c14b13a9de1196b382d75a103a9cfd4c575e
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Mon Mar 23 10:36:15 2015 +0800

    f2fs: avoid NULL pointer dereference in f2fs_xattr_advise_get

    We will encounter oops by executing below command.
    getfattr -n system.advise /mnt/f2fs/file
    Killed

    message log:
    BUG: unable to handle kernel NULL pointer dereference at   (null)
    IP: [<f8b54d69>] f2fs_xattr_advise_get+0x29/0x40 [f2fs]
    *pdpt = 00000000319b7001 *pde = 0000000000000000
    Oops: 0002 [#1] SMP
    Modules linked in: f2fs(O) snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq joydev
    snd_seq_device snd_timer bnep snd rfcomm microcode bluetooth soundcore i2c_piix4 mac_hid serio_raw parport_pc ppdev lp parport
    binfmt_misc hid_generic psmouse usbhid hid e1000 [last unloaded: f2fs]
    CPU: 3 PID: 3134 Comm: getfattr Tainted: G           O    4.0.0-rc1 #6
    Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
    task: f3a71b60 ti: f19a6000 task.ti: f19a6000
    EIP: 0060:[<f8b54d69>] EFLAGS: 00010246 CPU: 3
    EIP is at f2fs_xattr_advise_get+0x29/0x40 [f2fs]
    EAX: 00000000 EBX: f19a7e71 ECX: 00000000 EDX: f8b5b467
    ESI: 00000000 EDI: f2008570 EBP: f19a7e14 ESP: f19a7e08
     DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
    CR0: 80050033 CR2: 00000000 CR3: 319b8000 CR4: 000007f0
    Stack:
     f8b5a634 c0cbb580 00000000 f19a7e34 c1193850 00000000 00000007 f19a7e71
     f19a7e64 c0cbb580 c1193810 f19a7e50 c1193c00 00000000 00000000 00000000
     c0cbb580 00000000 f19a7f70 c1194097 00000000 00000000 00000000 74737973
    Call Trace:
     [<c1193850>] generic_getxattr+0x40/0x50
     [<c1193810>] ? xattr_resolve_name+0x80/0x80
     [<c1193c00>] vfs_getxattr+0x70/0xa0
     [<c1194097>] getxattr+0x87/0x190
     [<c11801d7>] ? path_lookupat+0x57/0x5f0
     [<c11819d2>] ? putname+0x32/0x50
     [<c116653a>] ? kmem_cache_alloc+0x2a/0x130
     [<c11819d2>] ? putname+0x32/0x50
     [<c11819d2>] ? putname+0x32/0x50
     [<c11819d2>] ? putname+0x32/0x50
     [<c11827f9>] ? user_path_at_empty+0x49/0x70
     [<c118283f>] ? user_path_at+0x1f/0x30
     [<c11941e7>] path_getxattr+0x47/0x80
     [<c11948e7>] SyS_getxattr+0x27/0x30
     [<c163f748>] sysenter_do_call+0x12/0x12
    Code: 66 90 55 89 e5 57 56 53 66 66 66 66 90 8b 78 20 89 d3 ba 67 b4 b5 f8 89 d8 89 ce e8 42 7c 7b c8 85 c0 75 16 0f b6 87 44 01 00
    00 <88> 06 b8 01 00 00 00 5b 5e 5f 5d c3 8d 76 00 b8 ea ff ff ff eb
    EIP: [<f8b54d69>] f2fs_xattr_advise_get+0x29/0x40 [f2fs] SS:ESP 0068:f19a7e08
    CR2: 0000000000000000
    ---[ end trace 860260654f1f416a ]---

    The reason is that in getfattr there are two steps which is indicated by strace info:
    1) try to lookup and get size of specified xattr.
    2) get value of the extented attribute.

    strace info:
    getxattr("/mnt/f2fs/file", "system.advise", 0x0, 0) = 1
    getxattr("/mnt/f2fs/file", "system.advise", "\x00", 256) = 1

    For the first step, getfattr may pass a NULL pointer in @value and zero in @size
    as parameters for ->getxattr, but we access this @value pointer directly without
    checking whether the pointer is valid or not in f2fs_xattr_advise_get, so the
    oops occurs.

    This patch fixes this issue by verifying @value pointer before using.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit b0b3d399c9c7d01a052bb1643c6727bb0380850e
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Mon Mar 23 10:33:37 2015 +0800

    f2fs: preallocate fallocated blocks for direct IO

    Normally, due to DIO_SKIP_HOLES flag is set by default, blockdev_direct_IO in
    f2fs_direct_IO tries to skip DIO in holes when writing inside i_size, this
    makes us falling back to buffered IO which shows lower performance.

    So in commit 59b802e5a453 ("f2fs: allocate data blocks in advance for
    f2fs_direct_IO"), we improve perfromance by allocating data blocks in advance
    if we meet holes no matter in i_size or not, since with it we can avoid falling
    back to buffered IO.

    But we forget to consider for unwritten fallocated block in this commit.
    This patch tries to fix it for fallocate case, this helps to improve
    performance.

    Test result:
    Storage info: sandisk ultra 64G micro sd card.

    touch /mnt/f2fs/file
    truncate -s 67108864 /mnt/f2fs/file
    fallocate -o 0 -l 67108864 /mnt/f2fs/file
    time dd if=/dev/zero of=/mnt/f2fs/file bs=1M count=64 conv=notrunc oflag=direct

    Time before applying the patch:
    67108864 bytes (67 MB) copied, 36.16 s, 1.9 MB/s
    real    0m36.162s
    user    0m0.000s
    sys     0m0.180s

    Time after applying the patch:
    67108864 bytes (67 MB) copied, 27.7776 s, 2.4 MB/s
    real    0m27.780s
    user    0m0.000s
    sys     0m0.036s

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 5b2ffbe519dddc8205a615a97943c978906262eb
Author: Wanpeng Li <wanpeng.li@linux.intel.com>
Date:   Tue Mar 24 10:20:27 2015 +0800

    f2fs: enable inline data by default

    Enable inline_data feature by default since it brings us better
    performance and space utilization and now has already stable.
    Add another option noinline_data to disable it during mount.

    Suggested-by: Jaegeuk Kim <jaegeuk@kernel.org>
    Suggested-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 5b6c1edfe66a6429acb9d5216aafb340022709bd
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Thu Mar 19 19:27:51 2015 +0800

    f2fs: preserve extent info for extent cache

    This patch tries to preserve last extent info in extent tree cache into on-disk
    inode, so this can help us to reuse the last extent info next time for
    performance.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit ef0c384c2b9eeb740ba676a10e5a2b194b23b1b0
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Thu Mar 19 19:26:02 2015 +0800

    f2fs: initialize extent tree with on-disk extent info of inode

    With normal extent info cache, we records largest extent mapping between logical
    block and physical block into extent info, and we persist extent info in on-disk
    inode.

    When we enable extent tree cache, if extent info of on-disk inode is exist, and
    the extent is not a small fragmented mapping extent. We'd better to load the
    extent info into extent tree cache when inode is loaded. By this way we can have
    more chance to hit extent tree cache rather than taking more time to read dnode
    page for block address.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 49b71c3af3ee7476f647071eee659998f11678d1
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Thu Mar 19 19:24:59 2015 +0800

    f2fs: introduce __{find,grab}_extent_tree

    This patch introduces __{find,grab}_extent_tree for reusing by following
    patches.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 841e0eccff53960fc7dcfec4a9f1e09186478205
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Thu Mar 19 19:23:32 2015 +0800

    f2fs: split set_data_blkaddr from f2fs_update_extent_cache

    Split __set_data_blkaddr from f2fs_update_extent_cache for readability.

    Additionally rename __set_data_blkaddr to set_data_blkaddr for exporting.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit f0683a2560052529e3b64180f54f2289463d8d82
Author: Wanpeng Li <wanpeng.li@linux.intel.com>
Date:   Thu Mar 19 13:23:48 2015 +0800

    f2fs: enable fast symlink by utilizing inline data

    Fast symlink can utilize inline data flow to avoid using any
    i_addr region, since we need to handle many cases such as
    truncation, roll-forward recovery, and fsck/dump tools.

    Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 139dd93c93898c3ccf19791d29890fe594332d64
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Tue Mar 17 17:58:08 2015 -0700

    f2fs: add some tracepoints to debug volatile and atomic writes

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 94718185df166d6c87cf856375479ee2c296be2b
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Tue Mar 17 17:16:35 2015 -0700

    f2fs: avoid punch_hole overhead when releasing volatile data

    This patch is to avoid some punch_hole overhead when releasing volatile data.
    If volatile data was not written yet, we just can make the first page as zero.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 3e8a62b8c2858369f7d563525fb528f7a4f1cb30
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Mon Mar 16 16:54:52 2015 -0700

    f2fs: avoid wrong f2fs_bug_on when truncating inline_data

    This patch removes wrong f2fs_bug_on in truncate_inline_inode.

    When there is no space, it can happen a corner case where i_isze is over
    MAX_INLINE_SIZE while its inode is still inline_data.

    The scenario is
     1. write small data into file #A.
     2. fill the whole partition to 100%.
     3. truncate 4096 on file #A.
     4. write data at 8192 offset.
      --> f2fs_write_begin
        -> -ENOSPC = f2fs_convert_inline_page
        -> f2fs_write_failed
          -> truncate_blocks
            -> truncate_inline_inode
    	  BUG_ON, since i_size is 4096.

    Reviewed-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 91c955cae47d1e00bf6b8db945f603df00b9feac
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Fri Mar 13 21:44:36 2015 -0700

    f2fs: enhance multi-threads performance

    Previously, f2fs_write_data_pages has a mutex, sbi->writepages, to serialize
    data writes to maximize write bandwidth, while sacrificing multi-threads
    performance.
    Practically, however, multi-threads environment is much more important for
    users. So this patch tries to remove the mutex.

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 4c02c4dac27ed95f8397c32ecce4a822f531d9f2
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed Mar 11 23:27:25 2015 -0400

    f2fs: set buffer_new when new blocks are allocated

    This patch modifies to call set_buffer_new, if new blocks are allocated.

    Reviewed-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 6a69e9c5bfef161547fe3efdcf0894faeb45e37b
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Mon Mar 16 21:08:44 2015 +0800

    f2fs: set SBI_NEED_FSCK when encountering exception in recovery

    This patch tries to set SBI_NEED_FSCK flag into sbi only when we fail to recover
    in fill_super, so we could skip fscking image when we fail to fill super for
    other reason.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 8c91980616b9af9cfeb810d1a7c82ad9323b3e7d
Author: Jaegeuk Kim <jaegeuk@kernel.org>
Date:   Wed Mar 11 13:42:48 2015 -0400

    f2fs: fix to cover sentry_lock for block allocation

    In the following call stack, f2fs changes the bitmap for dirty segments and # of
    dirty sentries without grabbing sit_i->sentry_lock.
    This can result in mismatch on bitmap and # of dirty sentries, since if there
    are some direct_io operations.

    In allocate_data_block,
     - __allocate_new_segments
      - mutex_lock(&curseg->curseg_mutex);
      - s_ops->allocate_segment
       - new_curseg/change_curseg
        - reset_curseg
         - __set_sit_entry_type
          - __mark_sit_entry_dirty
           - set_bit(dirty_sentries_bitmap)
           - dirty_sentries++;

    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 8189fdf992151e8826c924241674ef0de44fde40
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Thu Mar 12 17:04:24 2015 +0800

    f2fs: fix to check current blkaddr in __allocate_data_blocks

    In __allocate_data_blocks, we should check current blkaddr which is located at
    ofs_in_node of dnode page instead of checking first blkaddr all the time.
    Otherwise we can only allocate one blkaddr in each dnode page. Fix it.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit d616037f0c30568df61b77c738a4f6dfdde4b20b
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Tue Mar 10 13:16:25 2015 +0800

    f2fs: fix to truncate inline data past EOF

    Previously if inode is with inline data, we will try to invalid partial inline
    data in page #0 when we truncate size of inode in truncate_partial_data_page().
    And then we set page #0 to dirty, after this we can synchronize inode page with
    page #0 at ->writepage().

    But sometimes we will fail to operate page #0 in truncate_partial_data_page()
    due to below reason:
    a) if offset is zero, we will skip setting page #0 to dirty.
    b) if page #0 is not uptodate, we will fail to update it as it has no mapping
    data.

    So with following operations, we will meet recent data which should be
    truncated.

    1.write inline data to file
    2.sync first data page to inode page
    3.truncate file size to 0
    4.truncate file size to max_inline_size
    5.echo 1 > /proc/sys/vm/drop_caches
    6.read file --> meet original inline data which is remained in inode page.

    This patch renames truncate_inline_data() to truncate_inline_inode() for code
    readability, then use truncate_inline_inode() to truncate inline data in inode
    page in truncate_blocks() and truncate page #0 in truncate_partial_data_page()
    for fixing.

    v2:
     o truncate partially #0 page in truncate_partial_data_page to avoid keeping
       old data in #0 page.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit 74707d08913a157106af3dcacfce3be4f67f2c54
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Mon Mar 9 18:18:19 2015 +0800

    f2fs: fix reference leaks in f2fs_acl_create

    Our f2fs_acl_create is copied and modified from posix_acl_create to avoid
    deadlock bug when inline_dentry feature is enabled.

    Now, we got reference leaks in posix_acl_create, and this has been fixed in
    commit fed0b588be2f ("posix_acl: fix reference leaks in posix_acl_create")
    by Omar Sandoval.
    https://lkml.org/lkml/2015/2/9/5

    Let's fix this issue in f2fs_acl_create too.

    Signed-off-by: Chao Yu <chao2.yu@samsung.com>
    Reviewed-by: Changman Lee <cm224.lee@ssamsung.com>
    Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>

commit c8539703ed9549d476a7ea4134e83a26104e309d
Author: Chao Yu <chao2.yu@samsung.com>
Date:   Mon Mar 9 17:33:16 2015 +0800

    f2fs: fix to calculate max length of contiguous free slots correctly

    When l…
AndropaX pushed a commit to AndropaX/android_kernel_xiaomi_msm8992 that referenced this issue Apr 13, 2017
We will encounter oops by executing below command.
getfattr -n system.advise /mnt/f2fs/file
Killed

message log:
BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: [<f8b54d69>] f2fs_xattr_advise_get+0x29/0x40 [f2fs]
*pdpt = 00000000319b7001 *pde = 0000000000000000
Oops: 0002 [#1] SMP
Modules linked in: f2fs(O) snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq joydev
snd_seq_device snd_timer bnep snd rfcomm microcode bluetooth soundcore i2c_piix4 mac_hid serio_raw parport_pc ppdev lp parport
binfmt_misc hid_generic psmouse usbhid hid e1000 [last unloaded: f2fs]
CPU: 3 PID: 3134 Comm: getfattr Tainted: G           O    4.0.0-rc1 MiCode#6
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
task: f3a71b60 ti: f19a6000 task.ti: f19a6000
EIP: 0060:[<f8b54d69>] EFLAGS: 00010246 CPU: 3
EIP is at f2fs_xattr_advise_get+0x29/0x40 [f2fs]
EAX: 00000000 EBX: f19a7e71 ECX: 00000000 EDX: f8b5b467
ESI: 00000000 EDI: f2008570 EBP: f19a7e14 ESP: f19a7e08
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 80050033 CR2: 00000000 CR3: 319b8000 CR4: 000007f0
Stack:
 f8b5a634 c0cbb580 00000000 f19a7e34 c1193850 00000000 00000007 f19a7e71
 f19a7e64 c0cbb580 c1193810 f19a7e50 c1193c00 00000000 00000000 00000000
 c0cbb580 00000000 f19a7f70 c1194097 00000000 00000000 00000000 74737973
Call Trace:
 [<c1193850>] generic_getxattr+0x40/0x50
 [<c1193810>] ? xattr_resolve_name+0x80/0x80
 [<c1193c00>] vfs_getxattr+0x70/0xa0
 [<c1194097>] getxattr+0x87/0x190
 [<c11801d7>] ? path_lookupat+0x57/0x5f0
 [<c11819d2>] ? putname+0x32/0x50
 [<c116653a>] ? kmem_cache_alloc+0x2a/0x130
 [<c11819d2>] ? putname+0x32/0x50
 [<c11819d2>] ? putname+0x32/0x50
 [<c11819d2>] ? putname+0x32/0x50
 [<c11827f9>] ? user_path_at_empty+0x49/0x70
 [<c118283f>] ? user_path_at+0x1f/0x30
 [<c11941e7>] path_getxattr+0x47/0x80
 [<c11948e7>] SyS_getxattr+0x27/0x30
 [<c163f748>] sysenter_do_call+0x12/0x12
Code: 66 90 55 89 e5 57 56 53 66 66 66 66 90 8b 78 20 89 d3 ba 67 b4 b5 f8 89 d8 89 ce e8 42 7c 7b c8 85 c0 75 16 0f b6 87 44 01 00
00 <88> 06 b8 01 00 00 00 5b 5e 5f 5d c3 8d 76 00 b8 ea ff ff ff eb
EIP: [<f8b54d69>] f2fs_xattr_advise_get+0x29/0x40 [f2fs] SS:ESP 0068:f19a7e08
CR2: 0000000000000000
---[ end trace 860260654f1f416a ]---

The reason is that in getfattr there are two steps which is indicated by strace info:
1) try to lookup and get size of specified xattr.
2) get value of the extented attribute.

strace info:
getxattr("/mnt/f2fs/file", "system.advise", 0x0, 0) = 1
getxattr("/mnt/f2fs/file", "system.advise", "\x00", 256) = 1

For the first step, getfattr may pass a NULL pointer in @value and zero in @SiZe
as parameters for ->getxattr, but we access this @value pointer directly without
checking whether the pointer is valid or not in f2fs_xattr_advise_get, so the
oops occurs.

This patch fixes this issue by verifying @value pointer before using.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
AndropaX pushed a commit to AndropaX/android_kernel_xiaomi_msm8992 that referenced this issue Apr 25, 2017
We will encounter oops by executing below command.
getfattr -n system.advise /mnt/f2fs/file
Killed

message log:
BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: [<f8b54d69>] f2fs_xattr_advise_get+0x29/0x40 [f2fs]
*pdpt = 00000000319b7001 *pde = 0000000000000000
Oops: 0002 [#1] SMP
Modules linked in: f2fs(O) snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq joydev
snd_seq_device snd_timer bnep snd rfcomm microcode bluetooth soundcore i2c_piix4 mac_hid serio_raw parport_pc ppdev lp parport
binfmt_misc hid_generic psmouse usbhid hid e1000 [last unloaded: f2fs]
CPU: 3 PID: 3134 Comm: getfattr Tainted: G           O    4.0.0-rc1 MiCode#6
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
task: f3a71b60 ti: f19a6000 task.ti: f19a6000
EIP: 0060:[<f8b54d69>] EFLAGS: 00010246 CPU: 3
EIP is at f2fs_xattr_advise_get+0x29/0x40 [f2fs]
EAX: 00000000 EBX: f19a7e71 ECX: 00000000 EDX: f8b5b467
ESI: 00000000 EDI: f2008570 EBP: f19a7e14 ESP: f19a7e08
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 80050033 CR2: 00000000 CR3: 319b8000 CR4: 000007f0
Stack:
 f8b5a634 c0cbb580 00000000 f19a7e34 c1193850 00000000 00000007 f19a7e71
 f19a7e64 c0cbb580 c1193810 f19a7e50 c1193c00 00000000 00000000 00000000
 c0cbb580 00000000 f19a7f70 c1194097 00000000 00000000 00000000 74737973
Call Trace:
 [<c1193850>] generic_getxattr+0x40/0x50
 [<c1193810>] ? xattr_resolve_name+0x80/0x80
 [<c1193c00>] vfs_getxattr+0x70/0xa0
 [<c1194097>] getxattr+0x87/0x190
 [<c11801d7>] ? path_lookupat+0x57/0x5f0
 [<c11819d2>] ? putname+0x32/0x50
 [<c116653a>] ? kmem_cache_alloc+0x2a/0x130
 [<c11819d2>] ? putname+0x32/0x50
 [<c11819d2>] ? putname+0x32/0x50
 [<c11819d2>] ? putname+0x32/0x50
 [<c11827f9>] ? user_path_at_empty+0x49/0x70
 [<c118283f>] ? user_path_at+0x1f/0x30
 [<c11941e7>] path_getxattr+0x47/0x80
 [<c11948e7>] SyS_getxattr+0x27/0x30
 [<c163f748>] sysenter_do_call+0x12/0x12
Code: 66 90 55 89 e5 57 56 53 66 66 66 66 90 8b 78 20 89 d3 ba 67 b4 b5 f8 89 d8 89 ce e8 42 7c 7b c8 85 c0 75 16 0f b6 87 44 01 00
00 <88> 06 b8 01 00 00 00 5b 5e 5f 5d c3 8d 76 00 b8 ea ff ff ff eb
EIP: [<f8b54d69>] f2fs_xattr_advise_get+0x29/0x40 [f2fs] SS:ESP 0068:f19a7e08
CR2: 0000000000000000
---[ end trace 860260654f1f416a ]---

The reason is that in getfattr there are two steps which is indicated by strace info:
1) try to lookup and get size of specified xattr.
2) get value of the extented attribute.

strace info:
getxattr("/mnt/f2fs/file", "system.advise", 0x0, 0) = 1
getxattr("/mnt/f2fs/file", "system.advise", "\x00", 256) = 1

For the first step, getfattr may pass a NULL pointer in @value and zero in @SiZe
as parameters for ->getxattr, but we access this @value pointer directly without
checking whether the pointer is valid or not in f2fs_xattr_advise_get, so the
oops occurs.

This patch fixes this issue by verifying @value pointer before using.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
bgcngm pushed a commit to Mi5Devs/android_kernel_xiaomi_msm8996 that referenced this issue May 29, 2017
{min,max}_capacity are static variables that are only updated from
__update_min_max_capacity(), but not used anywhere else.

Remove them together with the function updating them. This has also
the nice side effect of fixing a LOCKDEP warning related to locking
all CPUs in update_min_max_capacity(), as reported by Ke Wang:

[    2.853595] c0 =============================================
[    2.859219] c0 [ INFO: possible recursive locking detected ]
[    2.864852] c0 4.4.6+ MiCode#5 Tainted: G        W
[    2.869604] c0 ---------------------------------------------
[    2.875230] c0 swapper/0/1 is trying to acquire lock:
[    2.880248]  (&rq->lock){-.-.-.}, at: [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    2.888815] c0
[    2.888815] c0 but task is already holding lock:
[    2.895132]  (&rq->lock){-.-.-.}, at: [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    2.903700] c0
[    2.903700] c0 other info that might help us debug this:
[    2.910710] c0  Possible unsafe locking scenario:
[    2.910710] c0
[    2.917112] c0        CPU0
[    2.919795] c0        ----
[    2.922478]   lock(&rq->lock);
[    2.925507]   lock(&rq->lock);
[    2.928536] c0
[    2.928536] c0  *** DEADLOCK ***
[    2.928536] c0
[    2.935200] c0  May be due to missing lock nesting notation
[    2.935200] c0
[    2.942471] c0 7 locks held by swapper/0/1:
[    2.946623]  #0:  (&dev->mutex){......}, at: [<ffffff800850e118>] __driver_attach+0x64/0xb8
[    2.954931]  #1:  (&dev->mutex){......}, at: [<ffffff800850e128>] __driver_attach+0x74/0xb8
[    2.963239]  #2:  (cpu_hotplug.lock){++++++}, at: [<ffffff80080cb218>] get_online_cpus+0x48/0xa8
[    2.971979]  MiCode#3:  (subsys mutex#6){+.+.+.}, at: [<ffffff800850bed4>] subsys_interface_register+0x44/0xc0
[    2.981411]  MiCode#4:  (&policy->rwsem){+.+.+.}, at: [<ffffff8008720338>] cpufreq_online+0x330/0x76c
[    2.990065]  MiCode#5:  ((cpufreq_policy_notifier_list).rwsem){.+.+..}, at: [<ffffff80080f3418>] blocking_notifier_call_chain+0x38/0xc4
[    3.001661]  MiCode#6:  (&rq->lock){-.-.-.}, at: [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    3.010661] c0
[    3.010661] c0 stack backtrace:
[    3.015514] c0 CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W 4.4.6+ MiCode#5
[    3.022864] c0 Hardware name: Spreadtrum SP9860g Board (DT)
[    3.028402] c0 Call trace:
[    3.031092] c0 [<ffffff800808b50c>] dump_backtrace+0x0/0x210
[    3.036716] c0 [<ffffff800808b73c>] show_stack+0x20/0x28
[    3.041994] c0 [<ffffff8008433310>] dump_stack+0xa8/0xe0
[    3.047273] c0 [<ffffff80081349e0>] __lock_acquire+0x1e0c/0x2218
[    3.053243] c0 [<ffffff80081353c0>] lock_acquire+0xe0/0x280
[    3.058784] c0 [<ffffff8008abfdfc>] _raw_spin_lock+0x44/0x58
[    3.064407] c0 [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    3.070983] c0 [<ffffff80080f3458>] blocking_notifier_call_chain+0x78/0xc4
[    3.077820] c0 [<ffffff8008720294>] cpufreq_online+0x28c/0x76c
[    3.083618] c0 [<ffffff80087208a4>] cpufreq_add_dev+0x98/0xdc
[    3.089331] c0 [<ffffff800850bf14>] subsys_interface_register+0x84/0xc0
[    3.095907] c0 [<ffffff800871fa0c>] cpufreq_register_driver+0x168/0x28c
[    3.102486] c0 [<ffffff80087272f8>] sprd_cpufreq_probe+0x134/0x19c
[    3.108629] c0 [<ffffff8008510768>] platform_drv_probe+0x58/0xd0
[    3.114599] c0 [<ffffff800850de2c>] driver_probe_device+0x1e8/0x470
[    3.120830] c0 [<ffffff800850e168>] __driver_attach+0xb4/0xb8
[    3.126541] c0 [<ffffff800850b750>] bus_for_each_dev+0x6c/0xac
[    3.132339] c0 [<ffffff800850d6c0>] driver_attach+0x2c/0x34
[    3.137877] c0 [<ffffff800850d234>] bus_add_driver+0x210/0x298
[    3.143676] c0 [<ffffff800850f1f4>] driver_register+0x7c/0x114
[    3.149476] c0 [<ffffff8008510654>] __platform_driver_register+0x60/0x6c
[    3.156139] c0 [<ffffff8008f49f40>] sprd_cpufreq_platdrv_init+0x18/0x20
[    3.162714] c0 [<ffffff8008082a64>] do_one_initcall+0xd0/0x1d8
[    3.168514] c0 [<ffffff8008f0bc58>] kernel_init_freeable+0x1fc/0x29c
[    3.174834] c0 [<ffffff8008ab554c>] kernel_init+0x20/0x12c
[    3.180281] c0 [<ffffff8008086290>] ret_from_fork+0x10/0x40

Change-Id: I5ebc57ea2681350c2f942e7c90078298cf5ec096
Reported-by: Ke Wang <ke.wang@spreadtrum.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
bgcngm pushed a commit to Mi5Devs/android_kernel_xiaomi_msm8996 that referenced this issue May 30, 2017
{min,max}_capacity are static variables that are only updated from
__update_min_max_capacity(), but not used anywhere else.

Remove them together with the function updating them. This has also
the nice side effect of fixing a LOCKDEP warning related to locking
all CPUs in update_min_max_capacity(), as reported by Ke Wang:

[    2.853595] c0 =============================================
[    2.859219] c0 [ INFO: possible recursive locking detected ]
[    2.864852] c0 4.4.6+ MiCode#5 Tainted: G        W
[    2.869604] c0 ---------------------------------------------
[    2.875230] c0 swapper/0/1 is trying to acquire lock:
[    2.880248]  (&rq->lock){-.-.-.}, at: [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    2.888815] c0
[    2.888815] c0 but task is already holding lock:
[    2.895132]  (&rq->lock){-.-.-.}, at: [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    2.903700] c0
[    2.903700] c0 other info that might help us debug this:
[    2.910710] c0  Possible unsafe locking scenario:
[    2.910710] c0
[    2.917112] c0        CPU0
[    2.919795] c0        ----
[    2.922478]   lock(&rq->lock);
[    2.925507]   lock(&rq->lock);
[    2.928536] c0
[    2.928536] c0  *** DEADLOCK ***
[    2.928536] c0
[    2.935200] c0  May be due to missing lock nesting notation
[    2.935200] c0
[    2.942471] c0 7 locks held by swapper/0/1:
[    2.946623]  #0:  (&dev->mutex){......}, at: [<ffffff800850e118>] __driver_attach+0x64/0xb8
[    2.954931]  #1:  (&dev->mutex){......}, at: [<ffffff800850e128>] __driver_attach+0x74/0xb8
[    2.963239]  #2:  (cpu_hotplug.lock){++++++}, at: [<ffffff80080cb218>] get_online_cpus+0x48/0xa8
[    2.971979]  MiCode#3:  (subsys mutex#6){+.+.+.}, at: [<ffffff800850bed4>] subsys_interface_register+0x44/0xc0
[    2.981411]  MiCode#4:  (&policy->rwsem){+.+.+.}, at: [<ffffff8008720338>] cpufreq_online+0x330/0x76c
[    2.990065]  MiCode#5:  ((cpufreq_policy_notifier_list).rwsem){.+.+..}, at: [<ffffff80080f3418>] blocking_notifier_call_chain+0x38/0xc4
[    3.001661]  MiCode#6:  (&rq->lock){-.-.-.}, at: [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    3.010661] c0
[    3.010661] c0 stack backtrace:
[    3.015514] c0 CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W 4.4.6+ MiCode#5
[    3.022864] c0 Hardware name: Spreadtrum SP9860g Board (DT)
[    3.028402] c0 Call trace:
[    3.031092] c0 [<ffffff800808b50c>] dump_backtrace+0x0/0x210
[    3.036716] c0 [<ffffff800808b73c>] show_stack+0x20/0x28
[    3.041994] c0 [<ffffff8008433310>] dump_stack+0xa8/0xe0
[    3.047273] c0 [<ffffff80081349e0>] __lock_acquire+0x1e0c/0x2218
[    3.053243] c0 [<ffffff80081353c0>] lock_acquire+0xe0/0x280
[    3.058784] c0 [<ffffff8008abfdfc>] _raw_spin_lock+0x44/0x58
[    3.064407] c0 [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    3.070983] c0 [<ffffff80080f3458>] blocking_notifier_call_chain+0x78/0xc4
[    3.077820] c0 [<ffffff8008720294>] cpufreq_online+0x28c/0x76c
[    3.083618] c0 [<ffffff80087208a4>] cpufreq_add_dev+0x98/0xdc
[    3.089331] c0 [<ffffff800850bf14>] subsys_interface_register+0x84/0xc0
[    3.095907] c0 [<ffffff800871fa0c>] cpufreq_register_driver+0x168/0x28c
[    3.102486] c0 [<ffffff80087272f8>] sprd_cpufreq_probe+0x134/0x19c
[    3.108629] c0 [<ffffff8008510768>] platform_drv_probe+0x58/0xd0
[    3.114599] c0 [<ffffff800850de2c>] driver_probe_device+0x1e8/0x470
[    3.120830] c0 [<ffffff800850e168>] __driver_attach+0xb4/0xb8
[    3.126541] c0 [<ffffff800850b750>] bus_for_each_dev+0x6c/0xac
[    3.132339] c0 [<ffffff800850d6c0>] driver_attach+0x2c/0x34
[    3.137877] c0 [<ffffff800850d234>] bus_add_driver+0x210/0x298
[    3.143676] c0 [<ffffff800850f1f4>] driver_register+0x7c/0x114
[    3.149476] c0 [<ffffff8008510654>] __platform_driver_register+0x60/0x6c
[    3.156139] c0 [<ffffff8008f49f40>] sprd_cpufreq_platdrv_init+0x18/0x20
[    3.162714] c0 [<ffffff8008082a64>] do_one_initcall+0xd0/0x1d8
[    3.168514] c0 [<ffffff8008f0bc58>] kernel_init_freeable+0x1fc/0x29c
[    3.174834] c0 [<ffffff8008ab554c>] kernel_init+0x20/0x12c
[    3.180281] c0 [<ffffff8008086290>] ret_from_fork+0x10/0x40

Change-Id: I5ebc57ea2681350c2f942e7c90078298cf5ec096
Reported-by: Ke Wang <ke.wang@spreadtrum.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
bgcngm pushed a commit to Mi5Devs/android_kernel_xiaomi_msm8996 that referenced this issue May 30, 2017
{min,max}_capacity are static variables that are only updated from
__update_min_max_capacity(), but not used anywhere else.

Remove them together with the function updating them. This has also
the nice side effect of fixing a LOCKDEP warning related to locking
all CPUs in update_min_max_capacity(), as reported by Ke Wang:

[    2.853595] c0 =============================================
[    2.859219] c0 [ INFO: possible recursive locking detected ]
[    2.864852] c0 4.4.6+ MiCode#5 Tainted: G        W
[    2.869604] c0 ---------------------------------------------
[    2.875230] c0 swapper/0/1 is trying to acquire lock:
[    2.880248]  (&rq->lock){-.-.-.}, at: [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    2.888815] c0
[    2.888815] c0 but task is already holding lock:
[    2.895132]  (&rq->lock){-.-.-.}, at: [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    2.903700] c0
[    2.903700] c0 other info that might help us debug this:
[    2.910710] c0  Possible unsafe locking scenario:
[    2.910710] c0
[    2.917112] c0        CPU0
[    2.919795] c0        ----
[    2.922478]   lock(&rq->lock);
[    2.925507]   lock(&rq->lock);
[    2.928536] c0
[    2.928536] c0  *** DEADLOCK ***
[    2.928536] c0
[    2.935200] c0  May be due to missing lock nesting notation
[    2.935200] c0
[    2.942471] c0 7 locks held by swapper/0/1:
[    2.946623]  #0:  (&dev->mutex){......}, at: [<ffffff800850e118>] __driver_attach+0x64/0xb8
[    2.954931]  #1:  (&dev->mutex){......}, at: [<ffffff800850e128>] __driver_attach+0x74/0xb8
[    2.963239]  #2:  (cpu_hotplug.lock){++++++}, at: [<ffffff80080cb218>] get_online_cpus+0x48/0xa8
[    2.971979]  MiCode#3:  (subsys mutex#6){+.+.+.}, at: [<ffffff800850bed4>] subsys_interface_register+0x44/0xc0
[    2.981411]  MiCode#4:  (&policy->rwsem){+.+.+.}, at: [<ffffff8008720338>] cpufreq_online+0x330/0x76c
[    2.990065]  MiCode#5:  ((cpufreq_policy_notifier_list).rwsem){.+.+..}, at: [<ffffff80080f3418>] blocking_notifier_call_chain+0x38/0xc4
[    3.001661]  MiCode#6:  (&rq->lock){-.-.-.}, at: [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    3.010661] c0
[    3.010661] c0 stack backtrace:
[    3.015514] c0 CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W 4.4.6+ MiCode#5
[    3.022864] c0 Hardware name: Spreadtrum SP9860g Board (DT)
[    3.028402] c0 Call trace:
[    3.031092] c0 [<ffffff800808b50c>] dump_backtrace+0x0/0x210
[    3.036716] c0 [<ffffff800808b73c>] show_stack+0x20/0x28
[    3.041994] c0 [<ffffff8008433310>] dump_stack+0xa8/0xe0
[    3.047273] c0 [<ffffff80081349e0>] __lock_acquire+0x1e0c/0x2218
[    3.053243] c0 [<ffffff80081353c0>] lock_acquire+0xe0/0x280
[    3.058784] c0 [<ffffff8008abfdfc>] _raw_spin_lock+0x44/0x58
[    3.064407] c0 [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    3.070983] c0 [<ffffff80080f3458>] blocking_notifier_call_chain+0x78/0xc4
[    3.077820] c0 [<ffffff8008720294>] cpufreq_online+0x28c/0x76c
[    3.083618] c0 [<ffffff80087208a4>] cpufreq_add_dev+0x98/0xdc
[    3.089331] c0 [<ffffff800850bf14>] subsys_interface_register+0x84/0xc0
[    3.095907] c0 [<ffffff800871fa0c>] cpufreq_register_driver+0x168/0x28c
[    3.102486] c0 [<ffffff80087272f8>] sprd_cpufreq_probe+0x134/0x19c
[    3.108629] c0 [<ffffff8008510768>] platform_drv_probe+0x58/0xd0
[    3.114599] c0 [<ffffff800850de2c>] driver_probe_device+0x1e8/0x470
[    3.120830] c0 [<ffffff800850e168>] __driver_attach+0xb4/0xb8
[    3.126541] c0 [<ffffff800850b750>] bus_for_each_dev+0x6c/0xac
[    3.132339] c0 [<ffffff800850d6c0>] driver_attach+0x2c/0x34
[    3.137877] c0 [<ffffff800850d234>] bus_add_driver+0x210/0x298
[    3.143676] c0 [<ffffff800850f1f4>] driver_register+0x7c/0x114
[    3.149476] c0 [<ffffff8008510654>] __platform_driver_register+0x60/0x6c
[    3.156139] c0 [<ffffff8008f49f40>] sprd_cpufreq_platdrv_init+0x18/0x20
[    3.162714] c0 [<ffffff8008082a64>] do_one_initcall+0xd0/0x1d8
[    3.168514] c0 [<ffffff8008f0bc58>] kernel_init_freeable+0x1fc/0x29c
[    3.174834] c0 [<ffffff8008ab554c>] kernel_init+0x20/0x12c
[    3.180281] c0 [<ffffff8008086290>] ret_from_fork+0x10/0x40

Change-Id: I5ebc57ea2681350c2f942e7c90078298cf5ec096
Reported-by: Ke Wang <ke.wang@spreadtrum.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
bgcngm pushed a commit to Mi5Devs/android_kernel_xiaomi_msm8996 that referenced this issue May 31, 2017
{min,max}_capacity are static variables that are only updated from
__update_min_max_capacity(), but not used anywhere else.

Remove them together with the function updating them. This has also
the nice side effect of fixing a LOCKDEP warning related to locking
all CPUs in update_min_max_capacity(), as reported by Ke Wang:

[    2.853595] c0 =============================================
[    2.859219] c0 [ INFO: possible recursive locking detected ]
[    2.864852] c0 4.4.6+ MiCode#5 Tainted: G        W
[    2.869604] c0 ---------------------------------------------
[    2.875230] c0 swapper/0/1 is trying to acquire lock:
[    2.880248]  (&rq->lock){-.-.-.}, at: [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    2.888815] c0
[    2.888815] c0 but task is already holding lock:
[    2.895132]  (&rq->lock){-.-.-.}, at: [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    2.903700] c0
[    2.903700] c0 other info that might help us debug this:
[    2.910710] c0  Possible unsafe locking scenario:
[    2.910710] c0
[    2.917112] c0        CPU0
[    2.919795] c0        ----
[    2.922478]   lock(&rq->lock);
[    2.925507]   lock(&rq->lock);
[    2.928536] c0
[    2.928536] c0  *** DEADLOCK ***
[    2.928536] c0
[    2.935200] c0  May be due to missing lock nesting notation
[    2.935200] c0
[    2.942471] c0 7 locks held by swapper/0/1:
[    2.946623]  #0:  (&dev->mutex){......}, at: [<ffffff800850e118>] __driver_attach+0x64/0xb8
[    2.954931]  #1:  (&dev->mutex){......}, at: [<ffffff800850e128>] __driver_attach+0x74/0xb8
[    2.963239]  #2:  (cpu_hotplug.lock){++++++}, at: [<ffffff80080cb218>] get_online_cpus+0x48/0xa8
[    2.971979]  MiCode#3:  (subsys mutex#6){+.+.+.}, at: [<ffffff800850bed4>] subsys_interface_register+0x44/0xc0
[    2.981411]  MiCode#4:  (&policy->rwsem){+.+.+.}, at: [<ffffff8008720338>] cpufreq_online+0x330/0x76c
[    2.990065]  MiCode#5:  ((cpufreq_policy_notifier_list).rwsem){.+.+..}, at: [<ffffff80080f3418>] blocking_notifier_call_chain+0x38/0xc4
[    3.001661]  MiCode#6:  (&rq->lock){-.-.-.}, at: [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    3.010661] c0
[    3.010661] c0 stack backtrace:
[    3.015514] c0 CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W 4.4.6+ MiCode#5
[    3.022864] c0 Hardware name: Spreadtrum SP9860g Board (DT)
[    3.028402] c0 Call trace:
[    3.031092] c0 [<ffffff800808b50c>] dump_backtrace+0x0/0x210
[    3.036716] c0 [<ffffff800808b73c>] show_stack+0x20/0x28
[    3.041994] c0 [<ffffff8008433310>] dump_stack+0xa8/0xe0
[    3.047273] c0 [<ffffff80081349e0>] __lock_acquire+0x1e0c/0x2218
[    3.053243] c0 [<ffffff80081353c0>] lock_acquire+0xe0/0x280
[    3.058784] c0 [<ffffff8008abfdfc>] _raw_spin_lock+0x44/0x58
[    3.064407] c0 [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    3.070983] c0 [<ffffff80080f3458>] blocking_notifier_call_chain+0x78/0xc4
[    3.077820] c0 [<ffffff8008720294>] cpufreq_online+0x28c/0x76c
[    3.083618] c0 [<ffffff80087208a4>] cpufreq_add_dev+0x98/0xdc
[    3.089331] c0 [<ffffff800850bf14>] subsys_interface_register+0x84/0xc0
[    3.095907] c0 [<ffffff800871fa0c>] cpufreq_register_driver+0x168/0x28c
[    3.102486] c0 [<ffffff80087272f8>] sprd_cpufreq_probe+0x134/0x19c
[    3.108629] c0 [<ffffff8008510768>] platform_drv_probe+0x58/0xd0
[    3.114599] c0 [<ffffff800850de2c>] driver_probe_device+0x1e8/0x470
[    3.120830] c0 [<ffffff800850e168>] __driver_attach+0xb4/0xb8
[    3.126541] c0 [<ffffff800850b750>] bus_for_each_dev+0x6c/0xac
[    3.132339] c0 [<ffffff800850d6c0>] driver_attach+0x2c/0x34
[    3.137877] c0 [<ffffff800850d234>] bus_add_driver+0x210/0x298
[    3.143676] c0 [<ffffff800850f1f4>] driver_register+0x7c/0x114
[    3.149476] c0 [<ffffff8008510654>] __platform_driver_register+0x60/0x6c
[    3.156139] c0 [<ffffff8008f49f40>] sprd_cpufreq_platdrv_init+0x18/0x20
[    3.162714] c0 [<ffffff8008082a64>] do_one_initcall+0xd0/0x1d8
[    3.168514] c0 [<ffffff8008f0bc58>] kernel_init_freeable+0x1fc/0x29c
[    3.174834] c0 [<ffffff8008ab554c>] kernel_init+0x20/0x12c
[    3.180281] c0 [<ffffff8008086290>] ret_from_fork+0x10/0x40

Change-Id: I5ebc57ea2681350c2f942e7c90078298cf5ec096
Reported-by: Ke Wang <ke.wang@spreadtrum.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
bgcngm pushed a commit to Mi5Devs/android_kernel_xiaomi_msm8996 that referenced this issue Jun 1, 2017
{min,max}_capacity are static variables that are only updated from
__update_min_max_capacity(), but not used anywhere else.

Remove them together with the function updating them. This has also
the nice side effect of fixing a LOCKDEP warning related to locking
all CPUs in update_min_max_capacity(), as reported by Ke Wang:

[    2.853595] c0 =============================================
[    2.859219] c0 [ INFO: possible recursive locking detected ]
[    2.864852] c0 4.4.6+ MiCode#5 Tainted: G        W
[    2.869604] c0 ---------------------------------------------
[    2.875230] c0 swapper/0/1 is trying to acquire lock:
[    2.880248]  (&rq->lock){-.-.-.}, at: [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    2.888815] c0
[    2.888815] c0 but task is already holding lock:
[    2.895132]  (&rq->lock){-.-.-.}, at: [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    2.903700] c0
[    2.903700] c0 other info that might help us debug this:
[    2.910710] c0  Possible unsafe locking scenario:
[    2.910710] c0
[    2.917112] c0        CPU0
[    2.919795] c0        ----
[    2.922478]   lock(&rq->lock);
[    2.925507]   lock(&rq->lock);
[    2.928536] c0
[    2.928536] c0  *** DEADLOCK ***
[    2.928536] c0
[    2.935200] c0  May be due to missing lock nesting notation
[    2.935200] c0
[    2.942471] c0 7 locks held by swapper/0/1:
[    2.946623]  #0:  (&dev->mutex){......}, at: [<ffffff800850e118>] __driver_attach+0x64/0xb8
[    2.954931]  #1:  (&dev->mutex){......}, at: [<ffffff800850e128>] __driver_attach+0x74/0xb8
[    2.963239]  #2:  (cpu_hotplug.lock){++++++}, at: [<ffffff80080cb218>] get_online_cpus+0x48/0xa8
[    2.971979]  MiCode#3:  (subsys mutex#6){+.+.+.}, at: [<ffffff800850bed4>] subsys_interface_register+0x44/0xc0
[    2.981411]  MiCode#4:  (&policy->rwsem){+.+.+.}, at: [<ffffff8008720338>] cpufreq_online+0x330/0x76c
[    2.990065]  MiCode#5:  ((cpufreq_policy_notifier_list).rwsem){.+.+..}, at: [<ffffff80080f3418>] blocking_notifier_call_chain+0x38/0xc4
[    3.001661]  MiCode#6:  (&rq->lock){-.-.-.}, at: [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    3.010661] c0
[    3.010661] c0 stack backtrace:
[    3.015514] c0 CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W 4.4.6+ MiCode#5
[    3.022864] c0 Hardware name: Spreadtrum SP9860g Board (DT)
[    3.028402] c0 Call trace:
[    3.031092] c0 [<ffffff800808b50c>] dump_backtrace+0x0/0x210
[    3.036716] c0 [<ffffff800808b73c>] show_stack+0x20/0x28
[    3.041994] c0 [<ffffff8008433310>] dump_stack+0xa8/0xe0
[    3.047273] c0 [<ffffff80081349e0>] __lock_acquire+0x1e0c/0x2218
[    3.053243] c0 [<ffffff80081353c0>] lock_acquire+0xe0/0x280
[    3.058784] c0 [<ffffff8008abfdfc>] _raw_spin_lock+0x44/0x58
[    3.064407] c0 [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    3.070983] c0 [<ffffff80080f3458>] blocking_notifier_call_chain+0x78/0xc4
[    3.077820] c0 [<ffffff8008720294>] cpufreq_online+0x28c/0x76c
[    3.083618] c0 [<ffffff80087208a4>] cpufreq_add_dev+0x98/0xdc
[    3.089331] c0 [<ffffff800850bf14>] subsys_interface_register+0x84/0xc0
[    3.095907] c0 [<ffffff800871fa0c>] cpufreq_register_driver+0x168/0x28c
[    3.102486] c0 [<ffffff80087272f8>] sprd_cpufreq_probe+0x134/0x19c
[    3.108629] c0 [<ffffff8008510768>] platform_drv_probe+0x58/0xd0
[    3.114599] c0 [<ffffff800850de2c>] driver_probe_device+0x1e8/0x470
[    3.120830] c0 [<ffffff800850e168>] __driver_attach+0xb4/0xb8
[    3.126541] c0 [<ffffff800850b750>] bus_for_each_dev+0x6c/0xac
[    3.132339] c0 [<ffffff800850d6c0>] driver_attach+0x2c/0x34
[    3.137877] c0 [<ffffff800850d234>] bus_add_driver+0x210/0x298
[    3.143676] c0 [<ffffff800850f1f4>] driver_register+0x7c/0x114
[    3.149476] c0 [<ffffff8008510654>] __platform_driver_register+0x60/0x6c
[    3.156139] c0 [<ffffff8008f49f40>] sprd_cpufreq_platdrv_init+0x18/0x20
[    3.162714] c0 [<ffffff8008082a64>] do_one_initcall+0xd0/0x1d8
[    3.168514] c0 [<ffffff8008f0bc58>] kernel_init_freeable+0x1fc/0x29c
[    3.174834] c0 [<ffffff8008ab554c>] kernel_init+0x20/0x12c
[    3.180281] c0 [<ffffff8008086290>] ret_from_fork+0x10/0x40

Change-Id: I5ebc57ea2681350c2f942e7c90078298cf5ec096
Reported-by: Ke Wang <ke.wang@spreadtrum.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
bgcngm pushed a commit to Mi5Devs/android_kernel_xiaomi_msm8996 that referenced this issue Jun 1, 2017
{min,max}_capacity are static variables that are only updated from
__update_min_max_capacity(), but not used anywhere else.

Remove them together with the function updating them. This has also
the nice side effect of fixing a LOCKDEP warning related to locking
all CPUs in update_min_max_capacity(), as reported by Ke Wang:

[    2.853595] c0 =============================================
[    2.859219] c0 [ INFO: possible recursive locking detected ]
[    2.864852] c0 4.4.6+ MiCode#5 Tainted: G        W
[    2.869604] c0 ---------------------------------------------
[    2.875230] c0 swapper/0/1 is trying to acquire lock:
[    2.880248]  (&rq->lock){-.-.-.}, at: [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    2.888815] c0
[    2.888815] c0 but task is already holding lock:
[    2.895132]  (&rq->lock){-.-.-.}, at: [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    2.903700] c0
[    2.903700] c0 other info that might help us debug this:
[    2.910710] c0  Possible unsafe locking scenario:
[    2.910710] c0
[    2.917112] c0        CPU0
[    2.919795] c0        ----
[    2.922478]   lock(&rq->lock);
[    2.925507]   lock(&rq->lock);
[    2.928536] c0
[    2.928536] c0  *** DEADLOCK ***
[    2.928536] c0
[    2.935200] c0  May be due to missing lock nesting notation
[    2.935200] c0
[    2.942471] c0 7 locks held by swapper/0/1:
[    2.946623]  #0:  (&dev->mutex){......}, at: [<ffffff800850e118>] __driver_attach+0x64/0xb8
[    2.954931]  #1:  (&dev->mutex){......}, at: [<ffffff800850e128>] __driver_attach+0x74/0xb8
[    2.963239]  #2:  (cpu_hotplug.lock){++++++}, at: [<ffffff80080cb218>] get_online_cpus+0x48/0xa8
[    2.971979]  MiCode#3:  (subsys mutex#6){+.+.+.}, at: [<ffffff800850bed4>] subsys_interface_register+0x44/0xc0
[    2.981411]  MiCode#4:  (&policy->rwsem){+.+.+.}, at: [<ffffff8008720338>] cpufreq_online+0x330/0x76c
[    2.990065]  MiCode#5:  ((cpufreq_policy_notifier_list).rwsem){.+.+..}, at: [<ffffff80080f3418>] blocking_notifier_call_chain+0x38/0xc4
[    3.001661]  MiCode#6:  (&rq->lock){-.-.-.}, at: [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    3.010661] c0
[    3.010661] c0 stack backtrace:
[    3.015514] c0 CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W 4.4.6+ MiCode#5
[    3.022864] c0 Hardware name: Spreadtrum SP9860g Board (DT)
[    3.028402] c0 Call trace:
[    3.031092] c0 [<ffffff800808b50c>] dump_backtrace+0x0/0x210
[    3.036716] c0 [<ffffff800808b73c>] show_stack+0x20/0x28
[    3.041994] c0 [<ffffff8008433310>] dump_stack+0xa8/0xe0
[    3.047273] c0 [<ffffff80081349e0>] __lock_acquire+0x1e0c/0x2218
[    3.053243] c0 [<ffffff80081353c0>] lock_acquire+0xe0/0x280
[    3.058784] c0 [<ffffff8008abfdfc>] _raw_spin_lock+0x44/0x58
[    3.064407] c0 [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    3.070983] c0 [<ffffff80080f3458>] blocking_notifier_call_chain+0x78/0xc4
[    3.077820] c0 [<ffffff8008720294>] cpufreq_online+0x28c/0x76c
[    3.083618] c0 [<ffffff80087208a4>] cpufreq_add_dev+0x98/0xdc
[    3.089331] c0 [<ffffff800850bf14>] subsys_interface_register+0x84/0xc0
[    3.095907] c0 [<ffffff800871fa0c>] cpufreq_register_driver+0x168/0x28c
[    3.102486] c0 [<ffffff80087272f8>] sprd_cpufreq_probe+0x134/0x19c
[    3.108629] c0 [<ffffff8008510768>] platform_drv_probe+0x58/0xd0
[    3.114599] c0 [<ffffff800850de2c>] driver_probe_device+0x1e8/0x470
[    3.120830] c0 [<ffffff800850e168>] __driver_attach+0xb4/0xb8
[    3.126541] c0 [<ffffff800850b750>] bus_for_each_dev+0x6c/0xac
[    3.132339] c0 [<ffffff800850d6c0>] driver_attach+0x2c/0x34
[    3.137877] c0 [<ffffff800850d234>] bus_add_driver+0x210/0x298
[    3.143676] c0 [<ffffff800850f1f4>] driver_register+0x7c/0x114
[    3.149476] c0 [<ffffff8008510654>] __platform_driver_register+0x60/0x6c
[    3.156139] c0 [<ffffff8008f49f40>] sprd_cpufreq_platdrv_init+0x18/0x20
[    3.162714] c0 [<ffffff8008082a64>] do_one_initcall+0xd0/0x1d8
[    3.168514] c0 [<ffffff8008f0bc58>] kernel_init_freeable+0x1fc/0x29c
[    3.174834] c0 [<ffffff8008ab554c>] kernel_init+0x20/0x12c
[    3.180281] c0 [<ffffff8008086290>] ret_from_fork+0x10/0x40

Change-Id: I5ebc57ea2681350c2f942e7c90078298cf5ec096
Reported-by: Ke Wang <ke.wang@spreadtrum.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
bgcngm pushed a commit to Mi5Devs/android_kernel_xiaomi_msm8996 that referenced this issue Jun 3, 2017
{min,max}_capacity are static variables that are only updated from
__update_min_max_capacity(), but not used anywhere else.

Remove them together with the function updating them. This has also
the nice side effect of fixing a LOCKDEP warning related to locking
all CPUs in update_min_max_capacity(), as reported by Ke Wang:

[    2.853595] c0 =============================================
[    2.859219] c0 [ INFO: possible recursive locking detected ]
[    2.864852] c0 4.4.6+ MiCode#5 Tainted: G        W
[    2.869604] c0 ---------------------------------------------
[    2.875230] c0 swapper/0/1 is trying to acquire lock:
[    2.880248]  (&rq->lock){-.-.-.}, at: [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    2.888815] c0
[    2.888815] c0 but task is already holding lock:
[    2.895132]  (&rq->lock){-.-.-.}, at: [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    2.903700] c0
[    2.903700] c0 other info that might help us debug this:
[    2.910710] c0  Possible unsafe locking scenario:
[    2.910710] c0
[    2.917112] c0        CPU0
[    2.919795] c0        ----
[    2.922478]   lock(&rq->lock);
[    2.925507]   lock(&rq->lock);
[    2.928536] c0
[    2.928536] c0  *** DEADLOCK ***
[    2.928536] c0
[    2.935200] c0  May be due to missing lock nesting notation
[    2.935200] c0
[    2.942471] c0 7 locks held by swapper/0/1:
[    2.946623]  #0:  (&dev->mutex){......}, at: [<ffffff800850e118>] __driver_attach+0x64/0xb8
[    2.954931]  #1:  (&dev->mutex){......}, at: [<ffffff800850e128>] __driver_attach+0x74/0xb8
[    2.963239]  #2:  (cpu_hotplug.lock){++++++}, at: [<ffffff80080cb218>] get_online_cpus+0x48/0xa8
[    2.971979]  MiCode#3:  (subsys mutex#6){+.+.+.}, at: [<ffffff800850bed4>] subsys_interface_register+0x44/0xc0
[    2.981411]  MiCode#4:  (&policy->rwsem){+.+.+.}, at: [<ffffff8008720338>] cpufreq_online+0x330/0x76c
[    2.990065]  MiCode#5:  ((cpufreq_policy_notifier_list).rwsem){.+.+..}, at: [<ffffff80080f3418>] blocking_notifier_call_chain+0x38/0xc4
[    3.001661]  MiCode#6:  (&rq->lock){-.-.-.}, at: [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    3.010661] c0
[    3.010661] c0 stack backtrace:
[    3.015514] c0 CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W 4.4.6+ MiCode#5
[    3.022864] c0 Hardware name: Spreadtrum SP9860g Board (DT)
[    3.028402] c0 Call trace:
[    3.031092] c0 [<ffffff800808b50c>] dump_backtrace+0x0/0x210
[    3.036716] c0 [<ffffff800808b73c>] show_stack+0x20/0x28
[    3.041994] c0 [<ffffff8008433310>] dump_stack+0xa8/0xe0
[    3.047273] c0 [<ffffff80081349e0>] __lock_acquire+0x1e0c/0x2218
[    3.053243] c0 [<ffffff80081353c0>] lock_acquire+0xe0/0x280
[    3.058784] c0 [<ffffff8008abfdfc>] _raw_spin_lock+0x44/0x58
[    3.064407] c0 [<ffffff80081241cc>] cpufreq_notifier_policy+0x2e8/0x37c
[    3.070983] c0 [<ffffff80080f3458>] blocking_notifier_call_chain+0x78/0xc4
[    3.077820] c0 [<ffffff8008720294>] cpufreq_online+0x28c/0x76c
[    3.083618] c0 [<ffffff80087208a4>] cpufreq_add_dev+0x98/0xdc
[    3.089331] c0 [<ffffff800850bf14>] subsys_interface_register+0x84/0xc0
[    3.095907] c0 [<ffffff800871fa0c>] cpufreq_register_driver+0x168/0x28c
[    3.102486] c0 [<ffffff80087272f8>] sprd_cpufreq_probe+0x134/0x19c
[    3.108629] c0 [<ffffff8008510768>] platform_drv_probe+0x58/0xd0
[    3.114599] c0 [<ffffff800850de2c>] driver_probe_device+0x1e8/0x470
[    3.120830] c0 [<ffffff800850e168>] __driver_attach+0xb4/0xb8
[    3.126541] c0 [<ffffff800850b750>] bus_for_each_dev+0x6c/0xac
[    3.132339] c0 [<ffffff800850d6c0>] driver_attach+0x2c/0x34
[    3.137877] c0 [<ffffff800850d234>] bus_add_driver+0x210/0x298
[    3.143676] c0 [<ffffff800850f1f4>] driver_register+0x7c/0x114
[    3.149476] c0 [<ffffff8008510654>] __platform_driver_register+0x60/0x6c
[    3.156139] c0 [<ffffff8008f49f40>] sprd_cpufreq_platdrv_init+0x18/0x20
[    3.162714] c0 [<ffffff8008082a64>] do_one_initcall+0xd0/0x1d8
[    3.168514] c0 [<ffffff8008f0bc58>] kernel_init_freeable+0x1fc/0x29c
[    3.174834] c0 [<ffffff8008ab554c>] kernel_init+0x20/0x12c
[    3.180281] c0 [<ffffff8008086290>] ret_from_fork+0x10/0x40

Change-Id: I5ebc57ea2681350c2f942e7c90078298cf5ec096
Reported-by: Ke Wang <ke.wang@spreadtrum.com>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
pix106 pushed a commit to pix106/android_kernel_xiaomi that referenced this issue Jul 2, 2020
[ Upstream commit 9b38cc7 ]

Ziqian reported lockup when adding retprobe on _raw_spin_lock_irqsave.
My test was also able to trigger lockdep output:

 ============================================
 WARNING: possible recursive locking detected
 5.6.0-rc6+ MiCode#6 Not tainted
 --------------------------------------------
 sched-messaging/2767 is trying to acquire lock:
 ffffffff9a492798 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_hash_lock+0x52/0xa0

 but task is already holding lock:
 ffffffff9a491a18 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_trampoline+0x0/0x50

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&(kretprobe_table_locks[i].lock));
   lock(&(kretprobe_table_locks[i].lock));

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 1 lock held by sched-messaging/2767:
  #0: ffffffff9a491a18 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_trampoline+0x0/0x50

 stack backtrace:
 CPU: 3 PID: 2767 Comm: sched-messaging Not tainted 5.6.0-rc6+ MiCode#6
 Call Trace:
  dump_stack+0x96/0xe0
  __lock_acquire.cold.57+0x173/0x2b7
  ? native_queued_spin_lock_slowpath+0x42b/0x9e0
  ? lockdep_hardirqs_on+0x590/0x590
  ? __lock_acquire+0xf63/0x4030
  lock_acquire+0x15a/0x3d0
  ? kretprobe_hash_lock+0x52/0xa0
  _raw_spin_lock_irqsave+0x36/0x70
  ? kretprobe_hash_lock+0x52/0xa0
  kretprobe_hash_lock+0x52/0xa0
  trampoline_handler+0xf8/0x940
  ? kprobe_fault_handler+0x380/0x380
  ? find_held_lock+0x3a/0x1c0
  kretprobe_trampoline+0x25/0x50
  ? lock_acquired+0x392/0xbc0
  ? _raw_spin_lock_irqsave+0x50/0x70
  ? __get_valid_kprobe+0x1f0/0x1f0
  ? _raw_spin_unlock_irqrestore+0x3b/0x40
  ? finish_task_switch+0x4b9/0x6d0
  ? __switch_to_asm+0x34/0x70
  ? __switch_to_asm+0x40/0x70

The code within the kretprobe handler checks for probe reentrancy,
so we won't trigger any _raw_spin_lock_irqsave probe in there.

The problem is in outside kprobe_flush_task, where we call:

  kprobe_flush_task
    kretprobe_table_lock
      raw_spin_lock_irqsave
        _raw_spin_lock_irqsave

where _raw_spin_lock_irqsave triggers the kretprobe and installs
kretprobe_trampoline handler on _raw_spin_lock_irqsave return.

The kretprobe_trampoline handler is then executed with already
locked kretprobe_table_locks, and first thing it does is to
lock kretprobe_table_locks ;-) the whole lockup path like:

  kprobe_flush_task
    kretprobe_table_lock
      raw_spin_lock_irqsave
        _raw_spin_lock_irqsave ---> probe triggered, kretprobe_trampoline installed

        ---> kretprobe_table_locks locked

        kretprobe_trampoline
          trampoline_handler
            kretprobe_hash_lock(current, &head, &flags);  <--- deadlock

Adding kprobe_busy_begin/end helpers that mark code with fake
probe installed to prevent triggering of another kprobe within
this code.

Using these helpers in kprobe_flush_task, so the probe recursion
protection check is hit and the probe is never set to prevent
above lockup.

Link: http://lkml.kernel.org/r/158927059835.27680.7011202830041561604.stgit@devnote2

Fixes: ef53d9c ("kprobes: improve kretprobe scalability with hashed locking")
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "Gustavo A . R . Silva" <gustavoars@kernel.org>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: "Naveen N . Rao" <naveen.n.rao@linux.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Reported-by: "Ziqian SUN (Zamir)" <zsun@redhat.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
pix106 pushed a commit to pix106/android_kernel_xiaomi that referenced this issue Jul 31, 2020
[ Upstream commit 8523c00 ]

After entering kdb due to breakpoint, when we execute 'ss' or 'go' (will
delay installing breakpoints, do single-step first), it won't work
correctly, and it will enter kdb due to oops.

It's because the reason gotten in kdb_stub() is not as expected, and it
seems that the ex_vector for single-step should be 0, like what arch
powerpc/sh/parisc has implemented.

Before the patch:
Entering kdb (current=0xffff8000119e2dc0, pid 0) on processor 0 due to Keyboard Entry
[0]kdb> bp printk
Instruction(i) BP #0 at 0xffff8000101486cc (printk)
    is enabled   addr at ffff8000101486cc, hardtype=0 installed=0

[0]kdb> g

/ # echo h > /proc/sysrq-trigger

Entering kdb (current=0xffff0000fa878040, pid 266) on processor 3 due to Breakpoint @ 0xffff8000101486cc
[3]kdb> ss

Entering kdb (current=0xffff0000fa878040, pid 266) on processor 3 Oops: (null)
due to oops @ 0xffff800010082ab8
CPU: 3 PID: 266 Comm: sh Not tainted 5.7.0-rc4-13839-gf0e5ad491718 MiCode#6
Hardware name: linux,dummy-virt (DT)
pstate: 00000085 (nzcv daIf -PAN -UAO)
pc : el1_irq+0x78/0x180
lr : __handle_sysrq+0x80/0x190
sp : ffff800015003bf0
x29: ffff800015003d20 x28: ffff0000fa878040
x27: 0000000000000000 x26: ffff80001126b1f0
x25: ffff800011b6a0d8 x24: 0000000000000000
x23: 0000000080200005 x22: ffff8000101486cc
x21: ffff800015003d30 x20: 0000ffffffffffff
x19: ffff8000119f2000 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000
x15: 0000000000000000 x14: 0000000000000000
x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000000 x10: 0000000000000000
x9 : 0000000000000000 x8 : ffff800015003e50
x7 : 0000000000000002 x6 : 00000000380b9990
x5 : ffff8000106e99e8 x4 : ffff0000fadd83c0
x3 : 0000ffffffffffff x2 : ffff800011b6a0d8
x1 : ffff800011b6a000 x0 : ffff80001130c9d8
Call trace:
 el1_irq+0x78/0x180
 printk+0x0/0x84
 write_sysrq_trigger+0xb0/0x118
 proc_reg_write+0xb4/0xe0
 __vfs_write+0x18/0x40
 vfs_write+0xb0/0x1b8
 ksys_write+0x64/0xf0
 __arm64_sys_write+0x14/0x20
 el0_svc_common.constprop.2+0xb0/0x168
 do_el0_svc+0x20/0x98
 el0_sync_handler+0xec/0x1a8
 el0_sync+0x140/0x180

[3]kdb>

After the patch:
Entering kdb (current=0xffff8000119e2dc0, pid 0) on processor 0 due to Keyboard Entry
[0]kdb> bp printk
Instruction(i) BP #0 at 0xffff8000101486cc (printk)
    is enabled   addr at ffff8000101486cc, hardtype=0 installed=0

[0]kdb> g

/ # echo h > /proc/sysrq-trigger

Entering kdb (current=0xffff0000fa852bc0, pid 268) on processor 0 due to Breakpoint @ 0xffff8000101486cc
[0]kdb> g

Entering kdb (current=0xffff0000fa852bc0, pid 268) on processor 0 due to Breakpoint @ 0xffff8000101486cc
[0]kdb> ss

Entering kdb (current=0xffff0000fa852bc0, pid 268) on processor 0 due to SS trap @ 0xffff800010082ab8
[0]kdb>

Fixes: 44679a4 ("arm64: KGDB: Add step debugging support")
Signed-off-by: Wei Li <liwei391@huawei.com>
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20200509214159.19680-2-liwei391@huawei.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
pix106 pushed a commit to pix106/android_kernel_xiaomi that referenced this issue Aug 24, 2020
[ Upstream commit e24c644 ]

I compiled with AddressSanitizer and I had these memory leaks while I
was using the tep_parse_format function:

    Direct leak of 28 byte(s) in 4 object(s) allocated from:
        #0 0x7fb07db49ffe in __interceptor_realloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dffe)
        MiCode#1 0x7fb07a724228 in extend_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:985
        MiCode#2 0x7fb07a724c21 in __read_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1140
        MiCode#3 0x7fb07a724f78 in read_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1206
        MiCode#4 0x7fb07a725191 in __read_expect_type /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1291
        MiCode#5 0x7fb07a7251df in read_expect_type /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1299
        MiCode#6 0x7fb07a72e6c8 in process_dynamic_array_len /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:2849
        MiCode#7 0x7fb07a7304b8 in process_function /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3161
        MiCode#8 0x7fb07a730900 in process_arg_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3207
        MiCode#9 0x7fb07a727c0b in process_arg /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1786
        MiCode#10 0x7fb07a731080 in event_read_print_args /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3285
        MiCode#11 0x7fb07a731722 in event_read_print /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3369
        MiCode#12 0x7fb07a740054 in __tep_parse_format /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:6335
        MiCode#13 0x7fb07a74047a in __parse_event /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:6389
        MiCode#14 0x7fb07a740536 in tep_parse_format /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:6431
        MiCode#15 0x7fb07a785acf in parse_event ../../../src/fs-src/fs.c:251
        MiCode#16 0x7fb07a785ccd in parse_systems ../../../src/fs-src/fs.c:284
        MiCode#17 0x7fb07a786fb3 in read_metadata ../../../src/fs-src/fs.c:593
        MiCode#18 0x7fb07a78760e in ftrace_fs_source_init ../../../src/fs-src/fs.c:727
        MiCode#19 0x7fb07d90c19c in add_component_with_init_method_data ../../../../src/lib/graph/graph.c:1048
        MiCode#20 0x7fb07d90c87b in add_source_component_with_initialize_method_data ../../../../src/lib/graph/graph.c:1127
        MiCode#21 0x7fb07d90c92a in bt_graph_add_source_component ../../../../src/lib/graph/graph.c:1152
        MiCode#22 0x55db11aa632e in cmd_run_ctx_create_components_from_config_components ../../../src/cli/babeltrace2.c:2252
        MiCode#23 0x55db11aa6fda in cmd_run_ctx_create_components ../../../src/cli/babeltrace2.c:2347
        MiCode#24 0x55db11aa780c in cmd_run ../../../src/cli/babeltrace2.c:2461
        MiCode#25 0x55db11aa8a7d in main ../../../src/cli/babeltrace2.c:2673
        MiCode#26 0x7fb07d5460b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)

The token variable in the process_dynamic_array_len function is
allocated in the read_expect_type function, but is not freed before
calling the read_token function.

Free the token variable before calling read_token in order to plug the
leak.

Signed-off-by: Philippe Duplessis-Guindon <pduplessis@efficios.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lore.kernel.org/linux-trace-devel/20200730150236.5392-1-pduplessis@efficios.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
pix106 pushed a commit to pix106/android_kernel_xiaomi that referenced this issue Oct 1, 2020
[ Upstream commit d26383d ]

The following leaks were detected by ASAN:

  Indirect leak of 360 byte(s) in 9 object(s) allocated from:
    #0 0x7fecc305180e in calloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10780e)
    MiCode#1 0x560578f6dce5 in perf_pmu__new_format util/pmu.c:1333
    MiCode#2 0x560578f752fc in perf_pmu_parse util/pmu.y:59
    MiCode#3 0x560578f6a8b7 in perf_pmu__format_parse util/pmu.c:73
    MiCode#4 0x560578e07045 in test__pmu tests/pmu.c:155
    MiCode#5 0x560578de109b in run_test tests/builtin-test.c:410
    MiCode#6 0x560578de109b in test_and_print tests/builtin-test.c:440
    MiCode#7 0x560578de401a in __cmd_test tests/builtin-test.c:661
    MiCode#8 0x560578de401a in cmd_test tests/builtin-test.c:807
    MiCode#9 0x560578e49354 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312
    MiCode#10 0x560578ce71a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364
    MiCode#11 0x560578ce71a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408
    MiCode#12 0x560578ce71a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538
    MiCode#13 0x7fecc2b7acc9 in __libc_start_main ../csu/libc-start.c:308

Fixes: cff7f95 ("perf tests: Move pmu tests into separate object")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20200915031819.386559-12-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
evSolod29 pushed a commit to evSolod29/Lotus_MT6765_Kernel that referenced this issue Nov 25, 2020
commit c73322d upstream.

Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
cleanups".

Jia reported a scenario in which the kswapd of a node indefinitely spins
at 100% CPU usage.  We have seen similar cases at Facebook.

The kernel's current method of judging its ability to reclaim a node (or
whether to back off and sleep) is based on the amount of scanned pages
in proportion to the amount of reclaimable pages.  In Jia's and our
scenarios, there are no reclaimable pages in the node, however, and the
condition for backing off is never met.  Kswapd busyloops in an attempt
to restore the watermarks while having nothing to work with.

This series reworks the definition of an unreclaimable node based not on
scanning but on whether kswapd is able to actually reclaim pages in
MAX_RECLAIM_RETRIES (16) consecutive runs.  This is the same criteria
the page allocator uses for giving up on direct reclaim and invoking the
OOM killer.  If it cannot free any pages, kswapd will go to sleep and
leave further attempts to direct reclaim invocations, which will either
make progress and re-enable kswapd, or invoke the OOM killer.

Patch MiCode#1 fixes the immediate problem Jia reported, the remainder are
smaller fixlets, cleanups, and overall phasing out of the old method.

Patch MiCode#6 is the odd one out.  It's a nice cleanup to get_scan_count(),
and directly related to MiCode#5, but in itself not relevant to the series.

If the whole series is too ambitious for 4.11, I would consider the
first three patches fixes, the rest cleanups.

This patch (of 9):

Jia He reports a problem with kswapd spinning at 100% CPU when
requesting more hugepages than memory available in the system:

$ echo 4000 >/proc/sys/vm/nr_hugepages

top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3

At that time, there are no reclaimable pages left in the node, but as
kswapd fails to restore the high watermarks it refuses to go to sleep.

Kswapd needs to back away from nodes that fail to balance.  Up until
commit 1d82de6 ("mm, vmscan: make kswapd reclaim in terms of
nodes") kswapd had such a mechanism.  It considered zones whose
theoretically reclaimable pages it had reclaimed six times over as
unreclaimable and backed away from them.  This guard was erroneously
removed as the patch changed the definition of a balanced node.

However, simply restoring this code wouldn't help in the case reported
here: there *are* no reclaimable pages that could be scanned until the
threshold is met.  Kswapd would stay awake anyway.

Introduce a new and much simpler way of backing off.  If kswapd runs
through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
page, make it back off from the node.  This is the same number of shots
direct reclaim takes before declaring OOM.  Kswapd will go to sleep on
that node until a direct reclaimer manages to reclaim some pages, thus
proving the node reclaimable again.

[hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
  Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
[shakeelb@google.com: fix condition for throttle_direct_reclaim]
  Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: Jia He <hejianet@gmail.com>
Tested-by: Jia He <hejianet@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
pix106 pushed a commit to pix106/android_kernel_xiaomi that referenced this issue Dec 5, 2020
[ Upstream commit e773ca7 ]

Actually, burst size is equal to '1 << desc->rqcfg.brst_size'.
we should use burst size, not desc->rqcfg.brst_size.

dma memcpy performance on Rockchip RV1126
@ 1512MHz A7, 1056MHz LPDDR3, 200MHz DMA:

dmatest:

/# echo dma0chan0 > /sys/module/dmatest/parameters/channel
/# echo 4194304 > /sys/module/dmatest/parameters/test_buf_size
/# echo 8 > /sys/module/dmatest/parameters/iterations
/# echo y > /sys/module/dmatest/parameters/norandom
/# echo y > /sys/module/dmatest/parameters/verbose
/# echo 1 > /sys/module/dmatest/parameters/run

dmatest: dma0chan0-copy0: result MiCode#1: 'test passed' with src_off=0x0 dst_off=0x0 len=0x400000
dmatest: dma0chan0-copy0: result MiCode#2: 'test passed' with src_off=0x0 dst_off=0x0 len=0x400000
dmatest: dma0chan0-copy0: result MiCode#3: 'test passed' with src_off=0x0 dst_off=0x0 len=0x400000
dmatest: dma0chan0-copy0: result MiCode#4: 'test passed' with src_off=0x0 dst_off=0x0 len=0x400000
dmatest: dma0chan0-copy0: result MiCode#5: 'test passed' with src_off=0x0 dst_off=0x0 len=0x400000
dmatest: dma0chan0-copy0: result MiCode#6: 'test passed' with src_off=0x0 dst_off=0x0 len=0x400000
dmatest: dma0chan0-copy0: result MiCode#7: 'test passed' with src_off=0x0 dst_off=0x0 len=0x400000
dmatest: dma0chan0-copy0: result MiCode#8: 'test passed' with src_off=0x0 dst_off=0x0 len=0x400000

Before:

  dmatest: dma0chan0-copy0: summary 8 tests, 0 failures 48 iops 200338 KB/s (0)

After this patch:

  dmatest: dma0chan0-copy0: summary 8 tests, 0 failures 179 iops 734873 KB/s (0)

After this patch and increase dma clk to 400MHz:

  dmatest: dma0chan0-copy0: summary 8 tests, 0 failures 259 iops 1062929 KB/s (0)

Signed-off-by: Sugar Zhang <sugar.zhang@rock-chips.com>
Link: https://lore.kernel.org/r/1605326106-55681-1-git-send-email-sugar.zhang@rock-chips.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
pix106 pushed a commit to pix106/android_kernel_xiaomi that referenced this issue Dec 30, 2020
[ Upstream commit 4a9d81c ]

If the elem is deleted during be iterated on it, the iteration
process will fall into an endless loop.

kernel: NMI watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [nfsd:17137]

PID: 17137  TASK: ffff8818d93c0000  CPU: 4   COMMAND: "nfsd"
    [exception RIP: __state_in_grace+76]
    RIP: ffffffffc00e817c  RSP: ffff8818d3aefc98  RFLAGS: 00000246
    RAX: ffff881dc0c38298  RBX: ffffffff81b03580  RCX: ffff881dc02c9f50
    RDX: ffff881e3fce8500  RSI: 0000000000000001  RDI: ffffffff81b03580
    RBP: ffff8818d3aefca0   R8: 0000000000000020   R9: ffff8818d3aefd40
    R10: ffff88017fc03800  R11: ffff8818e83933c0  R12: ffff8818d3aefd40
    R13: 0000000000000000  R14: ffff8818e8391068  R15: ffff8818fa6e4000
    CS: 0010  SS: 0018
 #0 [ffff8818d3aefc98] opens_in_grace at ffffffffc00e81e3 [grace]
 MiCode#1 [ffff8818d3aefca8] nfs4_preprocess_stateid_op at ffffffffc02a3e6c [nfsd]
 MiCode#2 [ffff8818d3aefd18] nfsd4_write at ffffffffc028ed5b [nfsd]
 MiCode#3 [ffff8818d3aefd80] nfsd4_proc_compound at ffffffffc0290a0d [nfsd]
 MiCode#4 [ffff8818d3aefdd0] nfsd_dispatch at ffffffffc027b800 [nfsd]
 MiCode#5 [ffff8818d3aefe08] svc_process_common at ffffffffc02017f3 [sunrpc]
 MiCode#6 [ffff8818d3aefe70] svc_process at ffffffffc0201ce3 [sunrpc]
 MiCode#7 [ffff8818d3aefe98] nfsd at ffffffffc027b117 [nfsd]
 MiCode#8 [ffff8818d3aefec8] kthread at ffffffff810b88c1
 MiCode#9 [ffff8818d3aeff50] ret_from_fork at ffffffff816d1607

The troublemake elem:
crash> lock_manager ffff881dc0c38298
struct lock_manager {
  list = {
    next = 0xffff881dc0c38298,
    prev = 0xffff881dc0c38298
  },
  block_opens = false
}

Fixes: c87fb4a ("lockd: NLM grace period shouldn't block NFSv4 opens")
Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn>
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
SakthivelNadar pushed a commit to SakthivelNadar/android_kernel_redmi_mt6768 that referenced this issue Jan 1, 2021
Move the loop-invariant calculation of 'cpu' in do_idle() out of the loop body,
because the current CPU is always constant.

This improves the generated code both on x86-64 and ARM64:

x86-64:

Before patch (execution in loop):
	864:       0f ae e8                lfence
	867:       65 8b 05 c2 38 f1 7e    mov %gs:0x7ef138c2(%rip),%eax
	86e:       89 c0                   mov %eax,%eax
	870:       48 0f a3 05 68 19 08    bt  %rax,0x1081968(%rip)
	877:	   01

After patch (execution in loop):
	872:       0f ae e8                lfence
	875:       4c 0f a3 25 63 19 08    bt  %r12,0x1081963(%rip)
	87c:       01

ARM64:

Before patch (execution in loop):
	c58:       d5033d9f        dsb     ld
	c5c:       d538d080        mrs     x0, tpidr_el1
	c60:       b8606a61        ldr     w1, [x19,x0]
	c64:       1100fc20        add     w0, w1, #0x3f
	c68:       7100003f        cmp     w1, #0x0
	c6c:       1a81b000        csel    w0, w0, w1, lt
	c70:       13067c00        asr     w0, w0, MiCode#6
	c74:       93407c00        sxtw    x0, w0
	c78:       f8607a80        ldr     x0, [x20,x0,lsl MiCode#3]
	c7c:       9ac12401        lsr     x1, x0, x1
	c80:       36000581        tbz     w1, #0, d30 <do_idle+0x128>

After patch (execution in loop):
	c84:       d5033d9f        dsb     ld
	c88:       f9400260        ldr     x0, [x19]
	c8c:       ea14001f        tst     x0, x20
	c90:       54000580        b.eq    d40 <do_idle+0x138>

Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
[ Rewrote the title and the changelog. ]
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: huawei.libin@huawei.com
Cc: xiexiuqi@huawei.com
Link: http://lkml.kernel.org/r/1508930907-107755-1-git-send-email-cj.chengjian@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: celtare21 <celtare21@gmail.com>
Signed-off-by: sohamxda7 <sensoham135@gmail.com>
mi-code pushed a commit that referenced this issue Mar 3, 2021
https://bugzilla.kernel.org/show_bug.cgi?id=208565

PID: 257    TASK: ecdd0000  CPU: 0   COMMAND: "init"
  #0 [<c0b420ec>] (__schedule) from [<c0b423c8>]
  #1 [<c0b423c8>] (schedule) from [<c0b459d4>]
  #2 [<c0b459d4>] (rwsem_down_read_failed) from [<c0b44fa0>]
  #3 [<c0b44fa0>] (down_read) from [<c044233c>]
  #4 [<c044233c>] (f2fs_truncate_blocks) from [<c0442890>]
  #5 [<c0442890>] (f2fs_truncate) from [<c044d408>]
  #6 [<c044d408>] (f2fs_evict_inode) from [<c030be18>]
  #7 [<c030be18>] (evict) from [<c030a558>]
  #8 [<c030a558>] (iput) from [<c047c600>]
  #9 [<c047c600>] (f2fs_sync_node_pages) from [<c0465414>]
 #10 [<c0465414>] (f2fs_write_checkpoint) from [<c04575f4>]
 #11 [<c04575f4>] (f2fs_sync_fs) from [<c0441918>]
 #12 [<c0441918>] (f2fs_do_sync_file) from [<c0441098>]
 #13 [<c0441098>] (f2fs_sync_file) from [<c0323fa0>]
 #14 [<c0323fa0>] (vfs_fsync_range) from [<c0324294>]
 #15 [<c0324294>] (do_fsync) from [<c0324014>]
 #16 [<c0324014>] (sys_fsync) from [<c0108bc0>]

This can be caused by flush_dirty_inode() in f2fs_sync_node_pages() where
iput() requires f2fs_lock_op() again resulting in livelock.

Change-Id: I5d7ef35a21cdb074e7bf5288371f579bfc0eb19d
Reported-by: Zhiguo Niu <Zhiguo.Niu@unisoc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Git-commit: b0f3b87
Git-repo: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/
Signed-off-by: Sayali Lokhande <sayalil@codeaurora.org>
regiesoriano pushed a commit to regiesoriano/android_kernel_xiaomi_jasmine_sprout that referenced this issue Apr 10, 2021
[ Upstream commit 829933e ]

For each device, the nosy driver allocates a pcilynx structure.
A use-after-free might happen in the following scenario:

 1. Open nosy device for the first time and call ioctl with command
    NOSY_IOC_START, then a new client A will be malloced and added to
    doubly linked list.
 2. Open nosy device for the second time and call ioctl with command
    NOSY_IOC_START, then a new client B will be malloced and added to
    doubly linked list.
 3. Call ioctl with command NOSY_IOC_START for client A, then client A
    will be readded to the doubly linked list. Now the doubly linked
    list is messed up.
 4. Close the first nosy device and nosy_release will be called. In
    nosy_release, client A will be unlinked and freed.
 5. Close the second nosy device, and client A will be referenced,
    resulting in UAF.

The root cause of this bug is that the element in the doubly linked list
is reentered into the list.

Fix this bug by adding a check before inserting a client.  If a client
is already in the linked list, don't insert it.

The following KASAN report reveals it:

   BUG: KASAN: use-after-free in nosy_release+0x1ea/0x210
   Write of size 8 at addr ffff888102ad7360 by task poc
   CPU: 3 PID: 337 Comm: poc Not tainted 5.12.0-rc5+ MiCode#6
   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
   Call Trace:
     nosy_release+0x1ea/0x210
     __fput+0x1e2/0x840
     task_work_run+0xe8/0x180
     exit_to_user_mode_prepare+0x114/0x120
     syscall_exit_to_user_mode+0x1d/0x40
     entry_SYSCALL_64_after_hwframe+0x44/0xae

   Allocated by task 337:
     nosy_open+0x154/0x4d0
     misc_open+0x2ec/0x410
     chrdev_open+0x20d/0x5a0
     do_dentry_open+0x40f/0xe80
     path_openat+0x1cf9/0x37b0
     do_filp_open+0x16d/0x390
     do_sys_openat2+0x11d/0x360
     __x64_sys_open+0xfd/0x1a0
     do_syscall_64+0x33/0x40
     entry_SYSCALL_64_after_hwframe+0x44/0xae

   Freed by task 337:
     kfree+0x8f/0x210
     nosy_release+0x158/0x210
     __fput+0x1e2/0x840
     task_work_run+0xe8/0x180
     exit_to_user_mode_prepare+0x114/0x120
     syscall_exit_to_user_mode+0x1d/0x40
     entry_SYSCALL_64_after_hwframe+0x44/0xae

   The buggy address belongs to the object at ffff888102ad7300 which belongs to the cache kmalloc-128 of size 128
   The buggy address is located 96 bytes inside of 128-byte region [ffff888102ad7300, ffff888102ad7380)

[ Modified to use 'list_empty()' inside proper lock  - Linus ]

Link: https://lore.kernel.org/lkml/1617433116-5930-1-git-send-email-zheyuma97@gmail.com/
Reported-and-tested-by: 马哲宇 (Zheyu Ma) <zheyuma97@gmail.com>
Signed-off-by: Zheyu Ma <zheyuma97@gmail.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
pascua28 pushed a commit to pascua28/msm-4.14-copy that referenced this issue May 7, 2021
Move the loop-invariant calculation of 'cpu' in do_idle() out of the loop body,
because the current CPU is always constant.

This improves the generated code both on x86-64 and ARM64:

x86-64:

Before patch (execution in loop):
	864:       0f ae e8                lfence
	867:       65 8b 05 c2 38 f1 7e    mov %gs:0x7ef138c2(%rip),%eax
	86e:       89 c0                   mov %eax,%eax
	870:       48 0f a3 05 68 19 08    bt  %rax,0x1081968(%rip)
	877:	   01

After patch (execution in loop):
	872:       0f ae e8                lfence
	875:       4c 0f a3 25 63 19 08    bt  %r12,0x1081963(%rip)
	87c:       01

ARM64:

Before patch (execution in loop):
	c58:       d5033d9f        dsb     ld
	c5c:       d538d080        mrs     x0, tpidr_el1
	c60:       b8606a61        ldr     w1, [x19,x0]
	c64:       1100fc20        add     w0, w1, #0x3f
	c68:       7100003f        cmp     w1, #0x0
	c6c:       1a81b000        csel    w0, w0, w1, lt
	c70:       13067c00        asr     w0, w0, MiCode#6
	c74:       93407c00        sxtw    x0, w0
	c78:       f8607a80        ldr     x0, [x20,x0,lsl MiCode#3]
	c7c:       9ac12401        lsr     x1, x0, x1
	c80:       36000581        tbz     w1, #0, d30 <do_idle+0x128>

After patch (execution in loop):
	c84:       d5033d9f        dsb     ld
	c88:       f9400260        ldr     x0, [x19]
	c8c:       ea14001f        tst     x0, x20
	c90:       54000580        b.eq    d40 <do_idle+0x138>

Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
[ Rewrote the title and the changelog. ]
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: huawei.libin@huawei.com
Cc: xiexiuqi@huawei.com
Link: http://lkml.kernel.org/r/1508930907-107755-1-git-send-email-cj.chengjian@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Samuel Pascua <pascua.samuel.14@gmail.com>
pascua28 pushed a commit to pascua28/msm-4.14-copy that referenced this issue May 10, 2021
Move the loop-invariant calculation of 'cpu' in do_idle() out of the loop body,
because the current CPU is always constant.

This improves the generated code both on x86-64 and ARM64:

x86-64:

Before patch (execution in loop):
	864:       0f ae e8                lfence
	867:       65 8b 05 c2 38 f1 7e    mov %gs:0x7ef138c2(%rip),%eax
	86e:       89 c0                   mov %eax,%eax
	870:       48 0f a3 05 68 19 08    bt  %rax,0x1081968(%rip)
	877:	   01

After patch (execution in loop):
	872:       0f ae e8                lfence
	875:       4c 0f a3 25 63 19 08    bt  %r12,0x1081963(%rip)
	87c:       01

ARM64:

Before patch (execution in loop):
	c58:       d5033d9f        dsb     ld
	c5c:       d538d080        mrs     x0, tpidr_el1
	c60:       b8606a61        ldr     w1, [x19,x0]
	c64:       1100fc20        add     w0, w1, #0x3f
	c68:       7100003f        cmp     w1, #0x0
	c6c:       1a81b000        csel    w0, w0, w1, lt
	c70:       13067c00        asr     w0, w0, MiCode#6
	c74:       93407c00        sxtw    x0, w0
	c78:       f8607a80        ldr     x0, [x20,x0,lsl MiCode#3]
	c7c:       9ac12401        lsr     x1, x0, x1
	c80:       36000581        tbz     w1, #0, d30 <do_idle+0x128>

After patch (execution in loop):
	c84:       d5033d9f        dsb     ld
	c88:       f9400260        ldr     x0, [x19]
	c8c:       ea14001f        tst     x0, x20
	c90:       54000580        b.eq    d40 <do_idle+0x138>

Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
[ Rewrote the title and the changelog. ]
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: huawei.libin@huawei.com
Cc: xiexiuqi@huawei.com
Link: http://lkml.kernel.org/r/1508930907-107755-1-git-send-email-cj.chengjian@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Samuel Pascua <pascua.samuel.14@gmail.com>
pascua28 pushed a commit to pascua28/msm-4.14-copy that referenced this issue May 27, 2021
Move the loop-invariant calculation of 'cpu' in do_idle() out of the loop body,
because the current CPU is always constant.

This improves the generated code both on x86-64 and ARM64:

x86-64:

Before patch (execution in loop):
	864:       0f ae e8                lfence
	867:       65 8b 05 c2 38 f1 7e    mov %gs:0x7ef138c2(%rip),%eax
	86e:       89 c0                   mov %eax,%eax
	870:       48 0f a3 05 68 19 08    bt  %rax,0x1081968(%rip)
	877:	   01

After patch (execution in loop):
	872:       0f ae e8                lfence
	875:       4c 0f a3 25 63 19 08    bt  %r12,0x1081963(%rip)
	87c:       01

ARM64:

Before patch (execution in loop):
	c58:       d5033d9f        dsb     ld
	c5c:       d538d080        mrs     x0, tpidr_el1
	c60:       b8606a61        ldr     w1, [x19,x0]
	c64:       1100fc20        add     w0, w1, #0x3f
	c68:       7100003f        cmp     w1, #0x0
	c6c:       1a81b000        csel    w0, w0, w1, lt
	c70:       13067c00        asr     w0, w0, MiCode#6
	c74:       93407c00        sxtw    x0, w0
	c78:       f8607a80        ldr     x0, [x20,x0,lsl MiCode#3]
	c7c:       9ac12401        lsr     x1, x0, x1
	c80:       36000581        tbz     w1, #0, d30 <do_idle+0x128>

After patch (execution in loop):
	c84:       d5033d9f        dsb     ld
	c88:       f9400260        ldr     x0, [x19]
	c8c:       ea14001f        tst     x0, x20
	c90:       54000580        b.eq    d40 <do_idle+0x138>

Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
[ Rewrote the title and the changelog. ]
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: huawei.libin@huawei.com
Cc: xiexiuqi@huawei.com
Link: http://lkml.kernel.org/r/1508930907-107755-1-git-send-email-cj.chengjian@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Samuel Pascua <pascua.samuel.14@gmail.com>
SakthivelNadar pushed a commit to SakthivelNadar/android_kernel_redmi_mt6768 that referenced this issue Sep 2, 2021
Upstream commit 0d0c8de.

When option CONFIG_KASAN is enabled toghether with ftrace, function
ftrace_graph_caller() gets in to a recursion, via functions
kasan_check_read() and kasan_check_write().

 Breakpoint 2, ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:179
 179             mcount_get_pc             x0    //     function's pc
 (gdb) bt
 #0  ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:179
 MiCode#1  0xffffff90101406c8 in ftrace_caller () at ../arch/arm64/kernel/entry-ftrace.S:151
 MiCode#2  0xffffff90106fd084 in kasan_check_write (p=0xffffffc06c170878, size=4) at ../mm/kasan/common.c:105
 MiCode#3  0xffffff90104a2464 in atomic_add_return (v=<optimized out>, i=<optimized out>) at ./include/generated/atomic-instrumented.h:71
 MiCode#4  atomic_inc_return (v=<optimized out>) at ./include/generated/atomic-fallback.h:284
 MiCode#5  trace_graph_entry (trace=0xffffffc03f5ff380) at ../kernel/trace/trace_functions_graph.c:441
 MiCode#6  0xffffff9010481774 in trace_graph_entry_watchdog (trace=<optimized out>) at ../kernel/trace/trace_selftest.c:741
 MiCode#7  0xffffff90104a185c in function_graph_enter (ret=<optimized out>, func=<optimized out>, frame_pointer=18446743799894897728, retp=<optimized out>) at ../kernel/trace/trace_functions_graph.c:196
 MiCode#8  0xffffff9010140628 in prepare_ftrace_return (self_addr=18446743592948977792, parent=0xffffffc03f5ff418, frame_pointer=18446743799894897728) at ../arch/arm64/kernel/ftrace.c:231
 MiCode#9  0xffffff90101406f4 in ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:182
 Backtrace stopped: previous frame identical to this frame (corrupt stack?)
 (gdb)

Rework so that the kasan implementation isn't traced.

Link: http://lkml.kernel.org/r/20181212183447.15890-1-anders.roxell@linaro.org
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Change-Id: Ia8874ccdfcca676f6dc480d6e62f197ee1fc6594
Bug: 128674696
mi-code pushed a commit that referenced this issue Aug 10, 2022
If an atomic update fails intel_crtc->atomic may have have some values left
from the last atomic check. One example is atomic->wait_for_vblank,
which results in spurious errors in kms_flip.

[ 1551.892708] ------------[ cut here ]------------
[ 1551.892721] WARNING: CPU: 3 PID: 4179 at ../drivers/gpu/drm/drm_irq.c:1199 drm_wait_one_vblank+0x197/0x1a0 [drm]()
[ 1551.892722] vblank not available on crtc 2, ret=-22
[ 1551.892751] Modules linked in: snd_hda_intel i915 drm_kms_helper drm
intel_gtt i2c_algo_bit cfbfillrect syscopyarea cfbimgblt sysfillrect
sysimgblt fb_sys_fops cfbcopyarea agpgart cfg80211 binfmt_misc
snd_hda_codec_hdmi intel_rapl iosf_mbi x86_pkg_temp_thermal coretemp
kvm_intel snd_hda_codec_realtek kvm snd_hda_codec_generic iTCO_wdt
aesni_intel aes_x86_64 glue_helper lrw snd_hda_codec gf128mul
ablk_helper cryptd snd_hwdep psmouse snd_hda_core pcspkr snd_pcm
snd_timer snd lpc_ich i2c_i801 mfd_core soundcore wmi evdev [last
unloaded: drm]
[ 1551.892753] CPU: 3 PID: 4179 Comm: kms_pipe_crc_ba Tainted: G     U  W       4.3.0-reg+ #6
[ 1551.892754] Hardware name:                  /DZ77BH-55K, BIOS BHZ7710H.86A.0100.2013.0517.0942 05/17/2013
[ 1551.892758]  ffffffffa03128d8 ffff8800cec73890 ffffffff812c0f3c ffff8800cec738d8
[ 1551.892760]  ffff8800cec738c8 ffffffff8104ff36 ffff880116ae2290 0000000000000002
[ 1551.892762]  ffff8800d39fcda0 ffff8800d038b4d0 ffff8800d42b5550 ffff8800cec73928
[ 1551.892763] Call Trace:
[ 1551.892768]  [<ffffffff812c0f3c>] dump_stack+0x4e/0x82
[ 1551.892771]  [<ffffffff8104ff36>] warn_slowpath_common+0x86/0xc0
[ 1551.892773]  [<ffffffff8104ffbc>] warn_slowpath_fmt+0x4c/0x50
[ 1551.892781]  [<ffffffffa02e6708>] ? drm_vblank_get+0x78/0xd0 [drm]
[ 1551.892787]  [<ffffffffa02e6d47>] drm_wait_one_vblank+0x197/0x1a0 [drm]
[ 1551.892813]  [<ffffffffa03d052f>] intel_post_plane_update+0xef/0x120 [i915]
[ 1551.892832]  [<ffffffffa03d11d2>] intel_atomic_commit+0x4c2/0x1600 [i915]
[ 1551.892862]  [<ffffffffa02ff0c7>] ? drm_atomic_check_only+0x147/0x5e0 [drm]
[ 1551.892872]  [<ffffffffa02feeb7>] ? drm_atomic_add_affected_connectors+0x27/0xf0 [drm]
[ 1551.892881]  [<ffffffffa02ff597>] drm_atomic_commit+0x37/0x60 [drm]
[ 1551.892887]  [<ffffffffa034301a>] restore_fbdev_mode+0x28a/0x2c0 [drm_kms_helper]
[ 1551.892895]  [<ffffffffa0345253>] drm_fb_helper_restore_fbdev_mode_unlocked+0x33/0x80 [drm_kms_helper]
[ 1551.892900]  [<ffffffffa03452cd>] drm_fb_helper_set_par+0x2d/0x50 [drm_kms_helper]
[ 1551.892920]  [<ffffffffa03e7a9a>] intel_fbdev_set_par+0x1a/0x60 [i915]
[ 1551.892923]  [<ffffffff8131a5a7>] fb_set_var+0x1a7/0x3f0
[ 1551.892927]  [<ffffffff8109732f>] ? trace_hardirqs_on_caller+0x12f/0x1c0
[ 1551.892931]  [<ffffffff81314f32>] fbcon_blank+0x212/0x2f0
[ 1551.892935]  [<ffffffff81373f4a>] do_unblank_screen+0xba/0x1d0
[ 1551.892937]  [<ffffffff8136b725>] vt_ioctl+0x13d5/0x1450
[ 1551.892940]  [<ffffffff8107cdd1>] ? preempt_count_sub+0x41/0x50
[ 1551.892943]  [<ffffffff8135d8a3>] tty_ioctl+0x423/0xe30
[ 1551.892947]  [<ffffffff8119f721>] do_vfs_ioctl+0x301/0x560
[ 1551.892949]  [<ffffffff8119b1e3>] ? putname+0x53/0x60
[ 1551.892952]  [<ffffffff811ab376>] ? __fget_light+0x66/0x90
[ 1551.892955]  [<ffffffff8119f9f9>] SyS_ioctl+0x79/0x90
[ 1551.892958]  [<ffffffff81552e97>] entry_SYSCALL_64_fastpath+0x12/0x6f
[ 1551.892961] ---[ end trace 3e764d4b6628c91c ]---

Testcase: kms_flip
Reported-and-tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: stable@vger.kernel.org #v4.3
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Reviewed-by: Daniel Stone <daniels@collabora.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/5649C2BA.6080300@mblankhorst.nl
mi-code pushed a commit that referenced this issue Aug 10, 2022
We try to convert the old way of of specifying fb tiling (obj->tiling)
into the new fb modifiers. We store the result in the passed in mode_cmd
structure. But that structure comes directly from the addfb2 ioctl, and
gets copied back out to userspace, which means we're clobbering the
modifiers that the user provided (all 0 since the DRM_MODE_FB_MODIFIERS
flag wasn't even set by the user). Hence if the user reuses the struct
for another addfb2, the ioctl will be rejected since it's now asking for
some modifiers w/o the flag set.

Fix the problem by making a copy of the user provided structure. We can
play any games we want with the copy.

IGT-Version: 1.12-git (x86_64) (Linux: 4.4.0-rc1-stereo+ x86_64)
...
Subtest basic-X-tiled: SUCCESS (0.001s)
Test assertion failure function pitch_tests, file kms_addfb_basic.c:167:
Failed assertion: drmIoctl(fd, DRM_IOCTL_MODE_ADDFB2, &f) == 0
Last errno: 22, Invalid argument
Stack trace:
  #0 [__igt_fail_assert+0x101]
  #1 [pitch_tests+0x619]
  #2 [__real_main426+0x2f]
  #3 [main+0x23]
  #4 [__libc_start_main+0xf0]
  #5 [_start+0x29]
  #6 [<unknown>+0x29]
  Subtest framebuffer-vs-set-tiling failed.
  **** DEBUG ****
  Test assertion failure function pitch_tests, file kms_addfb_basic.c:167:
  Failed assertion: drmIoctl(fd, DRM_IOCTL_MODE_ADDFB2, &f) == 0
  Last errno: 22, Invalid argument
  ****  END  ****
  Subtest framebuffer-vs-set-tiling: FAIL (0.003s)
  ...

IGT-Version: 1.12-git (x86_64) (Linux: 4.4.0-rc1-stereo+ x86_64)
Subtest framebuffer-vs-set-tiling: SUCCESS (0.000s)

Cc: stable@vger.kernel.org # v4.1+
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Fixes: 2a80ead ("drm/i915: Add fb format modifier support")
Testcase: igt/kms_addfb_basic/clobbered-modifier
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1447261890-3960-1-git-send-email-ville.syrjala@linux.intel.com
mi-code pushed a commit that referenced this issue Aug 10, 2022
The machine hang completely with the following message on the console:

[  487.777538] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
[  487.777554] IP: [<ffffffff8158aaee>] _raw_spin_lock+0xe/0x30
[  487.777557] PGD 42e9f7067 PUD 42f2fa067 PMD 0
[  487.777560] Oops: 0002 [#1] SMP
...
[  487.777618] CPU: 21 PID: 3190 Comm: Xorg Tainted: G            E   4.4.0-rc1-3-default+ #6
[  487.777620] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0059.R00.1501081238 01/08/2015
[  487.777621] task: ffff880853ae4680 ti: ffff8808696d4000 task.ti: ffff8808696d4000
[  487.777625] RIP: 0010:[<ffffffff8158aaee>]  [<ffffffff8158aaee>] _raw_spin_lock+0xe/0x30
[  487.777627] RSP: 0018:ffff8808696d79c0  EFLAGS: 00010246
[  487.777628] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  487.777629] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000060
[  487.777630] RBP: ffff8808696d79e0 R08: 0000000000000000 R09: ffff88086924a780
[  487.777631] R10: 000000000001bb40 R11: 0000000000003246 R12: 0000000000000000
[  487.777632] R13: ffff880463a27360 R14: ffff88046ca50218 R15: 0000000000000080
[  487.777634] FS:  00007f3f81c5a8c0(0000) GS:ffff88086f060000(0000) knlGS:0000000000000000
[  487.777635] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  487.777636] CR2: 0000000000000060 CR3: 000000042e678000 CR4: 00000000001406e0
[  487.777638] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  487.777639] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  487.777639] Stack:
[  487.777642]  ffffffffa00eb5fa ffff8808696d7b60 ffff88086b87d800 0000000000000000
[  487.777644]  ffff8808696d7ac8 ffffffffa01694b6 ffff8808696d7ae8 ffffffff8109c8d5
[  487.777647]  ffff880469158740 ffff880463a27000 ffff88086b87d800 ffff88086b87d800
[  487.777647] Call Trace:
[  487.777674]  [<ffffffffa00eb5fa>] ? drm_gem_object_lookup+0x1a/0xa0 [drm]
[  487.777681]  [<ffffffffa01694b6>] mga_crtc_cursor_set+0xc6/0xb60 [mgag200]
[  487.777691]  [<ffffffff8109c8d5>] ? find_busiest_group+0x35/0x4a0
[  487.777696]  [<ffffffff81086294>] ? __might_sleep+0x44/0x80
[  487.777699]  [<ffffffff815888c2>] ? __ww_mutex_lock+0x22/0x9c
[  487.777722]  [<ffffffffa0104f64>] ? drm_modeset_lock+0x34/0xf0 [drm]
[  487.777733]  [<ffffffffa0148d9e>] restore_fbdev_mode+0xee/0x2a0 [drm_kms_helper]
[  487.777742]  [<ffffffffa014afce>] drm_fb_helper_restore_fbdev_mode_unlocked+0x2e/0x70 [drm_kms_helper]
[  487.777748]  [<ffffffffa014b037>] drm_fb_helper_set_par+0x27/0x50 [drm_kms_helper]
[  487.777752]  [<ffffffff8134560c>] fb_set_var+0x18c/0x3f0
[  487.777777]  [<ffffffffa02a9b0a>] ? __ext4_handle_dirty_metadata+0x8a/0x210 [ext4]
[  487.777783]  [<ffffffff8133cb97>] fbcon_blank+0x1b7/0x2b0
[  487.777790]  [<ffffffff813be2a3>] do_unblank_screen+0xb3/0x1c0
[  487.777795]  [<ffffffff813b5aba>] vt_ioctl+0x118a/0x1210
[  487.777801]  [<ffffffff813a8fe0>] tty_ioctl+0x3f0/0xc90
[  487.777808]  [<ffffffff81172018>] ? kzfree+0x28/0x30
[  487.777813]  [<ffffffff811e053f>] ? mntput+0x1f/0x30
[  487.777817]  [<ffffffff811d3f5d>] do_vfs_ioctl+0x30d/0x570
[  487.777822]  [<ffffffff8107ed3a>] ? task_work_run+0x8a/0xa0
[  487.777825]  [<ffffffff811d4234>] SyS_ioctl+0x74/0x80
[  487.777829]  [<ffffffff8158aeae>] entry_SYSCALL_64_fastpath+0x12/0x71
[  487.777851] Code: 65 ff 0d ce 02 a8 7e 5d c3 ba 01 00 00 00 f0 0f b1 17 85 c0 75 e8 b0 01 5d c3 0f 1f 00 65 ff 05 b1 02 a8 7e 31 c0 ba 01 00 00 00 <f0> 0f b1 17 85 c0 75 01 c3 55 89 c6 48 89 e5 e8 4e f5 b1 ff 5d
[  487.777854] RIP  [<ffffffff8158aaee>] _raw_spin_lock+0xe/0x30
[  487.777855]  RSP <ffff8808696d79c0>
[  487.777856] CR2: 0000000000000060
[  487.777860] ---[ end trace 672a2cd555e0ebd3 ]---

The cursor code may be entered with file_priv == NULL && handle == NULL.
The problem was introduced by:

"bf89209 drm/mga200g: Hold a proper reference for cursor_set"

which calls drm_gem_object_lookup(dev, file_priv...). Previously this wasn't
a problem because we checked the handle. Move the check early in the function
can fix the problem.

Signed-off-by: Rui Wang <rui.y.wang@intel.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Dave Airlie <airlied@redhat.com>
mi-code pushed a commit that referenced this issue Aug 10, 2022
OMAP CPU hotplug uses cpu1's clocks and power domains for CPU1 wake up
from low power states (or turn on CPU1). This part of code is also
part of system suspend (disable_nonboot_cpus()).
>From other side, cpu1's clocks and power domains are used by CPUIdle. All above
functionality is mutually exclusive and, therefore, lockless clkdm/pwrdm api
can be used in omap4_boot_secondary().

This fixes below back-trace on -RT which is triggered by
pwrdm_lock/unlock():

BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:917
 in_atomic(): 1, irqs_disabled(): 0, pid: 118, name: sh
 9 locks held by sh/118:
  #0:  (sb_writers#4){.+.+.+}, at: [<c0144a6c>] vfs_write+0x13c/0x164
  #1:  (&of->mutex){+.+.+.}, at: [<c01b4c70>] kernfs_fop_write+0x48/0x19c
  #2:  (s_active#24){.+.+.+}, at: [<c01b4c78>] kernfs_fop_write+0x50/0x19c
  #3:  (device_hotplug_lock){+.+.+.}, at: [<c03cbff0>] lock_device_hotplug_sysfs+0xc/0x4c
  #4:  (&dev->mutex){......}, at: [<c03cd284>] device_online+0x14/0x88
  #5:  (cpu_add_remove_lock){+.+.+.}, at: [<c003af90>] cpu_up+0x50/0x1a0
  #6:  (cpu_hotplug.lock){++++++}, at: [<c003ae48>] cpu_hotplug_begin+0x0/0xc4
  #7:  (cpu_hotplug.lock#2){+.+.+.}, at: [<c003aec0>] cpu_hotplug_begin+0x78/0xc4
  #8:  (boot_lock){+.+...}, at: [<c002b254>] omap4_boot_secondary+0x1c/0x178
 Preemption disabled at:[<  (null)>]   (null)

 CPU: 0 PID: 118 Comm: sh Not tainted 4.1.12-rt11-01998-gb4a62c3-dirty #137
 Hardware name: Generic DRA74X (Flattened Device Tree)
 [<c0017574>] (unwind_backtrace) from [<c0013be8>] (show_stack+0x10/0x14)
 [<c0013be8>] (show_stack) from [<c05a8670>] (dump_stack+0x80/0x94)
 [<c05a8670>] (dump_stack) from [<c05ad158>] (rt_spin_lock+0x24/0x54)
 [<c05ad158>] (rt_spin_lock) from [<c0030dac>] (clkdm_wakeup+0x10/0x2c)
 [<c0030dac>] (clkdm_wakeup) from [<c002b2c0>] (omap4_boot_secondary+0x88/0x178)
 [<c002b2c0>] (omap4_boot_secondary) from [<c0015d00>] (__cpu_up+0xc4/0x164)
 [<c0015d00>] (__cpu_up) from [<c003b09c>] (cpu_up+0x15c/0x1a0)
 [<c003b09c>] (cpu_up) from [<c03cd2d4>] (device_online+0x64/0x88)
 [<c03cd2d4>] (device_online) from [<c03cd360>] (online_store+0x68/0x74)
 [<c03cd360>] (online_store) from [<c01b4ce0>] (kernfs_fop_write+0xb8/0x19c)
 [<c01b4ce0>] (kernfs_fop_write) from [<c0144124>] (__vfs_write+0x20/0xd8)
 [<c0144124>] (__vfs_write) from [<c01449c0>] (vfs_write+0x90/0x164)
 [<c01449c0>] (vfs_write) from [<c01451e4>] (SyS_write+0x44/0x9c)
 [<c01451e4>] (SyS_write) from [<c0010240>] (ret_fast_syscall+0x0/0x54)
 CPU1: smp_ops.cpu_die() returned, trying to resuscitate

Cc: Tero Kristo <t-kristo@ti.com>
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
mi-code pushed a commit that referenced this issue Oct 22, 2022
Or Gerlitz says:

====================
net/mlx5_core: Enhance flow steering support

v0 --> v1 changes:
  - fixed improperly formatted comments.
  - compare value of ib_spec->eth.mask.ether_type in network byte order
     in ('IB/mlx5: Add flow steering utilities').

v1 --> v2 changes:
  - made sure that service functions added in the IB driver are only static-fied
    on the last commit, to make sure bisection with -Werror works fine.

v2 --> v3 changes:
   - squashed patches 11 and 12 into one patch, s.t Dave's comment
     on unused static functions gcc complaints during bisection is
     correctly addressed.

v3 has been generated against net-next commit c9c9931 "Merge tag
'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge"

The series is signed by Matan who was revently assigned to a maintainer for
the mlx5_core and IB drivers (this is a 4.5-rc1 change to the maintainers file coming
from the rdma tree) -- as such I didn't see a neeed to add my signature (Or).

This series adds three new functionalists to the driver flow-steering
infrastructure: auto-grouped flow tables, chaining of flow tables and
updates for the root flow table.

1. Auto-grouped flow tables - Flow table with auto grouping management.
When a flow table is created, hints regarding the number of rule types
and the number of rules are given in advance. Thus, a flow table is
divided into #NUM_TYPES+1 groups each contains
(#NUM_RULES)/(#NUM_TYPES+1) rules. The first #NUM_TYPES parts are groups
which are filled if the added rule matches the group specification or
the group is empty. The last part is filled by rules that can't fit
any of the former groups.

2. Chaining flow tables - Flow tables from different priorities are chained
together, if there is no match in flow table of priority i we continue
searching for a match in priority i+1. This is both true if priorities
i and i+1 belongs to the same namespace or not.

3. Updating the root flow table - the root flow table is the flow table
with the lowest level. The hardware start searching for a match in the
root flow table and continue according to the matches it find along
the way.

The first usage for the new functionality is flow steering for user-space
ConnectX-4 offloaded HW Eth RX queues done through the mlx5 IB driver.

When the mlx5 core driver is loaded, it opens three flow namespaces:
1. By-pass namespace (used by mlx5 IB driver).
2. Kernel namespace (used in order to get packets to the networking stack
through mlx5 EN driver).
3. Leftovers namespace (used by mlx5 IB and future sniffer)

The series is built as follows:

Patch #1 introduces auto-grouped flow tables support.

Patch #2 add utility functions for finding the next and the previous
flow tables in different priorities. This is used in order to chain
the flow tables in a downstream patch.

Patch #3 introduces a firmware command for updating the root flow table.

Patch #4 introduces modify flow table firmware command, this command is used
when we want to change the next flow table of an existing flow table.
This is used for chaining flow tables as well.

Patch #5 connect/disconnect flow tables. This is actually the chaining
process when we want to link flow tables. This means that if we couldn't
find a match in the first flow table, we'll continue in the chained
flow table.

Patch #6 updates priority's attributes that is required for flow table
level allocation. We update both the max_fts (the number of allowed FTs
in the sub-tree of this priority) and the start_level (which is the first
level we'll assign to the flow-tables created inside the priority).

Patch #7 adds checking of required device capabilities. Some namespaces
could be only created if the hardware supports certain attributes.
This is especially true for the Bypass and leftovers namespaces. This
adds a generic mechanism to check these required attributes.

Patch #8 creates two additional namespaces:
	a. Bypass flow rules(has nine priorities)
	b. Leftovers packets(have one priority) - for unmatched packets.

Patch #9 re-factors ipv4/ipv6 match fields in the mlx5 firmware interface
header to be more clear.

Patch #10 exports the flow steering API for mlx5_ib usage

Patch #11 implements the required support in mlx5_ib in order
to support the RDMA flow steering verbs.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
hectorvax pushed a commit to hectorvax/Xiaomi_Kernel_OpenSource that referenced this issue Dec 31, 2022
[ Upstream commit 9b38cc7 ]

Ziqian reported lockup when adding retprobe on _raw_spin_lock_irqsave.
My test was also able to trigger lockdep output:

 ============================================
 WARNING: possible recursive locking detected
 5.6.0-rc6+ MiCode#6 Not tainted
 --------------------------------------------
 sched-messaging/2767 is trying to acquire lock:
 ffffffff9a492798 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_hash_lock+0x52/0xa0

 but task is already holding lock:
 ffffffff9a491a18 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_trampoline+0x0/0x50

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&(kretprobe_table_locks[i].lock));
   lock(&(kretprobe_table_locks[i].lock));

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 1 lock held by sched-messaging/2767:
  #0: ffffffff9a491a18 (&(kretprobe_table_locks[i].lock)){-.-.}, at: kretprobe_trampoline+0x0/0x50

 stack backtrace:
 CPU: 3 PID: 2767 Comm: sched-messaging Not tainted 5.6.0-rc6+ MiCode#6
 Call Trace:
  dump_stack+0x96/0xe0
  __lock_acquire.cold.57+0x173/0x2b7
  ? native_queued_spin_lock_slowpath+0x42b/0x9e0
  ? lockdep_hardirqs_on+0x590/0x590
  ? __lock_acquire+0xf63/0x4030
  lock_acquire+0x15a/0x3d0
  ? kretprobe_hash_lock+0x52/0xa0
  _raw_spin_lock_irqsave+0x36/0x70
  ? kretprobe_hash_lock+0x52/0xa0
  kretprobe_hash_lock+0x52/0xa0
  trampoline_handler+0xf8/0x940
  ? kprobe_fault_handler+0x380/0x380
  ? find_held_lock+0x3a/0x1c0
  kretprobe_trampoline+0x25/0x50
  ? lock_acquired+0x392/0xbc0
  ? _raw_spin_lock_irqsave+0x50/0x70
  ? __get_valid_kprobe+0x1f0/0x1f0
  ? _raw_spin_unlock_irqrestore+0x3b/0x40
  ? finish_task_switch+0x4b9/0x6d0
  ? __switch_to_asm+0x34/0x70
  ? __switch_to_asm+0x40/0x70

The code within the kretprobe handler checks for probe reentrancy,
so we won't trigger any _raw_spin_lock_irqsave probe in there.

The problem is in outside kprobe_flush_task, where we call:

  kprobe_flush_task
    kretprobe_table_lock
      raw_spin_lock_irqsave
        _raw_spin_lock_irqsave

where _raw_spin_lock_irqsave triggers the kretprobe and installs
kretprobe_trampoline handler on _raw_spin_lock_irqsave return.

The kretprobe_trampoline handler is then executed with already
locked kretprobe_table_locks, and first thing it does is to
lock kretprobe_table_locks ;-) the whole lockup path like:

  kprobe_flush_task
    kretprobe_table_lock
      raw_spin_lock_irqsave
        _raw_spin_lock_irqsave ---> probe triggered, kretprobe_trampoline installed

        ---> kretprobe_table_locks locked

        kretprobe_trampoline
          trampoline_handler
            kretprobe_hash_lock(current, &head, &flags);  <--- deadlock

Adding kprobe_busy_begin/end helpers that mark code with fake
probe installed to prevent triggering of another kprobe within
this code.

Using these helpers in kprobe_flush_task, so the probe recursion
protection check is hit and the probe is never set to prevent
above lockup.

Link: http://lkml.kernel.org/r/158927059835.27680.7011202830041561604.stgit@devnote2

Fixes: ef53d9c ("kprobes: improve kretprobe scalability with hashed locking")
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "Gustavo A . R . Silva" <gustavoars@kernel.org>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: "Naveen N . Rao" <naveen.n.rao@linux.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: David Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Reported-by: "Ziqian SUN (Zamir)" <zsun@redhat.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
hectorvax pushed a commit to hectorvax/Xiaomi_Kernel_OpenSource that referenced this issue Dec 31, 2022
[ Upstream commit 8523c00 ]

After entering kdb due to breakpoint, when we execute 'ss' or 'go' (will
delay installing breakpoints, do single-step first), it won't work
correctly, and it will enter kdb due to oops.

It's because the reason gotten in kdb_stub() is not as expected, and it
seems that the ex_vector for single-step should be 0, like what arch
powerpc/sh/parisc has implemented.

Before the patch:
Entering kdb (current=0xffff8000119e2dc0, pid 0) on processor 0 due to Keyboard Entry
[0]kdb> bp printk
Instruction(i) BP #0 at 0xffff8000101486cc (printk)
    is enabled   addr at ffff8000101486cc, hardtype=0 installed=0

[0]kdb> g

/ # echo h > /proc/sysrq-trigger

Entering kdb (current=0xffff0000fa878040, pid 266) on processor 3 due to Breakpoint @ 0xffff8000101486cc
[3]kdb> ss

Entering kdb (current=0xffff0000fa878040, pid 266) on processor 3 Oops: (null)
due to oops @ 0xffff800010082ab8
CPU: 3 PID: 266 Comm: sh Not tainted 5.7.0-rc4-13839-gf0e5ad491718 MiCode#6
Hardware name: linux,dummy-virt (DT)
pstate: 00000085 (nzcv daIf -PAN -UAO)
pc : el1_irq+0x78/0x180
lr : __handle_sysrq+0x80/0x190
sp : ffff800015003bf0
x29: ffff800015003d20 x28: ffff0000fa878040
x27: 0000000000000000 x26: ffff80001126b1f0
x25: ffff800011b6a0d8 x24: 0000000000000000
x23: 0000000080200005 x22: ffff8000101486cc
x21: ffff800015003d30 x20: 0000ffffffffffff
x19: ffff8000119f2000 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000
x15: 0000000000000000 x14: 0000000000000000
x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000000 x10: 0000000000000000
x9 : 0000000000000000 x8 : ffff800015003e50
x7 : 0000000000000002 x6 : 00000000380b9990
x5 : ffff8000106e99e8 x4 : ffff0000fadd83c0
x3 : 0000ffffffffffff x2 : ffff800011b6a0d8
x1 : ffff800011b6a000 x0 : ffff80001130c9d8
Call trace:
 el1_irq+0x78/0x180
 printk+0x0/0x84
 write_sysrq_trigger+0xb0/0x118
 proc_reg_write+0xb4/0xe0
 __vfs_write+0x18/0x40
 vfs_write+0xb0/0x1b8
 ksys_write+0x64/0xf0
 __arm64_sys_write+0x14/0x20
 el0_svc_common.constprop.2+0xb0/0x168
 do_el0_svc+0x20/0x98
 el0_sync_handler+0xec/0x1a8
 el0_sync+0x140/0x180

[3]kdb>

After the patch:
Entering kdb (current=0xffff8000119e2dc0, pid 0) on processor 0 due to Keyboard Entry
[0]kdb> bp printk
Instruction(i) BP #0 at 0xffff8000101486cc (printk)
    is enabled   addr at ffff8000101486cc, hardtype=0 installed=0

[0]kdb> g

/ # echo h > /proc/sysrq-trigger

Entering kdb (current=0xffff0000fa852bc0, pid 268) on processor 0 due to Breakpoint @ 0xffff8000101486cc
[0]kdb> g

Entering kdb (current=0xffff0000fa852bc0, pid 268) on processor 0 due to Breakpoint @ 0xffff8000101486cc
[0]kdb> ss

Entering kdb (current=0xffff0000fa852bc0, pid 268) on processor 0 due to SS trap @ 0xffff800010082ab8
[0]kdb>

Fixes: 44679a4 ("arm64: KGDB: Add step debugging support")
Signed-off-by: Wei Li <liwei391@huawei.com>
Tested-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20200509214159.19680-2-liwei391@huawei.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
hectorvax pushed a commit to hectorvax/Xiaomi_Kernel_OpenSource that referenced this issue Jan 3, 2023
[ Upstream commit e24c644 ]

I compiled with AddressSanitizer and I had these memory leaks while I
was using the tep_parse_format function:

    Direct leak of 28 byte(s) in 4 object(s) allocated from:
        #0 0x7fb07db49ffe in __interceptor_realloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dffe)
        MiCode#1 0x7fb07a724228 in extend_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:985
        MiCode#2 0x7fb07a724c21 in __read_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1140
        MiCode#3 0x7fb07a724f78 in read_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1206
        MiCode#4 0x7fb07a725191 in __read_expect_type /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1291
        MiCode#5 0x7fb07a7251df in read_expect_type /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1299
        MiCode#6 0x7fb07a72e6c8 in process_dynamic_array_len /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:2849
        MiCode#7 0x7fb07a7304b8 in process_function /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3161
        MiCode#8 0x7fb07a730900 in process_arg_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3207
        MiCode#9 0x7fb07a727c0b in process_arg /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1786
        MiCode#10 0x7fb07a731080 in event_read_print_args /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3285
        MiCode#11 0x7fb07a731722 in event_read_print /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3369
        MiCode#12 0x7fb07a740054 in __tep_parse_format /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:6335
        MiCode#13 0x7fb07a74047a in __parse_event /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:6389
        MiCode#14 0x7fb07a740536 in tep_parse_format /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:6431
        MiCode#15 0x7fb07a785acf in parse_event ../../../src/fs-src/fs.c:251
        MiCode#16 0x7fb07a785ccd in parse_systems ../../../src/fs-src/fs.c:284
        MiCode#17 0x7fb07a786fb3 in read_metadata ../../../src/fs-src/fs.c:593
        MiCode#18 0x7fb07a78760e in ftrace_fs_source_init ../../../src/fs-src/fs.c:727
        MiCode#19 0x7fb07d90c19c in add_component_with_init_method_data ../../../../src/lib/graph/graph.c:1048
        MiCode#20 0x7fb07d90c87b in add_source_component_with_initialize_method_data ../../../../src/lib/graph/graph.c:1127
        MiCode#21 0x7fb07d90c92a in bt_graph_add_source_component ../../../../src/lib/graph/graph.c:1152
        MiCode#22 0x55db11aa632e in cmd_run_ctx_create_components_from_config_components ../../../src/cli/babeltrace2.c:2252
        MiCode#23 0x55db11aa6fda in cmd_run_ctx_create_components ../../../src/cli/babeltrace2.c:2347
        MiCode#24 0x55db11aa780c in cmd_run ../../../src/cli/babeltrace2.c:2461
        MiCode#25 0x55db11aa8a7d in main ../../../src/cli/babeltrace2.c:2673
        MiCode#26 0x7fb07d5460b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)

The token variable in the process_dynamic_array_len function is
allocated in the read_expect_type function, but is not freed before
calling the read_token function.

Free the token variable before calling read_token in order to plug the
leak.

Signed-off-by: Philippe Duplessis-Guindon <pduplessis@efficios.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lore.kernel.org/linux-trace-devel/20200730150236.5392-1-pduplessis@efficios.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
hectorvax pushed a commit to hectorvax/Xiaomi_Kernel_OpenSource that referenced this issue Jan 3, 2023
[ Upstream commit d26383d ]

The following leaks were detected by ASAN:

  Indirect leak of 360 byte(s) in 9 object(s) allocated from:
    #0 0x7fecc305180e in calloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10780e)
    MiCode#1 0x560578f6dce5 in perf_pmu__new_format util/pmu.c:1333
    MiCode#2 0x560578f752fc in perf_pmu_parse util/pmu.y:59
    MiCode#3 0x560578f6a8b7 in perf_pmu__format_parse util/pmu.c:73
    MiCode#4 0x560578e07045 in test__pmu tests/pmu.c:155
    MiCode#5 0x560578de109b in run_test tests/builtin-test.c:410
    MiCode#6 0x560578de109b in test_and_print tests/builtin-test.c:440
    MiCode#7 0x560578de401a in __cmd_test tests/builtin-test.c:661
    MiCode#8 0x560578de401a in cmd_test tests/builtin-test.c:807
    MiCode#9 0x560578e49354 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:312
    MiCode#10 0x560578ce71a8 in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:364
    MiCode#11 0x560578ce71a8 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:408
    MiCode#12 0x560578ce71a8 in main /home/namhyung/project/linux/tools/perf/perf.c:538
    MiCode#13 0x7fecc2b7acc9 in __libc_start_main ../csu/libc-start.c:308

Fixes: cff7f95 ("perf tests: Move pmu tests into separate object")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20200915031819.386559-12-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
YumeMichi pushed a commit to YumeMichi/Xiaomi_Kernel_OpenSource that referenced this issue Jun 24, 2023
ARM64 doesn't implement find_first_{zero}_bit in arch code and doesn't
enable it in a config. It leads to using find_next_bit() which is less
efficient:

0000000000000000 <find_first_bit>:
   0:	aa0003e4 	mov	x4, x0
   4:	aa0103e0 	mov	x0, x1
   8:	b4000181 	cbz	x1, 38 <find_first_bit+0x38>
   c:	f9400083 	ldr	x3, [x4]
  10:	d2800802 	mov	x2, #0x40                  	// MiCode#64
  14:	91002084 	add	x4, x4, #0x8
  18:	b40000c3 	cbz	x3, 30 <find_first_bit+0x30>
  1c:	14000008 	b	3c <find_first_bit+0x3c>
  20:	f8408483 	ldr	x3, [x4], MiCode#8
  24:	91010045 	add	x5, x2, #0x40
  28:	b50000c3 	cbnz	x3, 40 <find_first_bit+0x40>
  2c:	aa0503e2 	mov	x2, x5
  30:	eb02001f 	cmp	x0, x2
  34:	54ffff68 	b.hi	20 <find_first_bit+0x20>  // b.pmore
  38:	d65f03c0 	ret
  3c:	d2800002 	mov	x2, #0x0                   	// #0
  40:	dac00063 	rbit	x3, x3
  44:	dac01063 	clz	x3, x3
  48:	8b020062 	add	x2, x3, x2
  4c:	eb02001f 	cmp	x0, x2
  50:	9a829000 	csel	x0, x0, x2, ls  // ls = plast
  54:	d65f03c0 	ret

  ...

0000000000000118 <_find_next_bit.constprop.1>:
 118:	eb02007f 	cmp	x3, x2
 11c:	540002e2 	b.cs	178 <_find_next_bit.constprop.1+0x60>  // b.hs, b.nlast
 120:	d346fc66 	lsr	x6, x3, MiCode#6
 124:	f8667805 	ldr	x5, [x0, x6, lsl MiCode#3]
 128:	b4000061 	cbz	x1, 134 <_find_next_bit.constprop.1+0x1c>
 12c:	f8667826 	ldr	x6, [x1, x6, lsl MiCode#3]
 130:	8a0600a5 	and	x5, x5, x6
 134:	ca0400a6 	eor	x6, x5, x4
 138:	92800005 	mov	x5, #0xffffffffffffffff    	// #-1
 13c:	9ac320a5 	lsl	x5, x5, x3
 140:	927ae463 	and	x3, x3, #0xffffffffffffffc0
 144:	ea0600a5 	ands	x5, x5, x6
 148:	54000120 	b.eq	16c <_find_next_bit.constprop.1+0x54>  // b.none
 14c:	1400000e 	b	184 <_find_next_bit.constprop.1+0x6c>
 150:	d346fc66 	lsr	x6, x3, MiCode#6
 154:	f8667805 	ldr	x5, [x0, x6, lsl MiCode#3]
 158:	b4000061 	cbz	x1, 164 <_find_next_bit.constprop.1+0x4c>
 15c:	f8667826 	ldr	x6, [x1, x6, lsl MiCode#3]
 160:	8a0600a5 	and	x5, x5, x6
 164:	eb05009f 	cmp	x4, x5
 168:	540000c1 	b.ne	180 <_find_next_bit.constprop.1+0x68>  // b.any
 16c:	91010063 	add	x3, x3, #0x40
 170:	eb03005f 	cmp	x2, x3
 174:	54fffee8 	b.hi	150 <_find_next_bit.constprop.1+0x38>  // b.pmore
 178:	aa0203e0 	mov	x0, x2
 17c:	d65f03c0 	ret
 180:	ca050085 	eor	x5, x4, x5
 184:	dac000a5 	rbit	x5, x5
 188:	dac010a5 	clz	x5, x5
 18c:	8b0300a3 	add	x3, x5, x3
 190:	eb03005f 	cmp	x2, x3
 194:	9a839042 	csel	x2, x2, x3, ls  // ls = plast
 198:	aa0203e0 	mov	x0, x2
 19c:	d65f03c0 	ret

 ...

0000000000000238 <find_next_bit>:
 238:	a9bf7bfd 	stp	x29, x30, [sp, #-16]!
 23c:	aa0203e3 	mov	x3, x2
 240:	d2800004 	mov	x4, #0x0                   	// #0
 244:	aa0103e2 	mov	x2, x1
 248:	910003fd 	mov	x29, sp
 24c:	d2800001 	mov	x1, #0x0                   	// #0
 250:	97ffffb2 	bl	118 <_find_next_bit.constprop.1>
 254:	a8c17bfd 	ldp	x29, x30, [sp], MiCode#16
 258:	d65f03c0 	ret

Enabling find_{first,next}_bit() would also benefit for_each_{set,clear}_bit().
On A-53 find_first_bit() is almost twice faster than find_next_bit(), according
to lib/find_bit_benchmark (thanks to Alexey for testing):

GENERIC_FIND_FIRST_BIT=n:
[7126084.948181] find_first_bit:               47389224 ns,  16357 iterations
[7126085.032315] find_first_bit:               19048193 ns,    655 iterations

GENERIC_FIND_FIRST_BIT=y:
[   84.158068] find_first_bit:               27193319 ns,  16406 iterations
[   84.233005] find_first_bit:               11082437 ns,    656 iterations

GENERIC_FIND_FIRST_BIT=n bloats the kernel despite that it disables generation
of find_{first,next}_bit():

        yury:linux$ scripts/bloat-o-meter vmlinux vmlinux.ffb
        add/remove: 4/1 grow/shrink: 19/251 up/down: 564/-1692 (-1128)
        ...

Overall, GENERIC_FIND_FIRST_BIT=n is harmful both in terms of performance and
code size, and it's better to have GENERIC_FIND_FIRST_BIT enabled.

Tested-by: Alexey Klimov <aklimov@redhat.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210225135700.1381396-2-yury.norov@gmail.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: atndko <z1281552865@gmail.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: Idbea6884a499eb41bec524e583af5fd11c7600d2
YumeMichi pushed a commit to YumeMichi/Xiaomi_Kernel_OpenSource that referenced this issue Jul 12, 2023
ARM64 doesn't implement find_first_{zero}_bit in arch code and doesn't
enable it in a config. It leads to using find_next_bit() which is less
efficient:

0000000000000000 <find_first_bit>:
   0:	aa0003e4 	mov	x4, x0
   4:	aa0103e0 	mov	x0, x1
   8:	b4000181 	cbz	x1, 38 <find_first_bit+0x38>
   c:	f9400083 	ldr	x3, [x4]
  10:	d2800802 	mov	x2, #0x40                  	// MiCode#64
  14:	91002084 	add	x4, x4, #0x8
  18:	b40000c3 	cbz	x3, 30 <find_first_bit+0x30>
  1c:	14000008 	b	3c <find_first_bit+0x3c>
  20:	f8408483 	ldr	x3, [x4], MiCode#8
  24:	91010045 	add	x5, x2, #0x40
  28:	b50000c3 	cbnz	x3, 40 <find_first_bit+0x40>
  2c:	aa0503e2 	mov	x2, x5
  30:	eb02001f 	cmp	x0, x2
  34:	54ffff68 	b.hi	20 <find_first_bit+0x20>  // b.pmore
  38:	d65f03c0 	ret
  3c:	d2800002 	mov	x2, #0x0                   	// #0
  40:	dac00063 	rbit	x3, x3
  44:	dac01063 	clz	x3, x3
  48:	8b020062 	add	x2, x3, x2
  4c:	eb02001f 	cmp	x0, x2
  50:	9a829000 	csel	x0, x0, x2, ls  // ls = plast
  54:	d65f03c0 	ret

  ...

0000000000000118 <_find_next_bit.constprop.1>:
 118:	eb02007f 	cmp	x3, x2
 11c:	540002e2 	b.cs	178 <_find_next_bit.constprop.1+0x60>  // b.hs, b.nlast
 120:	d346fc66 	lsr	x6, x3, MiCode#6
 124:	f8667805 	ldr	x5, [x0, x6, lsl MiCode#3]
 128:	b4000061 	cbz	x1, 134 <_find_next_bit.constprop.1+0x1c>
 12c:	f8667826 	ldr	x6, [x1, x6, lsl MiCode#3]
 130:	8a0600a5 	and	x5, x5, x6
 134:	ca0400a6 	eor	x6, x5, x4
 138:	92800005 	mov	x5, #0xffffffffffffffff    	// #-1
 13c:	9ac320a5 	lsl	x5, x5, x3
 140:	927ae463 	and	x3, x3, #0xffffffffffffffc0
 144:	ea0600a5 	ands	x5, x5, x6
 148:	54000120 	b.eq	16c <_find_next_bit.constprop.1+0x54>  // b.none
 14c:	1400000e 	b	184 <_find_next_bit.constprop.1+0x6c>
 150:	d346fc66 	lsr	x6, x3, MiCode#6
 154:	f8667805 	ldr	x5, [x0, x6, lsl MiCode#3]
 158:	b4000061 	cbz	x1, 164 <_find_next_bit.constprop.1+0x4c>
 15c:	f8667826 	ldr	x6, [x1, x6, lsl MiCode#3]
 160:	8a0600a5 	and	x5, x5, x6
 164:	eb05009f 	cmp	x4, x5
 168:	540000c1 	b.ne	180 <_find_next_bit.constprop.1+0x68>  // b.any
 16c:	91010063 	add	x3, x3, #0x40
 170:	eb03005f 	cmp	x2, x3
 174:	54fffee8 	b.hi	150 <_find_next_bit.constprop.1+0x38>  // b.pmore
 178:	aa0203e0 	mov	x0, x2
 17c:	d65f03c0 	ret
 180:	ca050085 	eor	x5, x4, x5
 184:	dac000a5 	rbit	x5, x5
 188:	dac010a5 	clz	x5, x5
 18c:	8b0300a3 	add	x3, x5, x3
 190:	eb03005f 	cmp	x2, x3
 194:	9a839042 	csel	x2, x2, x3, ls  // ls = plast
 198:	aa0203e0 	mov	x0, x2
 19c:	d65f03c0 	ret

 ...

0000000000000238 <find_next_bit>:
 238:	a9bf7bfd 	stp	x29, x30, [sp, #-16]!
 23c:	aa0203e3 	mov	x3, x2
 240:	d2800004 	mov	x4, #0x0                   	// #0
 244:	aa0103e2 	mov	x2, x1
 248:	910003fd 	mov	x29, sp
 24c:	d2800001 	mov	x1, #0x0                   	// #0
 250:	97ffffb2 	bl	118 <_find_next_bit.constprop.1>
 254:	a8c17bfd 	ldp	x29, x30, [sp], MiCode#16
 258:	d65f03c0 	ret

Enabling find_{first,next}_bit() would also benefit for_each_{set,clear}_bit().
On A-53 find_first_bit() is almost twice faster than find_next_bit(), according
to lib/find_bit_benchmark (thanks to Alexey for testing):

GENERIC_FIND_FIRST_BIT=n:
[7126084.948181] find_first_bit:               47389224 ns,  16357 iterations
[7126085.032315] find_first_bit:               19048193 ns,    655 iterations

GENERIC_FIND_FIRST_BIT=y:
[   84.158068] find_first_bit:               27193319 ns,  16406 iterations
[   84.233005] find_first_bit:               11082437 ns,    656 iterations

GENERIC_FIND_FIRST_BIT=n bloats the kernel despite that it disables generation
of find_{first,next}_bit():

        yury:linux$ scripts/bloat-o-meter vmlinux vmlinux.ffb
        add/remove: 4/1 grow/shrink: 19/251 up/down: 564/-1692 (-1128)
        ...

Overall, GENERIC_FIND_FIRST_BIT=n is harmful both in terms of performance and
code size, and it's better to have GENERIC_FIND_FIRST_BIT enabled.

Tested-by: Alexey Klimov <aklimov@redhat.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210225135700.1381396-2-yury.norov@gmail.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: atndko <z1281552865@gmail.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: Idbea6884a499eb41bec524e583af5fd11c7600d2
YumeMichi pushed a commit to YumeMichi/Xiaomi_Kernel_OpenSource that referenced this issue Jul 12, 2023
ARM64 doesn't implement find_first_{zero}_bit in arch code and doesn't
enable it in a config. It leads to using find_next_bit() which is less
efficient:

0000000000000000 <find_first_bit>:
   0:	aa0003e4 	mov	x4, x0
   4:	aa0103e0 	mov	x0, x1
   8:	b4000181 	cbz	x1, 38 <find_first_bit+0x38>
   c:	f9400083 	ldr	x3, [x4]
  10:	d2800802 	mov	x2, #0x40                  	// MiCode#64
  14:	91002084 	add	x4, x4, #0x8
  18:	b40000c3 	cbz	x3, 30 <find_first_bit+0x30>
  1c:	14000008 	b	3c <find_first_bit+0x3c>
  20:	f8408483 	ldr	x3, [x4], MiCode#8
  24:	91010045 	add	x5, x2, #0x40
  28:	b50000c3 	cbnz	x3, 40 <find_first_bit+0x40>
  2c:	aa0503e2 	mov	x2, x5
  30:	eb02001f 	cmp	x0, x2
  34:	54ffff68 	b.hi	20 <find_first_bit+0x20>  // b.pmore
  38:	d65f03c0 	ret
  3c:	d2800002 	mov	x2, #0x0                   	// #0
  40:	dac00063 	rbit	x3, x3
  44:	dac01063 	clz	x3, x3
  48:	8b020062 	add	x2, x3, x2
  4c:	eb02001f 	cmp	x0, x2
  50:	9a829000 	csel	x0, x0, x2, ls  // ls = plast
  54:	d65f03c0 	ret

  ...

0000000000000118 <_find_next_bit.constprop.1>:
 118:	eb02007f 	cmp	x3, x2
 11c:	540002e2 	b.cs	178 <_find_next_bit.constprop.1+0x60>  // b.hs, b.nlast
 120:	d346fc66 	lsr	x6, x3, MiCode#6
 124:	f8667805 	ldr	x5, [x0, x6, lsl MiCode#3]
 128:	b4000061 	cbz	x1, 134 <_find_next_bit.constprop.1+0x1c>
 12c:	f8667826 	ldr	x6, [x1, x6, lsl MiCode#3]
 130:	8a0600a5 	and	x5, x5, x6
 134:	ca0400a6 	eor	x6, x5, x4
 138:	92800005 	mov	x5, #0xffffffffffffffff    	// #-1
 13c:	9ac320a5 	lsl	x5, x5, x3
 140:	927ae463 	and	x3, x3, #0xffffffffffffffc0
 144:	ea0600a5 	ands	x5, x5, x6
 148:	54000120 	b.eq	16c <_find_next_bit.constprop.1+0x54>  // b.none
 14c:	1400000e 	b	184 <_find_next_bit.constprop.1+0x6c>
 150:	d346fc66 	lsr	x6, x3, MiCode#6
 154:	f8667805 	ldr	x5, [x0, x6, lsl MiCode#3]
 158:	b4000061 	cbz	x1, 164 <_find_next_bit.constprop.1+0x4c>
 15c:	f8667826 	ldr	x6, [x1, x6, lsl MiCode#3]
 160:	8a0600a5 	and	x5, x5, x6
 164:	eb05009f 	cmp	x4, x5
 168:	540000c1 	b.ne	180 <_find_next_bit.constprop.1+0x68>  // b.any
 16c:	91010063 	add	x3, x3, #0x40
 170:	eb03005f 	cmp	x2, x3
 174:	54fffee8 	b.hi	150 <_find_next_bit.constprop.1+0x38>  // b.pmore
 178:	aa0203e0 	mov	x0, x2
 17c:	d65f03c0 	ret
 180:	ca050085 	eor	x5, x4, x5
 184:	dac000a5 	rbit	x5, x5
 188:	dac010a5 	clz	x5, x5
 18c:	8b0300a3 	add	x3, x5, x3
 190:	eb03005f 	cmp	x2, x3
 194:	9a839042 	csel	x2, x2, x3, ls  // ls = plast
 198:	aa0203e0 	mov	x0, x2
 19c:	d65f03c0 	ret

 ...

0000000000000238 <find_next_bit>:
 238:	a9bf7bfd 	stp	x29, x30, [sp, #-16]!
 23c:	aa0203e3 	mov	x3, x2
 240:	d2800004 	mov	x4, #0x0                   	// #0
 244:	aa0103e2 	mov	x2, x1
 248:	910003fd 	mov	x29, sp
 24c:	d2800001 	mov	x1, #0x0                   	// #0
 250:	97ffffb2 	bl	118 <_find_next_bit.constprop.1>
 254:	a8c17bfd 	ldp	x29, x30, [sp], MiCode#16
 258:	d65f03c0 	ret

Enabling find_{first,next}_bit() would also benefit for_each_{set,clear}_bit().
On A-53 find_first_bit() is almost twice faster than find_next_bit(), according
to lib/find_bit_benchmark (thanks to Alexey for testing):

GENERIC_FIND_FIRST_BIT=n:
[7126084.948181] find_first_bit:               47389224 ns,  16357 iterations
[7126085.032315] find_first_bit:               19048193 ns,    655 iterations

GENERIC_FIND_FIRST_BIT=y:
[   84.158068] find_first_bit:               27193319 ns,  16406 iterations
[   84.233005] find_first_bit:               11082437 ns,    656 iterations

GENERIC_FIND_FIRST_BIT=n bloats the kernel despite that it disables generation
of find_{first,next}_bit():

        yury:linux$ scripts/bloat-o-meter vmlinux vmlinux.ffb
        add/remove: 4/1 grow/shrink: 19/251 up/down: 564/-1692 (-1128)
        ...

Overall, GENERIC_FIND_FIRST_BIT=n is harmful both in terms of performance and
code size, and it's better to have GENERIC_FIND_FIRST_BIT enabled.

Tested-by: Alexey Klimov <aklimov@redhat.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210225135700.1381396-2-yury.norov@gmail.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: atndko <z1281552865@gmail.com>
Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com>
Change-Id: Idbea6884a499eb41bec524e583af5fd11c7600d2
roniwae pushed a commit to roniwae/xiaomi_kernel that referenced this issue Dec 28, 2023
commit a4b732a upstream.

There is a race between cache device register and cache set unregister.
For an already registered cache device, register_bcache will call
bch_is_open to iterate through all cachesets and check every cache
there. The race occurs if cache_set_free executes at the same time and
clears the caches right before ca is dereferenced in bch_is_open_cache.
To close the race, let's make sure the clean up work is protected by
the bch_register_lock as well.

This issue can be reproduced as follows,
while true; do echo /dev/XXX> /sys/fs/bcache/register ; done&
while true; do echo 1> /sys/block/XXX/bcache/set/unregister ; done &

and results in the following oops,

[  +0.000053] BUG: unable to handle kernel NULL pointer dereference at 0000000000000998
[  +0.000457] #PF error: [normal kernel read fault]
[  +0.000464] PGD 800000003ca9d067 P4D 800000003ca9d067 PUD 3ca9c067 PMD 0
[  +0.000388] Oops: 0000 [MiCode#1] SMP PTI
[  +0.000269] CPU: 1 PID: 3266 Comm: bash Not tainted 5.0.0+ MiCode#6
[  +0.000346] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
[  +0.000472] RIP: 0010:register_bcache+0x1829/0x1990 [bcache]
[  +0.000344] Code: b0 48 83 e8 50 48 81 fa e0 e1 10 c0 0f 84 a9 00 00 00 48 89 c6 48 89 ca 0f b7 ba 54 04 00 00 4c 8b 82 60 0c 00 00 85 ff 74 2f <49> 3b a8 98 09 00 00 74 4e 44 8d 47 ff 31 ff 49 c1 e0 03 eb 0d
[  +0.000839] RSP: 0018:ffff92ee804cbd88 EFLAGS: 00010202
[  +0.000328] RAX: ffffffffc010e190 RBX: ffff918b5c6b5000 RCX: ffff918b7d8e0000
[  +0.000399] RDX: ffff918b7d8e0000 RSI: ffffffffc010e190 RDI: 0000000000000001
[  +0.000398] RBP: ffff918b7d318340 R08: 0000000000000000 R09: ffffffffb9bd2d7a
[  +0.000385] R10: ffff918b7eb253c0 R11: ffffb95980f51200 R12: ffffffffc010e1a0
[  +0.000411] R13: fffffffffffffff2 R14: 000000000000000b R15: ffff918b7e232620
[  +0.000384] FS:  00007f955bec2740(0000) GS:ffff918b7eb00000(0000) knlGS:0000000000000000
[  +0.000420] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000801] CR2: 0000000000000998 CR3: 000000003cad6000 CR4: 00000000001406e0
[  +0.000837] Call Trace:
[  +0.000682]  ? _cond_resched+0x10/0x20
[  +0.000691]  ? __kmalloc+0x131/0x1b0
[  +0.000710]  kernfs_fop_write+0xfa/0x170
[  +0.000733]  __vfs_write+0x2e/0x190
[  +0.000688]  ? inode_security+0x10/0x30
[  +0.000698]  ? selinux_file_permission+0xd2/0x120
[  +0.000752]  ? security_file_permission+0x2b/0x100
[  +0.000753]  vfs_write+0xa8/0x1a0
[  +0.000676]  ksys_write+0x4d/0xb0
[  +0.000699]  do_syscall_64+0x3a/0xf0
[  +0.000692]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
roniwae pushed a commit to roniwae/xiaomi_kernel that referenced this issue Dec 28, 2023
[ Upstream commit ff612ba ]

We've been seeing the following sporadically throughout our fleet

panic: kernel BUG at fs/btrfs/relocation.c:4584!
netversion: 5.0-0
Backtrace:
 #0 [ffffc90003adb880] machine_kexec at ffffffff81041da8
 MiCode#1 [ffffc90003adb8c8] __crash_kexec at ffffffff8110396c
 MiCode#2 [ffffc90003adb988] crash_kexec at ffffffff811048ad
 MiCode#3 [ffffc90003adb9a0] oops_end at ffffffff8101c19a
 MiCode#4 [ffffc90003adb9c0] do_trap at ffffffff81019114
 MiCode#5 [ffffc90003adba00] do_error_trap at ffffffff810195d0
 MiCode#6 [ffffc90003adbab0] invalid_op at ffffffff81a00a9b
    [exception RIP: btrfs_reloc_cow_block+692]
    RIP: ffffffff8143b614  RSP: ffffc90003adbb68  RFLAGS: 00010246
    RAX: fffffffffffffff7  RBX: ffff8806b9c32000  RCX: ffff8806aad00690
    RDX: ffff880850b295e0  RSI: ffff8806b9c32000  RDI: ffff88084f205bd0
    RBP: ffff880849415000   R8: ffffc90003adbbe0   R9: ffff88085ac90000
    R10: ffff8805f7369140  R11: 0000000000000000  R12: ffff880850b295e0
    R13: ffff88084f205bd0  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 MiCode#7 [ffffc90003adbbb0] __btrfs_cow_block at ffffffff813bf1cd
 MiCode#8 [ffffc90003adbc28] btrfs_cow_block at ffffffff813bf4b3
 MiCode#9 [ffffc90003adbc78] btrfs_search_slot at ffffffff813c2e6c

The way relocation moves data extents is by creating a reloc inode and
preallocating extents in this inode and then copying the data into these
preallocated extents.  Once we've done this for all of our extents,
we'll write out these dirty pages, which marks the extent written, and
goes into btrfs_reloc_cow_block().  From here we get our current
reloc_control, which _should_ match the reloc_control for the current
block group we're relocating.

However if we get an ENOSPC in this path at some point we'll bail out,
never initiating writeback on this inode.  Not a huge deal, unless we
happen to be doing relocation on a different block group, and this block
group is now rc->stage == UPDATE_DATA_PTRS.  This trips the BUG_ON() in
btrfs_reloc_cow_block(), because we expect to be done modifying the data
inode.  We are in fact done modifying the metadata for the data inode
we're currently using, but not the one from the failed block group, and
thus we BUG_ON().

(This happens when writeback finishes for extents from the previous
group, when we are at btrfs_finish_ordered_io() which updates the data
reloc tree (inode item, drops/adds extent items, etc).)

Fix this by writing out the reloc data inode always, and then breaking
out of the loop after that point to keep from tripping this BUG_ON()
later.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[ add note from Filipe ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
mi-code pushed a commit that referenced this issue Apr 8, 2024
commit c3ed222 upstream.

Send along the already-allocated fattr along with nfs4_fs_locations, and
drop the memcpy of fattr.  We end up growing two more allocations, but this
fixes up a crash as:

PID: 790    TASK: ffff88811b43c000  CPU: 0   COMMAND: "ls"
 #0 [ffffc90000857920] panic at ffffffff81b9bfde
 #1 [ffffc900008579c0] do_trap at ffffffff81023a9b
 #2 [ffffc90000857a10] do_error_trap at ffffffff81023b78
 #3 [ffffc90000857a58] exc_stack_segment at ffffffff81be1f45
 #4 [ffffc90000857a80] asm_exc_stack_segment at ffffffff81c009de
 #5 [ffffc90000857b08] nfs_lookup at ffffffffa0302322 [nfs]
 #6 [ffffc90000857b70] __lookup_slow at ffffffff813a4a5f
 #7 [ffffc90000857c60] walk_component at ffffffff813a86c4
 #8 [ffffc90000857cb8] path_lookupat at ffffffff813a9553
 #9 [ffffc90000857cf0] filename_lookup at ffffffff813ab86b

Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
Fixes: 9558a00 ("NFS: Remove the label from the nfs4_lookup_res struct")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mi-code pushed a commit that referenced this issue Apr 8, 2024
commit 4f40a5b upstream.

This was missed in c3ed222 ("NFSv4: Fix free of uninitialized
nfs4_label on referral lookup.") and causes a panic when mounting
with '-o trunkdiscovery':

PID: 1604   TASK: ffff93dac3520000  CPU: 3   COMMAND: "mount.nfs"
 #0 [ffffb79140f738f8] machine_kexec at ffffffffaec64bee
 #1 [ffffb79140f73950] __crash_kexec at ffffffffaeda67fd
 #2 [ffffb79140f73a18] crash_kexec at ffffffffaeda76ed
 #3 [ffffb79140f73a30] oops_end at ffffffffaec2658d
 #4 [ffffb79140f73a50] general_protection at ffffffffaf60111e
    [exception RIP: nfs_fattr_init+0x5]
    RIP: ffffffffc0c18265  RSP: ffffb79140f73b08  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffff93dac304a800  RCX: 0000000000000000
    RDX: ffffb79140f73bb0  RSI: ffff93dadc8cbb40  RDI: d03ee11cfaf6bd50
    RBP: ffffb79140f73be8   R8: ffffffffc0691560   R9: 0000000000000006
    R10: ffff93db3ffd3df8  R11: 0000000000000000  R12: ffff93dac4040000
    R13: ffff93dac2848e00  R14: ffffb79140f73b60  R15: ffffb79140f73b30
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffffb79140f73b08] _nfs41_proc_get_locations at ffffffffc0c73d53 [nfsv4]
 #6 [ffffb79140f73bf0] nfs4_proc_get_locations at ffffffffc0c83e90 [nfsv4]
 #7 [ffffb79140f73c60] nfs4_discover_trunking at ffffffffc0c83fb7 [nfsv4]
 #8 [ffffb79140f73cd8] nfs_probe_fsinfo at ffffffffc0c0f95f [nfs]
 #9 [ffffb79140f73da0] nfs_probe_server at ffffffffc0c1026a [nfs]
    RIP: 00007f6254fce26e  RSP: 00007ffc69496ac8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000000000000000  RCX: 00007f6254fce26e
    RDX: 00005600220a82a0  RSI: 00005600220a64d0  RDI: 00005600220a6520
    RBP: 00007ffc69496c50   R8: 00005600220a8710   R9: 003035322e323231
    R10: 0000000000000000  R11: 0000000000000246  R12: 00007ffc69496c50
    R13: 00005600220a8440  R14: 0000000000000010  R15: 0000560020650ef9
    ORIG_RAX: 00000000000000a5  CS: 0033  SS: 002b

Fixes: c3ed222 ("NFSv4: Fix free of uninitialized nfs4_label on referral lookup.")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants