Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More aggsum optimizations. #12145

Merged
merged 1 commit into from
Jun 7, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions include/os/linux/spl/sys/atomic.h
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,8 @@
#define atomic_sub_32_nv(v, i) atomic_sub_return((i), (atomic_t *)(v))
#define atomic_cas_32(v, x, y) atomic_cmpxchg((atomic_t *)(v), x, y)
#define atomic_swap_32(v, x) atomic_xchg((atomic_t *)(v), x)
#define atomic_load_32(v) atomic_read((atomic_t *)(v))
#define atomic_store_32(v, x) atomic_set((atomic_t *)(v), x)
#define atomic_inc_64(v) atomic64_inc((atomic64_t *)(v))
#define atomic_dec_64(v) atomic64_dec((atomic64_t *)(v))
#define atomic_add_64(v, i) atomic64_add((i), (atomic64_t *)(v))
Expand All @@ -58,6 +60,8 @@
#define atomic_sub_64_nv(v, i) atomic64_sub_return((i), (atomic64_t *)(v))
#define atomic_cas_64(v, x, y) atomic64_cmpxchg((atomic64_t *)(v), x, y)
#define atomic_swap_64(v, x) atomic64_xchg((atomic64_t *)(v), x)
#define atomic_load_64(v) atomic64_read((atomic64_t *)(v))
#define atomic_store_64(v, x) atomic64_set((atomic64_t *)(v), x)

#ifdef _LP64
static __inline__ void *
Expand Down
7 changes: 4 additions & 3 deletions include/sys/aggsum.h
Original file line number Diff line number Diff line change
Expand Up @@ -39,15 +39,16 @@ struct aggsum_bucket {
typedef struct aggsum {
kmutex_t as_lock;
int64_t as_lower_bound;
int64_t as_upper_bound;
uint64_t as_upper_bound;
aggsum_bucket_t *as_buckets ____cacheline_aligned;
uint_t as_numbuckets;
aggsum_bucket_t *as_buckets;
uint_t as_bucketshift;
} aggsum_t;

void aggsum_init(aggsum_t *, uint64_t);
void aggsum_fini(aggsum_t *);
int64_t aggsum_lower_bound(aggsum_t *);
int64_t aggsum_upper_bound(aggsum_t *);
uint64_t aggsum_upper_bound(aggsum_t *);
int aggsum_compare(aggsum_t *, uint64_t);
uint64_t aggsum_value(aggsum_t *);
void aggsum_add(aggsum_t *, int64_t);
Expand Down
13 changes: 13 additions & 0 deletions lib/libspl/atomic.c
Original file line number Diff line number Diff line change
Expand Up @@ -313,6 +313,19 @@ atomic_swap_ptr(volatile void *target, void *bits)
return (__atomic_exchange_n((void **)target, bits, __ATOMIC_SEQ_CST));
}

#ifndef _LP64
uint64_t
atomic_load_64(volatile uint64_t *target)
{
return (__atomic_load_n(target, __ATOMIC_RELAXED));
behlendorf marked this conversation as resolved.
Show resolved Hide resolved
}

void
atomic_store_64(volatile uint64_t *target, uint64_t bits)
{
return (__atomic_store_n(target, bits, __ATOMIC_RELAXED));
}
#endif

int
atomic_set_long_excl(volatile ulong_t *target, uint_t value)
Expand Down
43 changes: 43 additions & 0 deletions lib/libspl/include/atomic.h
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,49 @@ extern ulong_t atomic_swap_ulong(volatile ulong_t *, ulong_t);
extern uint64_t atomic_swap_64(volatile uint64_t *, uint64_t);
#endif

/*
* Atomically read variable.
*/
#define atomic_load_char(p) (*(volatile uchar_t *)(p))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know libspl's versions of these functions are okay being a bit loose, but I don't believe that volatile guarantees atomicity, that's just a common implementation side-effect.
How about GCC's __atomic builtins (also provided by Clang)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

volatile does not provide atomicity, it only makes compiler to not optimize memory accesses. Atomicity is provided by machine register size, copied with single instruction. That is why there are #ifdef _LP64. This is a copy/paste from FreeBSD's atomic_common.h, used on all supported architectures, so I believe it to be correct. I have no opinion about __atomic builtins. When I written this, I haven't seen recent наб's commit to start using them in lib/libspl/atomic.c. It actually makes me wonder why they were not inlined here, but that is a different topic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be nice to finally have proper wrappers for atomic loads and stores! This is functionality I know would be helpful elsewhere with some kstats which are a bit dodgy in this regard (though pretty harmless).

Since it sounds like these are all well tested already on FreeBSD this seems fine to me. Though I wouldn't object to updating them to use the GCC atomic builtins for consistency in a similar fashion as was done for the other atomics.

Copy link
Member Author

@amotin amotin May 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what is the point now to have all the atomics in separate C file, not inlined in the header. Is it supposed to improve compatibility when library is built with compiler supporting atomics and header is used by less capable? Previously it had sense, since alternative implementations were written in assembler. But now with compiler supposed to implement all of that -- what's the point? To troubles compiler/CPU with function calls?

PS: On a second thought, some operations may require more complicated implementation where function call could have sense, but since compiler hides it from us, we have no way to differentiate. For trivial load/store/etc I'd prefer to not call functions for single machine instruction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well since this is only for libzpool.so in user space it's probably not a huge deal since only zdb and ztest would use these (at least until there's FUSE or BUSE implementation). The only reason I can think of to not move everything to the header (to be inlined) would be if we wanted a single shared user/kernel space header. But we don't have that today and I don't think we'd gain much from it. We're already inlining these on the kernel side, doing the same in user space would make sense.

#define atomic_load_short(p) (*(volatile ushort_t *)(p))
#define atomic_load_int(p) (*(volatile uint_t *)(p))
#define atomic_load_long(p) (*(volatile ulong_t *)(p))
#define atomic_load_ptr(p) (*(volatile __typeof(*p) *)(p))
#define atomic_load_8(p) (*(volatile uint8_t *)(p))
#define atomic_load_16(p) (*(volatile uint16_t *)(p))
#define atomic_load_32(p) (*(volatile uint32_t *)(p))
#ifdef _LP64
#define atomic_load_64(p) (*(volatile uint64_t *)(p))
#elif defined(_INT64_TYPE)
extern uint64_t atomic_load_64(volatile uint64_t *);
#endif

/*
* Atomically write variable.
*/
#define atomic_store_char(p, v) \
(*(volatile uchar_t *)(p) = (uchar_t)(v))
#define atomic_store_short(p, v) \
(*(volatile ushort_t *)(p) = (ushort_t)(v))
#define atomic_store_int(p, v) \
(*(volatile uint_t *)(p) = (uint_t)(v))
#define atomic_store_long(p, v) \
(*(volatile ulong_t *)(p) = (ulong_t)(v))
#define atomic_store_ptr(p, v) \
(*(volatile __typeof(*p) *)(p) = (v))
#define atomic_store_8(p, v) \
(*(volatile uint8_t *)(p) = (uint8_t)(v))
#define atomic_store_16(p, v) \
(*(volatile uint16_t *)(p) = (uint16_t)(v))
#define atomic_store_32(p, v) \
(*(volatile uint32_t *)(p) = (uint32_t)(v))
#ifdef _LP64
#define atomic_store_64(p, v) \
(*(volatile uint64_t *)(p) = (uint64_t)(v))
#elif defined(_INT64_TYPE)
extern void atomic_store_64(volatile uint64_t *, uint64_t);
#endif

/*
* Perform an exclusive atomic bit set/clear on a target.
* Returns 0 if bit was successfully set/cleared, or -1
Expand Down
125 changes: 65 additions & 60 deletions module/zfs/aggsum.c
Original file line number Diff line number Diff line change
Expand Up @@ -78,21 +78,26 @@
*/

/*
* We will borrow aggsum_borrow_multiplier times the current request, so we will
* have to get the as_lock approximately every aggsum_borrow_multiplier calls to
* aggsum_delta().
* We will borrow 2^aggsum_borrow_shift times the current request, so we will
* have to get the as_lock approximately every 2^aggsum_borrow_shift calls to
* aggsum_add().
*/
static uint_t aggsum_borrow_multiplier = 10;
static uint_t aggsum_borrow_shift = 4;

void
aggsum_init(aggsum_t *as, uint64_t value)
{
bzero(as, sizeof (*as));
as->as_lower_bound = as->as_upper_bound = value;
mutex_init(&as->as_lock, NULL, MUTEX_DEFAULT, NULL);
as->as_numbuckets = boot_ncpus;
as->as_buckets = kmem_zalloc(boot_ncpus * sizeof (aggsum_bucket_t),
KM_SLEEP);
/*
* Too many buckets may hurt read performance without improving
* write. From 12 CPUs use bucket per 2 CPUs, from 48 per 4, etc.
*/
as->as_bucketshift = highbit64(boot_ncpus / 6) / 2;
as->as_numbuckets = ((boot_ncpus - 1) >> as->as_bucketshift) + 1;
as->as_buckets = kmem_zalloc(as->as_numbuckets *
sizeof (aggsum_bucket_t), KM_SLEEP);
for (int i = 0; i < as->as_numbuckets; i++) {
mutex_init(&as->as_buckets[i].asc_lock,
NULL, MUTEX_DEFAULT, NULL);
Expand All @@ -111,59 +116,49 @@ aggsum_fini(aggsum_t *as)
int64_t
aggsum_lower_bound(aggsum_t *as)
{
return (as->as_lower_bound);
return (atomic_load_64((volatile uint64_t *)&as->as_lower_bound));
}

int64_t
uint64_t
aggsum_upper_bound(aggsum_t *as)
{
return (as->as_upper_bound);
}

static void
aggsum_flush_bucket(aggsum_t *as, struct aggsum_bucket *asb)
{
ASSERT(MUTEX_HELD(&as->as_lock));
ASSERT(MUTEX_HELD(&asb->asc_lock));

/*
* We use atomic instructions for this because we read the upper and
* lower bounds without the lock, so we need stores to be atomic.
*/
atomic_add_64((volatile uint64_t *)&as->as_lower_bound,
asb->asc_delta + asb->asc_borrowed);
atomic_add_64((volatile uint64_t *)&as->as_upper_bound,
asb->asc_delta - asb->asc_borrowed);
asb->asc_delta = 0;
asb->asc_borrowed = 0;
return (atomic_load_64(&as->as_upper_bound));
}

uint64_t
aggsum_value(aggsum_t *as)
{
int64_t rv;
int64_t lb;
uint64_t ub;

mutex_enter(&as->as_lock);
if (as->as_lower_bound == as->as_upper_bound) {
rv = as->as_lower_bound;
lb = as->as_lower_bound;
ub = as->as_upper_bound;
if (lb == ub) {
for (int i = 0; i < as->as_numbuckets; i++) {
ASSERT0(as->as_buckets[i].asc_delta);
ASSERT0(as->as_buckets[i].asc_borrowed);
}
mutex_exit(&as->as_lock);
return (rv);
return (lb);
}
for (int i = 0; i < as->as_numbuckets; i++) {
struct aggsum_bucket *asb = &as->as_buckets[i];
if (asb->asc_borrowed == 0)
continue;
mutex_enter(&asb->asc_lock);
aggsum_flush_bucket(as, asb);
lb += asb->asc_delta + asb->asc_borrowed;
ub += asb->asc_delta - asb->asc_borrowed;
asb->asc_delta = 0;
asb->asc_borrowed = 0;
mutex_exit(&asb->asc_lock);
}
VERIFY3U(as->as_lower_bound, ==, as->as_upper_bound);
rv = as->as_lower_bound;
ASSERT3U(lb, ==, ub);
atomic_store_64((volatile uint64_t *)&as->as_lower_bound, lb);
atomic_store_64(&as->as_upper_bound, lb);
mutex_exit(&as->as_lock);

return (rv);
return (lb);
}

void
Expand All @@ -172,7 +167,8 @@ aggsum_add(aggsum_t *as, int64_t delta)
struct aggsum_bucket *asb;
int64_t borrow;

asb = &as->as_buckets[CPU_SEQID_UNSTABLE % as->as_numbuckets];
asb = &as->as_buckets[(CPU_SEQID_UNSTABLE >> as->as_bucketshift) %
as->as_numbuckets];

/* Try fast path if we already borrowed enough before. */
mutex_enter(&asb->asc_lock);
Expand All @@ -188,21 +184,22 @@ aggsum_add(aggsum_t *as, int64_t delta)
* We haven't borrowed enough. Take the global lock and borrow
* considering what is requested now and what we borrowed before.
*/
borrow = (delta < 0 ? -delta : delta) * aggsum_borrow_multiplier;
borrow = (delta < 0 ? -delta : delta);
borrow <<= aggsum_borrow_shift + as->as_bucketshift;
mutex_enter(&as->as_lock);
mutex_enter(&asb->asc_lock);
delta += asb->asc_delta;
asb->asc_delta = 0;
if (borrow >= asb->asc_borrowed)
borrow -= asb->asc_borrowed;
else
borrow = (borrow - (int64_t)asb->asc_borrowed) / 4;
mutex_enter(&asb->asc_lock);
delta += asb->asc_delta;
asb->asc_delta = 0;
asb->asc_borrowed += borrow;
atomic_add_64((volatile uint64_t *)&as->as_lower_bound,
delta - borrow);
atomic_add_64((volatile uint64_t *)&as->as_upper_bound,
delta + borrow);
mutex_exit(&asb->asc_lock);
atomic_store_64((volatile uint64_t *)&as->as_lower_bound,
as->as_lower_bound + delta - borrow);
atomic_store_64(&as->as_upper_bound,
as->as_upper_bound + delta + borrow);
mutex_exit(&as->as_lock);
}

Expand All @@ -214,27 +211,35 @@ aggsum_add(aggsum_t *as, int64_t delta)
int
aggsum_compare(aggsum_t *as, uint64_t target)
{
if (as->as_upper_bound < target)
int64_t lb;
uint64_t ub;
int i;

if (atomic_load_64(&as->as_upper_bound) < target)
return (-1);
if (as->as_lower_bound > target)
lb = atomic_load_64((volatile uint64_t *)&as->as_lower_bound);
if (lb > 0 && (uint64_t)lb > target)
return (1);
mutex_enter(&as->as_lock);
for (int i = 0; i < as->as_numbuckets; i++) {
lb = as->as_lower_bound;
ub = as->as_upper_bound;
for (i = 0; i < as->as_numbuckets; i++) {
struct aggsum_bucket *asb = &as->as_buckets[i];
if (asb->asc_borrowed == 0)
continue;
mutex_enter(&asb->asc_lock);
aggsum_flush_bucket(as, asb);
lb += asb->asc_delta + asb->asc_borrowed;
ub += asb->asc_delta - asb->asc_borrowed;
asb->asc_delta = 0;
asb->asc_borrowed = 0;
mutex_exit(&asb->asc_lock);
if (as->as_upper_bound < target) {
mutex_exit(&as->as_lock);
return (-1);
}
if (as->as_lower_bound > target) {
mutex_exit(&as->as_lock);
return (1);
}
if (ub < target || (lb > 0 && (uint64_t)lb > target))
break;
}
VERIFY3U(as->as_lower_bound, ==, as->as_upper_bound);
ASSERT3U(as->as_lower_bound, ==, target);
if (i >= as->as_numbuckets)
ASSERT3U(lb, ==, ub);
atomic_store_64((volatile uint64_t *)&as->as_lower_bound, lb);
atomic_store_64(&as->as_upper_bound, ub);
mutex_exit(&as->as_lock);
return (0);
return (ub < target ? -1 : (uint64_t)lb > target ? 1 : 0);
}