Skip to content

Commit d93ad50

Browse files
neilbrownmehmetb0
authored andcommitted
NFSv3: handle out-of-order write replies.
BugLink: https://bugs.launchpad.net/bugs/2089533 [ Upstream commit 3db63da ] NFSv3 includes pre/post wcc attributes which allow the client to determine if all changes to the file have been made by the client itself, or if any might have been made by some other client. If there are gaps in the pre/post ctime sequence it must be assumed that some other client changed the file in that gap and the local cache must be suspect. The next time the file is opened the cache should be invalidated. Since Commit 1c341b7 ("NFS: Add deferred cache invalidation for close-to-open consistency violations") in linux 5.3 the Linux client has been triggering this invalidation. The chunk in nfs_update_inode() in particularly triggers. Unfortunately Linux NFS assumes that all replies will be processed in the order sent, and will arrive in the order processed. This is not true in general. Consequently Linux NFS might ignore the wcc info in a WRITE reply because the reply is in response to a WRITE that was sent before some other request for which a reply has already been seen. This is detected by Linux using the gencount tests in nfs_inode_attr_cmp(). Also, when the gencount tests pass it is still possible that the request were processed on the server in a different order, and a gap seen in the ctime sequence might be filled in by a subsequent reply, so gaps should not immediately trigger delayed invalidation. The net result is that writing to a server and then reading the file back can result in going to the server for the read rather than serving it from cache - all because a couple of replies arrived out-of-order. This is a performance regression over kernels before 5.3, though the change in 5.3 is a correctness improvement. This has been seen with Linux writing to a Netapp server which occasionally re-orders requests. In testing the majority of requests were in-order, but a few (maybe 2 or three at a time) could be re-ordered. This patch addresses the problem by recording any gaps seen in the pre/post ctime sequence and not triggering invalidation until either there are too many gaps to fit in the table, or until there are no more active writes and the remaining gaps cannot be resolved. We allocate a table of 16 gaps on demand. If the allocation fails we revert to current behaviour which is of little cost as we are unlikely to be able to cache the writes anyway. In the table we store "start->end" pair when iversion is updated and "end<-start" pairs pre/post pairs reported by the server. Usually these exactly cancel out and so nothing is stored. When there are out-of-order replies we do store gaps and these will eventually be cancelled against later replies when this client is the only writer. If the final write is out-of-order there may be one gap remaining when the file is closed. This will be noticed and if there is precisely on gap and if the iversion can be advanced to match it, then we do so. This patch makes no attempt to handle directories correctly. The same problem potentially exists in the out-of-order replies to create/unlink requests can cause future lookup requires to be sent to the server unnecessarily. A similar scheme using the same primitives could be used to notice and handle out-of-order replies. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Stable-dep-of: 867da60 ("nfs: avoid i_lock contention in nfs_clear_invalid_mapping") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Manuel Diewald <manuel.diewald@canonical.com> Signed-off-by: Mehmet Basaran <mehmet.basaran@canonical.com>
1 parent c3299ae commit d93ad50

File tree

2 files changed

+144
-15
lines changed

2 files changed

+144
-15
lines changed

fs/nfs/inode.c

Lines changed: 97 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -214,11 +214,12 @@ void nfs_set_cache_invalid(struct inode *inode, unsigned long flags)
214214

215215
nfsi->cache_validity |= flags;
216216

217-
if (inode->i_mapping->nrpages == 0)
218-
nfsi->cache_validity &= ~(NFS_INO_INVALID_DATA |
219-
NFS_INO_DATA_INVAL_DEFER);
220-
else if (nfsi->cache_validity & NFS_INO_INVALID_DATA)
221-
nfsi->cache_validity &= ~NFS_INO_DATA_INVAL_DEFER;
217+
if (inode->i_mapping->nrpages == 0) {
218+
nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
219+
nfs_ooo_clear(nfsi);
220+
} else if (nfsi->cache_validity & NFS_INO_INVALID_DATA) {
221+
nfs_ooo_clear(nfsi);
222+
}
222223
trace_nfs_set_cache_invalid(inode, 0);
223224
}
224225
EXPORT_SYMBOL_GPL(nfs_set_cache_invalid);
@@ -692,9 +693,10 @@ static int nfs_vmtruncate(struct inode * inode, loff_t offset)
692693

693694
i_size_write(inode, offset);
694695
/* Optimisation */
695-
if (offset == 0)
696-
NFS_I(inode)->cache_validity &= ~(NFS_INO_INVALID_DATA |
697-
NFS_INO_DATA_INVAL_DEFER);
696+
if (offset == 0) {
697+
NFS_I(inode)->cache_validity &= ~NFS_INO_INVALID_DATA;
698+
nfs_ooo_clear(NFS_I(inode));
699+
}
698700
NFS_I(inode)->cache_validity &= ~NFS_INO_INVALID_SIZE;
699701

700702
spin_unlock(&inode->i_lock);
@@ -1109,7 +1111,7 @@ void nfs_inode_attach_open_context(struct nfs_open_context *ctx)
11091111

11101112
spin_lock(&inode->i_lock);
11111113
if (list_empty(&nfsi->open_files) &&
1112-
(nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER))
1114+
nfs_ooo_test(nfsi))
11131115
nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA |
11141116
NFS_INO_REVAL_FORCED);
11151117
list_add_tail_rcu(&ctx->list, &nfsi->open_files);
@@ -1361,8 +1363,8 @@ int nfs_clear_invalid_mapping(struct address_space *mapping)
13611363

13621364
set_bit(NFS_INO_INVALIDATING, bitlock);
13631365
smp_wmb();
1364-
nfsi->cache_validity &=
1365-
~(NFS_INO_INVALID_DATA | NFS_INO_DATA_INVAL_DEFER);
1366+
nfsi->cache_validity &= ~NFS_INO_INVALID_DATA;
1367+
nfs_ooo_clear(nfsi);
13661368
spin_unlock(&inode->i_lock);
13671369
trace_nfs_invalidate_mapping_enter(inode);
13681370
ret = nfs_invalidate_mapping(inode, mapping);
@@ -1825,6 +1827,66 @@ static int nfs_inode_finish_partial_attr_update(const struct nfs_fattr *fattr,
18251827
return 0;
18261828
}
18271829

1830+
static void nfs_ooo_merge(struct nfs_inode *nfsi,
1831+
u64 start, u64 end)
1832+
{
1833+
int i, cnt;
1834+
1835+
if (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER)
1836+
/* No point merging anything */
1837+
return;
1838+
1839+
if (!nfsi->ooo) {
1840+
nfsi->ooo = kmalloc(sizeof(*nfsi->ooo), GFP_ATOMIC);
1841+
if (!nfsi->ooo) {
1842+
nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER;
1843+
return;
1844+
}
1845+
nfsi->ooo->cnt = 0;
1846+
}
1847+
1848+
/* add this range, merging if possible */
1849+
cnt = nfsi->ooo->cnt;
1850+
for (i = 0; i < cnt; i++) {
1851+
if (end == nfsi->ooo->gap[i].start)
1852+
end = nfsi->ooo->gap[i].end;
1853+
else if (start == nfsi->ooo->gap[i].end)
1854+
start = nfsi->ooo->gap[i].start;
1855+
else
1856+
continue;
1857+
/* Remove 'i' from table and loop to insert the new range */
1858+
cnt -= 1;
1859+
nfsi->ooo->gap[i] = nfsi->ooo->gap[cnt];
1860+
i = -1;
1861+
}
1862+
if (start != end) {
1863+
if (cnt >= ARRAY_SIZE(nfsi->ooo->gap)) {
1864+
nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER;
1865+
kfree(nfsi->ooo);
1866+
nfsi->ooo = NULL;
1867+
return;
1868+
}
1869+
nfsi->ooo->gap[cnt].start = start;
1870+
nfsi->ooo->gap[cnt].end = end;
1871+
cnt += 1;
1872+
}
1873+
nfsi->ooo->cnt = cnt;
1874+
}
1875+
1876+
static void nfs_ooo_record(struct nfs_inode *nfsi,
1877+
struct nfs_fattr *fattr)
1878+
{
1879+
/* This reply was out-of-order, so record in the
1880+
* pre/post change id, possibly cancelling
1881+
* gaps created when iversion was jumpped forward.
1882+
*/
1883+
if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) &&
1884+
(fattr->valid & NFS_ATTR_FATTR_PRECHANGE))
1885+
nfs_ooo_merge(nfsi,
1886+
fattr->change_attr,
1887+
fattr->pre_change_attr);
1888+
}
1889+
18281890
static int nfs_refresh_inode_locked(struct inode *inode,
18291891
struct nfs_fattr *fattr)
18301892
{
@@ -1835,8 +1897,12 @@ static int nfs_refresh_inode_locked(struct inode *inode,
18351897

18361898
if (attr_cmp > 0 || nfs_inode_finish_partial_attr_update(fattr, inode))
18371899
ret = nfs_update_inode(inode, fattr);
1838-
else if (attr_cmp == 0)
1839-
ret = nfs_check_inode_attributes(inode, fattr);
1900+
else {
1901+
nfs_ooo_record(NFS_I(inode), fattr);
1902+
1903+
if (attr_cmp == 0)
1904+
ret = nfs_check_inode_attributes(inode, fattr);
1905+
}
18401906

18411907
trace_nfs_refresh_inode_exit(inode, ret);
18421908
return ret;
@@ -1927,6 +1993,8 @@ int nfs_post_op_update_inode_force_wcc_locked(struct inode *inode, struct nfs_fa
19271993
if (attr_cmp < 0)
19281994
return 0;
19291995
if ((fattr->valid & NFS_ATTR_FATTR) == 0 || !attr_cmp) {
1996+
/* Record the pre/post change info before clearing PRECHANGE */
1997+
nfs_ooo_record(NFS_I(inode), fattr);
19301998
fattr->valid &= ~(NFS_ATTR_FATTR_PRECHANGE
19311999
| NFS_ATTR_FATTR_PRESIZE
19322000
| NFS_ATTR_FATTR_PREMTIME
@@ -2081,6 +2149,15 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
20812149

20822150
/* More cache consistency checks */
20832151
if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
2152+
if (!have_writers && nfsi->ooo && nfsi->ooo->cnt == 1 &&
2153+
nfsi->ooo->gap[0].end == inode_peek_iversion_raw(inode)) {
2154+
/* There is one remaining gap that hasn't been
2155+
* merged into iversion - do that now.
2156+
*/
2157+
inode_set_iversion_raw(inode, nfsi->ooo->gap[0].start);
2158+
kfree(nfsi->ooo);
2159+
nfsi->ooo = NULL;
2160+
}
20842161
if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
20852162
/* Could it be a race with writeback? */
20862163
if (!(have_writers || have_delegation)) {
@@ -2102,8 +2179,11 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
21022179
dprintk("NFS: change_attr change on server for file %s/%ld\n",
21032180
inode->i_sb->s_id,
21042181
inode->i_ino);
2105-
} else if (!have_delegation)
2106-
nfsi->cache_validity |= NFS_INO_DATA_INVAL_DEFER;
2182+
} else if (!have_delegation) {
2183+
nfs_ooo_record(nfsi, fattr);
2184+
nfs_ooo_merge(nfsi, inode_peek_iversion_raw(inode),
2185+
fattr->change_attr);
2186+
}
21072187
inode_set_iversion_raw(inode, fattr->change_attr);
21082188
}
21092189
} else {
@@ -2265,6 +2345,7 @@ struct inode *nfs_alloc_inode(struct super_block *sb)
22652345
return NULL;
22662346
nfsi->flags = 0UL;
22672347
nfsi->cache_validity = 0UL;
2348+
nfsi->ooo = NULL;
22682349
#if IS_ENABLED(CONFIG_NFS_V4)
22692350
nfsi->nfs4_acl = NULL;
22702351
#endif /* CONFIG_NFS_V4 */
@@ -2277,6 +2358,7 @@ EXPORT_SYMBOL_GPL(nfs_alloc_inode);
22772358

22782359
void nfs_free_inode(struct inode *inode)
22792360
{
2361+
kfree(NFS_I(inode)->ooo);
22802362
kmem_cache_free(nfs_inode_cachep, NFS_I(inode));
22812363
}
22822364
EXPORT_SYMBOL_GPL(nfs_free_inode);

include/linux/nfs_fs.h

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -190,6 +190,39 @@ struct nfs_inode {
190190
/* Open contexts for shared mmap writes */
191191
struct list_head open_files;
192192

193+
/* Keep track of out-of-order replies.
194+
* The ooo array contains start/end pairs of
195+
* numbers from the changeid sequence when
196+
* the inode's iversion has been updated.
197+
* It also contains end/start pair (i.e. reverse order)
198+
* of sections of the changeid sequence that have
199+
* been seen in replies from the server.
200+
* Normally these should match and when both
201+
* A:B and B:A are found in ooo, they are both removed.
202+
* And if a reply with A:B causes an iversion update
203+
* of A:B, then neither are added.
204+
* When a reply has pre_change that doesn't match
205+
* iversion, then the changeid pair and any consequent
206+
* change in iversion ARE added. Later replies
207+
* might fill in the gaps, or possibly a gap is caused
208+
* by a change from another client.
209+
* When a file or directory is opened, if the ooo table
210+
* is not empty, then we assume the gaps were due to
211+
* another client and we invalidate the cached data.
212+
*
213+
* We can only track a limited number of concurrent gaps.
214+
* Currently that limit is 16.
215+
* We allocate the table on demand. If there is insufficient
216+
* memory, then we probably cannot cache the file anyway
217+
* so there is no loss.
218+
*/
219+
struct {
220+
int cnt;
221+
struct {
222+
u64 start, end;
223+
} gap[16];
224+
} *ooo;
225+
193226
#if IS_ENABLED(CONFIG_NFS_V4)
194227
struct nfs4_cached_acl *nfs4_acl;
195228
/* NFSv4 state */
@@ -625,6 +658,20 @@ nfs_fileid_to_ino_t(u64 fileid)
625658
return ino;
626659
}
627660

661+
static inline void nfs_ooo_clear(struct nfs_inode *nfsi)
662+
{
663+
nfsi->cache_validity &= ~NFS_INO_DATA_INVAL_DEFER;
664+
kfree(nfsi->ooo);
665+
nfsi->ooo = NULL;
666+
}
667+
668+
static inline bool nfs_ooo_test(struct nfs_inode *nfsi)
669+
{
670+
return (nfsi->cache_validity & NFS_INO_DATA_INVAL_DEFER) ||
671+
(nfsi->ooo && nfsi->ooo->cnt > 0);
672+
673+
}
674+
628675
#define NFS_JUKEBOX_RETRY_TIME (5 * HZ)
629676

630677

0 commit comments

Comments
 (0)