Skip to content

Commit

Permalink
Merge branch 'hs/fast-commit-v9' into dev
Browse files Browse the repository at this point in the history
  • Loading branch information
tytso committed Oct 9, 2020
2 parents 9cb3701 + 0955fdb commit ab7b179
Show file tree
Hide file tree
Showing 26 changed files with 4,155 additions and 165 deletions.
66 changes: 66 additions & 0 deletions Documentation/filesystems/ext4/journal.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,17 @@ metadata are written to disk through the journal. This is slower but
safest. If ``data=writeback``, dirty data blocks are not flushed to the
disk before the metadata are written to disk through the journal.

In case of ``data=ordered`` mode, Ext4 also supports fast commits which
help reduce commit latency significantly. The default ``data=ordered``
mode works by logging metadata blocks tothe journal. In fast commit
mode, Ext4 only stores the minimal delta needed to recreate the
affected metadata in fast commit space that is shared with JBD2.
Once the fast commit area fills in or if fast commit is not possible
or if JBD2 commit timer goes off, Ext4 performs a traditional full commit.
A full commit invalidates all the fast commits that happened before
it and thus it makes the fast commit area empty for further fast
commits. This feature needs to be enabled at compile time.

The journal inode is typically inode 8. The first 68 bytes of the
journal inode are replicated in the ext4 superblock. The journal itself
is normal (but hidden) file within the filesystem. The file usually
Expand Down Expand Up @@ -609,3 +620,58 @@ bytes long (but uses a full block):
- h\_commit\_nsec
- Nanoseconds component of the above timestamp.

Fast commits
~~~~~~~~~~~~

Fast commit area is organized as a log of tag tag length values. Each TLV has
a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length
of the entire field. It is followed by variable length tag specific value.
Here is the list of supported tags and their meanings:

.. list-table::
:widths: 8 20 20 32
:header-rows: 1

* - Tag
- Meaning
- Value struct
- Description
* - EXT4_FC_TAG_HEAD
- Fast commit area header
- ``struct ext4_fc_head``
- Stores the TID of the transaction after which these fast commits should
be applied.
* - EXT4_FC_TAG_ADD_RANGE
- Add extent to inode
- ``struct ext4_fc_add_range``
- Stores the inode number and extent to be added in this inode
* - EXT4_FC_TAG_DEL_RANGE
- Remove logical offsets to inode
- ``struct ext4_fc_del_range``
- Stores the inode number and the logical offset range that needs to be
removed
* - EXT4_FC_TAG_CREAT
- Create directory entry for a newly created file
- ``struct ext4_fc_dentry_info``
- Stores the parent inode numer, inode number and directory entry of the
newly created file
* - EXT4_FC_TAG_LINK
- Link a directory entry to an inode
- ``struct ext4_fc_dentry_info``
- Stores the parent inode numer, inode number and directory entry
* - EXT4_FC_TAG_UNLINK
- Unink a directory entry of an inode
- ``struct ext4_fc_dentry_info``
- Stores the parent inode numer, inode number and directory entry

* - EXT4_FC_TAG_PAD
- Padding (unused area)
- None
- Unused bytes in the fast commit area.

* - EXT4_FC_TAG_TAIL
- Mark the end of a fast commit
- ``struct ext4_fc_tail``
- Stores the TID of the commit, CRC of the fast commit of which this tag
represents the end of

28 changes: 28 additions & 0 deletions Documentation/filesystems/journalling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,34 @@ The opportunities for abuse and DOS attacks with this should be obvious,
if you allow unprivileged userspace to trigger codepaths containing
these calls.

Fast commits
~~~~~~~~~~~~

JBD2 to also allows you to perform file-system specific delta commits known as
fast commits. In order to use fast commits, you first need to call
:c:func:`jbd2_fc_init` and tell how many blocks at the end of journal
area should be reserved for fast commits. Along with that, you will also need
to set following callbacks that perform correspodning work:

`journal->j_fc_cleanup_cb`: Cleanup function called after every full commit and
fast commit.

`journal->j_fc_replay_cb`: Replay function called for replay of fast commit
blocks.

File system is free to perform fast commits as and when it wants as long as it
gets permission from JBD2 to do so by calling the function
:c:func:`jbd2_fc_start()`. Once a fast commit is done, the client
file system should tell JBD2 about it by calling :c:func:`jbd2_fc_stop()`.
If file system wants JBD2 to perform a full commit immediately after stopping
the fast commit it can do so by calling :c:func:`jbd2_fc_stop_do_commit()`.
This is useful if fast commit operation fails for some reason and the only way
to guarantee consistency is for JBD2 to perform the full traditional commit.

JBD2 helper functions to manage fast commit buffers. File system can use
:c:func:`jbd2_fc_get_buf()` and :c:func:`jbd2_fc_wait_bufs()` to allocate
and wait on IO completion of fast commit buffers.

Summary
~~~~~~~

Expand Down
2 changes: 1 addition & 1 deletion fs/ext4/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ ext4-y := balloc.o bitmap.o block_validity.o dir.o ext4_jbd2.o extents.o \
indirect.o inline.o inode.o ioctl.o mballoc.o migrate.o \
mmp.o move_extent.o namei.o page-io.o readpage.o resize.o \
super.o symlink.o sysfs.o xattr.o xattr_hurd.o xattr_trusted.o \
xattr_user.o
xattr_user.o fast_commit.o

ext4-$(CONFIG_EXT4_FS_POSIX_ACL) += acl.o
ext4-$(CONFIG_EXT4_FS_SECURITY) += xattr_security.o
Expand Down
2 changes: 2 additions & 0 deletions fs/ext4/acl.c
Original file line number Diff line number Diff line change
Expand Up @@ -242,6 +242,7 @@ ext4_set_acl(struct inode *inode, struct posix_acl *acl, int type)
handle = ext4_journal_start(inode, EXT4_HT_XATTR, credits);
if (IS_ERR(handle))
return PTR_ERR(handle);
ext4_fc_start_update(inode);

if ((type == ACL_TYPE_ACCESS) && acl) {
error = posix_acl_update_mode(inode, &mode, &acl);
Expand All @@ -259,6 +260,7 @@ ext4_set_acl(struct inode *inode, struct posix_acl *acl, int type)
}
out_stop:
ext4_journal_stop(handle);
ext4_fc_stop_update(inode);
if (error == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
goto retry;
return error;
Expand Down
7 changes: 6 additions & 1 deletion fs/ext4/balloc.c
Original file line number Diff line number Diff line change
Expand Up @@ -368,7 +368,12 @@ static int ext4_validate_block_bitmap(struct super_block *sb,
struct buffer_head *bh)
{
ext4_fsblk_t blk;
struct ext4_group_info *grp = ext4_get_group_info(sb, block_group);
struct ext4_group_info *grp;

if (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY)
return 0;

grp = ext4_get_group_info(sb, block_group);

if (buffer_verified(bh))
return 0;
Expand Down
95 changes: 95 additions & 0 deletions fs/ext4/ext4.h
Original file line number Diff line number Diff line change
Expand Up @@ -963,6 +963,7 @@ do { \
#endif /* defined(__KERNEL__) || defined(__linux__) */

#include "extents_status.h"
#include "fast_commit.h"

/*
* Lock subclasses for i_data_sem in the ext4_inode_info structure.
Expand Down Expand Up @@ -1020,6 +1021,27 @@ struct ext4_inode_info {

struct list_head i_orphan; /* unlinked but open inodes */

/* Fast commit related info */

struct list_head i_fc_list; /*
* inodes that need fast commit
* protected by sbi->s_fc_lock.
*/

/* Start of lblk range that needs to be committed in this fast commit */
ext4_lblk_t i_fc_lblk_start;

/* End of lblk range that needs to be committed in this fast commit */
ext4_lblk_t i_fc_lblk_len;

/* Number of ongoing updates on this inode */
atomic_t i_fc_updates;

/* Fast commit wait queue for this inode */
wait_queue_head_t i_fc_wait;

struct mutex i_fc_lock;

/*
* i_disksize keeps track of what the inode size is ON DISK, not
* in memory. During truncate, i_size is set to the new size by
Expand Down Expand Up @@ -1140,6 +1162,11 @@ struct ext4_inode_info {
#define EXT4_VALID_FS 0x0001 /* Unmounted cleanly */
#define EXT4_ERROR_FS 0x0002 /* Errors detected */
#define EXT4_ORPHAN_FS 0x0004 /* Orphans being recovered */
#define EXT4_FC_INELIGIBLE 0x0008 /* Fast commit ineligible */
#define EXT4_FC_COMMITTING 0x0010 /* File system underoing a fast
* commit.
*/
#define EXT4_FC_REPLAY 0x0020 /* Fast commit replay ongoing */

/*
* Misc. filesystem flags
Expand Down Expand Up @@ -1213,6 +1240,8 @@ struct ext4_inode_info {
#define EXT4_MOUNT2_EXPLICIT_JOURNAL_CHECKSUM 0x00000008 /* User explicitly
specified journal checksum */

#define EXT4_MOUNT2_JOURNAL_FAST_COMMIT 0x00000010 /* Journal fast commit */

#define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
~EXT4_MOUNT_##opt
#define set_opt(sb, opt) EXT4_SB(sb)->s_mount_opt |= \
Expand Down Expand Up @@ -1610,6 +1639,29 @@ struct ext4_sb_info {
/* Record the errseq of the backing block device */
errseq_t s_bdev_wb_err;
spinlock_t s_bdev_wb_lock;

/* Ext4 fast commit stuff */
atomic_t s_fc_subtid;
atomic_t s_fc_ineligible_updates;
/*
* After commit starts, the main queue gets locked, and the further
* updates get added in the the staging queue
*/
#define FC_Q_MAIN 0
#define FC_Q_STAGING 1
struct list_head s_fc_q[2]; /* Inodes staged for fast commit
* that have data changes in them.
*/
struct list_head s_fc_dentry_q[2]; /* directory entry updates */
int s_fc_bytes;
spinlock_t s_fc_lock;
struct buffer_head *s_fc_bh;
struct ext4_fc_stats s_fc_stats;
u64 s_fc_avg_commit_time;
#ifdef CONFIG_EXT4_DEBUG
int s_fc_debug_max_replay;
#endif
struct ext4_fc_replay_state s_fc_replay_state;
};

static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
Expand Down Expand Up @@ -1720,6 +1772,7 @@ enum {
EXT4_STATE_EXT_PRECACHED, /* extents have been precached */
EXT4_STATE_LUSTRE_EA_INODE, /* Lustre-style ea_inode */
EXT4_STATE_VERITY_IN_PROGRESS, /* building fs-verity Merkle tree */
EXT4_STATE_FC_COMMITTING, /* Fast commit ongoing */
};

#define EXT4_INODE_BIT_FNS(name, field, offset) \
Expand Down Expand Up @@ -1813,6 +1866,7 @@ static inline bool ext4_verity_in_progress(struct inode *inode)
#define EXT4_FEATURE_COMPAT_RESIZE_INODE 0x0010
#define EXT4_FEATURE_COMPAT_DIR_INDEX 0x0020
#define EXT4_FEATURE_COMPAT_SPARSE_SUPER2 0x0200
#define EXT4_FEATURE_COMPAT_FAST_COMMIT 0x0400
#define EXT4_FEATURE_COMPAT_STABLE_INODES 0x0800

#define EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER 0x0001
Expand Down Expand Up @@ -1915,6 +1969,7 @@ EXT4_FEATURE_COMPAT_FUNCS(xattr, EXT_ATTR)
EXT4_FEATURE_COMPAT_FUNCS(resize_inode, RESIZE_INODE)
EXT4_FEATURE_COMPAT_FUNCS(dir_index, DIR_INDEX)
EXT4_FEATURE_COMPAT_FUNCS(sparse_super2, SPARSE_SUPER2)
EXT4_FEATURE_COMPAT_FUNCS(fast_commit, FAST_COMMIT)
EXT4_FEATURE_COMPAT_FUNCS(stable_inodes, STABLE_INODES)

EXT4_FEATURE_RO_COMPAT_FUNCS(sparse_super, SPARSE_SUPER)
Expand Down Expand Up @@ -2649,6 +2704,7 @@ extern int ext4fs_dirhash(const struct inode *dir, const char *name, int len,
struct dx_hash_info *hinfo);

/* ialloc.c */
extern int ext4_mark_inode_used(struct super_block *sb, int ino);
extern struct inode *__ext4_new_inode(handle_t *, struct inode *, umode_t,
const struct qstr *qstr, __u32 goal,
uid_t *owner, __u32 i_flags,
Expand All @@ -2674,6 +2730,27 @@ extern int ext4_init_inode_table(struct super_block *sb,
ext4_group_t group, int barrier);
extern void ext4_end_bitmap_read(struct buffer_head *bh, int uptodate);

/* fast_commit.c */
int ext4_fc_info_show(struct seq_file *seq, void *v);
void ext4_fc_init(struct super_block *sb, journal_t *journal);
void ext4_fc_init_inode(struct inode *inode);
void ext4_fc_track_range(struct inode *inode, ext4_lblk_t start,
ext4_lblk_t end);
void ext4_fc_track_unlink(struct inode *inode, struct dentry *dentry);
void ext4_fc_track_link(struct inode *inode, struct dentry *dentry);
void ext4_fc_track_create(struct inode *inode, struct dentry *dentry);
void ext4_fc_track_inode(struct inode *inode);
void ext4_fc_mark_ineligible(struct super_block *sb, int reason);
void ext4_fc_start_ineligible(struct super_block *sb, int reason);
void ext4_fc_stop_ineligible(struct super_block *sb);
void ext4_fc_start_update(struct inode *inode);
void ext4_fc_stop_update(struct inode *inode);
void ext4_fc_del(struct inode *inode);
bool ext4_fc_replay_check_excluded(struct super_block *sb, ext4_fsblk_t block);
void ext4_fc_replay_cleanup(struct super_block *sb);
int ext4_fc_commit(journal_t *journal, tid_t commit_tid);
int __init ext4_fc_init_dentry_cache(void);

/* mballoc.c */
extern const struct seq_operations ext4_mb_seq_groups_ops;
extern long ext4_mb_stats;
Expand Down Expand Up @@ -2703,8 +2780,12 @@ extern int ext4_group_add_blocks(handle_t *handle, struct super_block *sb,
ext4_fsblk_t block, unsigned long count);
extern int ext4_trim_fs(struct super_block *, struct fstrim_range *);
extern void ext4_process_freed_data(struct super_block *sb, tid_t commit_tid);
extern void ext4_mb_mark_bb(struct super_block *sb, ext4_fsblk_t block,
int len, int state);

/* inode.c */
void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw,
struct ext4_inode_info *ei);
int ext4_inode_is_fast_symlink(struct inode *inode);
struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int);
struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int);
Expand Down Expand Up @@ -2751,6 +2832,8 @@ extern int ext4_sync_inode(handle_t *, struct inode *);
extern void ext4_dirty_inode(struct inode *, int);
extern int ext4_change_inode_journal_flag(struct inode *, int);
extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *);
extern int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino,
struct ext4_iloc *iloc);
extern int ext4_inode_attach_jinode(struct inode *inode);
extern int ext4_can_truncate(struct inode *inode);
extern int ext4_truncate(struct inode *);
Expand Down Expand Up @@ -2784,12 +2867,15 @@ extern int ext4_ind_remove_space(handle_t *handle, struct inode *inode,
/* ioctl.c */
extern long ext4_ioctl(struct file *, unsigned int, unsigned long);
extern long ext4_compat_ioctl(struct file *, unsigned int, unsigned long);
extern void ext4_reset_inode_seed(struct inode *inode);

/* migrate.c */
extern int ext4_ext_migrate(struct inode *);
extern int ext4_ind_migrate(struct inode *inode);

/* namei.c */
extern int ext4_init_new_dir(handle_t *handle, struct inode *dir,
struct inode *inode);
extern int ext4_dirblock_csum_verify(struct inode *inode,
struct buffer_head *bh);
extern int ext4_orphan_add(handle_t *, struct inode *);
Expand Down Expand Up @@ -3369,6 +3455,10 @@ extern int ext4_handle_dirty_dirblock(handle_t *handle, struct inode *inode,
extern int ext4_ci_compare(const struct inode *parent,
const struct qstr *fname,
const struct qstr *entry, bool quick);
extern int __ext4_unlink(struct inode *dir, const struct qstr *d_name,
struct inode *inode);
extern int __ext4_link(struct inode *dir, struct inode *inode,
struct dentry *dentry);

#define S_SHIFT 12
static const unsigned char ext4_type_by_mode[(S_IFMT >> S_SHIFT) + 1] = {
Expand Down Expand Up @@ -3469,6 +3559,11 @@ extern int ext4_clu_mapped(struct inode *inode, ext4_lblk_t lclu);
extern int ext4_datasem_ensure_credits(handle_t *handle, struct inode *inode,
int check_cred, int restart_cred,
int revoke_cred);
extern void ext4_ext_replay_shrink_inode(struct inode *inode, ext4_lblk_t end);
extern int ext4_ext_replay_set_iblocks(struct inode *inode);
extern int ext4_ext_replay_update_ex(struct inode *inode, ext4_lblk_t start,
int len, int unwritten, ext4_fsblk_t pblk);
extern int ext4_ext_clear_bb(struct inode *inode);


/* move_extent.c */
Expand Down
2 changes: 1 addition & 1 deletion fs/ext4/ext4_jbd2.c
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
return ERR_PTR(err);

journal = EXT4_SB(sb)->s_journal;
if (!journal)
if (!journal || (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))
return ext4_get_nojournal();
return jbd2__journal_start(journal, blocks, rsv_blocks, revoke_creds,
GFP_NOFS, type, line);
Expand Down
Loading

0 comments on commit ab7b179

Please sign in to comment.