Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.*: virtual sstable FileMetadata changes #2352

Merged
merged 1 commit into from
Mar 28, 2023
Merged

.*: virtual sstable FileMetadata changes #2352

merged 1 commit into from
Mar 28, 2023

Conversation

bananabrick
Copy link
Contributor

@bananabrick bananabrick commented Feb 21, 2023

This pr introduces the FileMetadata changes which will be required to make further virtual sstable changes.

This pr doesn't deal with persisting of the FileMetadata to disk through the version edits.

RFC: https://github.com/cockroachdb/pebble/blob/master/docs/RFCS/20221122_virtual_sstable.md

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumeerbhola This doesn't have the decoding/encoding/version edit changes. I put out this pr first, because it should be non-controversial, and future changes can be concurrently sequenced on top of it.

Future changes which can be built on top of this include:

  1. version edit changes + decoding/encoding.
  2. Iterator + table cache changes.
  3. asynchronous stats collection changes.

Reviewable status: 0 of 17 files reviewed, 1 unresolved discussion


internal/manifest/version.go line 162 at r1 (raw file):

// sstable.
type VirtualState struct {
	Parent PhysicalFileMeta

@itsbilal Going to assume for now that we're keeping around the parent files metadata here, which would mean that we definitely need the version edit changes to support manifest replay.

Another reason we need the parent file's metadata pointer in the virtual sstable meta is to support refcounting. We need to access the parent file metadata's atomic.refs count, and we also need to access the parent file metadata's atomic.latestRefs. An additional reason would be that we also need to implement a compaction picking heuristic to minimize space amp due to virtual sstables, and that would also need to access the parent file's metadata.

@bananabrick
Copy link
Contributor Author

Just copied code over from #2288. Will take a more careful look to make sure that I didn't miss some change.

Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 17 files reviewed, 3 unresolved discussions


checkpoint.go line 260 at r2 (raw file):

	var excludedFiles map[deletedFileEntry]*fileMetadata
	dedup := make(map[base.FileNum]struct{})

This pattern to iterate over the file metadata and deduplicating the parent sstables of a virtual sstable is coming up all over the place, but idk if there's a clean way of avoiding it.


internal/manifest/version.go line 169 at r3 (raw file):

// FileMetadata holds the metadata for an on-disk or virtual sstable.
type FileMetadata struct {

Haven't added the state needed for the space amplification compaction picking heuristic as that would be very hard to test carefully without the rest of the virtual sstable changes.

@bananabrick bananabrick requested review from a team, RaduBerinde and itsbilal February 22, 2023 00:32
Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 18 files reviewed, 4 unresolved discussions (waiting on @itsbilal and @RaduBerinde)


version_set_test.go line 116 at r5 (raw file):

			return d.getInProgressCompactionInfoLocked(nil)
		})
		d.updateReadStateLocked(nil)

I find it odd that we don't call updateReadStateLocked from within logAndApply and instead rely on the caller to make sure that the read state is updated.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take a more careful look to make sure that I didn't miss some change.

is this ready for a look?

Reviewable status: 0 of 18 files reviewed, 4 unresolved discussions (waiting on @bananabrick, @itsbilal, and @RaduBerinde)


internal/manifest/version.go line 169 at r3 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

Haven't added the state needed for the space amplification compaction picking heuristic as that would be very hard to test carefully without the rest of the virtual sstable changes.

Please add it. It's fine if it can't be tested yet, which can be left as a TODO in a test file. I'd really like to have the full "data-structure" part of virtual ssts in this PR.

@itsbilal
Copy link
Member

I made this point in a meeting with Arjun and I'll repeat it here so @sumeerbhola hears it too, but I'd support a FileMetadata that only has one additional field for virtual sstables, i.e. PhysicalFileNum. We really don't need any other fields from the physical file metadata, and keeping the physical meta around in the manifest (and adding all the logic for "unlinked" files in the version edit, etc., and changing the VersionEdit format) seems pretty unnecessary all for a marginal increase in debuggability. Also increases risk for something that is likely going into 23.1. Unless the manifest has rotated twice since the physical sst was created, we can always look back at the AddedFiles in the version edit for the physical SST creation to get the PhysicalMeta we are interested in.

I also worry about the space-amp increase in storing the smallest/largest point/range keys extra times for a FileMetadata that is really not being used except for the FileNum. Those fields take up much more space than a single uint64 for the physical file num (which will only be present if the virtual sst has a tag for virtual sst), and could only lead to more manifest rotations.

The only thing that becomes slightly harder with only storing a file num, is ref counting, but we can restructure ref counting to operate away from the FileMetadata and in some map[PhysicalFileNum] and that solves that problem pretty easily. It's also very easy to detect obsolete files at Open() time; we can just track which PhysicalFileNums are on disk but not referenced by any virtual sstables. My prototype actually did that change.

One more point is that PhysicalMeta will probably not exist or will be very Provider-managed for virtual sstables backed by a disaggregated shared foreign SST anyway.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 17 files at r1, 1 of 4 files at r5.
Reviewable status: 2 of 18 files reviewed, 15 unresolved discussions (waiting on @bananabrick, @itsbilal, and @RaduBerinde)


version_set.go line 44 at r6 (raw file):

// like it sounds: a delta from the previous version. Version edits are logged
// to the MANIFEST file, which is replayed at startup.
type versionSet struct {

where is backingPhysicalTables map[FileNum]physicalMeta that is in the other PR?


internal/manifest/btree.go line 210 at r6 (raw file):

	if contentsToo {
		for _, f := range n.items[:n.count] {
			if f.Unref() == 0 {

I think we need a signature like

// INVARIANT: additionalObsolete != nil => refCount == 0
Unref() (refCount int32, additionalObsolete *FileMetadata)

if we are ref counting virtual ssts too.

If not, we need something like the following where obsolete may be the `FileMetadata we called unref on (for a physical sst) or the underlying physical sst (if we called on a virtual sst).

// INVARIANT: obsolete != nil => refCount == 0
Unref() (refCount int32, obsolete *FileMetadata)

internal/manifest/version.go line 165 at r6 (raw file):

	// ParentSSTNum is the FileNum of VirtualState.Parent. We need to store
	// this separately for manifest replay.
	ParentSSTNum base.FileNum

why duplicating the FileNum here instead of using Parent.FileNum?


internal/manifest/version.go line 168 at r6 (raw file):

}

// FileMetadata holds the metadata for an on-disk or virtual sstable.

nit: worth saying on-disk/physical since we have also defined the term physical sst above.


internal/manifest/version.go line 189 at r6 (raw file):

		// version and decremented when the version is unreferenced. The file is
		// obsolete when the reference count falls to zero. The refs field is
		// only used for physical sstables. Virtual sstables will update their

Why only physical ssts? Don't we need a ref count for virtual ssts given that many versions can have the same virtual sst? I think they can also become zombies. We could track virtual sst zombies separately from physical sst zombies, even though some of the latter zombies will be zombies due to references from the former.

I think what you are doing here is noticing that there is no physical resource to delete when only the refs for a virtual sst drop to zero, hence why bother tracking it. I suppose that will work too, and is fine. Could you add a comment somewhere on this reasoning.


internal/manifest/version.go line 195 at r6 (raw file):

		// Note that we could just turn FileMetadata.atomic.refs to a pointer
		// so that a virtual sstable can directly update the refs field in its
		// own FileMetadata. But then we'll suffer an additional allocation on

I think we should cleanly separate what is only needed for physical ssts and only for virtual ssts by having *VirtualState and *PhysicalState pointers and we should do that now. Allocations can be pooled via sync.Pool in the future since we know when a sst becomes obsolete. And file creation is not happening 1000s of times a second, so we can do that allocation optimization later. I'm wary of continuing down this flattening path since eventually we will unflatten due to #2047, so we may as well not flatten for new stuff we are adding.


internal/manifest/version.go line 210 at r6 (raw file):

		//
		// INVARIANT: latestRefs <= refs.
		latestRefs int32

We should state an invariant somewhere that a physical sst that has references from other virtual ssts no longer exists as a first-class citizen in any level of the latest version. Though it could be a first class citizen of a level in an older version. I was imagining latestRefs to be only for such physical ssts, since for the others it is trivially 0 or 1. Were you planning to use it in all cases?
We need a term for such physical ssts. virtualized-physical sst? I am terrible with names. Maybe backing-physical sst, since you use that term below. Is PhysicalFileMeta for both regular physical ssts and backing-physical ssts?


internal/manifest/version.go line 242 at r6 (raw file):

	// alter point keys.
	// NB: these field should be set using ExtendPointKeyBounds. They are left
	// exported for reads as an optimization.

are all the following InternalKeys precise for virtual ssts?


internal/manifest/version.go line 264 at r6 (raw file):

	// TODO(bananabrick): To support manifest replay for virtual sstables, we
	// probably need to compute virtual sstable stats asynchronously. Otherwise,
	// we'd have to write virtual sstable stats to the version edit.

I don't think we want to write these stats for virtual ssts to the version edit given that we don't write these for physical ssts. Is this TODO for a future PR? Is there some aspect of this stats computation that is tricky?


internal/manifest/version.go line 303 at r6 (raw file):

	// HasPointKeys tracks whether the table contains point keys (including
	// RANGEDELs). If a table contains only range deletions, HasPointsKeys is
	// still true.

are all the following fields precise for virtual ssts?


internal/manifest/version.go line 1285 at r6 (raw file):

				fileNum, fileSize = f.FileNum, f.Size
			}
			path := base.MakeFilepath(fs, dirname, base.FileTypeTable, fileNum)

fileNum can be 0 here if we don't initialize it in the previous if-else block. And then we will fail to stat it.

Copy link
Member

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 17 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 2 of 3 files at r4, 1 of 4 files at r5, all commit messages.
Reviewable status: 13 of 18 files reviewed, 17 unresolved discussions (waiting on @bananabrick, @RaduBerinde, and @sumeerbhola)


internal/manifest/version.go line 162 at r1 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

@itsbilal Going to assume for now that we're keeping around the parent files metadata here, which would mean that we definitely need the version edit changes to support manifest replay.

Another reason we need the parent file's metadata pointer in the virtual sstable meta is to support refcounting. We need to access the parent file metadata's atomic.refs count, and we also need to access the parent file metadata's atomic.latestRefs. An additional reason would be that we also need to implement a compaction picking heuristic to minimize space amp due to virtual sstables, and that would also need to access the parent file's metadata.

I'm okay with that direction if Sumeer and others are good with it, however it's a pretty critical decision that's going to inform a lot of what comes next. I just want us to carefully make that choice and to consider the alternative as well, so I left a top-level comment making the case against this approach.


internal/manifest/version.go line 195 at r6 (raw file):

Previously, sumeerbhola wrote…

I think we should cleanly separate what is only needed for physical ssts and only for virtual ssts by having *VirtualState and *PhysicalState pointers and we should do that now. Allocations can be pooled via sync.Pool in the future since we know when a sst becomes obsolete. And file creation is not happening 1000s of times a second, so we can do that allocation optimization later. I'm wary of continuing down this flattening path since eventually we will unflatten due to #2047, so we may as well not flatten for new stuff we are adding.

I agree that we should separate VirtualState and PhysicalState, instead of keeping the refs here and not using them in virtual sstable metas. That way VirtualState can just reference a PhysicalState instead of a Physical file metadata in its entirety, and that is also more adaptable for when PhysicalState is some Provider-owned thing for disaggregated storage.


internal/manifest/version.go line 203 at r6 (raw file):

		refs int32

		// latestRefs are the references to a physical sstable in the latest

I think versionRefs or virtualRefs is a far better name for this, seeing as iterators don't increment this but versions do.


internal/manifest/version_edit.go line 113 at r6 (raw file):

// Decode decodes an edit from the specified reader.
// TODO(bananabrick): Support decoding of virtual sstable state.

nit: empty line before todo

Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, we could move the ref counting logic to a map in the versionSet, and then only store a parent physical sst number in the virtual ssts. I don't think golang maps are safe for concurrent access even to different keys, so we'd probably need to work around that. Imo, the ref counting logic as it is right now is a lot simpler. But if we go for this, I'd want to move the ref counting logic for all cases and not just the virtual sst cases to a separate map.

Unless the manifest has rotated twice since the physical sst was created, we can always look back at the AddedFiles in the version edit for the physical SST creation to get the PhysicalMeta we are interested in.

Why twice? Either way, it seems as if the core questions are:

  1. Whether we want to store some physical state of the parent file meta in virtual ssts.
    This can be avoided by using maps in the versionSet for refcounting, and only store a parent sst number. But the map would need to be safe for concurrent access to different keys, and we might see a perf hit there.

  2. Whether we want the additional debuggability to determine exactly when a physical sst is no longer required by the latest version, and exactly when it transitions to zombie.
    This can be partly determined just using the latest version ref counting, and I suppose if a system restarts, and the latest version has a virtual sst, then we can still determine the state the parent sst is in. We'll lose information about obsolete/zombie tables, which is fine because those would be deleted on restart anyway.

I also worry about the space-amp increase in storing the smallest/largest point/range keys extra times for a FileMetadata that is really not being used except for the FileNum.

I like the idea of only storing PhysicalState in the virtual meta, which would solve this. We'd still need the version edit changes in this case, though.

Overall, I'm good with not having those changes, because that's easier to test, but I've also implemented the changes already, so I'm also okay with keeping them.

Thoughts @sumeerbhola?

Reviewable status: 13 of 18 files reviewed, 17 unresolved discussions (waiting on @RaduBerinde and @sumeerbhola)

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 13 of 18 files reviewed, 19 unresolved discussions (waiting on @bananabrick, @itsbilal, and @sumeerbhola)


checkpoint.go line 285 at r4 (raw file):

				}
			} else {
				considerFile = f

I assume that there is an invariant that the same physical SST can't appear both as a physical SST and as a parent to a virtual SST. Are we verifying that somewhere?

In any case, it would be easier here to just use BackingPhysicalFile() and dedup in all cases.


internal/manifest/version.go line 165 at r1 (raw file):

	// ParentSSTNum is the FileNum of VirtualState.Parent. We need to store
	// this separately for manifest replay.
	ParentSSTNum base.FileNum

Could you explain this field a little more? Its existence suggests that there are situations where this field is set but Parent isn't? (that aspect should be documented for Parent).

We should add a NewVirtualState constructor which gets a PhysicalFileMeta and sets both fields.

Is the logic around this field isolated to this package? If so, we should unexport

Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 13 of 18 files reviewed, 19 unresolved discussions (waiting on @itsbilal, @RaduBerinde, and @sumeerbhola)


internal/manifest/version.go line 165 at r1 (raw file):

Previously, RaduBerinde wrote…

Could you explain this field a little more? Its existence suggests that there are situations where this field is set but Parent isn't? (that aspect should be documented for Parent).

We should add a NewVirtualState constructor which gets a PhysicalFileMeta and sets both fields.

Is the logic around this field isolated to this package? If so, we should unexport

Yea, this field isn't required. I realized that, but didn't make the change here. Will make it.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think golang maps are safe for concurrent access even to different keys, so we'd probably need to work around that.

We can use a RWLock, we'd only need read access to get to the atomic's address. Or, if we use a regular lock, we don't need the atomics anymore (plus it enables adjusting multiple counts at once).

Reviewable status: 13 of 18 files reviewed, 19 unresolved discussions (waiting on @bananabrick, @itsbilal, and @sumeerbhola)

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 13 of 18 files reviewed, 19 unresolved discussions (waiting on @bananabrick, @itsbilal, and @sumeerbhola)


internal/manifest/version.go line 165 at r1 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

Yea, this field isn't required. I realized that, but didn't make the change here. Will make it.

If that field isn't required, then we don't need VirtualState (it ends up being just a double pointer to FileMetadata). We'd just need a VirtualParent *FileMetadata

@RaduBerinde
Copy link
Member

internal/manifest/version.go line 165 at r1 (raw file):

Previously, RaduBerinde wrote…

If that field isn't required, then we don't need VirtualState (it ends up being just a double pointer to FileMetadata). We'd just need a VirtualParent *FileMetadata

Or rather VirtualParent PhysicalFileMeta

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this point in a meeting with Arjun and I'll repeat it here so @sumeerbhola hears it too, but I'd support a FileMetadata that only has one additional field for virtual sstables, i.e. PhysicalFileNum. We really don't need any other fields from the physical file metadata, and keeping the physical meta around in the manifest (and adding all the logic for "unlinked" files in the version edit, etc., and changing the VersionEdit format) seems pretty unnecessary all for a marginal increase in debuggability.
...
I also worry about the space-amp increase in storing the smallest/largest point/range keys extra times for a FileMetadata that is really not being used except for the FileNum. Those fields take up much more space than a single uint64 for the physical file num (which will only be present if the virtual sst has a tag for virtual sst), and could only lead to more manifest rotations.

I lean heavily towards debuggability in this case. We only need to store those fields once for each virtualized physical sstable, so I am doubtful that there is enough cost for this to matter (and we have improved our manifest rotation logic). And we should be explicit about when we have virtualized physical ssts in a version and when we delete them -- I am quite unmoved about not changing version edit. One question is what we will store about these virtualized physical ssts in the disagg world where there are also foreign ssts. Maybe we don't want to ship all the Smallest*, Largest* keys around for the underlying physical sst (though one will need those for the virtual sst) -- I am very sympathetic to that position. But we do need the fileNum, fileSize stored for these virtualized physical sst (and the version edit will need to explicitly state when the transition from physical => virtualized-physical is happening). So we have a state machine that looks like
physical-sst (has a level) => virtualized-physical-sst => zombie => obsolete
physical-sst => zombie => obsolete
where both virtualized-physical-sst and physical-sst are start states in the state machine. The virtualized-physical-sst start state will only exist in disagg storage (for foreign ssts). Also, we would need to de-dup in this disagg storage case since we may import the same virtualized-physical sst at different points in time. We need the de-duping for the local ref counting and also to keep how many bytes in the virtualized-physical sst are still relevant (to prioritize compactions to reduce space amp -- yes for disagg storage this space amp is only for a single LSM and not across LSMs but it is still worth tracking and minimizing per-LSM space amp since it will make the cluster wide space amp lower). fileNum, fileSize are also the only things in fileInfo which we use in various places (like obsolete files) so should minimize that part of the plumbing change.

The only thing that becomes slightly harder with only storing a file num, is ref counting, but we can restructure ref counting to operate away from the FileMetadata and in some map[PhysicalFileNum] and that solves that problem pretty easily.

I may not be understanding this suggestion, but I don't think this works. The reason we have a copy-on-write btree is that we can install new versions without making a full copy of the btree, and which is why ref counting works the way it is (a ref may be shared across versions if that part of the btree is shared). A map would require a copy of the map for each new version, which I think is a non starter.
I think we can have FileMetadata struct for both physical ssts and virtual ssts. Both of them have a *physicalState where

type physicalState struct {
   FileNum  FileNum
   Size uint64
   Refs int32
   RefsInLatestVersion int32
}

Reviewable status: 13 of 18 files reviewed, 19 unresolved discussions (waiting on @bananabrick, @itsbilal, and @RaduBerinde)

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we would need to de-dup in this disagg storage case since we may import the same virtualized-physical sst at different points in time.

Ignore this. We may give the foreign sst a new local filenum when importing through the provider, and then generate a (second) local filenum for the virtual sst. In which case it becomes the provider's job to also ref count for the foreign ssts (another layer of ref counting much like we have many layers of ref counting now -- nothing to worry about). If we don't do this "new local filenum" we can have complexities where a virtualized-physical sst is in zombie state, and we need to transition it back out since we are importing a new virtual sst that points to it.

Reviewable status: 13 of 18 files reviewed, 19 unresolved discussions (waiting on @bananabrick, @itsbilal, and @RaduBerinde)

@itsbilal
Copy link
Member

I think my map suggestion was taken more literally than I meant it to be, probably because I specified it as a go type. I just meant that we'd do refcounting on one int for all virtual sstables that map to one physical sstable. That int can be in a *PhysicalState that the virtual sstable FileMetadatas store a pointer to, like you suggested; the main point being that it doesn't need to be part of the FileMetadata and can be taken outside of it.

The VersionEdit can already store that transition to virtual by marking the physical SST's FileNum as deleted, and have new AddedFiles FileMetadatas in the same VersionEdit that are virtual sstables off of it. We already don't have a clear record of when SSTs go from zombie to obsolete, as that's dependent on iterators closing. But if the purpose is to document that transition to zombie explicitly for debuggability purposes, I would support a VersionEdit change that has one additional field, DeletedBackingFiles []base.FileNum (or maybe map[base.FileNum]struct{} similar to DeletedFiles) or so, that names the physical file num in the same version edit when the last Virtual SST from it was in DeletedFiles. That gets us the observability improvement without keeping around the entire metadata (which to me seems pretty excessive when we only need one or two fields).

As for fileSize, I still see enough ways to be able to implement the partial-zombie-table compaction heuristic without storing the total size of the physical sstable, but if Arjun feels that we should have it around, having it in the PhysicalState is not a bad idea. And I like the definition of PhysicalState that @sumeerbhola proposed.

In short, this is what I'm suggesting:

  1. Have a physical state struct like Sumeer proposed, that stores the refs and physical file num and size. Looks like we have consensus on this.
  2. Add a VersionEdit field, DeletedBackingFiles that specifies the transition from virtualized-physical-sstable to zombie; similar to what Arjun was planning on doing, except we don't store the entire physical FileMetadata there, but just the physical filenum. I know that this isn't the VersionEdit PR but we're basically required to make some decisions about that here.
  3. Don't have virtual file metadata point to a physical file metadata, but just to its physicalState. This way, we're able to maintain the guarantee that if you have a *FileMetadata, you can mint a valid iterator on it.

@bananabrick
Copy link
Contributor Author

bananabrick commented Feb 22, 2023

In short, this is what I'm suggesting:

Have a physical state struct like Sumeer proposed, that stores the refs and physical file num and size. Looks like we have >consensus on this.
Add a VersionEdit field, DeletedBackingFiles that specifies the transition from virtualized-physical-sstable to zombie; >similar to what Arjun was planning on doing, except we don't store the entire physical FileMetadata there, but just the >physical filenum. I know that this isn't the VersionEdit PR but we're basically required to make some decisions about that >here.
Don't have virtual file metadata point to a physical file metadata, but just to its physicalState. This way, we're able to >maintain the guarantee that if you have a *FileMetadata, you can mint a valid iterator on it.

This sounds good to me. I think the version edit will also need an additional field for AddedBackingState which will be written to once the physical sst is first virtualized if we're also storing the size of the physical sst. AddedBackingState would be a map AddedBackingState map[base.FileNum]uint64 // filenum -> size.

There's ways to work around it, of course.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this isn't the VersionEdit PR but we're basically required to make some decisions about that here.

btw, I would prefer to have the VersionEdit changes in this PR. We need the reasoning about the state transitions to know that we have the right state. We can leave the edit encoding/decoding to a later PR if folks prefer.

I think the version edit will also need an additional field for AddedBackingState which will be written to once the physical sst is first virtualized if we're also storing the size of the physical sst.

Making this explicit, as you outlined, would be my preference too.

Reviewable status: 13 of 18 files reviewed, 19 unresolved discussions (waiting on @bananabrick, @itsbilal, and @RaduBerinde)

@itsbilal
Copy link
Member

I think AddedBackingState and DeletedBackingState would do this job well. The filenum it's keyed on could be the physical file num (and providers generate a physical file num for foreign SSTs that can just act as a placeholder, either way disaggregated storage details are not super relevant for this PR). It might make sense to make the value more expandable than just a uint64, maybe a BackingState that's an interface that has an Encode method and a String method (our physicalState struct could implement this interface) , and we have a DecodeBackingState method in internal/manifest that produces a BackingState at restart. objstorage could have a separate BackingState implementation. But the BackingState interface work can wait until after virtual SSTs are done, as long as the on-disk format in the versionedit decode PR supports something other than just the size.

@itsbilal
Copy link
Member

Actually, Nevermind what I said about the BackingState interface and the on disk format being flexible. The provider’s catalog will contain any additional fields anyway.

@bananabrick
Copy link
Contributor Author

This needs a bit more work before I request review again. Haven't completed all of the suggestions with satisfactory quality yet.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for fileSize, I still see enough ways to be able to implement the partial-zombie-table compaction heuristic without storing the total size of the physical sstable

I might be misunderstanding, but I think we we need the total size of the physical sstable if only for disk space usage accounting. Cockroach relies on Pebble maintaining an accounting of all disk space usage under its control for the purpose of reporting in the DB console. This needs to by physical disk space usage, not virtual.

Reviewable status: 6 of 32 files reviewed, 19 unresolved discussions (waiting on @bananabrick, @itsbilal, @RaduBerinde, and @sumeerbhola)

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting closer to the finish line!

Reviewed 2 of 17 files at r1, 2 of 3 files at r4, 3 of 24 files at r7, 2 of 6 files at r10, 3 of 4 files at r11, all commit messages.
Reviewable status: 14 of 33 files reviewed, 54 unresolved discussions (waiting on @bananabrick, @itsbilal, @jbowens, and @RaduBerinde)


db.go line 1834 at r11 (raw file):

			if opt.withProperties && !m.Virtual {
				// The withProperties option isn't used by Cockroach, so it
				// should be okay to skip virtual sstable support here.

This is hacky, and I think we should support this properly by stating that the properties are for the underlying sstable (we don't need to de-duplicated). Since SSTableInfo is only used in this case we can add fields IsVirtual and PhysicalFileNum to it. I am ok with a TODO here pointing to an issue.


db.go line 1891 at r11 (raw file):

				// sstables backing file here to get the disk usage. But the
				// table cache is currently built to operate on FileMetadata and
				// not BackingState.

This needs a tracking issue too. Maybe create an overarching one with the list of things that we need to address before considering this production worthy.


format_major_version.go line 499 at r11 (raw file):

			for f := iter.First(); f != nil; f = iter.Next() {
				err = tc.withReader(
					f, func(r *sstable.Reader) error {

what's the reason for this formatting change?


version_set.go line 88 at r9 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

Or is this the metadata for that last virtual-leveled-sst?

Yes it is. obsoleteFn refers to versionSet.addObsoleteLocked which can handle virtual meta passed.

I think this is confusing and needs comments at least. Otherwise it is easy to imagine that we may be passing a virtual sst here because it is obsolete. Better yet, can we change this to fileInfo? versionSet.addObsoleteLocked does not need anything else from the FileMetadata.


version_set.go line 520 at r9 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

This is the solution I went with:

  • Added a new method called
func (b *BulkVersionEdit) ApplyAndCompleteVE(
        ve *versionEdit, curr *Version,
	cmp Compare,
	formatKey base.FormatKey,
	flushSplitBytes int64,
	readCompactionRate int64,
	backingState map[base.FileNum]*BackingState) (completedEdit *VersionEdit, err error)

Ideally, we'd complete the versionEdit before the apply step, and even before the accumulate step. But to determine if a version edit must be completed, we need to determine if a file is transitioning to a zombie status. To do that, we need to iterate over the files being added and removed in the bulk version edit, and look up/update the refcounts of the files. While it is possible to do this before the accumulate or during the accumulate step, it is error prone and the logic will be just as or more complex than the file refcounting logic we already have in the apply step.

We complete the version edit before the backingState is updated.

  • BVE.Apply handles creation/maintenance of the VersionSet.backingState in the latest version. It is self-contained.

I'll keep thinking about a better solution, but posting this to unblock the rest of the review. Overall, if we decide to refactor the code around logAndApply/load/Apply/Accumulate, I'd rather reason about the code in isolation separate from this pr.

The reason I am quibbling about BVE.ApplyAndCompleteVE is that one applies a BVE and not a VE and this pattern is clean: accumulate VEs into a BVE, then Apply. By saying ApplyAndCompleteVE we are saying there is only 1 VE in this BVE, so why did we even bother staging it in the BVE by calling Accumulate beforehand? We didn't even have a complete VE so what did we Accumulate, and why was that meaningful. It sounds semantically wonky.

I'd be ok with either my preceding suggestion of func (b *BulkVersionEdit) AccumulateFirstEditIncomplete(ve *VersionEdit, latestVersion *Version) (completedEdit *VersionEdit, err error) followed by Apply, or a single step AccumulateIncompleteAndApply which is called on an empty BVE. With the first option, we would update backingStateMap in Apply and AccumulateFirstEditIncomplete would have done the ref-counting calculations to know which backingStates need to be removed.


version_set.go line 103 at r11 (raw file):

	// don't keep track of backing state which supports a virtual sstable
	// which is not in the latest version.
	backingState map[FileNum]*backingState

nit: backingStateMap (harder to write code comments when both the map and the elements are called backingState)


version_set.go line 686 at r11 (raw file):

			if _, ok := dedup[meta.BackingState.FileNum]; !ok {
				dedup[meta.BackingState.FileNum] = struct{}{}
				snapshot.CreateBackingTables = append(

so we are including these for physical ssts too?
Maybe I misunderstood the definition of CreateBackingTables because I was expecting this to be the same as the list in versionSet.backingStateMap.


version_set.go line 782 at r11 (raw file):

	for _, f := range obsolete {
		backingState := f.BackingState
		_, ok := dedup[backingState.FileNum]

why would we have duplicates here if only the last virtual sst holding the reference would be added via addObsoleteLocked?


internal/manifest/version.go line 120 at r11 (raw file):

Previously, RaduBerinde wrote…

[nit] feels like these definitions (including BackingState) should come after the FileMetadata which explains everything.

+1


internal/manifest/version.go line 180 at r11 (raw file):

		latestVersionRefs int32
		// VirtualizedSize is set iff the physical sst associated with the
		// BackingState has been virtualized. VirtualizedSize is the sum of the

VirtualizedSize is set iff the backing sst is referred to only by virtual ssts. ...


internal/manifest/version.go line 195 at r11 (raw file):

Previously, RaduBerinde wrote…

I wonder if this would be a good time to consider coming up with a new type for this. FileNum serves as both a logical number within a manifest and a "handle" on reading/writing the object (the former is what we want in FileMetadata and the latter is what we need here).

If we use the same type, I could see a lot of bugs where we use FileMetadata.FileNum instead of FileMetadata.BackingState.FileNum (and that works just fine for physical so the bugs would only show up for virtual).

+1


internal/manifest/version.go line 202 at r11 (raw file):

Previously, RaduBerinde wrote…

Would it be accurate to say that there is only one FileMetadata for a given FileNum? (that invariant would explain why we need to share the metadata in different versions/levels)

+1 to stating that invariant.


internal/manifest/version.go line 216 at r11 (raw file):

//
// Once a physical sst is no longer needed by any version, i.e. it's obsolete,
// we will no longer maintain the file metadata associated with it. We will

This last para is a bit confusing. We should define "obsolete". In the sense we use for deletion of files, which is the wider use, it only applies to backing-ssts. A leveled-sst FileMetadata goes away when no version has a pointer to it (simply Go GC) -- we should probably not call that "obsolete" to avoid confusion.


internal/manifest/version_edit.go line 777 at r9 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

This is messy since we have created confusion on what this refcount really represents.

This is also a problem for the existing ref count. It's solved implicitly because the btree doesn't decrement the file ref count if the file is not present in the tree.

Here we solve it explicitly by stating the reasoning.

I'm keen on not making the suggested change here, but I do think it's a worthwhile change to make, because I found reasoning about correctness in the Apply step to be tricky because of the inconsistent added/deleted data, but mostly because Apply is called from multiple code paths. Would a todo suffice for now? I'm pretty confident about the correctness as it stands.

Let's make the change since I am finding it hard to reason clearly.
Can you send out a small PR which does this BVE change and we can review and merge it quickly and then rebase on top of that here.

Copy link
Member

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the others, this is really taking shape!

Reviewed 1 of 17 files at r1, 4 of 24 files at r7, 5 of 9 files at r8, 11 of 22 files at r12, 1 of 3 files at r13, 3 of 4 files at r14, 1 of 1 files at r15, all commit messages.
Reviewable status: 26 of 34 files reviewed, 61 unresolved discussions (waiting on @bananabrick, @jbowens, @RaduBerinde, and @sumeerbhola)


db.go line 1891 at r11 (raw file):

Previously, sumeerbhola wrote…

This needs a tracking issue too. Maybe create an overarching one with the list of things that we need to address before considering this production worthy.

The table cache should continue to operate on FileMetadata right? That way we can rely on it to create the appropriate bounds-enforcing virtual iterator. I'm good with a TODO/tracking issue here, but I don't see how this is a problem for this function; the iter should already be the constrained one.

The single biggest reason why I wanted us to separate out physical state from FileMetadata was so iterators could continue to be created on FileMetadata, as they everything you could possibly need to know when creating and using iterators.


ingest.go line 1025 at r15 (raw file):

// DB.mu must be locked when calling.
//
// TODO(bananabrick): Make sure that the ingestion step only passes in the

I can see us needing this for virtual SSTs ingested in disaggregated storage, any particular reason not to generally support this case too?


table_stats.go line 410 at r15 (raw file):

		for file := iter.First(); file != nil; file = iter.Next() {
			var err error
			if file.Virtual {

This will never happen because we'd have already hit the panic in PhysicalMeta() right?


internal/manifest/version.go line 188 at r15 (raw file):

	// ingested. For virtual sstables, this corresponds to the wall clock time
	// when the FileMetadata for the virtual sstable was first created.
	// virtual sstables.

nit: probably didn't mean to keep around this "virtual sstables." at the end?


internal/manifest/version.go line 279 at r15 (raw file):

// NB: This type should only be constructed by calling
// FileMetadata.PhysicalMeta.
type PhysicalFileMeta struct {

Is there a good reason for it to be a wrapping struct? How about:

type PhysicalFileMeta *FileMetadata

I also don't see this being used in anything other than stats calculation (unless I missed something) - maybe we should have a TODO to get rid of this once stats work for virtual SSTs. It makes more sense to have code deal with FileMetadata and BackingState only, not PhysicalFileMeta and VirtualFileMeta.


internal/manifest/version.go line 332 at r15 (raw file):

		//
		// INVARIANT: latestVersionRefs <= refs.
		latestVersionRefs int32

Having this and VirtualizedSize be atomics and not just regular ints protected by db.mu or the version list or manifest lock seems unintuitive, and suggests that they are open to more concurrency than they actually are. And the value for VirtualizedSize makes sense in conjunction with latestVersionRefs, not on its own. But it's also hard to have these two actually be protected by those mutexes, since you are updating them in bve.Apply* and not versionSet.logAndApply.

I think the correct way to clean this up is to move the zombie logic to versionSet.logAndApply, so we compute zombie tables after a version has been successfully installed, and where we currently bump the Version ref.


internal/manifest/version_edit.go line 124 at r15 (raw file):

	// The physical sstable associated with the BackingState must also not be
	// present in NewFiles.
	CreateBackingTables []*BackingState

should be CreatedBackingTables as all other verbs are in the past tense (addedfiles, deletedfiles)


internal/manifest/version_edit.go line 141 at r15 (raw file):

	// and RemoveBackingTables. A file must be present in RemoveBackingTables in
	// exactly one version edit.
	RemoveBackingTables []base.FileNum

should be RemovedBackingTables as all other verbs are in the past tense, and the comment should also be in the past tense.


internal/manifest/version_edit.go line 834 at r15 (raw file):

			// We're adding this file to the new version, so increment the
			// latest refs count.
			f.LatestRef()

Another major issue here is that our ref counting will become inconsistent if we error out in this method, eg. below where we InitL0Sublevels; we'd have bumped up this ref here for a version that never gets installed. It's why I think refs should be best read/written to outside of BulkVersionEdit and in VersionSet.logAndApply. Thoughts?

Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed comments. PTAL.

Reviewable status: 13 of 34 files reviewed, 50 unresolved discussions (waiting on @itsbilal, @jbowens, @RaduBerinde, and @sumeerbhola)


db.go line 1834 at r11 (raw file):

Previously, sumeerbhola wrote…

This is hacky, and I think we should support this properly by stating that the properties are for the underlying sstable (we don't need to de-duplicated). Since SSTableInfo is only used in this case we can add fields IsVirtual and PhysicalFileNum to it. I am ok with a TODO here pointing to an issue.

Added here because I want to avoid too many todos.


db.go line 1891 at r11 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

The table cache should continue to operate on FileMetadata right? That way we can rely on it to create the appropriate bounds-enforcing virtual iterator. I'm good with a TODO/tracking issue here, but I don't see how this is a problem for this function; the iter should already be the constrained one.

The single biggest reason why I wanted us to separate out physical state from FileMetadata was so iterators could continue to be created on FileMetadata, as they everything you could possibly need to know when creating and using iterators.

Got rid of the special casing for now. We would need Reader.EstimatedDiskUsage to work on virtual sstables, which requires further constraining of the bounds start,end to within the virtual sstable bounds, which is what I do in the read path changes.


ingest.go line 1025 at r15 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I can see us needing this for virtual SSTs ingested in disaggregated storage, any particular reason not to generally support this case too?

All this does is open a reader and validate block checksums for each block in the sstable. Can probably pass in the BackingState of the virtual sstables, grab a reader, and validate block checksums on those. That's what the todo is indicating. Makes more sense to do this in the read path pr once the reader changes are in.


table_stats.go line 410 at r15 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

This will never happen because we'd have already hit the panic in PhysicalMeta() right?

The input to the function is a physicalMeta but the files we are iterating over here can be virtual.


version_set.go line 88 at r9 (raw file):

Previously, sumeerbhola wrote…

I think this is confusing and needs comments at least. Otherwise it is easy to imagine that we may be passing a virtual sst here because it is obsolete. Better yet, can we change this to fileInfo? versionSet.addObsoleteLocked does not need anything else from the FileMetadata.

We want to pass virtual sstables into that function. The addObsoleteLocked function is called from Version.Unref and Version.UnrefLocked. Initially, I was only passing physical sst meta to addObsoleteLocked, but a suggestion was made during review that the we should let addObsoleteLocked handle virtual ssts being passed.

obsolete []*fileMetadata definition is accurate. It can be the file metadata associated with either a physical or virtual sstable which was recently obsoleted.

I added a comment with the explanation to addObsoleteLocked.


version_set.go line 103 at r9 (raw file):

Previously, sumeerbhola wrote…

We discussed this a week ago on slack, regarding whether this backing map should be in VersionSet or in each Version. I can see the efficiency benefit of putting this in VersionSet since then we don't have to copy the map for each new version, even though one could argue that this is really state that belongs in a Version.

Done.


version_set.go line 520 at r9 (raw file):

or a single step AccumulateIncompleteAndApply which is called on an empty BVE.

Doing this currently. We're still accumulating first within the function, applying, and then completing the incomplete version edit.

If we want, instead of completing an incomplete version edit, we could just write two version edits. One where the last virtual sstable in a version associated with a backing table gets deleted. And a subsequent version edit which has the RemovedBackingTables entry. That way we can get rid of the incomplete version edit mess. But I think overall, even the current solution is pretty easy to understand.


version_set.go line 686 at r11 (raw file):

Previously, sumeerbhola wrote…

so we are including these for physical ssts too?
Maybe I misunderstood the definition of CreateBackingTables because I was expecting this to be the same as the list in versionSet.backingStateMap.

Oversight. The accuracy of CreateBackingTables when we write the manifest snapshot is only tested during manifest replay, and there's no manifest replay testing without ve decoding/encoding.


version_set.go line 782 at r11 (raw file):

Previously, sumeerbhola wrote…

why would we have duplicates here if only the last virtual sst holding the reference would be added via addObsoleteLocked?

We wouldn't. But it was an implicit invariant. I turned the fact that there should be no duplicate backing files passed into addObsoleteLocked into an invariant and I added a panic if there's duplicates and invariants are enabled.

Doesn't seem like a necessary invariant to explicitly write down, but there's no harm in it.


internal/manifest/version.go line 180 at r11 (raw file):

Previously, sumeerbhola wrote…

VirtualizedSize is set iff the backing sst is referred to only by virtual ssts. ...

Done.


internal/manifest/version.go line 202 at r11 (raw file):

Previously, RaduBerinde wrote…

Would it be accurate to say that there is only one FileMetadata for a given FileNum? (that invariant would explain why we need to share the metadata in different versions/levels)

Yes, this will be true. We don't construct more than one file meta.


internal/manifest/version.go line 209 at r11 (raw file):

Previously, RaduBerinde wrote…

[nit] I'd say "Within a version, a backing-sst is .." (makes it a bit more clear that all SSTs talked about are in the same version)

Done.


internal/manifest/version.go line 216 at r11 (raw file):

Previously, sumeerbhola wrote…

This last para is a bit confusing. We should define "obsolete". In the sense we use for deletion of files, which is the wider use, it only applies to backing-ssts. A leveled-sst FileMetadata goes away when no version has a pointer to it (simply Go GC) -- we should probably not call that "obsolete" to avoid confusion.

Done.


internal/manifest/version.go line 249 at r11 (raw file):

Previously, RaduBerinde wrote…

What does it approximate? Is it an approximation of how much data in the physical SST is required to e.g. fully iterate through the virtual SST?

We can approximate based on the block counts. So, if a physical sst is a-e and we virtualize it to create a-b and c-e, then we can count how many blocks cover the keys a-b and compare it with the total number of blocks in the sstable.

We might have to expose the blocks skipped during seeks.


internal/manifest/version.go line 274 at r11 (raw file):

Previously, RaduBerinde wrote…

For virtual SSTs, are these fields used to delineate where the virtual SST's keyspace starts and ends?

Yes.


internal/manifest/version.go line 343 at r11 (raw file):

Previously, RaduBerinde wrote…

It would be good to outline here how the virtual boundaries are determined

I think this would be different for different use cases. For the read path pr, I've added some restrictions to what kind of boundaries can be picked, so I'll probably add those invariants here in that pr.


internal/manifest/version.go line 367 at r11 (raw file):

Previously, RaduBerinde wrote…

What's a "parent" in this latest iteration?

Added a comment here.


internal/manifest/version.go line 279 at r15 (raw file):

Is there a good reason for it to be a wrapping struct?
Yea, I believe type PhysicalFileMeta *FileMetadata creating an alias like this doesn't provide type safety.

I also don't see this being used in anything other than stats calculation

It's used in a couple of other places. While we would like to just work with FileMetadata and not worry whether it's virtual or physical, in reality, we need some special case handling for virtual sstables(like checkpointing). So, it's nice to be able to constrain the inputs to functions, or even slices to state that the filemetadata is physical or virtual.

I was also using this to constrain functions in the read path pr.

If it turns it isn't useful in the future, it's easy enough to remove.


internal/manifest/version.go line 332 at r15 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Having this and VirtualizedSize be atomics and not just regular ints protected by db.mu or the version list or manifest lock seems unintuitive, and suggests that they are open to more concurrency than they actually are. And the value for VirtualizedSize makes sense in conjunction with latestVersionRefs, not on its own. But it's also hard to have these two actually be protected by those mutexes, since you are updating them in bve.Apply* and not versionSet.logAndApply.

I think the correct way to clean this up is to move the zombie logic to versionSet.logAndApply, so we compute zombie tables after a version has been successfully installed, and where we currently bump the Version ref.

We also need this to be updated during manifest replay in versionSet.load. There's no concurrency during replay, so we don't have to worry about the mutex. But it makes sense to keep it in Apply for that reason, imo.


internal/manifest/version_edit.go line 777 at r9 (raw file):

Previously, sumeerbhola wrote…

Let's make the change since I am finding it hard to reason clearly.
Can you send out a small PR which does this BVE change and we can review and merge it quickly and then rebase on top of that here.

Done.


internal/manifest/version_edit.go line 124 at r15 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

should be CreatedBackingTables as all other verbs are in the past tense (addedfiles, deletedfiles)

Done.


internal/manifest/version_edit.go line 141 at r15 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

should be RemovedBackingTables as all other verbs are in the past tense, and the comment should also be in the past tense.

Done.


internal/manifest/version_edit.go line 834 at r15 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Another major issue here is that our ref counting will become inconsistent if we error out in this method, eg. below where we InitL0Sublevels; we'd have bumped up this ref here for a version that never gets installed. It's why I think refs should be best read/written to outside of BulkVersionEdit and in VersionSet.logAndApply. Thoughts?

As discussed on slack, logAndApply, calls FatalF if there's an error in Apply.


format_major_version.go line 499 at r11 (raw file):

Previously, sumeerbhola wrote…

what's the reason for this formatting change?

There was a change here, which I reverted, but the format change lingered.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 17 files at r1, 4 of 22 files at r12, 4 of 14 files at r16, all commit messages.
Reviewable status: 17 of 34 files reviewed, 55 unresolved discussions (waiting on @bananabrick, @itsbilal, @RaduBerinde, and @sumeerbhola)


checkpoint.go line 194 at r17 (raw file):

	manifestSize := d.mu.versions.manifest.Size()
	optionsFileNum := d.optionsFileNum
	virtualBackingFiles := make(map[base.FileNum]struct{})

can we call this versionBackingFiles or allBackingFiles? the virtual here is mildly confusing


checkpoint.go line 378 at r17 (raw file):

					ve.RemovedBackingTables = append(ve.RemovedBackingTables, fileNum)
				}
			}

rather than adding two new arguments just to compute the intersection, should we add a single parameter excludedBackingFiles and put the onus on the caller to compute the intersection?


db.go line 1891 at r11 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

Got rid of the special casing for now. We would need Reader.EstimatedDiskUsage to work on virtual sstables, which requires further constraining of the bounds start,end to within the virtual sstable bounds, which is what I do in the read path changes.

I think implementing EstimatedDiskUsage for virtual sstables by constraining the bounds is the best option, but I don't love how it can obscure physical disk space usage outside the virtual SST's bounds that is pinned by the virtual sstable's sole reference. In a sense, a single reference backing sstable is a part of the physical disk space usage of this keyspan.


table_stats.go line 477 at r17 (raw file):

			//
			// TODO(bananabrick): See if we can include virtual sstables here to
			// get a more accurate estimate.

I think we should include virtual sstables here. We're not necessarily reclaiming any disk space by compacting them, but by the same token we may reclaim an outsized portion of the physical file size if we're deleting the last reference (and so the compaction is especially beneficial at reducing space amplification). I think it would be reasonable to do the simple thing and just treat these the same as physical files with narrower bounds would be treated. The two competing over and under estimates may be a wash.


version_set.go line 782 at r11 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

We wouldn't. But it was an implicit invariant. I turned the fact that there should be no duplicate backing files passed into addObsoleteLocked into an invariant and I added a panic if there's duplicates and invariants are enabled.

Doesn't seem like a necessary invariant to explicitly write down, but there's no harm in it.

Since we don't need to de-duplicate, I think we should put the entire de-duplication (eg, in a second for loop) behind the invarinats.Enabled flag.


internal/manifest/version.go line 365 at r17 (raw file):

// after the relevant state has been set in the FileMetadata is not necessary
// in tests which don't rely on BackingState.
func (m *FileMetadata) Init() {

Init functions like this make me nervous because it's so easy to forget to invoke them.


internal/manifest/version.go line 428 at r17 (raw file):

	}

	return atomic.AddInt32(&m.BackingState.Atomic.latestVersionRefs, -1)

worth asserting (behind invariants.Enabled) that the return value AddInt32 in both of the unref calls is ≥ 0

@bananabrick bananabrick requested review from itsbilal, sumeerbhola, RaduBerinde and jbowens and removed request for jbowens March 21, 2023 16:35
Copy link
Member

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting very close! I think these comments plus the ones Jackson left are the only suggestions I can think of.

Reviewed 7 of 22 files at r12, 8 of 14 files at r16, 2 of 2 files at r17, all commit messages.
Reviewable status: all files reviewed, 55 unresolved discussions (waiting on @bananabrick, @jbowens, @RaduBerinde, and @sumeerbhola)


checkpoint.go line 194 at r17 (raw file):

Previously, jbowens (Jackson Owens) wrote…

can we call this versionBackingFiles or allBackingFiles? the virtual here is mildly confusing

I second this


version_set.go line 88 at r9 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

We want to pass virtual sstables into that function. The addObsoleteLocked function is called from Version.Unref and Version.UnrefLocked. Initially, I was only passing physical sst meta to addObsoleteLocked, but a suggestion was made during review that the we should let addObsoleteLocked handle virtual ssts being passed.

obsolete []*fileMetadata definition is accurate. It can be the file metadata associated with either a physical or virtual sstable which was recently obsoleted.

I added a comment with the explanation to addObsoleteLocked.

Can you refer to the suggestion that was made to allow virtual ssts to be passed here? the way I see it, this function should only deal with backing states, as it isn't smart enough to know if a virtual sst has a backing state that's still in use.


version_set.go line 786 at r17 (raw file):

//
// DB.mu must be held when addObsoleteLocked is called.
func (vs *versionSet) addObsoleteLocked(obsolete []*fileMetadata) {

This should take in *BackingState not *FileMetadata. Makes it clearer that this function can't be called with a virtual sst just because that virtual sst is no longer used, even if its backing state is still referenced somewhere. And also simplifies the comment above it a lot (eg. "The backing states aassociated with files in obsolete list must not appear more than once" -> "The backing states must not appear more than once")


internal/manifest/version.go line 274 at r11 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

Yes.

Yes, and rather it'll be the Smallest/Largest overall bounds where we will constrain iterators, not just the point key bounds. We expect the range key state to be complete within Smallest/Largest, even if there aren't any point keys for large parts of the overall key space.


internal/manifest/version.go line 198 at r17 (raw file):

	// Lower and upper bounds for the smallest and largest sequence numbers in
	// the table, across both point and range keys. For physical sstables, these
	// values are precise.

nit: s/precise/tight bounds? also probably worth saying that for virtual sstables, these values are loose bounds in that no keys with sequence numbers outside these bounds are expected, but there's no guarantee that there will be a key at the largest/smallest seq num.


internal/manifest/version.go line 401 at r17 (raw file):

}

// Ref increments the ref count associated with the backing sstable.

I think it'd be best if this comment called out that this is meant for, say, iterator / read state refs, and the one for LatestRef can say that this is meant for Versions.

Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responded to the review. PTAL!

Reviewable status: 27 of 34 files reviewed, 52 unresolved discussions (waiting on @itsbilal, @jbowens, @RaduBerinde, and @sumeerbhola)


checkpoint.go line 194 at r17 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I second this

But it's not all backing files or BackingState in the version. It's the BackingState which backs virtual sstables in latest version.


checkpoint.go line 378 at r17 (raw file):

Previously, jbowens (Jackson Owens) wrote…

rather than adding two new arguments just to compute the intersection, should we add a single parameter excludedBackingFiles and put the onus on the caller to compute the intersection?

Yea, done.


db.go line 1891 at r11 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I think implementing EstimatedDiskUsage for virtual sstables by constraining the bounds is the best option, but I don't love how it can obscure physical disk space usage outside the virtual SST's bounds that is pinned by the virtual sstable's sole reference. In a sense, a single reference backing sstable is a part of the physical disk space usage of this keyspan.

Technically we can just grab BackingState.Size once for virtual sstables. But it doesn't work for cases where the virtual sstable doesn't overlap with start,end but the physical sstable does.

Left it as is for now.


table_stats.go line 477 at r17 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I think we should include virtual sstables here. We're not necessarily reclaiming any disk space by compacting them, but by the same token we may reclaim an outsized portion of the physical file size if we're deleting the last reference (and so the compaction is especially beneficial at reducing space amplification). I think it would be reasonable to do the simple thing and just treat these the same as physical files with narrower bounds would be treated. The two competing over and under estimates may be a wash.

Don't disagree. Turned the continue into a panic. We might have to call Reader.EstimatedDiskUsage which will only work for virtual sstables after the read path pr is up. Leaving the todo for now with updated wording.


version_set.go line 88 at r9 (raw file):
https://reviewable.io/reviews/cockroachdb/pebble/2288#-NOLhoH--L-ZIxwTbViB

as it isn't smart enough to know if a virtual sst has a backing state that's still in use.

None of the backing states of virtual sstables passed in can be in use. They're no longer referenced by any Version. Anyway, I'm passing in BackingState now.


version_set.go line 782 at r11 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Since we don't need to de-duplicate, I think we should put the entire de-duplication (eg, in a second for loop) behind the invarinats.Enabled flag.

Done.


version_set.go line 786 at r17 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

This should take in *BackingState not *FileMetadata. Makes it clearer that this function can't be called with a virtual sst just because that virtual sst is no longer used, even if its backing state is still referenced somewhere. And also simplifies the comment above it a lot (eg. "The backing states aassociated with files in obsolete list must not appear more than once" -> "The backing states must not appear more than once")

Done.


internal/manifest/version.go line 195 at r11 (raw file):

Previously, sumeerbhola wrote…

+1

I can see the use for that, but I think we already get some type safety here. If a function is expecting BackingState.FileNum instead of FileMetadata.FileNum, then we can make the input parameter to the function a *BackingState and use the function that way. I plan to do this for some of the table cache functions in the read path.

But I left a TODO here to create a separate type for BackingState.FileNum, which I'll work in the read path pr.


internal/manifest/version.go line 365 at r17 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Init functions like this make me nervous because it's so easy to forget to invoke them.

Same. But I don't know how to avoid it. The good news is that we only create FileMetadata during runCompaction and ingest.


internal/manifest/version.go line 401 at r17 (raw file):
Both BackingState.Atomic.refs and BackingState.atomic.latestVersionRefs have comments explaining what they're tracking.

that this is meant for, say, iterator / read state refs, and the one for LatestRef can say that this is meant for Versions.

Both are tracking references to the FileMetadata by Versions.


internal/manifest/version.go line 428 at r17 (raw file):

Previously, jbowens (Jackson Owens) wrote…

worth asserting (behind invariants.Enabled) that the return value AddInt32 in both of the unref calls is ≥ 0

Done.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 22 of 34 files reviewed, 60 unresolved discussions (waiting on @bananabrick, @itsbilal, @jbowens, and @sumeerbhola)


internal/manifest/version.go line 325 at r18 (raw file):

//
// See the comment above the FileMetadata type for sstable terminology.
type BackingState struct {

[nit] "state" doesn't feel right, maybe FileBacking ? "FileBacking contains the metadata associated with a physical file; within a version, a FileBacking either backs a single physical sstable, or one or more virtual sstables."


internal/manifest/version.go line 333 at r18 (raw file):

		// backing sst file from disk. The backing file is obsolete when the
		// reference count falls to zero.
		refs int32

I think we can switch these over to atomic.Int32 (added in go 1.19)


internal/manifest/version.go line 341 at r18 (raw file):

		// INVARIANT: latestVersionRefs <= refs.
		latestVersionRefs int32
		// VirtualizedSize is set iff the backing sst is only referred to by

Is this accurate? If this backing is used for a physical sst in the previous version, and a bunch of virtual ones in the latest version, won't it be set? Should it be "VirtualizedSize is set if the backing is used for virtual sstables in the latest version"?


internal/manifest/version.go line 367 at r18 (raw file):

// after the relevant state has been set in the FileMetadata is not necessary
// in tests which don't rely on BackingState.
func (m *FileMetadata) Init() {

[nit] Init is too general, maybeInitPhysical or InitPhysicalBacking


internal/manifest/version.go line 372 at r18 (raw file):

	}
	if m.BackingState == nil {
		m.BackingState = &BackingState{}

[nit] &BackingState{Size: m.Size, FileNum: m.FileNum}


internal/manifest/version.go line 378 at r18 (raw file):

}

// ValidateVirtual should be called once the FileMetadata for a virtual sstable

[supernit] should first say what it does, "ValidateVirtual verifies that the fields of a virtual sstable are sound. It should be calle donce .."


internal/manifest/version.go line 399 at r18 (raw file):

// Refs returns the refcount of backing sstable.
func (m *FileMetadata) Refs() int32 {

[nit] Refs() is very close toRef(), maybe NumRefs(). Or maybe HasRefs() since we only care if it's 0 or not.


internal/manifest/version.go line 409 at r18 (raw file):

// Unref decrements the ref count associated with the backing sstable.
func (m *FileMetadata) Unref() int32 {

[nit] maybe return (hasRemainingRefs bool) instead of the count (same for the Latest version)

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 24 files at r7, 9 of 22 files at r12, 7 of 14 files at r16, 1 of 2 files at r17, 12 of 12 files at r18, all commit messages.
Reviewable status: all files reviewed, 63 unresolved discussions (waiting on @bananabrick, @itsbilal, @jbowens, and @RaduBerinde)


checkpoint.go line 279 at r18 (raw file):

			}

			backingState := f.BackingState

physical ssts also have a BackingState. I assume this computation is correct because requiredBackingFiles (which will include backing for physical ssts) is only used for subtracting from virtualBackingFiles and the latter only contains the ones backing virtual ssts (and because a backing sst cannot be backing both a physical and a virtual sst in a version). Is my understanding correct? If yes, this needs a code comment. Or better yet, lets scope requiredBackingFiles only to those needed for virtual ssts.


table_stats.go line 477 at r17 (raw file):

I think it would be reasonable to do the simple thing and just treat these the same as physical files with narrower bounds would be treated.

+1


version_set.go line 520 at r9 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

or a single step AccumulateIncompleteAndApply which is called on an empty BVE.

Doing this currently. We're still accumulating first within the function, applying, and then completing the incomplete version edit.

If we want, instead of completing an incomplete version edit, we could just write two version edits. One where the last virtual sstable in a version associated with a backing table gets deleted. And a subsequent version edit which has the RemovedBackingTables entry. That way we can get rid of the incomplete version edit mess. But I think overall, even the current solution is pretty easy to understand.

Current solution is good


version_set.go line 778 at r18 (raw file):

// addObsoleteLocked will add the fileInfo associated with obsolete backing
// sstables to the obsolete tables list. If virtual sstables are passed into

"If virtual sstables are passed into ..." seems stale since the parameter is now []*backingState.


internal/manifest/version.go line 341 at r18 (raw file):

Should it be "VirtualizedSize is set if the backing is used for virtual sstables in the latest version"?

sounds right


internal/manifest/version_edit.go line 739 at r18 (raw file):

	for _, fileNum := range b.RemovedBackingState {
		delete(backingStateMap, fileNum)

Will b.RemovedBackingState ever be non-empty, given that ve.RemovedBackingTables must be empty?
I think we should check the latter condition and return an error.


version_set_test.go line 116 at r5 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

I find it odd that we don't call updateReadStateLocked from within logAndApply and instead rely on the caller to make sure that the read state is updated.

I think it was a choice based on drawing a code abstraction boundary around the version code, that does not know about memtables.


version_set_test.go line 36 at r9 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

This test is a bit clunky, but it gets the job done and I don't think we need a full data driven test here.

We should consider adding a specialized randomized test that creates and deletes virtual ssts, and relies on strong invariant checking to uncover bugs in our manifest handling code.

Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 57 unresolved discussions (waiting on @itsbilal, @jbowens, @RaduBerinde, and @sumeerbhola)


checkpoint.go line 279 at r18 (raw file):

Previously, sumeerbhola wrote…

physical ssts also have a BackingState. I assume this computation is correct because requiredBackingFiles (which will include backing for physical ssts) is only used for subtracting from virtualBackingFiles and the latter only contains the ones backing virtual ssts (and because a backing sst cannot be backing both a physical and a virtual sst in a version). Is my understanding correct? If yes, this needs a code comment. Or better yet, lets scope requiredBackingFiles only to those needed for virtual ssts.

Made requiredBackingFiles only track virtual virtual ssts.


table_stats.go line 477 at r17 (raw file):

Previously, sumeerbhola wrote…

I think it would be reasonable to do the simple thing and just treat these the same as physical files with narrower bounds would be treated.

+1

Done.


version_set.go line 520 at r9 (raw file):

Previously, sumeerbhola wrote…

Current solution is good

Done.


version_set.go line 778 at r18 (raw file):

Previously, sumeerbhola wrote…

"If virtual sstables are passed into ..." seems stale since the parameter is now []*backingState.

Done.


internal/manifest/version.go line 325 at r18 (raw file):

Previously, RaduBerinde wrote…

[nit] "state" doesn't feel right, maybe FileBacking ? "FileBacking contains the metadata associated with a physical file; within a version, a FileBacking either backs a single physical sstable, or one or more virtual sstables."

Keeping this as is for now.


internal/manifest/version.go line 333 at r18 (raw file):

Previously, RaduBerinde wrote…

I think we can switch these over to atomic.Int32 (added in go 1.19)

Done.


internal/manifest/version.go line 399 at r18 (raw file):

Previously, RaduBerinde wrote…

[nit] Refs() is very close toRef(), maybe NumRefs(). Or maybe HasRefs() since we only care if it's 0 or not.

Going to keep this as is since we use the same naming scheme for Version.Refs/Ref.


internal/manifest/version.go line 409 at r18 (raw file):

Previously, RaduBerinde wrote…

[nit] maybe return (hasRemainingRefs bool) instead of the count (same for the Latest version)

Going to keep this as is.


internal/manifest/version_edit.go line 739 at r18 (raw file):

Previously, sumeerbhola wrote…

Will b.RemovedBackingState ever be non-empty, given that ve.RemovedBackingTables must be empty?
I think we should check the latter condition and return an error.

Done.


version_set_test.go line 36 at r9 (raw file):

Previously, sumeerbhola wrote…

We should consider adding a specialized randomized test that creates and deletes virtual ssts, and relies on strong invariant checking to uncover bugs in our manifest handling code.

I believe this will be covered by metamorphic tests once Pebble is writing virtual sstables y default. We can be more precise about invariants regarding some of the data structures in the manifest code, but metamorphic tests should also implicitly check those invariants.

Either way, I'd like to merge such a randomized test in a separate pr if necessary.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 17 files at r1, 25 of 26 files at r19, all commit messages.
Reviewable status: 33 of 34 files reviewed, 57 unresolved discussions (waiting on @bananabrick, @itsbilal, @jbowens, and @RaduBerinde)


internal/manifest/version.go line 325 at r18 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

Keeping this as is for now.

I like FileBacking, since it is more specific than BackingState


version_set_test.go line 36 at r9 (raw file):

Either way, I'd like to merge such a randomized test in a separate pr if necessary.

That is fine. But let's see if there is a way to have more focused randomized testing of virtual ssts -- the metamorphic test is great but sometimes can take a long time to find bugs.

Copy link
Member

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: save for Sumeer's comments

Reviewed 1 of 26 files at r19, all commit messages.
Reviewable status: all files reviewed, 57 unresolved discussions (waiting on @bananabrick, @jbowens, @RaduBerinde, and @sumeerbhola)


internal/manifest/version.go line 379 at r19 (raw file):

// ValidateVirtual should be called once the FileMetadata for a virtual sstable
// is created to verify that the fields of the virtual sstable are sound.
func (m *FileMetadata) ValidateVirtual(createdFrom *FileMetadata) {

I imagine the write path will call into this and that's why it's exported?

Copy link
Contributor Author

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 33 of 34 files reviewed, 56 unresolved discussions (waiting on @itsbilal, @jbowens, @RaduBerinde, and @sumeerbhola)


internal/manifest/version.go line 379 at r19 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I imagine the write path will call into this and that's why it's exported?

Yes.


version_set_test.go line 36 at r9 (raw file):

Previously, sumeerbhola wrote…

Either way, I'd like to merge such a randomized test in a separate pr if necessary.

That is fine. But let's see if there is a way to have more focused randomized testing of virtual ssts -- the metamorphic test is great but sometimes can take a long time to find bugs.

Sounds good.

This pr introduces the FileMetadata changes which will
be required to make further virtual sstable changes.

This pr doesn't deal with persisting of the FileMetadata
to disk through the version edits.

RFC: https://github.com/cockroachdb/pebble/blob/master/docs/RFCS/20221122_virtual_sstable.md
@bananabrick bananabrick merged commit 2e9d3f5 into cockroachdb:master Mar 28, 2023
@RaduBerinde
Copy link
Member

🎉 💯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants