-
Notifications
You must be signed in to change notification settings - Fork 21
Rewrite batch mode to use temp directory (#136) #142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
* Rewrite batch mode to use temp directory Write batch data to temp directory instead of memory to avoid high RAM usage with large files. Each batch creates its own temp directory, writes blocks directly to disk, and renames all files atomically on commit. * Use slice instead of map for batch puts * Add DiscardableBatch interface with Discard method * add read operations for batch iface * fix: remove unused function and variable to fix CI checks * Removed unused tempFileOnce() method * Replaced unused 'done' variable with '_' in Commit method * remove some local files * perf: optimize batch read operations to skip filesystem checks Only check temp directory for keys that exist in puts slice * feat: implement batch Query with temp directory merging * make batch Put operations async Address PR review feedback * fix Batch init * Update flatfs.go
|
Triage: Open PR in kubo and test on kubo staging. |
Upgrade go-ds-flatfs to version that uses uses temproary files to store items added to batches. See: ipfs/go-ds-flatfs#142
- add godoc for maxConcurrentPuts explaining why 16 was chosen - fix incorrect O(n) claims for Get/Has/GetSize (actually O(1) via putSet map)
lidel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm, as long we address two things:
- Updated godoc (bd468ee) to reflect the latest version, but unsure about one thing – see inline.
- IIUC two concurrent goroutines calling
os.Create()andWrite()on the same file can error or corrupt data – see inline.
Co-authored-by: Marcin Rataj <lidel@lidel.org>
flatfs.go
Outdated
| return | ||
| } | ||
|
|
||
| file, err := os.Create(tempFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bit worried about this on Windows: i suspect concurrent async writes (multiple ipfs add) may cause file handle issues, things like golang/go#34681 – flatfs uses https://github.com/alexbrainman/goissue34681 for a reason, we likely need similar wrapper on windows here (old code used tempFileOnce from util_windows.go).
@gammazero too late on my end to write code, but we probably need to create createFile equivalent that does something similar (pseudocode):
util_unix.go:
func createFile(name string) (*os.File, error) {
return os.Create(name)
}util_windows.go:
func createFile(name string) (*os.File, error) {
return goissue34681.OpenFile(name, os.O_RDWR|os.O_CREATE|os.O_TRUNC, 0666)
}Then finally here:
file, err := createFile(tempFile)Windows concurrency is janky, unsure if we need a retry wrapper, maybe can go without it, but in case, it could be something like:
util_windows.go:
func createFile(name string) (*os.File, error) {
var file *os.File
var err error
for i := 0; i < RetryAttempts; i++ {
file, err = createFileOnce(name)
if err == nil || !isTooManyFDError(err) {
break
}
time.Sleep(time.Duration(i+1) * RetryDelay)
}
return file, err
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, without retry wrapper.
|
Triage:
|
Continued from #136
Credit to @requilence for most of this PR
Problem
When uploading large files through UnixFS or other IPLD structures, the current batch implementation keeps everything in memory and writes blocks individually. If the application crashes mid-process, partial blocks remain in FlatFS, making garbage collection extremely difficult since there's no way to identify which blocks belong to incomplete operations.
Solution
This PR rewrites the batch mode to use a temporary directory approach:
BatchReaderinterface implementing Get/Has/GetSize methods. This is essential for IPLD structures which need to verify block links during construction. Without batch reads, you cannot build complex IPLD structures that reference blocks added earlier in the same batch. This follows standard database transaction semantics where reads within a transaction see uncommitted writes.Implementation Details