-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
br: fix lightning split large csv file error and adjust s3 seek result #27769
Conversation
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
br/pkg/storage/s3.go
Outdated
} | ||
|
||
r.reader = eofReader{} | ||
return r.rangeInfo.Size, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r.pos
need to be updated in case subsequent seeks. Also, should we return realOffset
rather than r.rangeInfo.Size
? I didn't find any specifications for this behavior, but returning realOffset
is more consistent with linux filesystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. I think we should return real position which is r.rangeInfo.Size
, since the behavior isn't the same as linux fs. fs implement ReaderWriter
and allow to write to a position that is larger that the current end and leave the file with a hole. But here we don't. Though there is no much difference.
br/pkg/storage/s3.go
Outdated
@@ -648,6 +648,17 @@ func (r *s3ObjectReader) Close() error { | |||
return r.reader.Close() | |||
} | |||
|
|||
// eofReader is a io.ReaderClose Reader that always return io.EOF | |||
type eofReader struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(optional io.NopCloser(bytes.NewReader(nil))
br/pkg/lightning/mydump/region.go
Outdated
@@ -268,7 +268,7 @@ func makeSourceFileRegion( | |||
} | |||
// If a csv file is overlarge, we need to split it into multiple regions. | |||
// Note: We can only split a csv file whose format is strict. | |||
if isCsvFile && dataFileSize > int64(cfg.Mydumper.MaxRegionSize) && cfg.Mydumper.StrictFormat { | |||
if isCsvFile && cfg.Mydumper.StrictFormat && dataFileSize > int64(cfg.Mydumper.MaxRegionSize)*11/10 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the file size if slightly bigger the int64(cfg.Mydumper.MaxRegionSize)*11/10
? 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the file will be split by cfg.Mydumper.MaxRegionSize
, so the second chunk size is about 1/10 * cfg.Mydumper.MaxRegionSize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the file size is slightly bigger the int64(cfg.Mydumper.MaxRegionSize)* 2
, the third chunk size is very small, will this be a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big problem. The common case is that the data export tool (like dumpling or mydumper) set the exported file size with cfg.Mydumper.MaxRegionSize
, but the output file size might be slightly bigger or smaller, so we can avoid split a lot of small chunks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make that 11/10 a named constant...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will make the code a bit ugly because 11/10 is a float. I add a code comment to explain why the threshold need to be increased😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can make "10" a constant and set the upper limit to MaxRegionSize + MaxRegionSize/10
.
/run-check_dev_2 |
1 similar comment
/run-check_dev_2 |
/merge |
This pull request has been accepted and is ready to merge. Commit hash: 69a37a3
|
/run-check_dev_2 |
/remove-label needs-cherry-pick-5.2 |
/label needs-cherry-pick-5.2 |
Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
cherry pick to release-5.2 in PR #33883 |
/remove-label needs-cherry-pick-5.2 |
What problem does this PR solve?
Fix the bug that lightning split large csv file may failed if the file size if slightly bigger the
region-split-size
and the csv has header.close #27763
What is changed and how it works?
io.EOF
makeTableRegions
, after preprocess the header line, if the remain file is smaller thanregion-split-size
, directly return the result with 1 region.Check List
Tests
Side effects
Documentation
Release note