-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added block size exception #12
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -390,9 +390,21 @@ public Cursor(ParquetLoader parent, Func<int, bool> predicate, IRandom rand) | |
Columns = _loader._columnsLoaded.Select(i => i.Name).ToArray() | ||
}; | ||
|
||
int numBlocks = (int)Math.Ceiling(((decimal)parent.GetRowCount() / _readerOptions.Count)); | ||
int[] blockOrder = _rand == null ? Utils.GetIdentityPermutation(numBlocks) : Utils.GetRandomPermutation(rand, numBlocks); | ||
_blockEnumerator = blockOrder.GetEnumerator(); | ||
try | ||
{ | ||
int numBlocks = checked((int)Math.Ceiling(((decimal)parent.GetRowCount() / _readerOptions.Count))); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Why not default to an AUTO block size? Can we figure out the right block size ourselves, so that it is not too big to fit in memory and not too small too create too many blocks? Also, if there are two many blocks, can we allocate only a few at a time? In the function below, please just instead create an random Ienumerable of blockorder without having to materialize the actual array. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The main reason to put the total number of blocks into an array is to be able to shuffle the blocks (in addition to shuffling the rows in a block.) In order to achieve this, we not only need both the {blocks} and {rows in blocks} to be < int.MaxValue, but also < the number of elements that can be allocated in an int array. In my own experiments this number turns out to be ~300M. I've set the default block size to be 1M rows. This will cover parquet files that contain 300 trillion rows, including reading and shuffling effectively without throwing the exception, and the user can adjust the blocks size if that's necessary. |
||
int[] blockOrder = _rand == null ? Utils.GetIdentityPermutation(numBlocks) : Utils.GetRandomPermutation(rand, numBlocks); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Make |
||
_blockEnumerator = blockOrder.GetEnumerator(); | ||
} | ||
catch (Exception e) | ||
{ | ||
if (e is OutOfMemoryException || e is OverflowException) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I would treat them as different exceptions. Don't they have exactly oppoisite causes? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe I'd say one is a refinement of the other. The size can be so large that either (1) it can't fit in an int or (2) it does fit in an int but that's moot since we can't allocate an That said I still agree with your suggestion of separating the two out... we know exactly which line would throw which exception, and I'd prefer the slightly move verbose two try statements because we have that expectation, since it clarifies to the reader where we expect the two distinct failure cases of the code above to happen, possibly. Also FYI, just a quick style note for the future: even if we let this code stand as is, we could convert this to In reply to: 186318417 [](ancestors = 186318417) |
||
{ | ||
throw new InvalidDataException("Error due to too many blocks. Try increasing block size.", e); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
use _host.Except when throwing exceptions within our componints There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To reiterate this: this sort of exception marking is how we distinguish exceptions thrown by .NET or underlying exceptions itself, or exceptions thrown by this library. Marked exceptions are intended to be, in some sense, actionable by a user, so we treat them differently from "unmarked" exceptions, so we consider it important to distinguish them. (In some contexts they're treated differently -- it's useful to know when a problem is due to misuse vs. us just having a bug) Plus of course you get marked with the throwing component; this is useful information. In reply to: 186318498 [](ancestors = 186318498) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Duly noted. This should definitely be a marked exception. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Given that we're interpreting the exception pretty exactly, I'm not certain that there is much utility in including that inner exception. My reason is: usually in this API and many APIs I've seen, innermost exceptions are usually the "real" problem and practically the only thing people wind up looking at. :) So it would be nice if your nice friendly error message you have written was that innermost exception, and I think you have a very good excuse to make it the innermost exception. :) |
||
} | ||
|
||
throw; | ||
} | ||
|
||
_dataSetEnumerator = new int[0].GetEnumerator(); // Initialize an empty enumerator to get started | ||
_columnValues = new IList[_actives.Length]; | ||
|
@@ -477,7 +489,7 @@ protected override bool MoveNextCore() | |
} | ||
else if (_blockEnumerator.MoveNext()) | ||
{ | ||
_readerOptions.Offset = (int)_blockEnumerator.Current * _readerOptions.Count; | ||
_readerOptions.Offset = (long)_blockEnumerator.Current * _readerOptions.Count; | ||
|
||
// When current dataset runs out, read the next portion of the parquet file. | ||
DataSet ds; | ||
|
@@ -486,9 +498,21 @@ protected override bool MoveNextCore() | |
ds = ParquetReader.Read(_loader._parquetStream, _loader._parquetOptions, _readerOptions); | ||
} | ||
|
||
int[] dataSetOrder = _rand == null ? Utils.GetIdentityPermutation(ds.RowCount) : Utils.GetRandomPermutation(_rand, ds.RowCount); | ||
_dataSetEnumerator = dataSetOrder.GetEnumerator(); | ||
_curDataSetRow = dataSetOrder[0]; | ||
try | ||
{ | ||
int[] dataSetOrder = _rand == null ? Utils.GetIdentityPermutation(ds.RowCount) : Utils.GetRandomPermutation(_rand, ds.RowCount); | ||
_dataSetEnumerator = dataSetOrder.GetEnumerator(); | ||
_curDataSetRow = dataSetOrder[0]; | ||
} | ||
catch (Exception e) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you can write this more simply as
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
why not just |
||
{ | ||
if (e is OutOfMemoryException) | ||
{ | ||
throw new InvalidDataException("Error caused because block size too big. Try decreasing block size.", e); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit "is too big" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed, thanks. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Just an FYI: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also as @glebuk says, might be best to use |
||
} | ||
|
||
throw; | ||
} | ||
|
||
// Cache list for each active column | ||
for (int i = 0; i < _actives.Length; i++) | ||
|
@@ -671,4 +695,4 @@ private string ConvertListToString(IList list) | |
} | ||
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just allocate numBlocks as long or decimal? Then just see if it is >some reasonable value such as int.Maxvalue/1024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elsewhere, our checks check whether something would result in
Utils.ArrayMaxLength
. Of course those checks also assume that the user is running withgcAllowVeryLargeObjects
in the app config, which may not be the case in API usage... :PIn reply to: 186318365 [](ancestors = 186318365)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to another comment, we wanted the {block size} and {number of blocks} to both be under 300M to fit in an int array so that it can be shuffled. I changed the default block size to be 1M, but the user could change that value, and if either exceeds the upper limit the exception will be thrown :P