Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added block size exception #12

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 32 additions & 8 deletions src/Microsoft.ML.Parquet/ParquetLoader.cs
Original file line number Diff line number Diff line change
Expand Up @@ -390,9 +390,21 @@ public Cursor(ParquetLoader parent, Func<int, bool> predicate, IRandom rand)
Columns = _loader._columnsLoaded.Select(i => i.Name).ToArray()
};

int numBlocks = (int)Math.Ceiling(((decimal)parent.GetRowCount() / _readerOptions.Count));
int[] blockOrder = _rand == null ? Utils.GetIdentityPermutation(numBlocks) : Utils.GetRandomPermutation(rand, numBlocks);
_blockEnumerator = blockOrder.GetEnumerator();
try
{
int numBlocks = checked((int)Math.Ceiling(((decimal)parent.GetRowCount() / _readerOptions.Count)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int [](start = 20, length = 3)

why not just allocate numBlocks as long or decimal? Then just see if it is >some reasonable value such as int.Maxvalue/1024

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere, our checks check whether something would result in Utils.ArrayMaxLength. Of course those checks also assume that the user is running with gcAllowVeryLargeObjects in the app config, which may not be the case in API usage... :P


In reply to: 186318365 [](ancestors = 186318365)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to another comment, we wanted the {block size} and {number of blocks} to both be under 300M to fit in an int array so that it can be shuffled. I changed the default block size to be 1M, but the user could change that value, and if either exceeds the upper limit the exception will be thrown :P

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_readerOptions [](start = 95, length = 14)

Why not default to an AUTO block size? Can we figure out the right block size ourselves, so that it is not too big to fit in memory and not too small too create too many blocks? Also, if there are two many blocks, can we allocate only a few at a time? In the function below, please just instead create an random Ienumerable of blockorder without having to materialize the actual array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason to put the total number of blocks into an array is to be able to shuffle the blocks (in addition to shuffling the rows in a block.) In order to achieve this, we not only need both the {blocks} and {rows in blocks} to be < int.MaxValue, but also < the number of elements that can be allocated in an int array. In my own experiments this number turns out to be ~300M.

I've set the default block size to be 1M rows. This will cover parquet files that contain 300 trillion rows, including reading and shuffling effectively without throwing the exception, and the user can adjust the blocks size if that's necessary.

int[] blockOrder = _rand == null ? Utils.GetIdentityPermutation(numBlocks) : Utils.GetRandomPermutation(rand, numBlocks);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Utils.GetIdentityPermutation(numBlocks) [](start = 55, length = 39)

Make blockOrder an IEnumerable<int>, and replace this specifiy "IdentityPermutation" with Enumerable.Range(0, numBlocks). This avoids materializing the trivial case. I don't see a way to avoid materializing the less trivial case, I'm sorry to say. :(

_blockEnumerator = blockOrder.GetEnumerator();
}
catch (Exception e)
{
if (e is OutOfMemoryException || e is OverflowException)
Copy link
Contributor

@glebuk glebuk May 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OutOfMemoryException [](start = 29, length = 20)

I would treat them as different exceptions. Don't they have exactly oppoisite causes?

Copy link
Contributor

@TomFinley TomFinley May 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'd say one is a refinement of the other. The size can be so large that either (1) it can't fit in an int or (2) it does fit in an int but that's moot since we can't allocate an int[] array that large in .NET anyway. So I'd like the error message to be the same, that is, I'm not sure they're opposites.

That said I still agree with your suggestion of separating the two out... we know exactly which line would throw which exception, and I'd prefer the slightly move verbose two try statements because we have that expectation, since it clarifies to the reader where we expect the two distinct failure cases of the code above to happen, possibly.

Also FYI, just a quick style note for the future: even if we let this code stand as is, we could convert this to catch (Exception e) when (e is OutOfMemoryException || e is OverflowException). This simplifies things enormously. (This very useful feature was added in, I think, C# 6 but I wouldn't stake my life on it.)


In reply to: 186318417 [](ancestors = 186318417)

{
throw new InvalidDataException("Error due to too many blocks. Try increasing block size.", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw [](start = 24, length = 5)

use _host.Except when throwing exceptions within our componints

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reiterate this: this sort of exception marking is how we distinguish exceptions thrown by .NET or underlying exceptions itself, or exceptions thrown by this library. Marked exceptions are intended to be, in some sense, actionable by a user, so we treat them differently from "unmarked" exceptions, so we consider it important to distinguish them. (In some contexts they're treated differently -- it's useful to know when a problem is due to misuse vs. us just having a bug) Plus of course you get marked with the throwing component; this is useful information.


In reply to: 186318498 [](ancestors = 186318498)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duly noted. This should definitely be a marked exception.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, e [](start = 113, length = 3)

Given that we're interpreting the exception pretty exactly, I'm not certain that there is much utility in including that inner exception. My reason is: usually in this API and many APIs I've seen, innermost exceptions are usually the "real" problem and practically the only thing people wind up looking at. :) So it would be nice if your nice friendly error message you have written was that innermost exception, and I think you have a very good excuse to make it the innermost exception. :)

}

throw;
}

_dataSetEnumerator = new int[0].GetEnumerator(); // Initialize an empty enumerator to get started
_columnValues = new IList[_actives.Length];
Expand Down Expand Up @@ -477,7 +489,7 @@ protected override bool MoveNextCore()
}
else if (_blockEnumerator.MoveNext())
{
_readerOptions.Offset = (int)_blockEnumerator.Current * _readerOptions.Count;
_readerOptions.Offset = (long)_blockEnumerator.Current * _readerOptions.Count;

// When current dataset runs out, read the next portion of the parquet file.
DataSet ds;
Expand All @@ -486,9 +498,21 @@ protected override bool MoveNextCore()
ds = ParquetReader.Read(_loader._parquetStream, _loader._parquetOptions, _readerOptions);
}

int[] dataSetOrder = _rand == null ? Utils.GetIdentityPermutation(ds.RowCount) : Utils.GetRandomPermutation(_rand, ds.RowCount);
_dataSetEnumerator = dataSetOrder.GetEnumerator();
_curDataSetRow = dataSetOrder[0];
try
{
int[] dataSetOrder = _rand == null ? Utils.GetIdentityPermutation(ds.RowCount) : Utils.GetRandomPermutation(_rand, ds.RowCount);
_dataSetEnumerator = dataSetOrder.GetEnumerator();
_curDataSetRow = dataSetOrder[0];
}
catch (Exception e)
Copy link
Member

@danmoseley danmoseley May 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can write this more simply as

catch (Exception e) when (e Is OutOfMemoryException)
{
      throw new ... 
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exception [](start = 27, length = 9)

why not just
catch (InvalidDataException e)?

{
if (e is OutOfMemoryException)
{
throw new InvalidDataException("Error caused because block size too big. Try decreasing block size.", e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit "is too big"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InvalidDataException [](start = 38, length = 20)

Just an FYI: InvalidDataException's 'documentation says "The exception that is thrown when a data stream is in an invalid format." This does not describe the scenario here: your data is not in an invalid format, rather, something is wrong with our implementation to interpret that format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also as @glebuk says, might be best to use _host.Except here... or _host.ExceptParam if you can isolate it to a specific user-provided parameter that was somehow "wrong."

}

throw;
}

// Cache list for each active column
for (int i = 0; i < _actives.Length; i++)
Expand Down Expand Up @@ -671,4 +695,4 @@ private string ConvertListToString(IList list)
}
}
}
}
}