Support for uint256/int256 #15443

elyase · 2024-04-02T14:44:58Z

Description

The most commonly used data types for smart contracts and token math on the blockchain are uint256 and int256. Currently, people are resorting to inefficient methods like converting to float or string because there is no native support for these data types. For example, paradigmxyz/cryo exports duplicate data columns in various formats (float, binary, string) due to the absence of native support.

banteg · 2024-04-02T15:28:41Z

willing to add a bounty of $500 to the implementer of this feature. could be more when the scope of work is more clear. i guess we'll need support in arrow first.

found this library that implements arbitrary sized uint: https://github.com/recmo/uint

stinodego · 2024-04-02T21:10:43Z

I think we want to support a Decimal backed by an Int256 in the future- this is also available in Arrow.

Not sure about a true 256-bit integer. We should probably first debate adding Int128/UInt128 with the 256-bit version as a potential next step. I don't think this will happen any time soon. Just my 2 cents though.

l1t1 · 2024-04-02T22:15:25Z

two many things will be effect if adding new tyoes. ref duckdb/duckdb#8635 (comment)

banteg · 2024-04-03T08:54:47Z

unfortunately decimal256 in arrow doesn't support the full range of uint256 values, it cuts off at 76 digits, which is approximately 252.4653 bits.

import pyarrow as pa

pa.array([2**256-1], type=pa.decimal256(76, 0))
# ArrowInvalid: Decimal type with precision 78 does not fit into precision inferred from first array element: 76

pa.array([2**256-1], type=pa.decimal256(78, 0))
# ValueError: precision should be between 1 and 76

0xvanbeethoven · 2024-04-03T15:03:47Z

Willing to add $500 to the bounty @banteg proposed.

gakonst · 2024-04-03T16:10:32Z

@paradigmxyz (maintainers of Cryo) to chip in an additional $1K, incl. the Arrow integration in scope.

orlp · 2024-04-03T16:43:42Z

What sort of operations would one expect to be able to do on this pl.UInt256/pl.Int256, and for what purposes (preferably with concrete examples)?

For the record, this issue is not yet accepted, we still need to discuss if we want this in Polars. Adding types is a lot of work, slows down future feature development, and increases the binary size of Polars by quite a bit. So implementers beware: even though third parties are offering PR bounties, that does not mean we will merge said PR's and you might not get paid if we end up deciding this is not in scope for Polars, or the drawbacks outweigh the positives.

sslivkoff · 2024-04-03T16:59:24Z

What sort of operations would one expect to be able to do on this pl.UInt256/pl.Int256, and for what purposes (preferably with concrete examples)?

For the record, this issue is not yet accepted, we still need to discuss if we want this in Polars. Adding types is a lot of work, slows down future feature development, and increases the binary size of Polars by quite a bit. So implementers beware: even though third parties are offering PR bounties, that does not mean we will merge said PR's and you might not get paid if we end up deciding this is not in scope for Polars, or the drawbacks outweigh the positives.

hi

in terms of raw operations I think the most important functionality would be

arithmetic ops + - * / % ^
comparison == >= > < <=
min/max/mean/value_counts
cum_sum / diff

these operations would be most commonly used in these contexts

pl.col() expressions in select() and with_columns()
inside .group_by(non_int_column).agg(int_column)
bare Series

concrete example: lets say you have a large dataframe of transactions (>100M rows). each row has a u256 column of tx price and a string/binary column of account id. you want to aggregate the total spend per account. right now the most common approach is to convert to f64 for tx price so that group_by(id).agg(pl.sum(price)) can be used. using f64 sacrifices precision because the data is natively u256 and often utilizes the full precision

gakonst · 2024-04-03T17:15:44Z

Totally @orlp definitely no expectation from our side and appreciate the clear communication. Makes sense and understand the nuances w.r.t maintenance, as we also follow a similar philosophy. Thank you for the swift reply.

yenicelik · 2024-07-21T10:56:52Z

happy to chip in another $100 to this bounty

IlluvatarEru · 2024-09-19T11:03:20Z

This is a blocker for a lot of people and orgs, would be great to see it priorised. What can we do see it done faster?

mahmudsudo · 2024-12-05T22:58:40Z

I would love to take on this bounty .

Proposed Workflow :

use uint::Uint;


pub struct UInt256Column {
    values: Vec<Uint<256>>,
    // NullBitmap tracks which values are null/valid
    validity: Option<NullBitmap>,
    name: String,
}

impl Column for UInt256Column {
    fn len(&self) -> usize {
        self.values.len()
    }
    
    fn dtype(&self) -> &DataType {
        &DataType::UInt256
    }
    
    fn name(&self) -> &str {
        &self.name
    }
    
    fn validity(&self) -> Option {
        self.validity.as_ref()
    }
    
    fn set_validity(&mut self, validity: Option) {
        self.validity = validity;
    }




// Validity bitmap tracks null values in the column
    // None means all values are valid (optimization)
    // Some(bitmap) stores which values are valid/null


}

alternative :

struct UInt256ChunkedArray {
high: PrimitiveChunkedArray,
low: PrimitiveChunkedArray
}
This would split the 256-bit number into two 128-bit parts, leveraging existing primitives.

other alternatives to recmo uint includes : https://crates.io/crates/primitive-types

ritchie46 · 2024-12-06T07:44:31Z

I sitll fail to understand why this has to be 256 bits. That's an astronomically large number. Why would you need prices in this large of number?

If you want to store a hash of 256 bits, we can offer exposing a fixed size binary type (similar to rust array [u8;32]), this would be just as performant and much more general.

Note that we don't work with bounties. We have to maintain it, so please don't make a PR without consulting with us first.

jmakov · 2024-12-06T07:51:00Z

I sitll fail to understand why this has to be 256 bits.

Not sure what you mean. There's a whole industry that has this problem and is using Pandas instead only because of this.

ritchie46 · 2024-12-06T07:55:11Z

There's a whole industry that has this problem and is using Pandas instead only because of this.

Can you explain me the problem? What data are you storing? Why does that data require 256 bits and what does this data represent?

mahmudsudo · 2024-12-06T08:10:07Z

I sitll fail to understand why this has to be 256 bits. That's an astronomically large number. Why would you need prices in this large of number?

If you want to store a hash of 256 bits, we can offer exposing a fixed size binary type (similar to rust array [u8;32]), this would be just as performant and much more general.

Note that we don't work with bounties. We have to maintain it, so please don't make a PR without consulting with us first.

Thanks for your corrections , I would correct the implementation to that , asides this correction what other parts do you want changed ?

MarcoGorelli · 2024-12-06T08:18:43Z

Not sure what you mean. There's a whole industry that has this problem and is using Pandas instead only because of this.

pandas doesn't support int256 either, could you please clarify how?

jmakov · 2024-12-06T08:21:15Z

pandas doesn't support int256 either, could you please clarify how?

Pandas is parsing to Python's int which is IIRC an arbitrary size int type. So in Pandas the type would be a (Python) object.

ritchie46 · 2024-12-06T08:40:24Z

Pandas is parsing to Python's int which is IIRC an arbitrary size int type. So in Pandas the type would be a (Python) object.

Which means they gave up. You can do the same in Polars, but I would advice against it. But again, if I can understand the usecase, I can see if we can come up with a DataType we both believe in.

banteg · 2024-12-06T08:54:13Z

it's a common roadblock for indexing EVM data. this virtual machine uses an unconventionally large 256-bit word size for its stack, as well as 256-bit to 256-bit mapping for storage.

you can see people have to work around this in this popular indexer, sacrificing either precision or speed:

Large ints such as u256 should allow multiple conversions. A value column of type u256 should allow: value_binary, value_string, value_f32, value_f64, value_u32, value_u64, and value_d128. These types can be specified at runtime using the --u256-types argument.

these are large datasets that could showcase polars well. i personally used polars like this on a 2.1 billion rows 304gb dataset with great success, but my usecase didn't require accounting-level precision since it was just a visualization.

ritchie46 · 2024-12-06T08:59:58Z

Right, so it is a sort of catch all type, which can be downcasted to a specific value to then work with. Sounds like this is possible with a FixedSizeBinary and then a plugin for downcasting to the specific types. With that dtype and plugins I think you can go wild.

sslivkoff · 2024-12-06T11:42:33Z

the ideal u256 functionality for our usecase would be performing these operations inside aggregations

sum, min, max, mean, +, -, *, /, median, quantile
perform as df.group_by('xyz').agg(pl.sum('u256_col'))

based on this example it seems like lots of operations can be implemented via plugins. maybe median/quantile would be harder

sslivkoff · 2024-12-06T11:42:59Z

would also want to be able to set custom display functions to render these u256 columns legibly with print(df)

any way way to accomplish this?

ritchie46 · 2024-12-06T11:55:49Z

Wouldn't a fixed size binary type give you what you need? I fail to understand why would you need to do arithmetic on numbers so large that they require 256 bits.

With a fixed size binary type you can store any crypto hash efficiently and get comparisons

the ideal u256 functionality for our usecase would be performing these operations inside aggregations
* `sum`, `min`, `max`, `mean`, `+`, `-`, `*`, `/`, `median`, `quantile`

* perform as `df.group_by('xyz').agg(pl.sum('u256_col'))`
based on this example it seems like lots of operations can be implemented via plugins. maybe median/quantile would be harder

But why can't you downcat to the size required? Why do prices require 10^75 digits?

sslivkoff · 2024-12-06T13:27:36Z

crypto token balances are typically stored with 1e-18 precision but sometimes as high as 1e-36 precision, as a fixed point integer. then we need to aggregate 10s of millions of token transfers or token accounts. so that's already about 10^-42 which is more than 128 bits. sometimes edge cases require even more aggregation than this sometimes it's ok to use float64 as an approximation but other times you need to maintain full precision. e.g. when you're doing forensic accounting or accounting of small amounts that change many times

…

On Fri, Dec 6, 2024, 6:56 AM Ritchie Vink ***@***.***> wrote: Wouldn't a fixed size binary type give you what you need? I fail to understand why would you need to do arithmetic on numbers so large that they require 256 bits. With a fixed size binary type you can store any crypto hash efficiently and get comparisons the ideal u256 functionality for our usecase would be performing these operations inside aggregations * `sum`, `min`, `max`, `mean`, `+`, `-`, `*`, `/`, `median`, `quantile` * perform as `df.group_by('xyz').agg(pl.sum('u256_col'))` based on this <https://marcogorelli.github.io/polars-plugins-tutorial/sum/> example it seems like lots of operations can be implemented via plugins. maybe median /quantile would be harder the ideal u256 functionality for our usecase would be performing these operations inside aggregations * `sum`, `min`, `max`, `mean`, `+`, `-`, `*`, `/`, `median`, `quantile` * perform as `df.group_by('xyz').agg(pl.sum('u256_col'))` based on this <https://marcogorelli.github.io/polars-plugins-tutorial/sum/> example it seems like lots of operations can be implemented via plugins. maybe median /quantile would be harder But why can't you downcat to the size required? Why do prices require 10^75 digits? — Reply to this email directly, view it on GitHub <#15443 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB4KSQDTFRMOB32BGYLVLJD2EGGFZAVCNFSM6AAAAABFTRIMK6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRTGA2DKNZVHA> . You are receiving this because you commented.Message ID: ***@***.***>

coastalwhite · 2024-12-06T13:41:26Z

Also sounds like you are talking about something related to #19784.

scur-iolus · 2024-12-06T15:04:33Z

@ritchie46 Here could be a concrete use case: I use polars for a financial application that calculates the Net Asset Values of investment funds. The value of one fund share is determined daily, and each day, the calculated value is rounded to approximately ten decimal places (depends on the fund). The value on day N determines the value on day N+1, forming a recurring sequence. If there is a rounding error at step N-100, the difference can become significant by step N, as the approximation amplifies exponentially with each recurrence. For context, some investments correspond to several million dollars. If one share of the fund is valued at $1.24, it’s not equivalent to $1.23: the division further magnifies the discrepancy (I won't even go into detail about the complexity added by currency conversion).

I have encountered various error messages that led me here. I'm not sure if this is the best place to share them, but here are 3 tests that fail due to overflow issues:

from decimal import Decimal as Dec

import polars as pl
import pytest

def test_df_constructor_with_high_precision_dec():
    """BindingsError: Decimal is too large to fit in a Decimal128."""
    _ = pl.DataFrame(
        [
            {"a": Dec("123_456_789")},
            {"a": Dec("3.141592653589793238462643383279502884197")},
        ],
        schema={"a": pl.Decimal(precision=43, scale=34)},
    )

def test_replace_return_dec():
    """See also issue #15037."""
    col = pl.Series(name="my_data", values=["a", "b", "c"])
    mapping = {"a": Dec("4.0"), "b": Dec("5.0"), "c": Dec("6.0")}
    replaced1 = col.replace(mapping, return_dtype=pl.Decimal(scale=37))
    # v1.0.0 next line raises a InvalidOperationError, conversion failed
    replaced2 = col.replace(mapping, return_dtype=pl.Decimal(scale=38))
    assert tuple(replaced1) == (Dec("4.0"), Dec("5.0"), Dec("6.0"))  # OK
    # for some reason, the following assertion fails because values have become null
    # no error has been raised, it happened silently
    assert tuple(replaced2) == (Dec("4.0"), Dec("5.0"), Dec("6.0"))  # KO with v0.20.15
    # v1.0.0 next line raises a InvalidOperationError, conversion failed
    _ = pl.Series(replaced1, dtype=pl.Decimal(scale=38))

@pytest.mark.parametrize("datatype", [float, Dec])
def test_element_wise_multiplication_n_division(datatype) -> None:
    """Works well with floats, but not with Decimals."""
    df = pl.DataFrame(
        [
            {
                "a": datatype(f"1.{'0' * 20}"),
                "b": datatype(f"1.{'0' * 20}"),
            }
        ]
    )
    df = df.with_columns(c=pl.col("a") * pl.col("b"))
    df = df.with_columns(d=pl.col("a") / pl.col("b"))
    # next line fails: I get a Decimal('0.0131811...') probably due to an overflow?
    assert df[0, "c"] == datatype("1")
    # next line silently fails, the value is null
    assert df[0, "d"] == datatype("1")

gakonst · 2025-01-06T15:01:56Z

Hi @ritchie46 -- gently following up on the above comment, in case you have any thoughts!

elyase added the enhancement New feature or an improvement of an existing feature label Apr 2, 2024

stinodego added the A-dtype Area: data types in general label Apr 2, 2024

peyha mentioned this issue Apr 18, 2024

feat(examples): implement a jupyter notebook example paradigmxyz/cryo#186

Open

3 tasks

elyase mentioned this issue Nov 26, 2024

Support for uint256/int256 Eventual-Inc/Daft#3440

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for uint256/int256 #15443

Support for uint256/int256 #15443

elyase commented Apr 2, 2024

banteg commented Apr 2, 2024 •

edited

Loading

stinodego commented Apr 2, 2024

l1t1 commented Apr 2, 2024

banteg commented Apr 3, 2024

0xvanbeethoven commented Apr 3, 2024 •

edited

Loading

gakonst commented Apr 3, 2024

orlp commented Apr 3, 2024 •

edited

Loading

sslivkoff commented Apr 3, 2024 •

edited

Loading

gakonst commented Apr 3, 2024 •

edited

Loading

yenicelik commented Jul 21, 2024

IlluvatarEru commented Sep 19, 2024

mahmudsudo commented Dec 5, 2024

ritchie46 commented Dec 6, 2024 •

edited

Loading

jmakov commented Dec 6, 2024

ritchie46 commented Dec 6, 2024

mahmudsudo commented Dec 6, 2024

MarcoGorelli commented Dec 6, 2024

jmakov commented Dec 6, 2024

ritchie46 commented Dec 6, 2024

banteg commented Dec 6, 2024

ritchie46 commented Dec 6, 2024

sslivkoff commented Dec 6, 2024 •

edited

Loading

sslivkoff commented Dec 6, 2024

ritchie46 commented Dec 6, 2024

sslivkoff commented Dec 6, 2024 via email

coastalwhite commented Dec 6, 2024

scur-iolus commented Dec 6, 2024 •

edited

Loading

gakonst commented Jan 6, 2025

Support for uint256/int256 #15443

Support for uint256/int256 #15443

Comments

elyase commented Apr 2, 2024

Description

banteg commented Apr 2, 2024 • edited Loading

stinodego commented Apr 2, 2024

l1t1 commented Apr 2, 2024

banteg commented Apr 3, 2024

0xvanbeethoven commented Apr 3, 2024 • edited Loading

gakonst commented Apr 3, 2024

orlp commented Apr 3, 2024 • edited Loading

sslivkoff commented Apr 3, 2024 • edited Loading

gakonst commented Apr 3, 2024 • edited Loading

yenicelik commented Jul 21, 2024

IlluvatarEru commented Sep 19, 2024

mahmudsudo commented Dec 5, 2024

ritchie46 commented Dec 6, 2024 • edited Loading

jmakov commented Dec 6, 2024

ritchie46 commented Dec 6, 2024

mahmudsudo commented Dec 6, 2024

MarcoGorelli commented Dec 6, 2024

jmakov commented Dec 6, 2024

ritchie46 commented Dec 6, 2024

banteg commented Dec 6, 2024

ritchie46 commented Dec 6, 2024

sslivkoff commented Dec 6, 2024 • edited Loading

sslivkoff commented Dec 6, 2024

ritchie46 commented Dec 6, 2024

sslivkoff commented Dec 6, 2024 via email

coastalwhite commented Dec 6, 2024

scur-iolus commented Dec 6, 2024 • edited Loading

gakonst commented Jan 6, 2025

banteg commented Apr 2, 2024 •

edited

Loading

0xvanbeethoven commented Apr 3, 2024 •

edited

Loading

orlp commented Apr 3, 2024 •

edited

Loading

sslivkoff commented Apr 3, 2024 •

edited

Loading

gakonst commented Apr 3, 2024 •

edited

Loading

ritchie46 commented Dec 6, 2024 •

edited

Loading

sslivkoff commented Dec 6, 2024 •

edited

Loading

scur-iolus commented Dec 6, 2024 •

edited

Loading