Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessing data in a DataFrameColumn is insanely slow. #5966

Open
DrDryg opened this issue Oct 10, 2021 · 2 comments
Open

Accessing data in a DataFrameColumn is insanely slow. #5966

DrDryg opened this issue Oct 10, 2021 · 2 comments
Labels
Microsoft.Data.Analysis All DataFrame related issues and PRs perf Performance and Benchmarking related

Comments

@DrDryg
Copy link

DrDryg commented Oct 10, 2021

System Information (please complete the following information):

  • Win 10
  • Microsoft.Data.Analysis 0.18.0
  • .net framework 4.7.2

Describe the bug
Accessing data in a PrimitiveDataFrameColumn<> is very very very slow.

To Reproduce
int n = 1000_000;
PrimitiveDataFrameColumn column = new PrimitiveDataFrameColumn("Name", n);

for (int i = 0; i <n; i++)
column[i] = 1;

Expected behavior
I filling in values in a column should cost a few clock cycles per value. So perhaps at least 100 million values per second should be achievable on a normal computer. But 1 million elements take around 0.5s on a high performance new laptop.

Is it simply that nullable objects are this slow? If that is the case, why did you go for such a technology for a data processing library where performance is a key factor?

For perspective, writing the data to disk is 10 times faster!

@michaelgsharp michaelgsharp added Microsoft.Data.Analysis All DataFrame related issues and PRs perf Performance and Benchmarking related labels Oct 14, 2021
@pgovind
Copy link

pgovind commented Jan 28, 2022

Is it simply that nullable objects are this slow?

Just my initial impression here: Are you able to test this by doing the following?

int n = 1000_000;
Int32DataFrameColumn column = new Int32DataFrameColumn("Name", n);

for (int i = 0; i <n; i++)
column[i] = 1;

My guess is that it will be much faster.

@asmirnov82
Copy link
Contributor

Compare indexing for read of double column and double array. Reading array is 50 times faster:

[GlobalSetup]
public void SetUp()
{
     var values = Enumerable.Range(1, ItemsCount).ToArray();

     _doubleColumn = new DoubleDataFrameColumn("Column2", values.Select(v => (double)v));
    _doubleArr = new double[ItemsCount];
}
[Benchmark]
public void Indexing_Double_Column()
{
    double? a = 0;
    for (int i = 0; i < _doubleColumn1.Length; i++)
        a = _doubleColumn[i];
}

[Benchmark]
public void Indexing_Double_Array()
{
    double a = 0;
    for (int i = 0; i < _doubleColumn.Length; i++)
        a = _doubleArr[i];
}
Method Mean Error StdDev
Indexing_Double_Column 10,970.0 us 190.83 us 178.50 us
Indexing_Double_Array 291.5 us 0.62 us 0.55 us

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Microsoft.Data.Analysis All DataFrame related issues and PRs perf Performance and Benchmarking related
Projects
None yet
Development

No branches or pull requests

4 participants