Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf, memory: Improve performance and memory use for large datasets #5927

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

mleibman-db
Copy link

@mleibman-db mleibman-db commented Feb 21, 2025

This PR moves all duplicate row instance methods to row proto, drastically reducing memory use for large datasets.
Fixes issue #5926.

For a table of 50'000x5, this reduces table memory use from 136 Mb to 4.8 Mb (28x).
The initialization time (accessRows()) is also reduced from 132ms to 18ms (7x).
It also speeds up grouping (using modified getGroupedRowModel.test.ts) from 1169ms to 276ms (4x).

Before
before

After
after

Deltas
delta

@mleibman-db mleibman-db changed the title perf, memory: Improve performance and memory use for large datasets (#5926) perf, memory: Improve performance and memory use for large datasets Feb 21, 2025
@mleibman-db mleibman-db changed the title perf, memory: Improve performance and memory use for large datasets perf, memory: Improve performance and memory use for large datasets (WIP) Feb 21, 2025
@KevinVandy
Copy link
Member

This idea is awesome, and it's something that I've been thinking about for a long time.

However, it's a bigger change and I would personally like to see this PR targeted at alpha so we can test the pattern there and even take this idea further.

V9 probably won't be stable for a few more months, but I hope that you still would be able to help implement this there.

We should follow this same pattern for not only rows, but also cells, headers, and header groups. The table object is the only one where I don't see this as necessary since there is only 1 table created.

What do you think?

@mleibman-db
Copy link
Author

mleibman-db commented Feb 21, 2025

@KevinVandy Thank you for the quick response!

Is there any way we can apply this fix to V8 (pretty please)?

This is a bit of a pressing issue for us. We use the table in Notebooks in our product (Databricks), where it's not unusual to have dozens of tables on a page with up to 10k rows each, which results in the page taking up multiple gigabytes of memory, with over 70% of it coming from the table. We have some ideas for workarounds in our code, but they are somewhat brittle/hacky, and I would much prefer to contribute to improving the table itself.

We could increase the scope of this change and apply the pattern more broadly (I saw a lot of areas where we could improve scalability wrt efficiency), but my goal here was to make a minimally invasive fix w/o changing the API that gets us most of the way there. As is, this PR reduces table memory use by ~28x, which is more than enough to address our current needs.

In terms of applying the same pattern to headers and header groups, yes, that would have a similar impact on tables with a very large number of columns, but that is a lot less common than tables with lots of rows. We're talking about tens of thousands of columns here. It does happen, but is something that can easily be added in a separate PR.

Applying this to cells is not as effective however, since cells are created on-demand as they are rendered, so unlike rows, they don't use up memory unless you render (or scroll through in case of virtualization) tens of thousands of cells.

@mleibman-db mleibman-db changed the title perf, memory: Improve performance and memory use for large datasets (WIP) perf, memory: Improve performance and memory use for large datasets Feb 21, 2025
@KevinVandy
Copy link
Member

I'm definitely still open to merging this to v8, I'll just need to do some extensive regression testing

Copy link

nx-cloud bot commented Feb 21, 2025

View your CI Pipeline Execution ↗ for commit 9cf94c1.

Command Status Duration Result
nx affected --targets=test:format,test:sherif,t... ✅ Succeeded 2m 22s View ↗
nx run-many --targets=build --exclude=examples/** ✅ Succeeded 34s View ↗

☁️ Nx Cloud last updated this comment at 2025-02-27 15:41:17 UTC

Copy link

pkg-pr-new bot commented Feb 21, 2025

Open in Stackblitz

More templates

@tanstack/angular-table

npm i https://pkg.pr.new/@tanstack/angular-table@5927

@tanstack/lit-table

npm i https://pkg.pr.new/@tanstack/lit-table@5927

@tanstack/match-sorter-utils

npm i https://pkg.pr.new/@tanstack/match-sorter-utils@5927

@tanstack/qwik-table

npm i https://pkg.pr.new/@tanstack/qwik-table@5927

@tanstack/react-table

npm i https://pkg.pr.new/@tanstack/react-table@5927

@tanstack/react-table-devtools

npm i https://pkg.pr.new/@tanstack/react-table-devtools@5927

@tanstack/solid-table

npm i https://pkg.pr.new/@tanstack/solid-table@5927

@tanstack/svelte-table

npm i https://pkg.pr.new/@tanstack/svelte-table@5927

@tanstack/vue-table

npm i https://pkg.pr.new/@tanstack/vue-table@5927

@tanstack/table-core

npm i https://pkg.pr.new/@tanstack/table-core@5927

commit: 9cf94c1

@KevinVandy
Copy link
Member

@mleibman-db You can install npm i https://pkg.pr.new/@tanstack/react-table@5927 right now to try out the preview NPM version in your code.

For the alpha branch, it would need a redo instead of merging up. So your help would be appreciated there if you have time.

And yes, I forgot to include column objects in my original feedback. Those would be 2nd most important. It's interesting that cells don't have this problem as much, but that makes sense.

@mleibman-db
Copy link
Author

@KevinVandy I don't do much OSS development, so I'm going to need some help / hand-holding here :) Do I need to do a separate PR to apply the changes to the alpha branch? Is that instead of this one, or in addition to? Not sure what I need to do here.

Re: doing the same thing for column objects. As I mentioned, in most cases it wouldn't be as impactful since tables with tens of thousands of columns are much less rare than tables with lots of rows, but I'm happy to make that change as well. I'd probably do that in a separate PR though to limit the scope.

@KevinVandy
Copy link
Member

The scope of this pr to the main v8 branch is fine/good.

However, the alpha v9 branch has been heavily refactored with new approaches to assigning APIs to these objects. In the v9 alpha branch, I'd hope to find an approach that follows this new strategy for everything as much as possible.

@mleibman-db
Copy link
Author

Ok, so IIUIC, I'll leave this PR as-is to proceed with code review, testing, and inclusion in v8, and will look at v9 to see how things are different there and what I need to do to re-apply them there.

@KevinVandy
Copy link
Member

Ok, so IIUIC, I'll leave this PR as-is to proceed with code review, testing, and inclusion in v8, and will look at v9 to see how things are different there and what I need to do to re-apply them there.

That would be awesome. I realize the follow up for the v9 alpha work is a big extra ask, but hopefully a fun and interesting thing for you to look at helping us out. It will need a slightly different approach.

One of the main goals of v9 is to strip down the bundle sizes (and memory usage) of table instances down to just the features that apps are actually using. This PR is very much on theme for that.

@mleibman-db
Copy link
Author

Will do!

@@ -411,6 +403,23 @@ export const ColumnFiltering: TableFeature = {

return table._getFilteredRowModel()
}

Object.assign(getRowProto(table), {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At one point, we had removed most usages of Object.assign in favor of direct assignment as a performance improvement at scale. Wonder if that's still applicable to consider here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't be an issue here since it's only called once per table anyway. Your question would apply more to createRow() in row.ts since we call it once per row there, but AFAIK, there are no known performance issues around Object.assign(). There have been some many years ago when it was just introduced and browser support was fresh (plus there were polyfills), but that hasn't been the case in quite some time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KevinVandy beat me to it. I like the idea but am not a big fan of typing the prototype as CoreRow which is not strictly accurate, (and requires us to create these dummy values to keep typescript happy).

@mleibman-db did you try making the createRow function into a constructor function, adding the methods directly to the prototype? I haven't tried it myself but intuitively it feels like it should work. Would need to always call createRow with the new keyword I think.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Typing the row proto as CoreRow is actually very useful since it provides type safety and makes sure the methods only access defined props there. The use of default unused values there doesn't strike me as concerning, but we could try to replace them with some purely TypeScript type annotations, though IMHO that would be more hacky.

  2. I'm not sure I understand what you're proposing. Could you elaborate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it provides type safety

It's the wrong type though, isn't it? The prototype shouldn't have the instance properties on it.

Could you elaborate?

I am imagining something approximately like the below. I haven't tried but think it should work, happy to be corrected. The naming would be a bit weird though. createRow should probably become just Row, but that would be a breaking change - not sure what to do about that.

const createRow = <TData>(
  this: CoreRow<TData>,
  table: Table<TData>,
  id: string,
  original: TData,
  rowIndex: number,
  depth: number,
  subRows?: Row<TData>[],
  parentId?: string
) => {
  this.id = id
  this.original = original
  // etc.
}

createRow.prototype.getValue = (columnId: string) => {
      
    // ...

    return this._valuesCache[columnId] as any
}

elsewhere:

const row = new createRow(...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anywhere where we are thinking that an alternative would be cleaner, but it's a breaking change, can be reserved for a v9 pr. So far this PR looks mostly good. We don't have to assign dummy vars to the prototype just to satisfy TypeScript. A cast could be acceptable there.

If the Object.assign only gets called once, that is negligible and something we don't need to worry about. Direct assignment was a performance improvement in this pr that sped up rendering when creating 10k+ rows. This PR is solving the memory side of that same issue. In conclusion, I'm not worried about this after you explained more.

Copy link
Member

@tombuntus tombuntus Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you can't just create the proto once at the module level

But I think you can merge the feature.createRow prototypes into the prototype of the object returned by the core createRow function at runtime, when new createRow() is called. In the same loop where we currently call feature.createRow in the core createRow() function body. I haven't tested this though. In this case the prototype's methods would be created at module level on each of the features' createRow functions.

vastly preferring classes

Personally I am not opposed to using a class if it makes typing easier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and just for anyone reading this ... the code snippet in this comment should be using function createRow() {}, not an arrow function!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But happy to change if you feel strongly about it

I was actually agreeing with you that since it's not called many times, it wouldn't be likely to cause issues. I was just trying to explain the likely cause of perf issues - not due to Object.assign() itself, but rather the fact that it it often called like this:

Object.assign(
  targetObject, // <-- existing object
  {
    // new source object which will be garbage collected eventually
  },
)

If it's used this way in a loop with many thousands of iterations, you can run into perf issues due to garbage collection.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to assign dummy vars to the prototype just to satisfy TypeScript. A cast could be acceptable there.

Done.

@@ -362,14 +362,6 @@ export const ColumnFiltering: TableFeature = {
}
},

createRow: <TData extends RowData>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the core createRow function, we still call these feature.createRow functions if they exist, passing them the row and table instance. That should prevent breaking changes for existing custom features, but we may want to recommend custom features to take the same approach (i.e. extend the prototype). @KevinVandy what do you think about this?

I haven't thought all the details through but something like retaining a createRow function in each feature, and in the core createRow function both calling the feature.createRow function with the row and table instances (to prevent breaking changes for existing custom features), and also merging its prototype onto the core createRow prototype.

That way we could also retain the createRow functions in the core features, (just move the methods onto the prototype), and wouldn't need the getRowProto and Object.assign() approach I think.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to generally recommending people use the same approach for implementing custom features. I considered making things more explicit by adding methods like initRowProto() to TableFeature interface, but decided against it for simplicity's sake, plus this is more of an internal implementation detail than a public API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of pattern will be useful to think about in the alpha branch though

@mleibman-db
Copy link
Author

Friendly ping

The way I was setting up lazy-initialized props wasn't working since Object.assign() was invoking the prop getter, bypassing the lazy initialization.
@KevinVandy
Copy link
Member

I'm not going to be able to fully review until at least Friday most likely. @tombuntus @riccardoperra might help more.

As I said earlier, you should be able to install the package preview that was generated in this pr, so hopefully it's not a blocker for you.

@mleibman-db
Copy link
Author

That's ok. We can upgrade to the next v8 release with this fix.

@mleibman-db
Copy link
Author

dshaduishdishudis hduish duish sui hs uish dsui hsiu d

@riccardoperra eh?

@riccardoperra
Copy link
Collaborator

riccardoperra commented Feb 26, 2025

I can definitely look at the implementation better tomorrow.

It's just an opinion taking a quick look, could an implementation with classes leveraging mixins might be more readable, but achieving the same result? (I don't remember if classes turned out to be more performant for those cases, or maybe the require a lot of more work to be used in the current core)

@riccardoperra
Copy link
Collaborator

dshaduishdishudis hduish duish sui hs uish dsui hsiu d

@riccardoperra eh?

Just my cat 😶

@mleibman-db
Copy link
Author

It's just an opinion taking a quick look, could an implementation with classes leveraging mixins might be more readable? Achieving the same result. (I don't remember if classes turned out to be more performant for those cases)

I started using that approach initially, but got quickly discouraged with TypeScript semantics, general verbosity, and a bunch of other issues I can't recall right now. Plus having multiple different versions of a Row object with different captured values and different implementations (one per table) just hurt my brain, and a more direct/manual approach ended up much more concise and readable to IMHO.

Found a couple of issues with using object spreading to clone a row, which doesn't work if the row has a prototype.
@mleibman-db
Copy link
Author

I found a potential risky area. Since rows now have prototypes, we can no longer clone them using {...row} or any other method that doesn't copy the prototype. I added a clone() method to the row and fixed the only two places I could find where this was happening, but user code outside of the component may be doing the same. Alas, there is no other way of achieving the improvement we're looking for without changing the public API, so this will have to do. Definitely worth mentioning in the release notes though.

@@ -45,7 +45,7 @@ export function getSortedRowModel<TData extends RowData>(): (
const sortData = (rows: Row<TData>[]) => {
// This will also perform a stable sorting using the row index
// if needed.
const sortedData = rows.map(row => ({ ...row }))
const sortedData = rows.map(row => row.clone())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the only potential breaking change, I think we may still be able to ship it. It's on the edge of what's acceptable. We don't have much of a way to knowing how many people would rely on this.

@KevinVandy
Copy link
Member

If anyone wants to install and test this PR in their own codebase, you can install the preview package down below. Find all of the other adapter preview versions under "Continuous Releases" under the "All checks have passed" dropdown.

@tanstack/react-table@9cf94c1

@mleibman-db Not to beat a dead horse by over-explaining this, but you can install this instead of the actual tanstack table package in your own code-base right now so that your team won't be blocked regardless of whether or not we are able to verify that there are no breaking changes or regressions here for a wide release.

@mleibman-db
Copy link
Author

mleibman-db commented Feb 27, 2025

@KevinVandy We tried that yesterday, and that's how we found the row cloning issue. With it fixed, all tests pass now.

@mleibman-db
Copy link
Author

@KevinVandy So all our tests are passing, with the exception of several instances where we essentially run into a similar issue I listed as a risk above.

const { id, getVisibleCells } = row;
const cells = getVisibleCells();  // ERROR: can only call this as a method on a row

@KevinVandy
Copy link
Member

@KevinVandy So all our tests are passing, with the exception of several instances where we essentially run into a similar issue I listed as a risk above.

const { id, getVisibleCells } = row;
const cells = getVisibleCells();  // ERROR: can only call this as a method on a row

You're saying Destructiring doesn't work with prototype methods?

@mleibman-db
Copy link
Author

mleibman-db commented Feb 27, 2025

You're saying Destructiring doesn't work with prototype methods?

No, destructuring works just fine, but you can't invoke an object's method separately since it wouldn't have access to scope.
For example:

const mySet = new Set();

// this works
mySet.has('entry');

// this doesn't
const setHas = mySet.has;
setHas('entry');

// or this
const { has: setHas } = mySet;
setHas('entry');

@KevinVandy
Copy link
Member

That makes sense. But to be clear, you are saying that anyone who is currently destructuring methods from rows and then tries to call them is going to get unexpected results/errors? That would be a pretty large breaking change. Totally acceptable for v9, but that brings a lot of doubt to me for whether or not we would be able to safely release this for v8.

@mleibman-db
Copy link
Author

That makes sense. But to be clear, you are saying that anyone who is currently destructuring methods from rows and then tries to call them is going to get unexpected results/errors?

Yes.

That would be a pretty large breaking change. Totally acceptable for v9, but that brings a lot of doubt to me for whether or not we would be able to safely release this for v8.

The only way around it I can see for v8 is to make it optional/configurable and keep the default behavior backwards compatible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants