Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue: Migrate from re_arrow2 to arrow #3741

Open
1 of 5 tasks
teh-cmc opened this issue Oct 9, 2023 · 4 comments
Open
1 of 5 tasks

Tracking issue: Migrate from re_arrow2 to arrow #3741

teh-cmc opened this issue Oct 9, 2023 · 4 comments
Labels
🏹 arrow concerning arrow blocked can't make progress right now dependencies concerning crates, pip packages etc 🦀 Rust API Rust logging API

Comments

@teh-cmc
Copy link
Member

teh-cmc commented Oct 9, 2023

Currently blocked on:


Multiple end-goals:


TODO (split into sub-issues as needed):

On the way there we might hit a few bumps because we have a lot of redundant ad-hoc code that integrates with polars (which is built on top of arrow2).

The solution to this is to make sure we only integrate with polars in one single place: the Data{Cell,Row,Table} layer (#1692).
Once that's done, we can remove all ad-hoc polars code everywhere and just build a Data{Row,Cell,Table} anytime we want a polars::Series/polars::DataFrame (#1759).

Internally, the conversion from DataTable to polars::DataFrame will require a zero-copy tri-stage conversion from arrow1->arrow2->polars.


@teh-cmc teh-cmc added the 🏹 arrow concerning arrow label Oct 9, 2023
@teh-cmc teh-cmc mentioned this issue Oct 16, 2023
4 tasks
teh-cmc added a commit that referenced this issue Oct 16, 2023
This PR introduces a new crate: `re_types_core`.

`re_types_core` only contains the fundamental traits and types that make
up Rerun's data model.
It is split off from the existing `re_types`.

This makes it possible to work with our data model abstractions without
having to depend on the `re_types` behemoth.
This is more than a DX improvement: since so many things depend directly
or indirectly on `re_types`, it is very easy to end-up with unsolvable
dependency cycles. This helps with that in some cases (though certainly
not all).

In particular, `re_tuid` (and by extension `re_format`) are now
completely free of `re_types`.

For convenience, `re_types` reexports all of `re_types_core`, so the
public API looks unchanged.
In a handful of instances (`re_arrow_store`, `re_data_store`,
`re_log_types`, `re_query`), I've went the extra mile and started
porting these crates towards raw `re_types_core` rather than relying on
the reexports.
The reason is that, upon closer inspection, these crates are very close
to being able to live free of `re_types`. In the future, the custom
crate and custom module attributes coming with #3741 might allow us to
make these independent.

Similarly, the codegen now uses `re_types_core` directly, as that makes
the life of the upcoming "serde-codegen" work much easier.
teh-cmc added a commit that referenced this issue Oct 17, 2023
**Commit by commit**

This is necessary refactoring work for the upcoming
`attr.rust.custom_crate` attribute, itself necessary for the upcoming
serde-codegen support, itself necessary for the upcoming blueprint
experimentations as well as #3741.

### Changes

1. The `CodeGenerator` trait as well as all post-processing helpers
(gitattributes, orphan detection...) are now I/O-free.
  ```rust
pub type GeneratedFiles =
std::collections::BTreeMap<camino::Utf8PathBuf, String>;
  
  pub trait CodeGenerator {
      fn generate(
          &mut self,
          reporter: &crate::Reporter,
          objects: &crate::Objects,
          arrow_registry: &crate::ArrowRegistry,
      ) -> GeneratedFiles;
  }
  ```
2. All post-processing helpers are now agnostic to the location output.
This is very important as it makes it possible to generate e.g. rust
code out of the `re_types` crate without everything crumbling down.
A side-effect is that gitattributes files are now finer-grained.
3. The Rust codegen pass is now crate agnostic: it is driven by the
workspace path rather than a specific crate path.
Necessary for the upcoming `attr.rust.custom_crate`.
4. All codegen passes now follow the exact same 4-step structure:
  ```
  // 1. Generate in-memory code files.
  let mut gen = MyGenerator::new();
  let mut files = gen.generate(reporter, objects, arrow_registry);
  // 2. Generate in-memory attribute files.
  generate_gitattributes_for_generated_files(&mut files);
  // 3. Write all in-memory files to disk.
  write_files(&gen.pkg_path, &gen.testing_pkg_path, &files);
  // 4. Remove orphaned files.
  crate::codegen::common::remove_orphaned_files(reporter, &files);
  ```
5. The documentation codegen pass now removes its orphans, which is why
some `md` files were removed in this PR.

---

- Unblocks #3741 
- Unblocks #3495
@emilk
Copy link
Member

emilk commented Jul 8, 2024

re_arrow2 has an arrow feature, with glue for converting data between arrow and re_arrow2: https://docs.rs/re_arrow2/0.17.4/re_arrow2/array/trait.Arrow2Arrow.html

Using that we can start this migration piece-wise. It would have double the dependencies for a transitionary period, leading to longer compilation times and bigger .wasm binary, but I think that is an ok tradeoff.

Potential roadmap:

After de-chunkfification:

  • Migrate codegenned deserialization
  • Migrate everything else

As of 2024-07-08, there are only around 300 lines of Rust referencing the string arrow2 directly, when one ignores generated code.

ignored paths crates/re_types/**, crates/re_types_core/src/archetypes/**, crates/re_types_core/src/datatypes/**, crates/re_types_core/src/components/**, crates/re_types_blueprint/src/blueprint/components/**, crates/re_types_blueprint/src/blueprint/archetypes/**

@emilk emilk self-assigned this Jul 8, 2024
@emilk emilk changed the title Tracking issue: arrow cleanup & migration away from arrow2{-convert} Tracking issue: Migrate from re_arrow2 to arrow Jul 8, 2024
@emilk emilk added dependencies concerning crates, pip packages etc 🦀 Rust API Rust logging API labels Jul 9, 2024
@emilk emilk removed their assignment Jul 9, 2024
@jleibs
Copy link
Member

jleibs commented Jul 10, 2024

teh-cmc added a commit that referenced this issue Aug 23, 2024
Remove unused old traits.

Part of a lot of clean up I want to while we head towards:
* #7245
* #3741
teh-cmc added a commit that referenced this issue Aug 23, 2024
It doesn't make any sense for a `ComponentBatch` to have any say in what
the final `ArrowField` should look like.

An `ArrowField` is a `Chunk`/`RecordBatch`/`Schema`-level concern that
only makes sense during IO/transport/FFI/storage/etc, and which requires
external context that a single `ComponentBatch` on its own has no idea
of.

---

Part of a lot of clean up I want to while we head towards:
* #7245
* #3741
@teh-cmc
Copy link
Member Author

teh-cmc commented Aug 31, 2024

@teh-cmc teh-cmc added the blocked can't make progress right now label Aug 31, 2024
@teh-cmc
Copy link
Member Author

teh-cmc commented Sep 5, 2024

@emilk emilk mentioned this issue Sep 16, 2024
6 tasks
Wumpf pushed a commit that referenced this issue Sep 16, 2024
### What
* Waiting for a proper fix in
apache/arrow-rs#6401
* Should be fixed before #3741
is considered finished

### Checklist
* [x] I have read and agree to [Contributor
Guide](https://github.com/rerun-io/rerun/blob/main/CONTRIBUTING.md) and
the [Code of
Conduct](https://github.com/rerun-io/rerun/blob/main/CODE_OF_CONDUCT.md)
* [x] I've included a screenshot or gif (if applicable)
* [x] I have tested the web demo (if applicable):
* Using examples from latest `main` build:
[rerun.io/viewer](https://rerun.io/viewer/pr/7426?manifest_url=https://app.rerun.io/version/main/examples_manifest.json)
* Using full set of examples from `nightly` build:
[rerun.io/viewer](https://rerun.io/viewer/pr/7426?manifest_url=https://app.rerun.io/version/nightly/examples_manifest.json)
* [x] The PR title and labels are set such as to maximize their
usefulness for the next release's CHANGELOG
* [x] If applicable, add a new check to the [release
checklist](https://github.com/rerun-io/rerun/blob/main/tests/python/release_checklist)!
* [x] If have noted any breaking changes to the log API in
`CHANGELOG.md` and the migration guide

- [PR Build Summary](https://build.rerun.io/pr/7426)
- [Recent benchmark results](https://build.rerun.io/graphs/crates.html)
- [Wasm size tracking](https://build.rerun.io/graphs/sizes.html)

To run all checks from `main`, comment on the PR with `@rerun-bot
full-check`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏹 arrow concerning arrow blocked can't make progress right now dependencies concerning crates, pip packages etc 🦀 Rust API Rust logging API
Projects
None yet
Development

No branches or pull requests

3 participants