add interlacer support via read_interlaced_resource()
#213
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Just put together a quick sketch of what interlacer support might look like for frictionless-r
Key features that interlacer brings:
field.missingValues
#174field.categorical
andfield.categoriesOrdered
#148list
Support new field typelist
#179Example csv in
tests/testthat/data/type_interlaced.csv
:Example run:
Notes / thoughts
The complexity of
read_resource
is getting pretty out of hand (as you note in Makeread_resource()
more modular #210). I would argue that it's not just that we need to make it more modular; what really hurts us is that we're writing a complex parser here in a language that doesn't support static typing. So we can't just validate the JSON via an object schema (as you'd do with pydantic or serde), then rely on the type signatures thereafter to guarantee valid types; instead type validation is mixed into all of our transformation logic (look at all the null checks that are necessary, for example). Without static typing, it becomes very difficult to manage the input possibilities and decouple input validation from transformation. As I mentioned in the last frictionless call, in the long term I want to write a library in rust to help address / standardize / streamline some of this, but in the short term I agree that your ideas in Makeread_resource()
more modular #210 will go a long way.While working on this I noticed some more (easily fixable) performance issues with
read_interlaced_*()
:Because interlacer is still in its infancy, I don't think it should be the default reader. So presently I have an
interlaced
argument inread_resource
, as well as an alias functionread_interlaced_resource()
that runs withinterlaced = TRUE
.When
interlaced = FALSE
, it usesread_delim()
instead ofread_interlaced_delim()
. This has the following effects:The reason for this, is that even after I fix the performance issues above, we're still looking at a speed compromise if we want to load non-vroom types (interlaced columns , cfactors,
list
, etc.) with frictionless-r. This is the case because if we touch a column in R after loading it with vroom, the column gets loaded into memory, thereby defeating the lazy ALTREP loading of the column and slowing things down. In the long-term I want to work with ALTREP so it lazy loads everything just like vroom; but in the short term I'm planning to focus on the larger low hanging fruit (like the issues i link above)