Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract registering default features from SessionState and into its own function #11320

Closed
alamb opened this issue Jul 7, 2024 · 5 comments · Fixed by #11403
Closed

Extract registering default features from SessionState and into its own function #11320

alamb opened this issue Jul 7, 2024 · 5 comments · Fixed by #11403
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jul 7, 2024

Is your feature request related to a problem or challenge?

Using SessionContext provides a batteries included experience, as it configures and installs many functions, rewrites, data providers, etc.

However, for those people who want to more carefully control how DataFusion works (as we do in InfluxDB and I think delta.rs does with DeltaTables) it is somewhat akward to do so. It also makes it harder to

One of the major challenges is that SessionState's constructor basically installs the "pre-provided" functionality (like data sources, rewrites, udfs, etc)

pub fn new_with_config_rt_and_catalog_list(
config: SessionConfig,
runtime: Arc<RuntimeEnv>,
catalog_list: Arc<dyn CatalogProviderList>,
) -> Self {
let session_id = Uuid::new_v4().to_string();
// Create table_factories for all default formats
let mut table_factories: HashMap<String, Arc<dyn TableProviderFactory>> =
HashMap::new();
#[cfg(feature = "parquet")]
table_factories.insert("PARQUET".into(), Arc::new(DefaultTableFactory::new()));
table_factories.insert("CSV".into(), Arc::new(DefaultTableFactory::new()));
table_factories.insert("JSON".into(), Arc::new(DefaultTableFactory::new()));
table_factories.insert("NDJSON".into(), Arc::new(DefaultTableFactory::new()));
table_factories.insert("AVRO".into(), Arc::new(DefaultTableFactory::new()));
table_factories.insert("ARROW".into(), Arc::new(DefaultTableFactory::new()));
if config.create_default_catalog_and_schema() {
let default_catalog = MemoryCatalogProvider::new();
default_catalog
.register_schema(
&config.options().catalog.default_schema,
Arc::new(MemorySchemaProvider::new()),
)
.expect("memory catalog provider can register schema");
Self::register_default_schema(
&config,
&table_factories,
&runtime,
&default_catalog,
);
catalog_list.register_catalog(
config.options().catalog.default_catalog.clone(),
Arc::new(default_catalog),
);
}
let mut new_self = SessionState {
session_id,
analyzer: Analyzer::new(),
optimizer: Optimizer::new(),
physical_optimizers: PhysicalOptimizer::new(),
query_planner: Arc::new(DefaultQueryPlanner {}),
catalog_list,
table_functions: HashMap::new(),
scalar_functions: HashMap::new(),
aggregate_functions: HashMap::new(),
window_functions: HashMap::new(),
serializer_registry: Arc::new(EmptySerializerRegistry),
file_formats: HashMap::new(),
table_options: TableOptions::default_from_session_config(config.options()),
config,
execution_props: ExecutionProps::new(),
runtime_env: runtime,
table_factories,
function_factory: None,
};
#[cfg(feature = "parquet")]
if let Err(e) =
new_self.register_file_format(Arc::new(ParquetFormatFactory::new()), false)
{
log::info!("Unable to register default ParquetFormat: {e}")
};
if let Err(e) =
new_self.register_file_format(Arc::new(JsonFormatFactory::new()), false)
{
log::info!("Unable to register default JsonFormat: {e}")
};
if let Err(e) =
new_self.register_file_format(Arc::new(CsvFormatFactory::new()), false)
{
log::info!("Unable to register default CsvFormat: {e}")
};
if let Err(e) =
new_self.register_file_format(Arc::new(ArrowFormatFactory::new()), false)
{
log::info!("Unable to register default ArrowFormat: {e}")
};
if let Err(e) =
new_self.register_file_format(Arc::new(AvroFormatFactory::new()), false)
{
log::info!("Unable to register default AvroFormat: {e}")
};
// register built in functions
functions::register_all(&mut new_self)
.expect("can not register built in functions");
// register crate of array expressions (if enabled)
#[cfg(feature = "array_expressions")]
functions_array::register_all(&mut new_self)
.expect("can not register array expressions");
functions_aggregate::register_all(&mut new_self)
.expect("can not register aggregate functions");
new_self

This pre-installation also makes it hard to split SessionState / catalog information out of the core as discussed in #11182

Describe the solution you'd like

I would like to be able to construct SessionState with the minimal built in features, and have a separate function / configuration system

That way it would still be easy to use DataFusion with a minimal SessionState but also easily register all the built in extensions 🤔

Describe alternatives you've considered

Idea 1: function in SessionContext

SessionContext like SessionContext::register_built_ins that would register things like listing table, information schema, etc.

Then SessionContext::new() etc could call this function but there should be a way to create a SessionState without calling it (for minimal use)

Additional context

This came up most recently with @rtyler in #11180 (comment). I am not sure if it is quite the same thing, but I think the symptoms are at least similar

@jayzhan211 @Omega359 and @ahirner and I discussed this in the context of
#11182 -- #11182 (comment)

@jayzhan211
Copy link
Contributor

Should we cut it even more granularly to several of register_* functions or configure them like existing create_default_catalog_and_schema?

@Omega359
Copy link
Contributor

Omega359 commented Jul 8, 2024

I was actually looking at this last week interestingly enough. My thought was to have a parent register_all_defaults which delegates to individual functions for registering defaults for catalogues, table factories, etc. The current flag to create_default_catalog_and_schema would then just be an in statement calling out to the register_all_defaults function.

In doing it that way we could allow for

  1. Existing behaviour would be unchanged
  2. Users could register all defaults or
  3. Users could register defaults for parts of their choosing

@alamb
Copy link
Contributor Author

alamb commented Jul 9, 2024

My thought was to have a parent register_all_defaults which delegates to individual functions for registering defaults for catalogues, table factories, etc.

I think this is a good idea.

If we want htat level of customizablility, perhaps we could make a builder style interface. Something like 🤔

let builder = SessionStateBuilder::new()
  .with_default_functions(false);

let state: SessionState = builder.build()?;

@Omega359
Copy link
Contributor

Would we want to then deprecate the SessionState::new_with_config_rt and SessionState::new_with_config_rt_and_catalog_list as well (basically all the SessionState::new_* functions) leaving the builder as the preferred method to create a SessionState ?

@Omega359
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants