Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kotlin DataFrame compiler plugin #704

Open
koperagen opened this issue May 22, 2024 · 1 comment
Open

Kotlin DataFrame compiler plugin #704

koperagen opened this issue May 22, 2024 · 1 comment
Assignees
Labels
Compiler plugin Anything related to the DataFrame Compiler Plugin enhancement New feature or request research This requires a deeper dive to gather a better understanding
Milestone

Comments

@koperagen
Copy link
Collaborator

koperagen commented May 22, 2024

Place for discussion and questions about Kotlin DataFrame compiler plugin

Idea behind it is to make such code compile, provide coding assistance in project files and later in Kotlin Notebooks - on top of already existing code generation in-between notebook cells

@DataSchema
data class WikiData(val name: String, val paradigms: List<String>)

fun main() {
    val df = dataFrameOf(
        WikiData("Kotlin", listOf("object-oriented", "functional", "imperative")),
        WikiData("Haskell", listOf("Purely functional")),
        WikiData("C", listOf("imperative")),
    )
    val df1 = df.add("size") { 
        paradigms.size // `paradigms` is generated based on WikiData class structure
    }
    // `size` property is generated based on `add` argument
    df1.size.print()
}

Demo project that you can clone and run
https://github.com/koperagen/df-plugin-demo

Issue that describes required compiler API and provides some information about use case
https://youtrack.jetbrains.com/issue/KT-65859

@koperagen koperagen self-assigned this May 22, 2024
@Jolanrensen Jolanrensen added this to the Backlog milestone Jun 19, 2024
@Jolanrensen Jolanrensen added enhancement New feature or request research This requires a deeper dive to gather a better understanding labels Jun 19, 2024
@Jolanrensen
Copy link
Collaborator

We might need to do some additional research with regard to the maintainability of the implementation, mainly the cases where we have to write the same DataFrame logic in two places.

Doing operations on DataFrames with the plugin happens in two places:

  • The library itself
    • This works mostly on runtime
    • Is based on both the structure, types, and names of the DataFrame, but also on its data
  • The compiler plugin
    • This works during code analysis
    • Is purely based on structure, types and names of the DataFrame
    • It could carry some information under-the-hood, for example:
      • @Import json data
      • The state of a DSL scope, such as groupBy {}
      • df.transpose() -> The df will have a keys: String column containing the previous column names
      • Etc.

I believe we should try, wherever we can, to share the logic between these two scopes. This can only be done in places where the logic is exclusively dependent on the structure, types, or names of the DataFrame. Sharing the logic will help us (and future contributors) to a) fix bugs more easily and b) keep ensuring consistency between the plugin and the library.

I see 3 options for us:

  • Keep the logic separate (Such as with join generating names in two places: plugin, library.)

    • +This keeps the plugin an add-on to the library without having to modify the library itself
    • +Types work different in the compiler plugin, this will allow us to work in two different worlds without difficult bridges
    • -Maintainability and ensuring consistency is difficult
  • Create a new abstract tree-structure as supertype of both DataFrame and the PluginDataFrameSchema (Such as with insert, called also from the plugin)

    • +Allows us to share logic regarding structure/names
    • -Type sharing is difficult because there's no easy ConeKotlinType <-> KType conversion
    • -We'd have to convert each supported API function to run on this generic tree structure instead.
  • Share logic by running the original function on an "empty" DataFrame (Such as drafted for renameToCamelCase())

    • +Allows us to share logic regarding structure/names
    • +We can call the original function, as seen in the draft, no rewrites of the library needed
    • -We would still need to write custom logic for more difficult operations or large DSLs
    • -Sharing types is still difficult. We'd need to either ignore types and store the TypeApproximation inside the DF or try to find a way to create an empty dataframe wíth KTypes and figure out a way to do ConeKotlinType <-> KType

Feel free to edit this comment to add more pros and cons to each option or to add more options.

These are just my thoughts for now :) I'm curious to see what you think!

@Jolanrensen Jolanrensen added the Compiler plugin Anything related to the DataFrame Compiler Plugin label Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compiler plugin Anything related to the DataFrame Compiler Plugin enhancement New feature or request research This requires a deeper dive to gather a better understanding
Projects
None yet
Development

No branches or pull requests

2 participants