Skip to content

Add convert asColumn operation as compiler plugin friendly variant oа replace with #1143

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 25, 2025

Conversation

koperagen
Copy link
Collaborator

I think it's more or less common to perform column-wide operations in async / parallel context, so having such compiler plugin friendly operation is useful
Another use case, although not as handy, is creating a ColumnGroup like
df.convert { col }.asColumn { dataFrameOf("a" to listOf(123), "b" to listOf(321).asColumnGroup() }

@koperagen koperagen added the enhancement New feature or request label Apr 22, 2025
@koperagen koperagen added this to the 1.0.0-Beta1 (0.16) milestone Apr 22, 2025
@koperagen koperagen self-assigned this Apr 22, 2025

```kotlin
df.convert { name }.asColumn { col ->
col.toList().parallelStream().map { it.toString() }.collect(Collectors.toList()).toColumn()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parallel streams should be avoided in single-threaded libraries because they can introduce race conditions, synchronization issues, and unnecessary overhead.

However, in this case, the use of parallelStream is localized and safe, as it only transforms column values without affecting global state or column names.

But! Even though this use of parallelStream is logically safe, it can unexpectedly increase CPU load, especially on weaker machines. Parallel streams use the shared ForkJoinPool, which may cause performance issues if the system has limited resources or is already running other parallel tasks. This can lead to slowdowns or contention for threads, impacting the overall responsiveness of the application.

When running in Kotlin Notebooks, parallel streams can compete with notebook execution and UI rendering for limited CPU resources. This may cause lags or freezes, especially in constrained environments like containers or shared servers. Therefore, parallelism in notebooks should be used carefully to avoid degrading the interactive experience.

Copy link
Collaborator Author

@koperagen koperagen Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually i wouldn't recommend using parallel streams for any trivial operations, but there can be a situation when it's needed. For example, one time i used a library that performs IO, parses file into kind of AST. So

df.add("data") { 
  Library.parse(file)
}

In my case single threaded execution took literally minutes of real time because CPU was loaded 5%. Opting to use parallel made it 20 times faster. #723. But in any case, we should minimize number of operations that plugin can't understand. replace with is one of them, convert asColumn would be an alternative should users need it

@Jolanrensen
Copy link
Collaborator

How is asColumn different from public fun <T, C> Convert<T, C>.to(columnConverter: DataFrame<T>.(DataColumn<C>) -> AnyBaseCol): DataFrame<T> we already have? Just an extra type R right?

Then I'd either add R to convert.to { } or deprecate to {} in favor of the better named asColumn {}. wdyt?

@Jolanrensen
Copy link
Collaborator

Actually, why don't we just add the type R to ReplaceClause.with if you want compiler plugin support for it?

@koperagen
Copy link
Collaborator Author

koperagen commented Apr 23, 2025

Then I'd either add R to convert.to { } or deprecate to {} in favor of the better named asColumn {}. wdyt?

The major difference is ignoring name changes. So, let's say df.replace { col }.with { produceRandomColumnWithRandomName(it) } is a valid code - but impossible for the plugin to interpret. So convert asColumn renames all new columns to original name

@koperagen
Copy link
Collaborator Author

@Jolanrensen I think because convert { }.to { } works differently, let's postpone decision whether it should be removed or not

@Jolanrensen
Copy link
Collaborator

@koperagen ooh right I see. Actually I don't think many people would mind if we restricted that names cannot change for the entire convert operation. For that they'd need replace. That way we could make convert {}.to {} and convert {}.asColumn {} the same and replace {}.with {} different.

@koperagen koperagen merged commit 483054d into master Apr 25, 2025
4 of 5 checks passed
@koperagen koperagen deleted the convert-as-column branch April 30, 2025 10:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants