-
Notifications
You must be signed in to change notification settings - Fork 73
Add convert asColumn operation as compiler plugin friendly variant oа replace with #1143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
```kotlin | ||
df.convert { name }.asColumn { col -> | ||
col.toList().parallelStream().map { it.toString() }.collect(Collectors.toList()).toColumn() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parallel streams should be avoided in single-threaded libraries because they can introduce race conditions, synchronization issues, and unnecessary overhead.
However, in this case, the use of parallelStream
is localized and safe, as it only transforms column values without affecting global state or column names.
But! Even though this use of parallelStream is logically safe, it can unexpectedly increase CPU load, especially on weaker machines. Parallel streams use the shared ForkJoinPool, which may cause performance issues if the system has limited resources or is already running other parallel tasks. This can lead to slowdowns or contention for threads, impacting the overall responsiveness of the application.
When running in Kotlin Notebooks, parallel streams can compete with notebook execution and UI rendering for limited CPU resources. This may cause lags or freezes, especially in constrained environments like containers or shared servers. Therefore, parallelism in notebooks should be used carefully to avoid degrading the interactive experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually i wouldn't recommend using parallel streams for any trivial operations, but there can be a situation when it's needed. For example, one time i used a library that performs IO, parses file into kind of AST. So
df.add("data") {
Library.parse(file)
}
In my case single threaded execution took literally minutes of real time because CPU was loaded 5%. Opting to use parallel made it 20 times faster. #723. But in any case, we should minimize number of operations that plugin can't understand. replace with is one of them, convert asColumn would be an alternative should users need it
How is Then I'd either add |
Actually, why don't we just add the type |
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/convert.kt
Outdated
Show resolved
Hide resolved
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/api/replace.kt
Outdated
Show resolved
Hide resolved
plugins/kotlin-dataframe/src/org/jetbrains/kotlinx/dataframe/plugin/impl/api/convert.kt
Outdated
Show resolved
Hide resolved
The major difference is ignoring name changes. So, let's say |
8fc1fab
to
8756bb9
Compare
@Jolanrensen I think because |
@koperagen ooh right I see. Actually I don't think many people would mind if we restricted that names cannot change for the entire |
I think it's more or less common to perform column-wide operations in async / parallel context, so having such compiler plugin friendly operation is useful
Another use case, although not as handy, is creating a ColumnGroup like
df.convert { col }.asColumn { dataFrameOf("a" to listOf(123), "b" to listOf(321).asColumnGroup() }