Skip to content

Added some name repairing strategies #386

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
package org.jetbrains.kotlinx.dataframe.io

/**
* This strategy defines how the repeatable name column will be handled
* during the creation new dataframe from the IO sources.
*/
public enum class NameRepairStrategy {
/** No actions, keep as is. */
DO_NOTHING,

/** Check the uniqueness of the column names without any actions. */
CHECK_UNIQUE,

/** Check the uniqueness of the column names and repair it. */
MAKE_UNIQUE
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
package org.jetbrains.kotlinx.dataframe.io

/**
* This strategy defines how the repeatable name column will be handled
* during the creation new dataframe from the IO sources.
*/
public enum class NameRepairStrategy {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear that it targets duplicate names just by looking at the enum class name. Maybe call it DuplicateNameStrategy?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In many dataframe frameworks it's named as NameRepairing, better to keep naming to reduce learning curve

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "repairing" insinuates a name is "broken", e.g. unsupported characters etc. This is definitely different IMO. Can you give an example from another library?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! Well, if in the future we also add "universal", which fixes name syntax too, I'm fine with the name. Then it won't be just about uniqueness

/** No actions, keep as is. */
DO_NOTHING,

/** Check the uniqueness of the column names without any actions. */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this would then throw an exception?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed solution don't do that, and throwing of exception is happened in totally another place, this is why I could not guarantee that it happens (but for now happens)

CHECK_UNIQUE,

/** Check the uniqueness of the column names and repair it. */
MAKE_UNIQUE
}
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ import org.jetbrains.kotlinx.dataframe.api.forEach
import org.jetbrains.kotlinx.dataframe.api.select
import org.jetbrains.kotlinx.dataframe.codeGen.AbstractDefaultReadMethod
import org.jetbrains.kotlinx.dataframe.codeGen.DefaultReadDfMethod
import org.jetbrains.kotlinx.dataframe.exceptions.DuplicateColumnNamesException
import java.io.File
import java.io.InputStream
import java.io.OutputStream
Expand Down Expand Up @@ -60,96 +61,114 @@ private const val readExcel = "readExcel"
* @param columns comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”)
* @param skipRows number of rows before header
* @param rowsCount number of rows to read.
* @param nameRepairStrategy handling of column names.
* The default behavior is [NameRepairStrategy.CHECK_UNIQUE]
*/
public fun DataFrame.Companion.readExcel(
url: URL,
sheetName: String? = null,
skipRows: Int = 0,
columns: String? = null,
rowsCount: Int? = null,
nameRepairStrategy: NameRepairStrategy = NameRepairStrategy.CHECK_UNIQUE,
): AnyFrame {
val wb = WorkbookFactory.create(url.openStream())
return wb.use { readExcel(wb, sheetName, skipRows, columns, rowsCount) }
return wb.use { readExcel(wb, sheetName, skipRows, columns, rowsCount, nameRepairStrategy) }
}

/**
* @param sheetName sheet to read. By default, first sheet in the document
* @param columns comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”)
* @param skipRows number of rows before header
* @param rowsCount number of rows to read.
* @param nameRepairStrategy handling of column names.
* The default behavior is [NameRepairStrategy.CHECK_UNIQUE]
*/
public fun DataFrame.Companion.readExcel(
file: File,
sheetName: String? = null,
skipRows: Int = 0,
columns: String? = null,
rowsCount: Int? = null,
nameRepairStrategy: NameRepairStrategy = NameRepairStrategy.CHECK_UNIQUE,
): AnyFrame {
val wb = WorkbookFactory.create(file)
return wb.use { readExcel(it, sheetName, skipRows, columns, rowsCount) }
return wb.use { readExcel(it, sheetName, skipRows, columns, rowsCount, nameRepairStrategy) }
}

/**
* @param sheetName sheet to read. By default, first sheet in the document
* @param columns comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”)
* @param skipRows number of rows before header
* @param rowsCount number of rows to read.
* @param nameRepairStrategy handling of column names.
* The default behavior is [NameRepairStrategy.CHECK_UNIQUE]
*/
public fun DataFrame.Companion.readExcel(
fileOrUrl: String,
sheetName: String? = null,
skipRows: Int = 0,
columns: String? = null,
rowsCount: Int? = null,
): AnyFrame = readExcel(asURL(fileOrUrl), sheetName, skipRows, columns, rowsCount)
nameRepairStrategy: NameRepairStrategy = NameRepairStrategy.CHECK_UNIQUE,
): AnyFrame = readExcel(asURL(fileOrUrl), sheetName, skipRows, columns, rowsCount, nameRepairStrategy)

/**
* @param sheetName sheet to read. By default, first sheet in the document
* @param columns comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”)
* @param skipRows number of rows before header
* @param rowsCount number of rows to read.
* @param nameRepairStrategy handling of column names.
* The default behavior is [NameRepairStrategy.CHECK_UNIQUE]
*/
public fun DataFrame.Companion.readExcel(
inputStream: InputStream,
sheetName: String? = null,
skipRows: Int = 0,
columns: String? = null,
rowsCount: Int? = null,
nameRepairStrategy: NameRepairStrategy = NameRepairStrategy.CHECK_UNIQUE,
): AnyFrame {
val wb = WorkbookFactory.create(inputStream)
return wb.use { readExcel(it, sheetName, skipRows, columns, rowsCount) }
return wb.use { readExcel(it, sheetName, skipRows, columns, rowsCount, nameRepairStrategy) }
}

/**
* @param sheetName sheet to read. By default, first sheet in the document
* @param columns comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”)
* @param skipRows number of rows before header
* @param rowsCount number of rows to read.
* @param nameRepairStrategy handling of column names.
* The default behavior is [NameRepairStrategy.CHECK_UNIQUE]
*/
public fun DataFrame.Companion.readExcel(
wb: Workbook,
sheetName: String? = null,
skipRows: Int = 0,
columns: String? = null,
rowsCount: Int? = null,
nameRepairStrategy: NameRepairStrategy = NameRepairStrategy.CHECK_UNIQUE,
): AnyFrame {
val sheet: Sheet = sheetName
?.let { wb.getSheet(it) ?: error("Sheet with name $sheetName not found") }
?: wb.getSheetAt(0)
return readExcel(sheet, columns, skipRows, rowsCount)
return readExcel(sheet, columns, skipRows, rowsCount, nameRepairStrategy)
}

/**
* @param sheet sheet to read.
* @param columns comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”)
* @param skipRows number of rows before header
* @param rowsCount number of rows to read.
* @param nameRepairStrategy handling of column names.
* The default behavior is [NameRepairStrategy.CHECK_UNIQUE]
*/
public fun DataFrame.Companion.readExcel(
sheet: Sheet,
columns: String? = null,
skipRows: Int = 0,
rowsCount: Int? = null,
nameRepairStrategy: NameRepairStrategy = NameRepairStrategy.CHECK_UNIQUE,
): AnyFrame {
val columnIndexes: Iterable<Int> = if (columns != null) {
columns.split(",").flatMap {
Expand All @@ -176,15 +195,19 @@ public fun DataFrame.Companion.readExcel(
val last = rowsCount?.let { first + it - 1 } ?: sheet.lastRowNum
val valueRowsRange = (first..last)

val columnNameCounters = mutableMapOf<String, Int>()
val columns = columnIndexes.map { index ->
val headerCell = headerRow?.getCell(index)
val name = if (headerCell?.cellType == CellType.NUMERIC) {
val nameFromCell = if (headerCell?.cellType == CellType.NUMERIC) {
headerCell.numericCellValue.toString() // Support numeric-named columns
} else {
headerCell?.stringCellValue
?: CellReference.convertNumToColString(index) // Use Excel column names if no data
}

val name = repairNameIfRequired(nameFromCell, columnNameCounters, nameRepairStrategy)
columnNameCounters[nameFromCell] = columnNameCounters.getOrDefault(nameFromCell, 0) + 1 // increase the counter for specific column name

val values: List<Any?> = valueRowsRange.map {
val row: Row? = sheet.getRow(it)
val cell: Cell? = row?.getCell(index)
Expand All @@ -195,6 +218,31 @@ public fun DataFrame.Companion.readExcel(
return dataFrameOf(columns)
}

/**
* This is a universal function for name repairing
* and should be moved to the API module later,
* when the functionality will be enabled for all IO sources.
*
* TODO: https://github.com/Kotlin/dataframe/issues/387
*/
private fun repairNameIfRequired(nameFromCell: String, columnNameCounters: MutableMap<String, Int>, nameRepairStrategy: NameRepairStrategy): String {
return when (nameRepairStrategy) {
NameRepairStrategy.DO_NOTHING -> nameFromCell
NameRepairStrategy.CHECK_UNIQUE -> if (columnNameCounters.contains(nameFromCell)) throw DuplicateColumnNamesException(columnNameCounters.keys.toList()) else nameFromCell
NameRepairStrategy.MAKE_UNIQUE -> if (nameFromCell.isEmpty()) { // probably it's never empty because of filling empty column names earlier
val emptyName = "Unknown column"
if (columnNameCounters.contains(emptyName)) "${emptyName}${columnNameCounters[emptyName]}"
else emptyName
} else {
if (columnNameCounters.contains(nameFromCell)) {
"${nameFromCell}${columnNameCounters[nameFromCell]}"
} else {
nameFromCell
}
}
}
}

private fun Cell?.cellValue(sheetName: String): Any? =
when (this?.cellType) {
CellType._NONE -> error("Cell $address of sheet $sheetName has a CellType that should only be used internally. This is a bug, please report https://github.com/Kotlin/dataframe/issues")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ import org.jetbrains.kotlinx.dataframe.DataFrame
import org.jetbrains.kotlinx.dataframe.api.concat
import org.jetbrains.kotlinx.dataframe.api.dataFrameOf
import org.jetbrains.kotlinx.dataframe.api.toColumn
import org.jetbrains.kotlinx.dataframe.exceptions.DuplicateColumnNamesException
import org.jetbrains.kotlinx.dataframe.impl.DataFrameSize
import org.jetbrains.kotlinx.dataframe.size
import org.junit.Test
Expand Down Expand Up @@ -109,4 +110,15 @@ class XlsxTest {
DataFrame.readExcel(testResource("xlsx6.xlsx"), skipRows = 4)
}
}

@Test
fun `read xlsx file with duplicated columns and repair column names`() {
shouldThrow<DuplicateColumnNamesException> {
DataFrame.readExcel(testResource("iris_duplicated_column.xlsx"))
}

val df = DataFrame.readExcel(testResource("iris_duplicated_column.xlsx"), nameRepairStrategy = NameRepairStrategy.MAKE_UNIQUE)
df.columnNames() shouldBe listOf("Sepal.Length", "Sepal.Width", "C",
"Petal.Length", "Petal.Width", "Species", "Other.Width", "Species1", "I", "Other.Width1", "Species2")
}
}
Binary file not shown.