diff --git a/docs/StardustDocs/topics/read.md b/docs/StardustDocs/topics/read.md index 8615c70060..d4999bcb4f 100644 --- a/docs/StardustDocs/topics/read.md +++ b/docs/StardustDocs/topics/read.md @@ -1,32 +1,41 @@ [//]: # (title: Read) -The Kotlin DataFrame library supports CSV, TSV, JSON, XLS and XLSX, Apache Arrow input formats. +The Kotlin DataFrame library supports CSV, TSV, JSON, XLS and XLSX, and Apache Arrow input formats. -`read` method automatically detects input format based on file extension and content +The `.read()` function automatically detects the input format based on file extension and content: ```kotlin DataFrame.read("input.csv") ``` -Input string can be a file path or URL. +The input string can be a file path or URL. -## Reading CSV +## Read from CSV -All these calls are valid: +To read a CSV file, use the `.readCSV()` function. + +To read a CSV file from a file: ```kotlin import java.io.File -import java.net.URL DataFrame.readCSV("input.csv") +// Alternatively DataFrame.readCSV(File("input.csv")) +``` + +To read a CSV file from a URL: + +```kotlin +import java.net.URL + DataFrame.readCSV(URL("https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv")) ``` -All `readCSV` overloads support different options. -For example, you can specify custom delimiter if it differs from `,`, charset -and column names if your CSV is missing them +### Specify delimiter + +By default, CSV files are parsed using `,` as the delimiter. To specify a custom delimiter, use the `delimiter` argument: @@ -41,7 +50,9 @@ val df = DataFrame.readCSV( -Column types will be inferred from the actual CSV data. Suppose that CSV from the previous +### Column type inference from CSV + +Column types are inferred from the CSV data. Suppose that the CSV from the previous example had the following content: @@ -51,7 +62,7 @@ example had the following content:
89abc7.1false
-[`DataFrame`](DataFrame.md) schema we get is: +Then the [`DataFrame`](DataFrame.md) schema we get is: ```text A: Int @@ -60,7 +71,7 @@ C: Double D: Boolean? ``` -[`DataFrame`](DataFrame.md) will try to parse columns as JSON, so when reading following table with JSON object in column D: +[`DataFrame`](DataFrame.md) tries to parse columns as JSON, so when reading the following table with JSON object in column D: @@ -77,7 +88,7 @@ D: C: Int ``` -For column where values are lists of JSON values: +For a column where values are lists of JSON values:
AD
@@ -92,7 +103,7 @@ G: * D: Int ``` -### Dealing with locale specific numbers +### Work with locale-specific numbers Sometimes columns in your CSV can be interpreted differently depending on your system locale. @@ -102,8 +113,8 @@ Sometimes columns in your CSV can be interpreted differently depending on your s
AG
12[{"B":1,"C":2,"D":3},{"B":1,"C":3,"D":2}]
41,111
-Here comma can be decimal or thousands separator, thus different values. -You can deal with it in two ways +Here a comma can be decimal or thousands separator, thus different values. +You can deal with it in two ways: 1) Provide locale as a parser option @@ -132,20 +143,34 @@ val df = DataFrame.readCSV( -## Reading JSON +## Read from JSON + +To read a JSON file, use the `.readJSON()` function. JSON files can be read from a file or a URL. + +Note that after reading a JSON with a complex structure, you can get hierarchical +[`DataFrame`](DataFrame.md): [`DataFrame`](DataFrame.md) with `ColumnGroup`s and [`FrameColumn`](DataColumn.md#framecolumn)s. + +To read a JSON file from a file: + + + +```kotlin +val df = DataFrame.readJson(file) +``` + + -Basics for reading JSONs are the same: you can read from file or from remote URL. +To read a JSON file from a URL: ```kotlin DataFrame.readJson("https://covid.ourworldindata.org/data/owid-covid-data.json") ``` -Note that after reading a JSON with a complex structure, you can get hierarchical -[`DataFrame`](DataFrame.md): [`DataFrame`](DataFrame.md) with `ColumnGroup`s and [`FrameColumn`](DataColumn.md#framecolumn)s. +### Column type inference from JSON -Also note that type inferring process for JSON is much simpler than for CSV. -JSON string literals are always supposed to have String type, number literals -take different `Number` kinds, boolean literals are converted to `Boolean`. +Type inference for JSON is much simpler than for CSV. +JSON string literals are always supposed to have String type. Number literals +take different `Number` kinds. Boolean literals are converted to `Boolean`. Let's take a look at the following JSON: @@ -178,17 +203,13 @@ Let's take a look at the following JSON: ] ``` -We can read it from file - - +We can read it from file: ```kotlin val df = DataFrame.readJson(file) ``` - - -Corresponding [`DataFrame`](DataFrame.md) schema will be +The corresponding [`DataFrame`](DataFrame.md) schema is: ```text A: String @@ -200,7 +221,9 @@ D: Boolean? Column A has `String` type because all values are string literals, no implicit conversion is performed. Column C has `Number` type because it's the least common type for `Int` and `Double`. -### JSON Reading Options: Type Clash Tactic +### JSON parsing options + +#### Manage type clashes By default, if a type clash occurs when reading JSON, a new column group is created consisting of: "value", "array", and any number of object properties: @@ -251,9 +274,9 @@ For this case, you can set `typeClashTactic = JSON.TypeClashTactic.ANY_COLUMNS` This option is also possible to set in the Gradle- and KSP plugin by providing `jsonOptions`. -### JSON Reading Options: Key/Value Paths +#### Specify Key/Value Paths -If you have some JSON looking like +If you have a JSON like: ```json { @@ -280,10 +303,10 @@ If you have some JSON looking like } ``` -you will get a column for each dog, which becomes an issue when you have a lot of dogs. -This issue is especially noticeable when generating data schemas from the JSON, as you might even run out of memory -when doing that due to the sheer number of generated interfaces.\ -Instead, you can use `keyValuePaths` to specify paths to the objects that should be read as key value frame columns. +You will get a column for each dog, which becomes an issue when you have a lot of dogs. +This issue is especially noticeable when generating data schemas from JSON, as you might run out of memory +when doing that due to the sheer number of generated interfaces. Instead, you can use `keyValuePaths` to specify paths +to the objects that should be read as key value frame columns. This can be the difference between: @@ -342,22 +365,35 @@ Only the bracket notation of json path is supported, as well as just double quot For more examples, see the "examples/json" module. -## Reading Excel +## Read from Excel -Add dependency: +Before you can read data from Excel, add the following dependency: ```kotlin implementation("org.jetbrains.kotlinx:dataframe-excel:$dataframe_version") ``` -Right now [`DataFrame`](DataFrame.md) supports reading Excel spreadsheet formats: xls, xlsx. +To read an Excel spreadsheet, use the `.readExcel()` function. Excel spreadsheets can be read from a file or a URL. Supported +Excel spreadsheet formats are: xls, xlsx. + +To read an Excel spreadsheet from a file: + +```kotlin +val df = DataFrame.readExcel(file) +``` -You can read from file or URL. +To read an Excel spreadsheet from a URL: + +```kotlin +DataFrame.readExcel("https://example.com/data.xlsx") +``` + +### Cell type inference from Excel Cells representing dates will be read as `kotlinx.datetime.LocalDateTime`. -Cells with number values, including whole numbers such as "100", or calculated formulas will be read as `Double` +Cells with number values, including whole numbers such as "100", or calculated formulas will be read as `Double`. -Sometimes cells can have wrong format in Excel file, for example you expect to read column of String: +Sometimes cells can have the wrong format in an Excel file. For example, you expect to read a column of `String`: ```text IDS @@ -367,9 +403,9 @@ B100 C100 ``` -You will get column of Serializable instead (common parent for Double & String) +You will get column of `Serializable` instead (common parent for `Double` and `String`). -You can fix it using convert: +You can fix it using the `.convert()` function: @@ -387,21 +423,16 @@ df1["IDS"].type() shouldBe typeOf() -## Reading Apache Arrow formats +## Read Apache Arrow formats -Add dependency: +Before you can read data from Apache Arrow format, add the following dependency: ```kotlin implementation("org.jetbrains.kotlinx:dataframe-arrow:$dataframe_version") ``` - -Make sure to follow [Apache Arrow Java compatibility](https://arrow.apache.org/docs/java/install.html#java-compatibility) guide when using Java 9+ - +To read Apache Arrow formats, use the `.readArrowFeather()` function: -[`DataFrame`](DataFrame.md) supports reading [Arrow interprocess streaming format](https://arrow.apache.org/docs/java/ipc.html#writing-and-reading-streaming-format) -and [Arrow random access format](https://arrow.apache.org/docs/java/ipc.html#writing-and-reading-random-access-files) -from raw Channel (ReadableByteChannel for streaming and SeekableByteChannel for random access), InputStream, File or ByteArray. ```kotlin @@ -409,3 +440,11 @@ val df = DataFrame.readArrowFeather(file) ``` + +[`DataFrame`](DataFrame.md) supports reading [Arrow interprocess streaming format](https://arrow.apache.org/docs/java/ipc.html#writing-and-reading-streaming-format) +and [Arrow random access format](https://arrow.apache.org/docs/java/ipc.html#writing-and-reading-random-access-files) +from raw Channel (ReadableByteChannel for streaming and SeekableByteChannel for random access), InputStream, File or ByteArray. + +> If you use Java 9+, follow the [Apache Arrow Java compatibility](https://arrow.apache.org/docs/java/install.html#java-compatibility) guide. +> +{style="note"}