Skip to content

Commit 4437e44

Browse files
committed
update: restructure read operations page
1 parent 6bb6cf7 commit 4437e44

File tree

1 file changed

+91
-52
lines changed

1 file changed

+91
-52
lines changed

docs/StardustDocs/topics/read.md

Lines changed: 91 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,41 @@
11
[//]: # (title: Read)
22
<!---IMPORT org.jetbrains.kotlinx.dataframe.samples.api.Read-->
33

4-
The Kotlin DataFrame library supports CSV, TSV, JSON, XLS and XLSX, Apache Arrow input formats.
4+
The Kotlin DataFrame library supports CSV, TSV, JSON, XLS and XLSX, and Apache Arrow input formats.
55

6-
`read` method automatically detects input format based on file extension and content
6+
The `.read()` function automatically detects the input format based on file extension and content:
77

88
```kotlin
99
DataFrame.read("input.csv")
1010
```
1111

12-
Input string can be a file path or URL.
12+
The input string can be a file path or URL.
1313

14-
## Reading CSV
14+
## Read from CSV
1515

16-
All these calls are valid:
16+
To read a CSV file, use the `.readCSV()` function.
17+
18+
To read a CSV file from a file:
1719

1820
```kotlin
1921
import java.io.File
20-
import java.net.URL
2122

2223
DataFrame.readCSV("input.csv")
24+
// Alternatively
2325
DataFrame.readCSV(File("input.csv"))
26+
```
27+
28+
To read a CSV file from a URL:
29+
30+
```kotlin
31+
import java.net.URL
32+
2433
DataFrame.readCSV(URL("https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv"))
2534
```
2635

27-
All `readCSV` overloads support different options.
28-
For example, you can specify custom delimiter if it differs from `,`, charset
29-
and column names if your CSV is missing them
36+
### Specify delimiter
37+
38+
By default, CSV files are parsed using `,` as the delimiter. To specify a custom delimiter, use the `delimiter` argument:
3039

3140
<!---FUN readCsvCustom-->
3241

@@ -41,7 +50,9 @@ val df = DataFrame.readCSV(
4150

4251
<!---END-->
4352

44-
Column types will be inferred from the actual CSV data. Suppose that CSV from the previous
53+
### Column type inference from CSV
54+
55+
Column types are inferred from the CSV data. Suppose that the CSV from the previous
4556
example had the following content:
4657

4758
<table>
@@ -51,7 +62,7 @@ example had the following content:
5162
<tr><td>89</td><td>abc</td><td>7.1</td><td>false</td></tr>
5263
</table>
5364

54-
[`DataFrame`](DataFrame.md) schema we get is:
65+
Then the [`DataFrame`](DataFrame.md) schema we get is:
5566

5667
```text
5768
A: Int
@@ -60,7 +71,7 @@ C: Double
6071
D: Boolean?
6172
```
6273

63-
[`DataFrame`](DataFrame.md) will try to parse columns as JSON, so when reading following table with JSON object in column D:
74+
[`DataFrame`](DataFrame.md) tries to parse columns as JSON, so when reading the following table with JSON object in column D:
6475

6576
<table>
6677
<tr><th>A</th><th>D</th></tr>
@@ -77,7 +88,7 @@ D:
7788
C: Int
7889
```
7990

80-
For column where values are lists of JSON values:
91+
For a column where values are lists of JSON values:
8192
<table>
8293
<tr><th>A</th><th>G</th></tr>
8394
<tr><td>12</td><td>[{"B":1,"C":2,"D":3},{"B":1,"C":3,"D":2}]</td></tr>
@@ -92,7 +103,7 @@ G: *
92103
D: Int
93104
```
94105

95-
### Dealing with locale specific numbers
106+
### Work with locale-specific numbers
96107

97108
Sometimes columns in your CSV can be interpreted differently depending on your system locale.
98109

@@ -102,8 +113,8 @@ Sometimes columns in your CSV can be interpreted differently depending on your s
102113
<tr><td>41,111</td></tr>
103114
</table>
104115

105-
Here comma can be decimal or thousands separator, thus different values.
106-
You can deal with it in two ways
116+
Here a comma can be decimal or thousands separator, thus different values.
117+
You can deal with it in two ways:
107118

108119
1) Provide locale as a parser option
109120

@@ -132,20 +143,34 @@ val df = DataFrame.readCSV(
132143
<!---END-->
133144

134145

135-
## Reading JSON
146+
## Read from JSON
147+
148+
To read a JSON file, use the `.readJSON()` function. JSON files can be read from a file or a URL.
149+
150+
Note that after reading a JSON with a complex structure, you can get hierarchical
151+
[`DataFrame`](DataFrame.md): [`DataFrame`](DataFrame.md) with `ColumnGroup`s and [`FrameColumn`](DataColumn.md#framecolumn)s.
152+
153+
To read a JSON file from a file:
154+
155+
<!---FUN readJson-->
156+
157+
```kotlin
158+
val df = DataFrame.readJson(file)
159+
```
160+
161+
<!---END-->
136162

137-
Basics for reading JSONs are the same: you can read from file or from remote URL.
163+
To read a JSON file from a URL:
138164

139165
```kotlin
140166
DataFrame.readJson("https://covid.ourworldindata.org/data/owid-covid-data.json")
141167
```
142168

143-
Note that after reading a JSON with a complex structure, you can get hierarchical
144-
[`DataFrame`](DataFrame.md): [`DataFrame`](DataFrame.md) with `ColumnGroup`s and [`FrameColumn`](DataColumn.md#framecolumn)s.
169+
### Column type inference from JSON
145170

146-
Also note that type inferring process for JSON is much simpler than for CSV.
147-
JSON string literals are always supposed to have String type, number literals
148-
take different `Number` kinds, boolean literals are converted to `Boolean`.
171+
Type inference for JSON is much simpler than for CSV.
172+
JSON string literals are always supposed to have String type. Number literals
173+
take different `Number` kinds. Boolean literals are converted to `Boolean`.
149174

150175
Let's take a look at the following JSON:
151176

@@ -178,17 +203,13 @@ Let's take a look at the following JSON:
178203
]
179204
```
180205

181-
We can read it from file
182-
183-
<!---FUN readJson-->
206+
We can read it from file:
184207

185208
```kotlin
186209
val df = DataFrame.readJson(file)
187210
```
188211

189-
<!---END-->
190-
191-
Corresponding [`DataFrame`](DataFrame.md) schema will be
212+
The corresponding [`DataFrame`](DataFrame.md) schema is:
192213

193214
```text
194215
A: String
@@ -200,7 +221,9 @@ D: Boolean?
200221
Column A has `String` type because all values are string literals, no implicit conversion is performed. Column C
201222
has `Number` type because it's the least common type for `Int` and `Double`.
202223

203-
### JSON Reading Options: Type Clash Tactic
224+
### JSON parsing options
225+
226+
#### Manage type clashes
204227

205228
By default, if a type clash occurs when reading JSON, a new column group is created consisting of: "value", "array", and
206229
any number of object properties:
@@ -251,9 +274,9 @@ For this case, you can set `typeClashTactic = JSON.TypeClashTactic.ANY_COLUMNS`
251274

252275
This option is also possible to set in the Gradle- and KSP plugin by providing `jsonOptions`.
253276

254-
### JSON Reading Options: Key/Value Paths
277+
#### Specify Key/Value Paths
255278

256-
If you have some JSON looking like
279+
If you have a JSON like:
257280

258281
```json
259282
{
@@ -280,10 +303,10 @@ If you have some JSON looking like
280303
}
281304
```
282305

283-
you will get a column for each dog, which becomes an issue when you have a lot of dogs.
284-
This issue is especially noticeable when generating data schemas from the JSON, as you might even run out of memory
285-
when doing that due to the sheer number of generated interfaces.\
286-
Instead, you can use `keyValuePaths` to specify paths to the objects that should be read as key value frame columns.
306+
You will get a column for each dog, which becomes an issue when you have a lot of dogs.
307+
This issue is especially noticeable when generating data schemas from JSON, as you might run out of memory
308+
when doing that due to the sheer number of generated interfaces. Instead, you can use `keyValuePaths` to specify paths
309+
to the objects that should be read as key value frame columns.
287310

288311
This can be the difference between:
289312

@@ -342,22 +365,35 @@ Only the bracket notation of json path is supported, as well as just double quot
342365

343366
For more examples, see the "examples/json" module.
344367

345-
## Reading Excel
368+
## Read from Excel
346369

347-
Add dependency:
370+
Before you can read data from Excel, add the following dependency:
348371

349372
```kotlin
350373
implementation("org.jetbrains.kotlinx:dataframe-excel:$dataframe_version")
351374
```
352375

353-
Right now [`DataFrame`](DataFrame.md) supports reading Excel spreadsheet formats: xls, xlsx.
376+
To read an Excel spreadsheet, use the `.readExcel()` function. Excel spreadsheets can be read from a file or a URL. Supported
377+
Excel spreadsheet formats are: xls, xlsx.
378+
379+
To read an Excel spreadsheet from a file:
380+
381+
```kotlin
382+
val df = DataFrame.readExcel(file)
383+
```
354384

355-
You can read from file or URL.
385+
To read an Excel spreadsheet from a URL:
386+
387+
```kotlin
388+
DataFrame.readExcel("https://example.com/data.xlsx")
389+
```
390+
391+
### Cell type inference from Excel
356392

357393
Cells representing dates will be read as `kotlinx.datetime.LocalDateTime`.
358-
Cells with number values, including whole numbers such as "100", or calculated formulas will be read as `Double`
394+
Cells with number values, including whole numbers such as "100", or calculated formulas will be read as `Double`.
359395

360-
Sometimes cells can have wrong format in Excel file, for example you expect to read column of String:
396+
Sometimes cells can have the wrong format in an Excel file. For example, you expect to read a column of `String`:
361397

362398
```text
363399
IDS
@@ -367,9 +403,9 @@ B100
367403
C100
368404
```
369405

370-
You will get column of Serializable instead (common parent for Double & String)
406+
You will get column of `Serializable` instead (common parent for `Double` and `String`).
371407

372-
You can fix it using convert:
408+
You can fix it using the `.convert()` function:
373409

374410
<!---FUN fixMixedColumn-->
375411

@@ -387,25 +423,28 @@ df1["IDS"].type() shouldBe typeOf<String>()
387423

388424
<!---END-->
389425

390-
## Reading Apache Arrow formats
426+
## Read Apache Arrow formats
391427

392-
Add dependency:
428+
Before you can read data from Apache Arrow format, add the following dependency:
393429

394430
```kotlin
395431
implementation("org.jetbrains.kotlinx:dataframe-arrow:$dataframe_version")
396432
```
397433

398-
<warning>
399-
Make sure to follow [Apache Arrow Java compatibility](https://arrow.apache.org/docs/java/install.html#java-compatibility) guide when using Java 9+
400-
</warning>
434+
To read Apache Arrow formats, use the `.readArrowFeather()` function:
401435

402-
[`DataFrame`](DataFrame.md) supports reading [Arrow interprocess streaming format](https://arrow.apache.org/docs/java/ipc.html#writing-and-reading-streaming-format)
403-
and [Arrow random access format](https://arrow.apache.org/docs/java/ipc.html#writing-and-reading-random-access-files)
404-
from raw Channel (ReadableByteChannel for streaming and SeekableByteChannel for random access), InputStream, File or ByteArray.
405436
<!---FUN readArrowFeather-->
406437

407438
```kotlin
408439
val df = DataFrame.readArrowFeather(file)
409440
```
410441

411442
<!---END-->
443+
444+
[`DataFrame`](DataFrame.md) supports reading [Arrow interprocess streaming format](https://arrow.apache.org/docs/java/ipc.html#writing-and-reading-streaming-format)
445+
and [Arrow random access format](https://arrow.apache.org/docs/java/ipc.html#writing-and-reading-random-access-files)
446+
from raw Channel (ReadableByteChannel for streaming and SeekableByteChannel for random access), InputStream, File or ByteArray.
447+
448+
> If you use Java 9+, follow the [Apache Arrow Java compatibility](https://arrow.apache.org/docs/java/install.html#java-compatibility) guide.
449+
>
450+
{style="note"}

0 commit comments

Comments
 (0)