1
1
[ // ] : # ( title: Read )
2
2
<!-- -IMPORT org.jetbrains.kotlinx.dataframe.samples.api.Read-->
3
3
4
- The Kotlin DataFrame library supports CSV, TSV, JSON, XLS and XLSX, Apache Arrow input formats.
4
+ The Kotlin DataFrame library supports CSV, TSV, JSON, XLS and XLSX, and Apache Arrow input formats.
5
5
6
- ` read ` method automatically detects input format based on file extension and content
6
+ The ` . read() ` function automatically detects the input format based on file extension and content:
7
7
8
8
``` kotlin
9
9
DataFrame .read(" input.csv" )
10
10
```
11
11
12
- Input string can be a file path or URL.
12
+ The input string can be a file path or URL.
13
13
14
- ## Reading CSV
14
+ ## Read from CSV
15
15
16
- All these calls are valid:
16
+ To read a CSV file, use the ` .readCSV() ` function.
17
+
18
+ To read a CSV file from a file:
17
19
18
20
``` kotlin
19
21
import java.io.File
20
- import java.net.URL
21
22
22
23
DataFrame .readCSV(" input.csv" )
24
+ // Alternatively
23
25
DataFrame .readCSV(File (" input.csv" ))
26
+ ```
27
+
28
+ To read a CSV file from a URL:
29
+
30
+ ``` kotlin
31
+ import java.net.URL
32
+
24
33
DataFrame .readCSV(URL (" https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv" ))
25
34
```
26
35
27
- All ` readCSV ` overloads support different options.
28
- For example, you can specify custom delimiter if it differs from ` , ` , charset
29
- and column names if your CSV is missing them
36
+ ### Specify delimiter
37
+
38
+ By default, CSV files are parsed using ` , ` as the delimiter. To specify a custom delimiter, use the ` delimiter ` argument:
30
39
31
40
<!-- -FUN readCsvCustom-->
32
41
@@ -41,7 +50,9 @@ val df = DataFrame.readCSV(
41
50
42
51
<!-- -END-->
43
52
44
- Column types will be inferred from the actual CSV data. Suppose that CSV from the previous
53
+ ### Column type inference from CSV
54
+
55
+ Column types are inferred from the CSV data. Suppose that the CSV from the previous
45
56
example had the following content:
46
57
47
58
<table >
@@ -51,7 +62,7 @@ example had the following content:
51
62
<tr ><td >89</td ><td >abc</td ><td >7.1</td ><td >false</td ></tr >
52
63
</table >
53
64
54
- [ ` DataFrame ` ] ( DataFrame.md ) schema we get is:
65
+ Then the [ ` DataFrame ` ] ( DataFrame.md ) schema we get is:
55
66
56
67
``` text
57
68
A: Int
@@ -60,7 +71,7 @@ C: Double
60
71
D: Boolean?
61
72
```
62
73
63
- [ ` DataFrame ` ] ( DataFrame.md ) will try to parse columns as JSON, so when reading following table with JSON object in column D:
74
+ [ ` DataFrame ` ] ( DataFrame.md ) tries to parse columns as JSON, so when reading the following table with JSON object in column D:
64
75
65
76
<table >
66
77
<tr ><th >A</th ><th >D</th ></tr >
77
88
C: Int
78
89
```
79
90
80
- For column where values are lists of JSON values:
91
+ For a column where values are lists of JSON values:
81
92
<table >
82
93
<tr ><th >A</th ><th >G</th ></tr >
83
94
<tr ><td >12</td ><td >[{"B":1,"C":2,"D":3},{"B":1,"C":3,"D":2}]</td ></tr >
92
103
D: Int
93
104
```
94
105
95
- ### Dealing with locale specific numbers
106
+ ### Work with locale- specific numbers
96
107
97
108
Sometimes columns in your CSV can be interpreted differently depending on your system locale.
98
109
@@ -102,8 +113,8 @@ Sometimes columns in your CSV can be interpreted differently depending on your s
102
113
<tr ><td >41,111</td ></tr >
103
114
</table >
104
115
105
- Here comma can be decimal or thousands separator, thus different values.
106
- You can deal with it in two ways
116
+ Here a comma can be decimal or thousands separator, thus different values.
117
+ You can deal with it in two ways:
107
118
108
119
1 ) Provide locale as a parser option
109
120
@@ -132,20 +143,34 @@ val df = DataFrame.readCSV(
132
143
<!-- -END-->
133
144
134
145
135
- ## Reading JSON
146
+ ## Read from JSON
147
+
148
+ To read a JSON file, use the ` .readJSON() ` function. JSON files can be read from a file or a URL.
149
+
150
+ Note that after reading a JSON with a complex structure, you can get hierarchical
151
+ [ ` DataFrame ` ] ( DataFrame.md ) : [ ` DataFrame ` ] ( DataFrame.md ) with ` ColumnGroup ` s and [ ` FrameColumn ` ] ( DataColumn.md#framecolumn ) s.
152
+
153
+ To read a JSON file from a file:
154
+
155
+ <!-- -FUN readJson-->
156
+
157
+ ``` kotlin
158
+ val df = DataFrame .readJson(file)
159
+ ```
160
+
161
+ <!-- -END-->
136
162
137
- Basics for reading JSONs are the same: you can read from file or from remote URL.
163
+ To read a JSON file from a URL:
138
164
139
165
``` kotlin
140
166
DataFrame .readJson(" https://covid.ourworldindata.org/data/owid-covid-data.json" )
141
167
```
142
168
143
- Note that after reading a JSON with a complex structure, you can get hierarchical
144
- [ ` DataFrame ` ] ( DataFrame.md ) : [ ` DataFrame ` ] ( DataFrame.md ) with ` ColumnGroup ` s and [ ` FrameColumn ` ] ( DataColumn.md#framecolumn ) s.
169
+ ### Column type inference from JSON
145
170
146
- Also note that type inferring process for JSON is much simpler than for CSV.
147
- JSON string literals are always supposed to have String type, number literals
148
- take different ` Number ` kinds, boolean literals are converted to ` Boolean ` .
171
+ Type inference for JSON is much simpler than for CSV.
172
+ JSON string literals are always supposed to have String type. Number literals
173
+ take different ` Number ` kinds. Boolean literals are converted to ` Boolean ` .
149
174
150
175
Let's take a look at the following JSON:
151
176
@@ -178,17 +203,13 @@ Let's take a look at the following JSON:
178
203
]
179
204
```
180
205
181
- We can read it from file
182
-
183
- <!-- -FUN readJson-->
206
+ We can read it from file:
184
207
185
208
``` kotlin
186
209
val df = DataFrame .readJson(file)
187
210
```
188
211
189
- <!-- -END-->
190
-
191
- Corresponding [ ` DataFrame ` ] ( DataFrame.md ) schema will be
212
+ The corresponding [ ` DataFrame ` ] ( DataFrame.md ) schema is:
192
213
193
214
``` text
194
215
A: String
@@ -200,7 +221,9 @@ D: Boolean?
200
221
Column A has ` String ` type because all values are string literals, no implicit conversion is performed. Column C
201
222
has ` Number ` type because it's the least common type for ` Int ` and ` Double ` .
202
223
203
- ### JSON Reading Options: Type Clash Tactic
224
+ ### JSON parsing options
225
+
226
+ #### Manage type clashes
204
227
205
228
By default, if a type clash occurs when reading JSON, a new column group is created consisting of: "value", "array", and
206
229
any number of object properties:
@@ -251,9 +274,9 @@ For this case, you can set `typeClashTactic = JSON.TypeClashTactic.ANY_COLUMNS`
251
274
252
275
This option is also possible to set in the Gradle- and KSP plugin by providing ` jsonOptions ` .
253
276
254
- ### JSON Reading Options: Key/Value Paths
277
+ #### Specify Key/Value Paths
255
278
256
- If you have some JSON looking like
279
+ If you have a JSON like:
257
280
258
281
``` json
259
282
{
@@ -280,10 +303,10 @@ If you have some JSON looking like
280
303
}
281
304
```
282
305
283
- you will get a column for each dog, which becomes an issue when you have a lot of dogs.
284
- This issue is especially noticeable when generating data schemas from the JSON, as you might even run out of memory
285
- when doing that due to the sheer number of generated interfaces.\
286
- Instead, you can use ` keyValuePaths ` to specify paths to the objects that should be read as key value frame columns.
306
+ You will get a column for each dog, which becomes an issue when you have a lot of dogs.
307
+ This issue is especially noticeable when generating data schemas from JSON, as you might run out of memory
308
+ when doing that due to the sheer number of generated interfaces. Instead, you can use ` keyValuePaths ` to specify paths
309
+ to the objects that should be read as key value frame columns.
287
310
288
311
This can be the difference between:
289
312
@@ -342,22 +365,35 @@ Only the bracket notation of json path is supported, as well as just double quot
342
365
343
366
For more examples, see the "examples/json" module.
344
367
345
- ## Reading Excel
368
+ ## Read from Excel
346
369
347
- Add dependency:
370
+ Before you can read data from Excel, add the following dependency:
348
371
349
372
``` kotlin
350
373
implementation(" org.jetbrains.kotlinx:dataframe-excel:$dataframe_version " )
351
374
```
352
375
353
- Right now [ ` DataFrame ` ] ( DataFrame.md ) supports reading Excel spreadsheet formats: xls, xlsx.
376
+ To read an Excel spreadsheet, use the ` .readExcel() ` function. Excel spreadsheets can be read from a file or a URL. Supported
377
+ Excel spreadsheet formats are: xls, xlsx.
378
+
379
+ To read an Excel spreadsheet from a file:
380
+
381
+ ``` kotlin
382
+ val df = DataFrame .readExcel(file)
383
+ ```
354
384
355
- You can read from file or URL.
385
+ To read an Excel spreadsheet from a URL:
386
+
387
+ ``` kotlin
388
+ DataFrame .readExcel(" https://example.com/data.xlsx" )
389
+ ```
390
+
391
+ ### Cell type inference from Excel
356
392
357
393
Cells representing dates will be read as ` kotlinx.datetime.LocalDateTime ` .
358
- Cells with number values, including whole numbers such as "100", or calculated formulas will be read as ` Double `
394
+ Cells with number values, including whole numbers such as "100", or calculated formulas will be read as ` Double ` .
359
395
360
- Sometimes cells can have wrong format in Excel file, for example you expect to read column of String:
396
+ Sometimes cells can have the wrong format in an Excel file. For example, you expect to read a column of ` String ` :
361
397
362
398
``` text
363
399
IDS
367
403
C100
368
404
```
369
405
370
- You will get column of Serializable instead (common parent for Double & String)
406
+ You will get column of ` Serializable ` instead (common parent for ` Double ` and ` String ` ).
371
407
372
- You can fix it using convert:
408
+ You can fix it using the ` . convert() ` function :
373
409
374
410
<!-- -FUN fixMixedColumn-->
375
411
@@ -387,25 +423,28 @@ df1["IDS"].type() shouldBe typeOf<String>()
387
423
388
424
<!-- -END-->
389
425
390
- ## Reading Apache Arrow formats
426
+ ## Read Apache Arrow formats
391
427
392
- Add dependency:
428
+ Before you can read data from Apache Arrow format, add the following dependency:
393
429
394
430
``` kotlin
395
431
implementation(" org.jetbrains.kotlinx:dataframe-arrow:$dataframe_version " )
396
432
```
397
433
398
- <warning >
399
- Make sure to follow [ Apache Arrow Java compatibility] ( https://arrow.apache.org/docs/java/install.html#java-compatibility ) guide when using Java 9+
400
- </warning >
434
+ To read Apache Arrow formats, use the ` .readArrowFeather() ` function:
401
435
402
- [ ` DataFrame ` ] ( DataFrame.md ) supports reading [ Arrow interprocess streaming format] ( https://arrow.apache.org/docs/java/ipc.html#writing-and-reading-streaming-format )
403
- and [ Arrow random access format] ( https://arrow.apache.org/docs/java/ipc.html#writing-and-reading-random-access-files )
404
- from raw Channel (ReadableByteChannel for streaming and SeekableByteChannel for random access), InputStream, File or ByteArray.
405
436
<!-- -FUN readArrowFeather-->
406
437
407
438
``` kotlin
408
439
val df = DataFrame .readArrowFeather(file)
409
440
```
410
441
411
442
<!-- -END-->
443
+
444
+ [ ` DataFrame ` ] ( DataFrame.md ) supports reading [ Arrow interprocess streaming format] ( https://arrow.apache.org/docs/java/ipc.html#writing-and-reading-streaming-format )
445
+ and [ Arrow random access format] ( https://arrow.apache.org/docs/java/ipc.html#writing-and-reading-random-access-files )
446
+ from raw Channel (ReadableByteChannel for streaming and SeekableByteChannel for random access), InputStream, File or ByteArray.
447
+
448
+ > If you use Java 9+, follow the [ Apache Arrow Java compatibility] ( https://arrow.apache.org/docs/java/install.html#java-compatibility ) guide.
449
+ >
450
+ {style="note"}
0 commit comments