Skip to content

Commit

Permalink
Merge pull request #2 from gdcc/1-field
Browse files Browse the repository at this point in the history
stop repeating field over and over #1
  • Loading branch information
pdurbin authored Jun 3, 2024
2 parents dfed3b0 + 52bcb37 commit 1687207
Show file tree
Hide file tree
Showing 3 changed files with 15 additions and 69 deletions.
1 change: 0 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,6 @@ Same as above but use a JVM option in domain.xml such as the example below.
### Differences from Kaggle

- I see an `encodingFormat` of `text/comma-separated-values`. Kind of curious about that since I think `text/csv` is more the MIME type that's on https://www.iana.org/assignments/media-types/media-types.xhtml and https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types . See https://github.com/IQSS/dataverse/issues/4943#issuecomment-2145333830
- One big difference I see is that you have many `recordSets` (and each one containing a single `field`) despite there being only 1 CSV. My understanding was that a `recordSet` maps roughly to a table and a `field` maps roughly to a column. So you'll see that our implementation has only 1 `recordSet` with many `field`s. This might be a good thing to get clarification on.
- Another thing that sticks out is that I see all of the `field`s have a `dataType` of `sc:Integer`. But nearly all of the columns (excluding `quality` and `Id`) are `sc:Float`. On the Kaggle side, we have a column type of "Id" and so if that's set on a column, we set the `dataType` to `sc:Text` since Ids can often be non-numerical. Just a minor difference there, though, so nothing alarming to me personally.

### Differences from pyDataverse
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,8 @@ public void exportDataset(ExportDataProvider dataProvider, OutputStream outputSt
int fileCounter = 0;
for (JsonValue jsonValue : datasetFileDetails) {

JsonObjectBuilder recordSetContent = Json.createObjectBuilder();
recordSetContent.add("@type", "cr:RecordSet");
JsonObject fileDetails = jsonValue.asJsonObject();
/**
* When there is an originalFileName, it means that the file has gone through ingest
Expand Down Expand Up @@ -306,9 +308,9 @@ public void exportDataset(ExportDataProvider dataProvider, OutputStream outputSt
"fileObject",
Json.createObjectBuilder()
.add("@id", fileId))));
fieldSetObject.add("field", fieldSetArray);
recordSet.add(fieldSetObject);
}
recordSetContent.add("field", fieldSetArray);
recordSet.add(recordSetContent);
fileIndex++;
}
fileCounter++;
Expand Down
77 changes: 11 additions & 66 deletions src/test/resources/cars/expected/cars-croissant.json
Original file line number Diff line number Diff line change
Expand Up @@ -126,12 +126,7 @@
"@id": "data/stata13-auto.dta"
}
}
}
]
},
{
"@type": "cr:RecordSet",
"field": [
},
{
"@type": "cr:Field",
"name": "price",
Expand All @@ -143,12 +138,7 @@
"@id": "data/stata13-auto.dta"
}
}
}
]
},
{
"@type": "cr:RecordSet",
"field": [
},
{
"@type": "cr:Field",
"name": "mpg",
Expand All @@ -160,12 +150,7 @@
"@id": "data/stata13-auto.dta"
}
}
}
]
},
{
"@type": "cr:RecordSet",
"field": [
},
{
"@type": "cr:Field",
"name": "rep78",
Expand All @@ -177,12 +162,7 @@
"@id": "data/stata13-auto.dta"
}
}
}
]
},
{
"@type": "cr:RecordSet",
"field": [
},
{
"@type": "cr:Field",
"name": "headroom",
Expand All @@ -194,12 +174,7 @@
"@id": "data/stata13-auto.dta"
}
}
}
]
},
{
"@type": "cr:RecordSet",
"field": [
},
{
"@type": "cr:Field",
"name": "trunk",
Expand All @@ -211,12 +186,7 @@
"@id": "data/stata13-auto.dta"
}
}
}
]
},
{
"@type": "cr:RecordSet",
"field": [
},
{
"@type": "cr:Field",
"name": "weight",
Expand All @@ -228,12 +198,7 @@
"@id": "data/stata13-auto.dta"
}
}
}
]
},
{
"@type": "cr:RecordSet",
"field": [
},
{
"@type": "cr:Field",
"name": "length",
Expand All @@ -245,12 +210,7 @@
"@id": "data/stata13-auto.dta"
}
}
}
]
},
{
"@type": "cr:RecordSet",
"field": [
},
{
"@type": "cr:Field",
"name": "turn",
Expand All @@ -262,12 +222,7 @@
"@id": "data/stata13-auto.dta"
}
}
}
]
},
{
"@type": "cr:RecordSet",
"field": [
},
{
"@type": "cr:Field",
"name": "displacement",
Expand All @@ -279,12 +234,7 @@
"@id": "data/stata13-auto.dta"
}
}
}
]
},
{
"@type": "cr:RecordSet",
"field": [
},
{
"@type": "cr:Field",
"name": "gear_ratio",
Expand All @@ -296,12 +246,7 @@
"@id": "data/stata13-auto.dta"
}
}
}
]
},
{
"@type": "cr:RecordSet",
"field": [
},
{
"@type": "cr:Field",
"name": "foreign",
Expand Down

0 comments on commit 1687207

Please sign in to comment.