-
Notifications
You must be signed in to change notification settings - Fork 32
Data Model
WInte.r does not require any specific data model for its operations. All methods implemented in the MatchingEngine expect objects implementing the Matchable interface, and the DataFusionEngine methods expect objects implementing the Fusable interface. For a quick start, you can use the default model, which has pre-implemented readers of CSV and XML files. The default model provides simple classes to represent Attributes and Records.
However, if you decide to extensively work with data from a specific domain, it is recommended to implement a dedicated data model for this domain. This specific data model equips you with an additional layer of abstraction during the data integration process. For demoing purposes, WInte.r contains example use cases like a movie use case and a city use case including specific data models with readers for well-known CSV and XML files. A record in this user-specific model is represented by a specifc record like a movie record or a city record.
In general, default and user-specific records are loaded into a DataSet, which contains records and their schema, represented by attributes. Two implementations of a dataset are already provided: the HashedDataSet and the ParallelHashDataSet.
The default implementation of a DataSet is the HashedDataSet, which is a ProcessableCollection
and hence uses single-threaded processing. Analogously, the ParallelHashedDataSet is a ParallelProcessableCollection
and uses multi-threaded processing.
Datasets can store records and metadata, i.e., attributes.
To access the metadata, use the getSchema()
function, which returns another dataset with the metadata (you could actually manage another level of metadata by calling getSchema()
on the metadataset).
All collections in WInte.r implement the Processable interface, which enables a replaceable processing archtecture behind the matching and fusion methods. The default implementation of a Processable
is ProcessableCollection, which uses a single-threaded processing model. For multi-threaded processing, load your data into a ParallelProcessableCollection.
To access the elements in a Processable
use the get()
function, which returns a java Collection
.
Processable<String> myProcessable = new ProcessableCollection<>();
myProcessable.add("a");
myProcessable.add("bb");
myProcessable.add("ccc");
for(String element : myProcessable.get()) {
System.out.println(element);
}
Output:
a
bb
ccc
Note: All implementations in WInte.r use the methods defined in Processable
to manipulate data. If you need to optimise for performance, you should consider defining you own Processable
with custom implementations of the processing operations.
Processable<String> myProcessable = new ProcessableCollection<>();
myProcessable.add("a");
myProcessable.add("bb");
myProcessable.add("ccc");
myProcessable.add("a");
myProcessable.add("bb");
myProcessable.add("ccc");
String result = myProcessable
.distinct()
.filter((s)->s.length()>1)
.take(1)
.firstOrNull();
System.out.println(result);
Output:
bb
The Record class represents records in a dataset. It implements both the Matchable
and Fusible
interfaces.
The record class can manage attributes with atomic values and attributes with lists of values.
To access values, you need to define attributes and pass them to the setValue
/getValue
or setList
/getList
methods.
The Attribute class represents attributes in a dataset. It implements the Matchable
interface.
The only data that an attribute has is its name.
Attribute name = new Attribute("name");
Attribute phoneNumbers = new Attribute("phoneNumbers");
Record record = new Record("record1");
record.setValue(name, "the name of the entity");
record.setList(phoneNumbers, Arrays.toList(
new String[] {
"001",
"002",
"003"
}));
logger.info(String.format("Name: %s", record.getValue(name)));
for(String pn : record.getList(phoneNumbers)) {
logger.info(pn);
}
Output:
Name: the name of the entity
001
002
003
WInte.r provides implementations to load data from CSV and XML files using the default data model:
Reading a CSV file: The CSVRecordReader handles the CSV reading for the default data model and assumes that the first row contains the attribute names. You must specify the index of the column that contains the unique identifier in the constructor (if none exists, use -1).
// create a dataset
DataSet<Record, Attribute> data1 = new HashedDataSet<>();
// load data into the dataset
new CSVRecordReader(0).loadFromCSV(new File("usecase/movie/input/scifi1.csv"), data1);
If no header row is available, a mapping can be provided to connect an attribute to a column.
// generate mapping
Map<String, Attribute> columnMappingITunes = new HashMap<>();
columnMappingITunes.put("1", iTunesSong.PAGE);
columnMappingITunes.put("2", iTunesSong.URI0);
columnMappingITunes.put("3", iTunesSong.URI1);
columnMappingITunes.put("4", iTunesSong.URI2);
columnMappingITunes.put("5", iTunesSong.URI3);
columnMappingITunes.put("6", iTunesSong.POSITION);
columnMappingITunes.put("7", iTunesSong.NAME);
columnMappingITunes.put("8", iTunesSong.ARTIST);
columnMappingITunes.put("9", iTunesSong.TIME);
// load data
DataSet<Record, Attribute> dataITunes = new HashedDataSet<>();
new CSVRecordReader(0, columnMappingITunes).loadFromCSV(new File("usecase/itunes/input/itunes.csv"),
dataITunes);
Reading an XML file: The XMLRecordReader handles the XML reading for the default data model and requires a map that assigns an Attribute
to each XML Tag. Additionally, you must specify the XML tag that contains the record ids and an XPath expression the returns all records from the XML file.
<?xml version="1.0" encoding="UTF-8"?>
<movies>
...
<movie>
<id>movie_list_3</id>
<title>Alice in Wonderland</title>
<studio>Disney</studio>
</movie>
...
Map<String, Attribute> nodeMapping = new HashMap<>();
nodeMapping.put("title", new Attribute("title"));
nodeMapping.put("studio", new Attribute("studio"));
new XMLRecordReader("id", nodeMapping).loadFromXML(sourceFile, "/movies/movie", dataset);
The movie record class represents movie records in a dataset. It extends the abstract class AbstractRecord
.
To hold a movie's information, the movie record has attributes like 'title' or 'director'. When implementing these attributes, you should know whether and where they are contained in the input files. You can access these attributes via distinct 'get' and 'set' methods.
public class Movie extends AbstractRecord<Attribute> implements Serializable {
...
public Movie(String identifier, String provenance) {
super(identifier, provenance);
actors = new LinkedList<>();
}
private String title;
private String director;
private LocalDateTime date;
private List<Actor> actors;
private String studio;
private String genre;
private double budget;
...
public String getTitle() {
return title;
}
public void setTitle(String title) {
this.title = title;
}
...
}
WInte.r provides example implementations to load data from CSV and XML files using a user-specific data model:
Reading a CSV file: The CSVCityReader handles the CSV reading for the city data model.
id | countryCode | name | latitude | longitude | officialName | country | population |
---|---|---|---|---|---|---|---|
0 | IL | Ashqelon | 31.66926 | 34.57149 | אשקלון | Israel | 105.995 thousand |
1 | IQ | Ctesiphon | 33.08333 | 44.58333 | Iraq | ||
2 | FI | Espoo | 60.25 | 24.66667 | Espoo | Finland | 256.76 thousand |
... | ... | ... | ... | ... | ... | ... | ... |
// load data
DataSet<City, Attribute> dataCity = new HashedDataSet<>();
new CSVCityReader().loadFromCSV(new File("usecase/city/input/city.csv"), dataCity);
Whenever a new line is read from the target csv file via readLine()
, a new city record is generated and filled as shown below. As the format of the csv file is known upfront, you can directly pass a new value from an input line 'values' to the distinct attribute of the city record. Afterwards, the new city record is added to the data set.
@Override
protected void readLine(File file, int rowNumber, String[] values, DataSet<City, Attribute> dataset) {
if (rowNumber == 0) {
initialiseDataset(dataset);
} else {
// generate new record of type city
City r = new City(values[0], file.getAbsolutePath());
// set values of record
r.setCountryCode(values[1]);
r.setName(values[2]);
...
dataset.add(r);
}
}
Hint: As you can see in the city input table above, the column 'population' contains the unit 'thousand', thus it should be converted before processing it further. Check the wiki page on Data Normalization for more details.
Reading an XML file: The MovieXMLReader handles the XML reading for the movie data model. As the format of the xml is known upfront, no additional details have to be passed to the MovieXMLReader.
<?xml version="1.0" encoding="UTF-8"?>
<movies>
...
<movie>
<id>movie_list_3</id>
<title>Alice in Wonderland</title>
<studio>Disney</studio>
</movie>
...
HashedDataSet<Movie, Attribute> dataAcademyAwards = new HashedDataSet<>();
new MovieXMLReader().loadFromXML(new File("usecase/movie/input/academy_awards.xml"), "/movies/movie", dataAcademyAwards);
The mapping information between attribute and xml tag directly implemented into the MovieXMLReader, which generates the movie records from the xml file.
public class MovieXMLReader extends XMLMatchableReader<Movie, Attribute> implements
FusibleFactory<Movie, Attribute> {
...
@Override
public Movie createModelFromElement(Node node, String provenanceInfo) {
String id = getValueFromChildElement(node, "id");
// create the object with id and provenance information
Movie movie = new Movie(id, provenanceInfo);
// fill the attributes
movie.setTitle(getValueFromChildElement(node, "title"));
...
return movie;
}
}