[DataSource API] Design issues and/or missing documentation #989

sylvlecl · 2019-10-24T09:05:46Z

Do you want to request a feature or report a bug?

Design issues, or missing documentation.

What is the current behavior?

The DataSource API, which is used in particular in the core Importer API, is very hard for third-party users to use and understand.

An original issue seems to be that it is used for 2 different things:

A common abstraction over different storages (folder, zip, AFS, memory ...)
The representation of a Network as a set of binary contents which can be read by an Importer. To help for this, it has the concept of "base name" which has nothing to do with item 1. However, the actual resource names are not known by the data source, only by the importer which has the ability to import it.

A consequence of this is that the manipulation of DataSource objects as described in item 2 is not convenient, because it's not autonomous.
For example, if you want to simply copy the binary content representing a network from a DataSource to an in-memory Map<String, byte[]>, you will need :

An Importer which knows what files in the DataSource are part of the network representation. If you want to find an adequate Importer, you can use Importers.findImporter but you need to provide unexpected parameters, for example a ComputationManager
An in-memory DataSource implementation which exposes the actual map of content it contains (kind of extended MemDataSource) : otherwise, at the end you don't know what files have been actually copied, only the Importer knows.

Or for example you could want to copy the contents to an in-memory zip. Once again, you need to provide your own DataSource implementation, and this is not obvious since the interface is not documented.

What is the expected behavior?

The DataSource API is used a lot in the framework, and should be documented, and simple to understand and use for third-party users.

Possible guidelines for a clearer API would be to separate the 2 concerns described above into 2 different objects or interfaces. One would be in charge of abstracting away the storage, while the other one would carry the actual set of binary contents which serve for the Network representation.

What is the motivation / use case for changing the behavior?

This central part of the API really feels like a blocker for adoption of the framework by third-party users.

Please tell us about your environment:
- PowSyBl Version: 3.0.0

The text was updated successfully, but these errors were encountered:

sylvlecl · 2019-10-28T08:15:50Z

References

issue DataSourceUtil extension handling #121 describes problems with data source extension handling
cancelled PR Data source rework #471 tried to adress some of the problems described here or in DataSourceUtil extension handling #121
In particular, this PR removed the extension handling from DataSource, leaving it to the Importer API, but left an optional "main file name" in place of the "base name" as part of the object.
It also externalized the handling of compression formats in a dedicated interface.
Globally, it sounds good except maybe for the main file name which may look too much linked to network file formats.

sylvlecl · 2019-10-28T09:14:20Z

Goals and use cases
In order to design the API we need to decide what it should adress.

1. Enable reading/writing seamlessly from/to a compressed or uncompressed file

Practically, we want to be able to run either :

itools loadflow --case-file network.xiidm

and also

itools loadflow --case-file network.xiidm.gz

2. Enable reading/writing files seamlessly from/to a directory, a zip file, an in-memory map, or another storage (AFS node, tarball ...)

In practice, we would like to be able to write something like :

Path path = Paths.get("/path/to/network.xiidm");
Network network = Importers.loadNetwork(DataSources.file(path));

and also

Map<String, byte[]> data = ...;
String name = "network.xiidm";
Network network = Importers.loadNetwork(DataSources.map(data, name));

A question for the command line : if a zip file is given as argument, since we don't have any hint for the file names inside the zip, how should we resolve them ?

itools loadflow --case-file network.zip

Should we assume that the zip will contain files named "network.<extension>", or should it work if the zip contains one file named "toto.<extension>" ? And in this case, throw if we don't know what files to choose ?

3. Enable reading/writing Network models from/to multiple files

Some network file formats may rely on multiple files :

CGMES defines several "profiles" (EQ, TP, SV, ...)
XIIDM allows to read/write extensions from/to separate files

Do we want to be able to define the source for the model through a single name ?
It makes it easier for the command line, but it implies that other files resolution is at the Importer hand, not at the user hand.

sylvlecl · 2019-10-28T09:20:39Z

Propositions

1. Simplify implementation of DataSource
The real SPI to asbtract away from the underlying storage could simply look like this:

interface DataSource {
  List<String> getFileNames() throws IOException;
  InputStream newInputStream(String name) throws IOException;
  OutputStream newOutputStream(String name, boolean append) throws IOException;
}

Other methods look common to all implementation.

2. Provide more utility methods to create data sources

The framework could provide methods to build and manipulate data sources more easily, maybe chaining calls:

DataSource source = DataSources.singleFile("/path/to/network.xiidm");
Map<String, byte[]> map = new HashMap<>();
DataSource target = DataSources.mem(map)
                               .gzip()
                               .observe(observer);
// Copies all the content
DataSources.copy(source, target);

geofjamg · 2020-01-28T16:07:13Z

I would keep the actual DataSource, ReadOnlyDataSource separation as some use case requires only read access and forbid write access (like importer). Also, some implementations only make sense for read like ResourceDataSource which give access to classpath data for unit testing.

geofjamg · 2020-01-28T16:15:41Z

Not clear to me how it would work which such a design:

DataSource source = DataSources.singleFile("/path/to/network.xiidm");

In an importer implementation, it would require the importer to have an extra information to known which file name to import:

// in an importer code
dataSource.newInputStream(????);

What dataSource.getFileNames() is supposed to return in that case? just network.xiidm ? network.xiidm and all its friends of the same directory?

jonenst · 2020-01-29T10:58:28Z

/**
* A DataStore represents access to many unrelated blobs, not limited to a single network.
* examples: on local filesystem all files in a folder, or everything in a zip, everything in a database..
*/
public interface ReadOnlyDataStore {
    Set<String> getStreamNames() throws IOException;  // list all the things !
    InputStream newInputStream(String name) throws IOException; // open them one by one
// note: no basename, no network name
}
// Importers can use DataStore to abstract away access to blobs in different storages and autodiscover networks.
// They are created from some kind of uri which can be used to filter what is inspected during the autodiscover phase

/**
* A NetworkDataStore represents access to all the data of one network only. 
*/
public interface NetworkDataStore extends DataStore {
    String getBaseName();
    Set<String> listStreams() throws IOException; //only the streams of one network
    InputStream newInputStream(String name) throws IOException;
}

Then you can have a function List<NetworkDataStore> Importers.discover(DataStore)

For writes, since we can create any new file, a simple

public interface DataStore extends ReadonlyDataStore {
    OutputStream newOutputStream(String name) throws IOException; 
}

might suffice ?

sylvlecl added Doc needed enhancement help wanted Refactoring labels Oct 24, 2019

top-issues bot added the 👍 Top Issue! label Oct 24, 2019

sylvlecl changed the title ~~DataSource API : design issues and/or missing documentation~~ [DataSource API] Design issues and/or missing documentation Oct 31, 2019

quinarygio mentioned this issue May 27, 2020

Data store implementation #1327

Closed

5 tasks

rolnico mentioned this issue Apr 15, 2024

Datasources #2959

Closed

7 tasks

jonenst mentioned this issue Jun 26, 2024

Disambiguate files on import #3085

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataSource API] Design issues and/or missing documentation #989

[DataSource API] Design issues and/or missing documentation #989

sylvlecl commented Oct 24, 2019 •

edited

Loading

sylvlecl commented Oct 28, 2019

sylvlecl commented Oct 28, 2019 •

edited

Loading

sylvlecl commented Oct 28, 2019 •

edited by geofjamg

Loading

geofjamg commented Jan 28, 2020

geofjamg commented Jan 28, 2020 •

edited

Loading

jonenst commented Jan 29, 2020

[DataSource API] Design issues and/or missing documentation #989

[DataSource API] Design issues and/or missing documentation #989

Comments

sylvlecl commented Oct 24, 2019 • edited Loading

sylvlecl commented Oct 28, 2019

sylvlecl commented Oct 28, 2019 • edited Loading

sylvlecl commented Oct 28, 2019 • edited by geofjamg Loading

geofjamg commented Jan 28, 2020

geofjamg commented Jan 28, 2020 • edited Loading

jonenst commented Jan 29, 2020

sylvlecl commented Oct 24, 2019 •

edited

Loading

sylvlecl commented Oct 28, 2019 •

edited

Loading

sylvlecl commented Oct 28, 2019 •

edited by geofjamg

Loading

geofjamg commented Jan 28, 2020 •

edited

Loading