Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataSource API] Design issues and/or missing documentation #989

Open
sylvlecl opened this issue Oct 24, 2019 · 6 comments
Open

[DataSource API] Design issues and/or missing documentation #989

sylvlecl opened this issue Oct 24, 2019 · 6 comments

Comments

@sylvlecl
Copy link
Contributor

sylvlecl commented Oct 24, 2019

  • Do you want to request a feature or report a bug?

Design issues, or missing documentation.

  • What is the current behavior?

The DataSource API, which is used in particular in the core Importer API, is very hard for third-party users to use and understand.

An original issue seems to be that it is used for 2 different things:

  1. A common abstraction over different storages (folder, zip, AFS, memory ...)
  2. The representation of a Network as a set of binary contents which can be read by an Importer. To help for this, it has the concept of "base name" which has nothing to do with item 1. However, the actual resource names are not known by the data source, only by the importer which has the ability to import it.

A consequence of this is that the manipulation of DataSource objects as described in item 2 is not convenient, because it's not autonomous.
For example, if you want to simply copy the binary content representing a network from a DataSource to an in-memory Map<String, byte[]>, you will need :

  • An Importer which knows what files in the DataSource are part of the network representation. If you want to find an adequate Importer, you can use Importers.findImporter but you need to provide unexpected parameters, for example a ComputationManager
  • An in-memory DataSource implementation which exposes the actual map of content it contains (kind of extended MemDataSource) : otherwise, at the end you don't know what files have been actually copied, only the Importer knows.

Or for example you could want to copy the contents to an in-memory zip. Once again, you need to provide your own DataSource implementation, and this is not obvious since the interface is not documented.

  • What is the expected behavior?

The DataSource API is used a lot in the framework, and should be documented, and simple to understand and use for third-party users.

Possible guidelines for a clearer API would be to separate the 2 concerns described above into 2 different objects or interfaces. One would be in charge of abstracting away the storage, while the other one would carry the actual set of binary contents which serve for the Network representation.

  • What is the motivation / use case for changing the behavior?

This central part of the API really feels like a blocker for adoption of the framework by third-party users.

  • Please tell us about your environment:
    • PowSyBl Version: 3.0.0
@sylvlecl
Copy link
Contributor Author

References

  • issue DataSourceUtil extension handling #121 describes problems with data source extension handling
  • cancelled PR Data source rework #471 tried to adress some of the problems described here or in DataSourceUtil extension handling #121
    In particular, this PR removed the extension handling from DataSource, leaving it to the Importer API, but left an optional "main file name" in place of the "base name" as part of the object.
    It also externalized the handling of compression formats in a dedicated interface.
    Globally, it sounds good except maybe for the main file name which may look too much linked to network file formats.

@sylvlecl
Copy link
Contributor Author

sylvlecl commented Oct 28, 2019

Goals and use cases
In order to design the API we need to decide what it should adress.

1. Enable reading/writing seamlessly from/to a compressed or uncompressed file

Practically, we want to be able to run either :

itools loadflow --case-file network.xiidm

and also

itools loadflow --case-file network.xiidm.gz

2. Enable reading/writing files seamlessly from/to a directory, a zip file, an in-memory map, or another storage (AFS node, tarball ...)

In practice, we would like to be able to write something like :

Path path = Paths.get("/path/to/network.xiidm");
Network network = Importers.loadNetwork(DataSources.file(path));

and also

Map<String, byte[]> data = ...;
String name = "network.xiidm";
Network network = Importers.loadNetwork(DataSources.map(data, name));

A question for the command line : if a zip file is given as argument, since we don't have any hint for the file names inside the zip, how should we resolve them ?

itools loadflow --case-file network.zip

Should we assume that the zip will contain files named "network.<extension>", or should it work if the zip contains one file named "toto.<extension>" ? And in this case, throw if we don't know what files to choose ?

3. Enable reading/writing Network models from/to multiple files

Some network file formats may rely on multiple files :

  • CGMES defines several "profiles" (EQ, TP, SV, ...)
  • XIIDM allows to read/write extensions from/to separate files

Do we want to be able to define the source for the model through a single name ?
It makes it easier for the command line, but it implies that other files resolution is at the Importer hand, not at the user hand.

@sylvlecl
Copy link
Contributor Author

sylvlecl commented Oct 28, 2019

Propositions

1. Simplify implementation of DataSource
The real SPI to asbtract away from the underlying storage could simply look like this:

interface DataSource {
  List<String> getFileNames() throws IOException;
  InputStream newInputStream(String name) throws IOException;
  OutputStream newOutputStream(String name, boolean append) throws IOException;
}

Other methods look common to all implementation.

2. Provide more utility methods to create data sources

The framework could provide methods to build and manipulate data sources more easily, maybe chaining calls:

DataSource source = DataSources.singleFile("/path/to/network.xiidm");
Map<String, byte[]> map = new HashMap<>();
DataSource target = DataSources.mem(map)
                               .gzip()
                               .observe(observer);
// Copies all the content
DataSources.copy(source, target);

@sylvlecl sylvlecl changed the title DataSource API : design issues and/or missing documentation [DataSource API] Design issues and/or missing documentation Oct 31, 2019
@geofjamg
Copy link
Member

I would keep the actual DataSource, ReadOnlyDataSource separation as some use case requires only read access and forbid write access (like importer). Also, some implementations only make sense for read like ResourceDataSource which give access to classpath data for unit testing.

@geofjamg
Copy link
Member

geofjamg commented Jan 28, 2020

Not clear to me how it would work which such a design:

DataSource source = DataSources.singleFile("/path/to/network.xiidm");

In an importer implementation, it would require the importer to have an extra information to known which file name to import:

// in an importer code
dataSource.newInputStream(????);

What dataSource.getFileNames() is supposed to return in that case? just network.xiidm ? network.xiidm and all its friends of the same directory?

@jonenst
Copy link
Contributor

jonenst commented Jan 29, 2020

/**
* A DataStore represents access to many unrelated blobs, not limited to a single network.
* examples: on local filesystem all files in a folder, or everything in a zip, everything in a database..
*/
public interface ReadOnlyDataStore {
    Set<String> getStreamNames() throws IOException;  // list all the things !
    InputStream newInputStream(String name) throws IOException; // open them one by one
// note: no basename, no network name
}
// Importers can use DataStore to abstract away access to blobs in different storages and autodiscover networks.
// They are created from some kind of uri which can be used to filter what is inspected during the autodiscover phase
/**
* A NetworkDataStore represents access to all the data of one network only. 
*/
public interface NetworkDataStore extends DataStore {
    String getBaseName();
    Set<String> listStreams() throws IOException; //only the streams of one network
    InputStream newInputStream(String name) throws IOException;
}

Then you can have a function List<NetworkDataStore> Importers.discover(DataStore)

For writes, since we can create any new file, a simple

public interface DataStore extends ReadonlyDataStore {
    OutputStream newOutputStream(String name) throws IOException; 
}

might suffice ?

@rolnico rolnico mentioned this issue Apr 15, 2024
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants