Remove inter-table references #7

abyrd · 2016-02-05T14:32:10Z

This would change the API of gtfs-lib and require us to cut a new version, so it's still up for debate.

The tables in GTFS feeds contain references to each other’s objects via IDs (stop_id, route_id, etc.) The gtfs-lib classes that model GTFS entities resolve these IDs into direct Java object references to other entities. All those tables that contain such direct references cannot be in persistent disk-backed MapDBs because MapDB table contents are treated as pure values; all objects they contain use semantic equality not identity equality. Therefore those tables containing cross-references are currently in plain Java in-memory hash maps (if we put them in MapDBs, many entities including the extremely voluminous shapes would be serialized multiple times in different tables). This use of in-memory maps means that gtfs-lib currently requires you to re-load the source file into a temporary MapDB every time you want to use it.

This is a tradeoff between syntax and convenience on one hand, and persistence of gtfs-lib tables on the other hand. If the tables are persistent, the purpose and meaning of gtfs-lib are very clear: it’s a random access, indexed Java representation of exactly one whole GTFS feed, which can be quickly re-opened once it’s been built, and re-used in place of the original feed.

If all tables are not persistent but we still end up re-opening gtfs-lib files for pragmatic reasons, e.g. to yank out some colors or IDs when rendering itineraries, the purpose and behavior of the library becomes murky. The data available to the caller changes depending on whether you loaded from a feed, or re-opened one that was already loaded.

I think it would be great if both osm-lib and gtfs-lib gave you a java interface to exactly the contents of the GTFS or OSM file you initially wrapped, plus some convenient functionality for grouping things / seeking within the data store etc.

Resolving inter-table references to object references already destroys some of the original source data. For example, say we want to do a few more passes of validation on a GTFS feed. If a table contained a bad ID all you would know is that it was bad (the reference remains unresolved as a null), but you couldn’t look up that bad ID at a later date.

The more I think about it, the more I think the gtfs-lib and osm-lib should exactly reflect the original data source (which also conveniently makes it possible for them to be persistent).

You can always have helper methods to getRoute() on a Trip for example, even though it internally only stores a String ID. However, if these resolver methods were directly on the entity objects they’d either need a reference to the containing feed (impossible since that makes a circular reference for serialization, and wastes a ton of space anyway), or the feed must be passed in as a parameter. But we also don’t need to be super-OO. Functions like GTFSFeed.routeForTrip(Trip trip) would be fine.
They’re all just convenience methods anyway.

This is very easy to read:

Trip trip = feed.getTrip(tripId); 
feed.getRoute(trip.routeId);

How ugly is it to say trip.getRoute(feed) versus feed.routeForTrip(trip)?

The text was updated successfully, but these errors were encountered:

abyrd mentioned this issue Feb 5, 2016

All tables should be persistent #3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove inter-table references #7

Remove inter-table references #7

abyrd commented Feb 5, 2016

Remove inter-table references #7

Remove inter-table references #7

Comments

abyrd commented Feb 5, 2016