Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osmdata_sc function #148

Closed
mpadge opened this issue Oct 25, 2018 · 42 comments
Closed

osmdata_sc function #148

mpadge opened this issue Oct 25, 2018 · 42 comments
Labels

Comments

@mpadge
Copy link
Member

mpadge commented Oct 25, 2018

See this silicate issue for motivation here, and ping @mdsumner

@mpadge mpadge added the must do label Oct 25, 2018
@mpadge
Copy link
Member Author

mpadge commented Oct 29, 2018

Current can't handle points coz of this issue, but otherwise should be good to go. Plan is to generate silicate structures like this:

$object
# A tibble n x m
     osm_id    name    osm_type    ...    object_
*    <int>    <chr>    <chr>    ...    <chr>
1    123456789    relOne    relation    ...    e3b30cf9a2
 ...
123    234567890    wayOne    way    ...    83f9d051b6
 ...
456    345678901    nodeOne    node    ....    a6058441f3

$object_link_edge_
# A tibble p x 2
    edge_    object_
1    e5ab8d129c    e3b30cf9a2
2.    e3b30cf9a2    e3b30cf9a2
3    ...

$edge
# A tibble: 5 x 3
  .vertex0   .vertex1   edge_     
  <chr>      <chr>      <chr>
1 e5ab8d129c e5ab8d129c bb19059d04
2 8451f0964e 8451f0964e 93e575a887

$vertex
# A tibble: 5 x 3
     x_    y_ vertex_   
  <int> <int> <chr>
1     1     6 e5ab8d129c
2     2     7 8451f0964e

And the workaround for current lack of POINT objects in silicate (see issue linked above) is to have the object_ for osm_type == "node" being the vertex ID, so the link is direct.

ping @mdsumner: This is a nifty solution for OSM structure in general, and requires neither modification to silicate nor immediate solution to that silicate issue. Problem unfortunately remains for the case of sf::MULTIPOINT and the like... Note also that this is going to be a great test case for silicate (in my subjective opinion here), because silicate will enable direct encoding of the fundamental OSM types, which are not really directly SF-compliant at all, so we're using a fundamentally different representational scheme from the outset.

@mdsumner
Copy link

Sounds great, but is this function only code on your local computer?

Note that I merged the binary branch today, and SC now only has object, edge, vertex tables (no object_link_edge) - there's no "edge_" ID, and "object_" is on edge. Do you think that's a bad idea? I feel like edges don't require an explicit ID, at least in most cases.

@mpadge
Copy link
Member Author

mpadge commented Oct 29, 2018

If I may disagree - there are many cases where edges would require and explicit ID, plus I think that all levels of all tables should have IDs to enable consistent filter operations, otherwise you've got filter by ID as an option for all cases except edge_, which would then become inconsistent and kinda un-hypertidy, right?

@Robinlovelace
Copy link
Member

Sorry for intruding on a debate I know very little about but if object_link_edge and other columns can be created on-the-fly my (uninformed) guess would be that it's better to have them despite the storage overhead if those extra cols will be used in many/most operations and that creating them is costly but that they should not be there if they are unlikely to be used regularly and if creating them on-the-fly is superfast.

@Robinlovelace
Copy link
Member

If that makes any sense at all! If not feel free to ignore - excited to see where this newfangled sc thing goes in any case.

@mpadge
Copy link
Member Author

mpadge commented Oct 29, 2018

@Robinlovelace yeah, I largely agree. I actually suspect that the simplicity of SC alone with all (four) tables and some empty is likely to be ultimately preferable to an SC0 option with fewer tables (and contrary to my comment here).

Ref to core silicate models here

@mdsumner
Copy link

Well true, I'm enamoured by having topology table, and then a _link table to the objects - but at some point you will want both kinds of primitives (triangles for the surface, edges for just the boundary) and so like a kind of "all tables" from TRI and SC. (Eonfusion had separate tables for each type of primitive, so I knew this was workable - we just don't have a smart link-mechanism other than hefty string IDs).

Dang, I have to revert to the original TRI and SC.

@mdsumner
Copy link

mdsumner commented Oct 30, 2018

Well, it's not hard to change (I'm finding!). I did have the edge/segment distinction before but wasn't clear. So, edge table gets .vx0, .vx1, edge_. The link table gets object_, edge_, direction_.

Direction is a record of the input orientation. Either .vx0,.vx1 or .vx1,.vx0. So edge is unique, the link table records the instances. Makes sense?

@mdsumner
Copy link

Topojson stores orientation, I think the direction of an arc around a polygon - presumably osm does the same for ways?

@mpadge
Copy link
Member Author

mpadge commented Oct 30, 2018

There's no sense of orientation in OSM, everything is simply sequential, and inner polygons are simply tagged as such. But I've another quick Q for you Mike: What is the motivation behind having both $edge and $object_link_edge tables rather than just appending an extra object column onto the $edge table? Would it greatly affect your grand vision if these two were reduced to a single table?

@mdsumner
Copy link

mdsumner commented Oct 30, 2018

SC()$edge is a record of paired vertices, no matter what order they occur. (This is absolutely critical for ARC, because we have to remove one set of the two sequences between neighbouring polygons). object_link_edge records every instance of that edge, and today it records if the order (.vx0, .vx1) is native_ - i.e. the edge table is now unique after parallel sort of the pairs.

In the initial implementation the edge table included "edge_" and "segment_", because I thought segment was a good name for the instances of an edge. But now I think it was confusing.

  • $edge : .vx0, .vx1, edge_`
  • $object_link_edge: object_, edge_, native_

Edges are normalized by parallel sort of .vx0, .vx1 - so we don't know their original orientation - but it is recorded on the link table whether the edge was re-oriented or not - one will be TRUE and one FALSE when the edge is shared.

e.g. in my local branch

x <- SC(minimal_mesh)
purrr::map_int(x, nrow)
     object object_link_edge      edge           vertex             meta 
               2               16               15               14                1 
print(x)
class       : SC
type        : Primitive
vertices    : 14
primitives  : 15 (1-space)
crs         : NA

14 unique coordinates, (two are shared) 15 edges, one is repeated where the features touch.

The sf object has 19 coordinates, 3 are repeats (at the end of each path) and 2 are shared. So 19 - 2 - 3

sc_coord(minimal_mesh)
# A tibble: 19 x 2

It could be one edge table, and maybe it should be - but I see this as pretty key. SC0 has a nested edge table with object implicitly, so it's not far way - or you can join back up from the link table.

@mpadge
Copy link
Member Author

mpadge commented Oct 30, 2018

Oh, now I see -thanks! All good to leave as is. I'm nearly there ...

@mdsumner
Copy link

Awright, behold the topology branch.

SC0, SC, and TRI all work as intended. PATH and ARC will have to wait.

SC derives from SC0, and SC can ingest TRI (which is cool), but not the other way around (TRI is PATH-based, not edge-based).

SC0 takes points! SC is purely edges.

There are plot and print methods.

@mpadge mpadge closed this as completed in f6f61b9 Oct 30, 2018
@mpadge
Copy link
Member Author

mpadge commented Oct 30, 2018

Awright, behold osmdata_sc() as of that closing commit. I expect to have to re-open this soon to iron out glitches, but that prototype should get us going. Just osmdata_sc() any query and you should have it. I've been testing with some fiendishly complex OSM topology from

dat <- opq ("london uk") %>%
    add_osm_feature (key = "name", value = "Thames", value_exact = FALSE) %>%
    osmdata_sc (doc = "thames.xml")

It's got all sorts of nested polygons for islands and guff like that.

@mpadge
Copy link
Member Author

mpadge commented Oct 30, 2018

And somewhat contradicting my comments in this commit, the full benchmarks now have osmdata_sc() only performing a bit over 10% faster than osmdata_sf(). That prior comment was before I had filled out the $edge table by tracing along all of the OSM way entities, and that slows it down to a roughly comparable speed after all. But hey, 10% faster remains 10% faster.

@mdsumner
Copy link

I folded all new stuff into silicate master branch, just FYI

@mpadge
Copy link
Member Author

mpadge commented Oct 31, 2018

That's great - that'll make it much easier for me to start delving in. Nice work!

@mdsumner
Copy link

mdsumner commented Nov 1, 2018

Would it be reasonable to write sc_coord, sc_vertex, sc_path etc. functions for the OSM doc? I had a try with navigating the structure with xml2, but it's not something I'm used to doing. I gather that a lot more complex stuff is going on in the code here, but the silicate API is supposed to work like the internals of SC0 and SC - by writing these verb methods for various formats.

It may not work , but it'd help me to see an attempt to decompose OSM in that way. If you're adept at xml2 (maybe it's slow?) it'd be great to be able to attack the doc that way as well, at least for comparing.

@mpadge
Copy link
Member Author

mpadge commented Nov 2, 2018

This is the first issue re-opener as anticipated above. osmdata_sc could simply have an extra argument, sc_verb = "SC", which could also be set to "oord", "vertex", "path", or whatever. That would also ease dodgr compliance by being able to bundle junction-vertex finding (maybe in a new verb? We-d have to discuss this). The current dodgr_contract_graph code - which is largely the work of @karpfen - is also the key step required to plug into the ggm::fundCycles() function to enable direct translation between edge and mesh models. Actually, rather than re-open here, I'll open a new issue for explicit discussion / listing of desirable sc verbs

This was referenced Nov 2, 2018
@mpadge
Copy link
Member Author

mpadge commented Nov 24, 2018

Reopened to add proper vertex info, extending from this dodgr issue. @mdsumner I agree with your suspicion that vertex info doesn't belong on the vertex table itself, but it ends up actually much better placed on a genericobject_link table, which can be eitherobject_link_edge or object_link_vertex, or maybe presumably other linked objects(arc, whatevs). The only substantive divergence from core silicate vision as far as I can tell is the generalisation of the object link table. Otherwise everything fits with, i am pretty sure, no gross distortion of general vision. once i nut this out i expect I'll open a `` silicate` issue on renaming. Early thoughts from your side?

@mpadge mpadge reopened this Nov 24, 2018
@mdsumner
Copy link

On where info belongs, I just meant in some circumstances - because, if you measure x, y, z, time, temperature, then all those belong on the vertices, they are uncontroversially measurements in "geometry". It's just that if we normalize these data (make unique in x, y or x, y, z ...) - then whatever wasn't in the unique-ifying set belongs on the instances of vertex table. I find this requirement bites me in different directions and I haven't sorted it all out yet.

The other thing I've been thinking about is the models, PATH, SC etc. - it's starting to seem like the 'object' table should really be the paths or the edges, and higher level stuff exists on more tables. This came to me while thiking about plotmethods, it would be nice to be able to pass in n-colours when there are n-paths - rather than always working on the "feature level" grouping. Then your sphier ideas really come into it.

I only think tables should be split when there's been some de-duplication, vertices can store anything, including text properties.

Of course, it's also important that we don't get too carried away - I think the SC, TRI, PATH, and ARC structures as they are (and when they work properly) are pretty right. I get caught by the de-duplication thing when I split a DEL mesh for feature constants (e.g. height as SIDS79, because now unique in x, y, z not x, y) - and then I cannot re-triangulate that with Triangle because of the non-xy uniqueness. I feel like that was leading me down a tangled path.

@mpadge
Copy link
Member Author

mpadge commented Nov 27, 2018

The last two commits have sorted out the basic table structure. FYI @mdsumner it now delivers perfectly standard tables for vertex, edge, and object_link_edge, with all the info in the object table, which has the four columns of [object_, key, value, obj_type], where obj_type is one of "relation", "way", or "node", and key-value are OSM standard except for members and roles of multipolygon relations, where OSM ("ref", "role") are re-mapped to key = "rel_role_xxx" ("outer", "inner", whatever), value = <osm_id>.

Still, however, no points, and no ways of re-mapping ID values in the object table on to vertices ...

@mdsumner
Copy link

hey I've so far been assuming that object$object_ is unique and that all object$object_ are in the set of object_link_edge$object, and neither is true here:

library(osmdata)
x <- opq ("hampi india") %>%
  add_osm_feature (key="historic", value="ruins") %>%
  osmdata_sc ()

library(dplyr)
nrow(distinct(x$object, object_)) == nrow(x$object)


setdiff(x$object$object_, x$object_link_edge$object_)

I'm having a look at multiple object instances (this is where sphier extension is really needed) - and how to handle those with SC0, but I admit I haven't before dealt with names that don't exist in downstream tables. I don't think that should be allowed, but I'm also thinking about ways to deal with it - it'll definitely come up in the longer term and silicate will have to be robust to it.

@mpadge
Copy link
Member Author

mpadge commented Nov 27, 2018

Oh yes there is:

x <- opq("hobart") %>% # Just returns the inner city, so is small and workable here
   add_osm_feature(key = "highway") %>%
   osmdata_sc ()

library (tidyverse)
# grab the ID of the first way that is the outer member of a multipolygon relation:
way_id <- x$object %>%
    filter (key == "rel_role_outer") %>%
    select (value) %>%
    first () %>%
    first ()
# find IDs all edges that are part of that way
edges <- x$object_link_edge %>%
    filter (object_ == way_id) %>%
    select (edge_) %>%
    first ()
# then the OSM IDs of all member vertices
verts <- x$edge %>%
    filter (edge_ %in% edges) %>%
    select (.vx0, .vx1) %>%
    gather () %>%
    select (value) %>%
    distinct () %>%
    first ()
# and then their coords:
xy <- x$vertex %>%
    filter (vertex_ %in% verts)

And we have:

> print(way_id)
[1] "553706635"
> edges
 [1] "ovnQhMoumE" "nYmKPtIifD" "ZrNjyH3h6u" "8TVgPjKc8B" "TZqJqNjfZb" "VaoNZpNZx6" "iiFXT9SAzz" "0UQliwIrri" "TBhzPlX4WZ" "vq1w23Wkl3"
[11] "dEKX1TRsmc" "8NXwpdkBVS" "5K7nL6LaQR" "z4U6NtQIHI" "5QUz00r8mC" "rIEKvws0Lx" "IRhRTWc2NJ" "t3Tlg7KEKX" "h6ossh8O6p" "1KzJcoSXJo"
[21] "vCubfh7Gj7" "9fhltLFrVQ" "gjnjAmYEU1" "NhokoEttqO" "7xH8lQn9Ij"
> verts
 [1] "5344515615" "5344515600" "5344515559" "5344515601" "5716270591" "5344515603" "5344515598" "5344515604" "5344515605" "5344515606"
[11] "5344515562" "5344515564" "5344515607" "5344515617" "5344515608" "5344515609" "5344515597" "5344515610" "5344515611" "5344515560"
[21] "5344515616" "5344515612" "5344515599" "5344515613" "5344515614"
> xy
# A tibble: 25 x 3
      x_    y_ vertex_   
   <dbl> <dbl> <chr>
 1  147. -42.9 5344515559
 2  147. -42.9 5344515560
 3  147. -42.9 5344515562
 4  147. -42.9 5344515564
 5  147. -42.9 5344515597
 6  147. -42.9 5344515598
 7  147. -42.9 5344515599
 8  147. -42.9 5344515600
 9  147. -42.9 5344515601
10  147. -42.9 5344515603
# ... with 15 more rows

Easy! Then what happens if we want to group the coordinates of all ways? (This will be needed if SC is to be master class from which osmdata_sf() is derived, as per #158.)

get_ways <- function (x) # append OSM Way ID to the edge table
{
    way_ids <- x$object %>%
        filter (obj_type == "way") %>%
        select (object_) %>%
        distinct () %>%
        first ()
    # alternative:
    # way_ids <- unique (x$object_link_edge$object_)
    links <- x$object_link_edge %>%
        filter (object_ %in% way_ids) %>%
        select (edge_, object_)
    edges <- x$edge %>%
        filter (edge_ %in% edges$edge_) %>%
        left_join (links)
    return (edges)
}

# save the data as xml to compare the above with `sf::read_sf()`
q <- opq ("hobart tasmania") %>%
    add_osm_feature (key = "highway")
osmdata_xml (q, filename = "hobart.osm")

# wrapper functions for benchmarking:
do_sc <- function (doc){
    x <- osmdata_sc (q, doc = doc)
    suppressMessages (get_ways (x))
}
do_sf <- function (doc){
    sf::st_read (doc, layer = "lines", quiet = TRUE)
}

rbenchmark::benchmark (
                       do_sf("hobart.osm"),
                       do_sc("hobart.osm"),
                       replications = 10
)

                test replications elapsed relative user.self sys.self user.child sys.child
2 do_sc("hobart.osm")           10   0.299    1.000     0.295    0.004          0         0
1 do_sf("hobart.osm")           10   0.402    1.344     0.400    0.000          0         0

Looks good. Full sf conversion would require a bit more messing around, but that's the essence of it, and performance is strong. Try a much bigger example:

q <- oqp ("muenster 48155 germany") %>%
    add_osm_feature(key = "highway")
osmdata_xml(q, filename = "ms.osm")
rbenchmark::benchmark (
                       do_sf("ms.osm"),
                       do_sc("ms.osm"),
                       replications = 10
)
            test replications elapsed relative user.self sys.self user.child sys.child
2 do_sc("ms.xml")           10  20.802    1.419    20.721    0.049          0         0
1 do_sf("ms.xml")           10  14.661    1.000    14.558    0.087          0         0

And osmdata_sc doesn't yet scale so well. That could nevertheless lie in inefficiencies in my code here, which should all be improved with #158. I'm happy to close this now - you too @mdsumner?

@mdsumner
Copy link

yes, but with caveat above that I posted ~20s before you ;)

It's late so I'm out for today

@mpadge
Copy link
Member Author

mpadge commented Nov 27, 2018

Ah damn ... I can't get object$object_ to be unique without massively inflating the table. Unique values could only be generated by converting it to wide table which would suffer all the problems inherent in sf (for OSM data, that is) - the actual sf data tables are mostly empty and very wasteful. The current form as I've coded it is much more efficient.

There are also "meta" objects - the OSM "relations" - that do not map directly on to edges. They simply contain other objects which are internally referenced in the object table, and these referents are then the things that map on to the object_link_edge table. This comes back to my question of whether object_link_edge might better be something more general: object_link? I'm not convinced either way, but I don't see any way here for my data to satisfy your current sc principles of setdiff(x$object$object_, x$object_link_edge$object_) being empty.

As you clearly realise, what I've tried to do here is effectively sneak in a bit of the sphier vision, and it actually works pretty well thus far, this little glitch notwithstanding.

@mdsumner
Copy link

I think it's cool, just needs silicate to be more robust - so we can plot and convert and so on. Print seems fine, and that implies a regime of rules about matching that doesn't assume stuff like this . Aand it's late ;)

@mdsumner
Copy link

mdsumner commented Nov 27, 2018

I am a bit worried about this, the process so far has been to unjoin tables up the chain, so (in my speak) you would have a "instances of the objects" table by going distinct on object_ and creating another table:

library(osmdata)
#> Data (c) OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright
x <- opq("hobart") %>% # Just returns the inner city, so is small and workable here
  add_osm_feature(key = "highway") %>%
  osmdata_sc ()



ujd <- unjoin::unjoin(x$object, object_, key_col = "obj_pk")
x1 <- x
x1$object <- ujd$obj_pk
## this step is weird, because downstream plot etc. is not robust to an object that
## has no edges (but we still have object_instance as a record)
x1$object <- x1$object %>% dplyr::filter(object_ %in% x1$object_link_edge$object_)
x1$object_instance <- ujd$data


library(silicate)
#> 
#> Attaching package: 'silicate'
#> The following object is masked from 'package:stats':
#> 
#>     filter
plot(x1)
plot(SC0(x1))

## even this works, cool but useless
plot(anglr::DEL(x1))
#> dropping untriangulatable objects

Created on 2018-11-27 by the reprex package (v0.2.1)

So, now x1 is now SC with primary keys from object down, and we have a dangling "what kind of extension object is this ..." question.

I'm not wedded to this, just airing my thoughts. I'm a little worried about #89 which would otherwise have to do something similar every time it converted or plotted, so it begs the question of how we'd keep those extra data at all.

@mpadge
Copy link
Member Author

mpadge commented Nov 27, 2018

I think your "robustness" is exactly what should be aimed for here, and this ought not be a real biggy. The extra data don't need to be kept at all, they just need to be presumed to be potentially present. In the slight change I made, the object table on SC0 potentially ends up with a bunch of empty topology_ entries that can simply be filtered away in subsequent operations.

@mdsumner
Copy link

Cool, you are already onto it - that's what I'm in the middle of explaining in a reprex - I'll still add it above so you can see what I'd do. So, the reprex above is 1) and there's also 2)

  1. normalize object and create another instance table (and new class?)
  2. robustify silicate ops so they filter away stuff that doesn't link (as you said)

My concerns still with 2) are that it provides some useability problems, but you're exactly right that these kind of things will occur anyway so it has to be dealt with.

@mdsumner
Copy link

One more note, I'm happy for you to close this

anglr::DEL is per object at the moment, so the usage above didn't catch much of the space. The way to do that is to triangulate across the entire set of edges in one (and later we need tools to re-find those edges to link source to mesh):

library(osmdata)
#> Data (c) OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright
x <- opq("hobart") %>% # Just returns the inner city, so is small and workable here
  add_osm_feature(key = "highway") %>%
  osmdata_sc ()



ujd <- unjoin::unjoin(x$object, object_, key_col = "obj_pk")
x1 <- x
x1$object <- ujd$obj_pk
## this step is weird, because downstream plot etc. is not robust to an object that
## has no edges (but we still have object_instance as a record)
x1$object <- x1$object %>% dplyr::filter(object_ %in% x1$object_link_edge$object_)
x1$object_instance <- ujd$data

x1$object <- tibble::tibble(object_ = "I am the one true object")
x1$object_link_edge$object_ <- "I am the one true object"
library(silicate)
#> 
#> Attaching package: 'silicate'
#> The following object is masked from 'package:stats':
#> 
#>     filter

plot(anglr::DEL(x1))
```

![](https://i.imgur.com/IIVMeJF.png)

<sup>Created on 2018-11-27 by the [reprex package](https://reprex.tidyverse.org) (v0.2.1)</sup>

@mpadge
Copy link
Member Author

mpadge commented Nov 29, 2018

What are your thoughts now on unjoin-ing object into object (SC pure) and object_instance? I can easily implement that if that would help. You could readily play with osmdata_sc in current state if that would help.

@mdsumner
Copy link

I'm still worried about this, I haven't been able to focus on it since it came up. Thus far I've been constantly assuming that everything in a model belongs and is connected, and so this idea of unlinked entities is disconcerting. I'm worried that "robustifying" becomes an endless tail chasing, where the simplest form of inner_join doesn't work without clearing unlinked things first - but I haven't been able to get a sense of the issue these last few days, about whether it's a problem or just needless worry.

@mpadge
Copy link
Member Author

mpadge commented Nov 29, 2018

Okay, then I'll push on with things as they are, which means a mucked-up object table, but can easily unjoin down the line. I'm not sure it becomes endless tail chasing, because at least in my sphier vision it's only the object table itself that might not link up. The three "lower" tables remain sacrosanct and so can nor should not violate SC assumptions. I've gotta do other stuff for the coming week or so, so dev here will slow down just for a bit anyway.

@mdsumner
Copy link

Ah ok, I need reassurance like this - we good.

@mdsumner
Copy link

Do these osm data often have points lines and polygons? That is a natural split,shared vertices, shared edges, sets of objects with shared attributes - not necessarily heirarchical but can be. The obvious next case is objects with no geometry, so it's perfectly on-vision.

@mpadge
Copy link
Member Author

mpadge commented Nov 30, 2018

They have just three classes of "nodes" (= points, but can also have attributes / features), "ways" = sequences of points whatever they may be (no necessary distinction between lines and polygons, but that can be made if desired), and "relations" = any and all higher-order relationships between sets of ways, points, or combinations.

So in your terms above, "ways" define the edges, and "ways" = shared vertices, and "ways" can also define shared objects because the vertices, as well as the entire ways themselves, can have and share attributes. "relations" are then strict sets of objects with shared attributes. And importantly here, "relations" themselves have no geometry, rather the geometry merely inheres within the objects (to) which they relate.

The two things that affect the current vision of an SC::object table are:

  1. ways have attributes stored as key-value pairs, and so I've currently got
> x$object
object    key              val           obj_type
0001      highway     bicycle     way
0001      oneway      yes           way
0001      surface      asphalt    way
...

and x$object %>% filter(obj_type == "relation") introduces even more complexity because relations have both key-value pairs, as well as members which I've stored as

> x1$object %>% filter(obj_type == "relation") %>% filter(grepl("way_", key)) 
# A tibble: 67 x 4
   object_ key       val       obj_type
   <chr>   <chr>     <chr>     <chr>
 1 1876611 way_outer 5880406   relation
 2 1876611 way_inner 139433030 relation
 3 1876611 way_inner 483971823 relation
 4 1876611 way_inner 483971824 relation
 5 1918592 way_inner 141607154 relation
 6 1918592 way_outer 141607140 relation
 7 1918592 way_inner 234098812 relation
 8 1918592 way_inner 234098811 relation
 9 1918592 way_inner 291535582 relation
10 1918592 way_inner 291535581 relation
# ... with 57 more rows

In that case, "key" is serving to define an OSM "role", which is an arbitrary string describing the role of the member, but which is often used to define inner and outer components of multipolygon objects (as in this case), with "val" naming the member way.

@mdsumner
Copy link

mdsumner commented Nov 30, 2018

Aha, so we definitely need to be careful about edges - relations are completely general links between entity tables. But, from a geospatial perspective edges are a necessary decomposition of structured data. I feel like we are reaching for a "relations" concept that's not supported yet, much more like your sphier ideas. Silicate edges are definitely about linking vertices, I don't care about what the fields/columns/attributes are on the vertices, but the vertex table is what "edge" in silicate is about.

I'm also inspired by models where location is the question, we have streams of data where we use location as the solution/s for wads of measured data - the fact that the animal travelled between those locations in an ordered sequence in time is not controversial - where it actuall was geographically at/near the nodes is the crux, the measured data is pretty clean but using them as a proxy for long/lat is messy.

Relations, edges in abstract graphs that describe all these things are much more general than SC.

SC can represent a more general graph, no question - but I'm concerned that we are over-reaching a bit here. The vertex table is not nodes, and so maybe that's the missing entity here to bridge to sphier.

@mpadge
Copy link
Member Author

mpadge commented Dec 10, 2018

@mdsumner Another status request from my side here. This is the only outstanding issue prior to next release and major upgrade to osmdata v0.1. The osmdata_sc() fn will be included because the code is entirely separate and can not interfere with any current functionality. But we just need a resolution here. Current status still has the single, massive object table which contains "unlinked" entities - that is, entities with links only within that table, not beyond to any other tables.

Solutions:

  1. Leave as is and allow that. I don't think either of us prefer this option;
  2. Create object_instance table as described above, but this doesn't seem right as the higher-level sphier-type stuff does not actually even link back to the object items;
  3. Reduce / normalize object table only to those objects in object_link_edge, and create an additional table as per your comment above - "normalize object and create another instance table (and new class?)"

The latter would be my current preference, and I presume yours too. I'll describe normalization below for reference here, and that will leave only two primary questions here for you:

  1. What should the new "instance" table be called here? It's a table of hierarchical inter-relationships between object instances, so call it relations?
  2. Do you think this would or should necessitate a new SC class? My preference here would be to say, "No", and have SC always, or at least by default, simply allowing such extension yet ignoring it at all times. SC has and works with its four primary tables; anyone ought be free to add any other tables as desired, safe in the knowledge that silicate will always simply ignore these. As I'm sure you already realise, this would directly allow my entire sphier vision with no modification at all, which would be great for me!

Normalization

Current osmdata_sc() has

  1. Strict SC::vertex table with (x_, y_, vertex_), with unique instances of OSM nodes (refer above comment that "the vertex table is not nodes", as considered further below).
  2. Strict SC::edge table with all edges simply as (.vx0, .vx1, edge_).
  3. Strict SC::object_link_edge table which has no duplication by virtue of OSM schema, so simply has (edge_, object_, native_), where the following always hold:
nrow (object_link_edge) == nrow (edge)
all(object_link_edge, native_)
  1. An object table that has been "normalized" to contain only those objects in object_link_edge$object_, yet can contain multiple entries for each, so, for example:
object_  key            val
000        name        a
000        type          one

The latter is the only potential conflict with current silicate, yet I see this as a necessity to enable efficient (read "sparse") storage. Any comments, suggestions here would be very helpful!

Sub-issue (1): Vertices are not nodes

In OSM, nodes have properties, yet these can or ought not be considered or stored as "objects" in the above schema, and nor are they vertices. These properties thus have to be stored in an additional table. There could be just one additional table called relation or something, yet nodes wouldn't fit comfortably in there. It might thus be better to include node info in its own additional table, nodes, which simply has (vertex_, key, value), where vertex_ will always map directly on to vertex$vertex_.

Sub-issue (2): Relations table

Putting nodes in their own table would enable a relations table to be added that held strict relations in the sense of both OSM and sphier; that is, relations between any and all kinds of lower or higher-level entities. Linkage beyond this table (that is, non-internal linkage) would be either to the SC objects (in the object and object_link_edge tables), or nodes. Relations themselves have two types of entries:

  1. member-role; and
  2. key-value

member-role entries include, for example, multipolygon inner / outer specifications, and are the fundamental entities that specify the relation, while key-value pairs pertain to the entire relation, and are not associated with the member-role entries at all. (That is, a given key-value specification pertains to the relation and does not specify key-value properties of any member entities.) This all suggests that relations might be best specified by two tables:

  1. relation_members with (relation, member, member_type, role) entries; and
  2. relation_properties with (relation, key, value) entries.

The relation_members$member_type entries could then be either object, in which case relation_members$member would map directly onto object_link_edge$object_, or node, in which case relation_members$member would map on to vertex$vertex_ and potentially to nodes$vertex_ (where such entries existed).

As long as silicate simply ignores these tables, then the remainder would behave entirely as expected within all silicate operations. Thoughts please!

@mdsumner
Copy link

I like the idea of relations and other extra tables, and I agree they can just be there. I think you are right about multiple instances in object, that also makes sense - and silicate should be robust to that.

I am still inclined to make silicate robust to dangles hanging around elsewhere, i.e. unlinked vertices or edges - I think they will occur, and could even be useful for implicit data or data that exists elsewhere. But, it's not a good starting point to allow anything.

Sub (1)

The nodes thing is tricky, but I think it's fair to assume that normal normalization is planar, and so the common assumption that these are nodes in x/y (even if those are implicit) is fair enough. I've only ever used different assumptions with triangles (break the mesh for discrete polygons that aren't neighbours in z), or for track data (unique in x, y and so other data goes on a link table - time, z, temperature whatever - this is a general concept of normalizing geometry but is rightly a specializing extension).

Sub (2)

I think I understand, I'm happy with your points overall.

Finally, I haven't moved on this just because of other tricky work that's rather pressing but I should be able to do more before xmas.

@mpadge mpadge closed this as completed in 8123abe Dec 13, 2018
@mpadge
Copy link
Member Author

mpadge commented Dec 13, 2018

@mdsumner In the hope that we can judge how close we're getting by the scope of the questions, this one is tiny: The above commit implements what I described above, and ends with

attr(,"join_ramp")
[1] "nodes"               "relation_members"    "relation_properties"
[4] "object"              "object_link_edge"    "edge"               
[7] "vertex"

I admit to no current understanding at all what this join_ramp is intended to achieve. Do you see this as acceptable? Advice please

@mdsumner
Copy link

Yes all good, join_ramp is best ignored

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants