Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add clustering functions to C API #1154

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

dbaston
Copy link
Member

@dbaston dbaston commented Aug 31, 2024

This PR resurrects the C API that was removed from #688 and changes it to return an array of cluster IDs instead of a GeometryCollection, as suggested in #823.

This is clearly a more flexible API than the one implemented in e464b79, but it becomes somewhat awkward to construct geometries representing the clusters (see function construct_clusters in the test suite).

I am a bit averse to adding new types to the C API, but I am wondering if something like the following would be an improvement.

struct GEOSClusterInfo;

// return cluster info or NULL on error
GEOSClusterInfo* GEOSClusterDBSCAN(const GEOSGeometry* g, double eps, unsigned minPoints);

unsigned GEOSClusterInfo_getNumClusters(const GEOSClusterInfo* clusters);
unsigned GEOSClusterInfo_getClusterSize(const GEOSClusterInfo* clusters, unsigned i);

// Return an array with the input indices associated with the cluster i
unsigned* GEOSClusterInfo_getIndicesForCluster(const GEOSClusterInfo* clusters, unsigned i);

// Return an array with the cluster id associated with each input index
// Input indices not associated with a cluster will have a cluster id of GEOS_CLUSTER_NONE
unsigned* GEOSClusterInfo_getClusterForIndices(const GEOSClusterInfo*);

void GEOSClusterInfo_destroy(GEOSClusterInfo*);


capi/geos_c.h.in Outdated Show resolved Hide resolved
capi/geos_c.h.in Outdated Show resolved Hide resolved
capi/geos_c.h.in Outdated Show resolved Hide resolved
capi/geos_c.h.in Outdated Show resolved Hide resolved
capi/geos_c.h.in Outdated Show resolved Hide resolved
@dbaston dbaston added Feedback Requested Enhancement New feature or feature improvement. labels Aug 31, 2024
@dr-jts
Copy link
Contributor

dr-jts commented Sep 3, 2024

A suggestion for naming of a couple of the methods, with clarifying comments:

// Return an array with the input indices associated with the cluster i
// Size of array is given by GEOSClusterInfo_getClusterSize
unsigned* GEOSClusterInfo_getClusterElements(const GEOSClusterInfo* clusters, unsigned i);

// Return an array with the cluster id associated with each input index
// Size of array is equal to the number of elements in the input geometry
// Input indices not associated with a cluster have a cluster id of GEOS_CLUSTER_NONE
unsigned* GEOSClusterInfo_getClusterIds(const GEOSClusterInfo* clusters);

@dr-jts
Copy link
Contributor

dr-jts commented Sep 3, 2024

Rather than adding another type to the API, perhaps a function similar to construct_clusters could be provided?

That has the downside that all clusters have to be materialized at once, rather than individually. But perhaps this is fine for the anticipated use cases?

@pramsey
Copy link
Member

pramsey commented Sep 3, 2024

I'm OK with either approach, but I think the extra functionality of the added object is pretty good, and it's supporting 4 different functions, so it's not like it's an extra "junk object" or something.

@dbaston dbaston added this to the 3.14.0 milestone Sep 4, 2024
@dbaston
Copy link
Member Author

dbaston commented Sep 4, 2024

Thanks for the feedback. I've implemented a new version with the custom type, but I don't want to rush it out before Friday. I'm OK leaving this until 3.14.

@dr-jts
Copy link
Contributor

dr-jts commented Sep 14, 2024

There is at least one other non-clustering algorithm that could benefit from this approach - GEOSLIneMerge (and GEOSLineMergeDirected). Returning ids of merged linestrings will support creating a PostGIS ST_LineMergeWin window function, which allows maintaining attribution for merged lines.

Given this, is it worth renaming the API so that it's more general than just clustering? The term Group could be used, giving:

struct GEOSGroupInfo;

// return cluster info or NULL on error
GEOSGroupInfo* GEOSClusterDBSCAN(const GEOSGeometry* g, double eps, unsigned minPoints);

unsigned GEOSGroupInfo_getNumGroups(const GEOSGroupInfo* groups);
unsigned GEOSGroupInfo_getGroupSize(const GEOSGroupInfo* groups, unsigned groupId);

// Return an array with the input indices associated with the group groupId
// Size of array is given by GEOSGroupInfo_getGroupSize
unsigned* GEOSGroupInfo_getGroupItems(const GEOSGroupInfo* groups, unsigned groupId);

// Return an array with the group id associated with each input index
// Size of array is equal to the number of elements in the input geometry
// Input indices not associated with a group have a cluster id of GEOS_GROUP_NONE
unsigned* GEOSGroupInfo_getGroupIds(const GEOSGroupInfo* groups);

void GEOSGroupInfo_destroy(GEOSGroupInfo*);

Or, LineMerge can be though of as a kind of clustering, so an alternative design choice is to use this API and name the GEOS C API function appropriately (say GEOSClusterLineMerge).

@dbaston
Copy link
Member Author

dbaston commented Sep 23, 2024

Since we wouldn't want to change the existing signatures of GEOSLineMerge, I think adding GEOSClusterLineMerge makes sense, but to do so the LineMerger it would need to create a Clusters object that provides the parentage information. This isn't something that LineMerger tracks, but from a quick look it doesn't seem like it should be difficult. The Clusters class would also need a new constructor to take the parentage from the LineMerger. I think it could be done as a separate PR.

@dbaston dbaston force-pushed the cluster-capi-2 branch 3 times, most recently from 4c3109e to a26df82 Compare September 23, 2024 15:14
extern size_t GEOS_DLL* GEOSClusterInfo_getClustersForInputs_r(GEOSContextHandle_t, const GEOSClusterInfo* clusters);

/** \see GEOSClusterInfo_getInputsForClusterN */
extern const size_t GEOS_DLL* GEOSClusterInfo_getInputsForClusterN_r(GEOSContextHandle_t, const GEOSClusterInfo* clusters, size_t i);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we are returning const size_t* here. This avoids a memory allocation and array copy for each cluster, but constrains our implementation of GEOSClusterInfo to actually store these values in a contiguous block somewhere. I think that's a reasonable tradeoff.

@dbaston dbaston changed the title WIP: Add clustering functions to C API Add clustering functions to C API Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement New feature or feature improvement. Feedback Requested
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants