Persist node and edge indexes in partition header #613

oshofmann · 2021-10-11T22:40:52Z

This takes Nirhjar's work to add index persistence to RDGPartHeader and:

Rebase and resolve conflicts on current head
Tweaks for some shorter names, better types (don't return or pass non-const references), and data flow
Adds what I believe are the necessary bits to to_json/from_json to actually persist the index list

tylershunt

LGTM

arekay

LGTM.

arekay · 2021-10-13T18:28:34Z

libgalois/src/PropertyGraph.cpp

  }
+
+  KATANA_CHECKED(property_graph->RecreatePropertyIndexes());


nit: Do we want to put this behind a config option - might make it easier to debug/test.

That sounds like a great idea - what do we use for that kind of config?

Thats a great question - currently we have at least a couple of option classes. But that approach is not scalable as it requires the object to be passed around. I guess we can leave it for now and then revisit once we have better config management in place. @witchel might have better insight into this.

The two main option structures are BlockedReadOptions and RDGLoadOptions. The latter is passed to PropertyGraph for one of the Make routines. While I dream of cleaning up these classes, I have no current plan, so if you want to add a field to RDGLoadOptions that would probably be ok.

I did intend to ask @ddn0 if there is a reason to avoid a per-thread ambient options structure, like we have for the ProgressTracer object. I think Galois used to have this for stats, but something about it was bad.

I actually prefer explicitly passing things around over global variables (or thread-local variables).

(old) Galois stats are bad because it predates and does not interop with tracing (and future metrics).

I think tracing, logging and metrics is a little different than configuration for loading files. In the former, the policy and behavior is supposed to orthogonal to the code being instrumented.

arekay · 2021-10-13T18:29:59Z

libgalois/src/PropertyGraph.cpp

+  for (const std::string& column_name : rdg_.node_property_index_columns()) {
+    KATANA_CHECKED(MakeNodeIndex(column_name));
+  }
+
+  for (const std::string& column_name : rdg_.edge_property_index_columns()) {
+    KATANA_CHECKED(MakeEdgeIndex(column_name));
+  }
+
+  return katana::ResultSuccess();
+}


Do we want to check the property type to ensure that we create indexes for types supported.

I think that MakeNodeIndex should do the right thing and return an error for an unsupported type (if not it's a bug). The larger question is what to do when that happens, or some other error in index-building. Should we try to create all of the indexes possible and proceed, or treat any failure as a failure in Make? At this point I think I'd prefer the former since a problem will manifest as a clear failure to load the graph, rather than just future slowness. Thoughts?

I encountered some issues when i was testing the code. For instance, a column of type int32_t would cause the indexing code to crash - which we don't want. But there might be cases where we would like it to crash - for instance specifying an index for a column that doesn't exist. I don't have any concrete suggestions for here - probably some points to think about.

Crash or return an error? If it's crashing then that's bad, and I'd like to fix that. But returning an error here seems like the right thing to do when asking to index a column for an as-yet-unsupported data type, or a column that does not exist.

Yeah, we can leave it as return an error for now, given that we should be checking for validity when the user creates an index (or we do).

witchel

Looking good.

witchel · 2021-10-13T20:40:23Z

libgalois/include/katana/PropertyGraph.h

@@ -74,6 +74,9 @@ class KATANA_EXPORT PropertyGraph {
  Result<void> WriteView(
      const std::string& uri, const std::string& command_line);

+  // Recreate indexes from json


Indexes or indices?

More substantively, our serialization format for indexes is json??

I've been using indexes, both are grammatically correct. The top search result suggests 'indices' are used in the mathematical sense, while 'indexes' for e.g. written documents, and I think this use is from the idea of a written index of a document. As an added bonus, text search for 'index' will find 'indexes'.

Currently we're not persisting anything about the contents of the index, just the metadata - list of columns indexed. If we have an index which has some representation in storage that helps build it faster than just rebuilding from the property, then yes json would be silly.

I'd maybe say index names from json in the comment.

libgalois/include/katana/PropertyGraph.h

witchel · 2021-10-13T20:46:02Z

libgalois/src/PropertyGraph.cpp

+katana::PropertyGraph::RecreatePropertyIndexes() {
+  for (const std::string& column_name : rdg_.node_property_index_columns()) {
+    KATANA_CHECKED(MakeNodeIndex(column_name));
+  }


Are we creating property indices for all properties, or only properties loaded into memory?

Currently a request to make an index only looks at the PropertyGraph, so will fail unless the property is loaded into memory. I had intended to defer any questions of in/out-of-core properties, however it would be simple here to only try to make the index if the column is loaded (otherwise this will fail the entire graph load) - I'll make that change.

There will still exist the bugaboo which sounds like something you were describing a while ago - we load indexes only for in-core properties, then only persist those, losing indexes for out-of-core properties. That I would like to defer for this change - having a basic index flow in place, as well as me understanding more, will help do that correctly in the future.

Yes, the in-core/out-of-core logic does get tricky.

I like your plan--only make the index if the column is loaded.

For the grand and glorious future, you can look at RDGPartHeader.h for the state machine for properties. We also have full_node|edge_schema and loaded_node|edge_schema, so you do have access to all properties all the time, even if they aren't loaded.

witchel · 2021-10-13T20:47:57Z

libgalois/src/PropertyGraph.cpp

+  for (const std::string& column_name : rdg_.node_property_index_columns()) {
+    KATANA_CHECKED(MakeNodeIndex(column_name));
+  }
+
+  for (const std::string& column_name : rdg_.edge_property_index_columns()) {
+    KATANA_CHECKED(MakeEdgeIndex(column_name));
+  }
+
+  return katana::ResultSuccess();
+}


Yeah, we can leave it as return an error for now, given that we should be checking for validity when the user creates an index (or we do).

witchel · 2021-10-13T21:12:30Z

libgalois/test/property-index.cpp

 template <>
 katana::Result<katana::PropertyIndex<katana::GraphTopology::Node>*>


You need these empty template declarations to make the NodeOrEdge trick work? They can't just be overloads?

I think these are required for template specializations like these, unless I'm mistaken. I think it would be hard to do this with normal overloading, especially since it's.just the return type. I'm totally open to something else, this was just the limit of my imagination.

witchel · 2021-10-13T21:16:11Z

libgalois/test/property-index.cpp

+
+  katana::Result<std::unique_ptr<katana::PropertyGraph>> make_result =
+      katana::PropertyGraph::Make(rdg_dir, tsuba::RDGLoadOptions());
+  boost::filesystem::remove_all(rdg_dir);


I'm not crazy about our dependence on boost::filesystem, but I've used remove_all. So nice.

Fair, I'll also say that in tests I tend to be much more permissive for my own coding style and grabbing arbitrary dependencies to just get the job done.

libtsuba/include/tsuba/RDG.h

witchel · 2021-10-13T21:17:55Z

libtsuba/src/RDGPartHeader.cpp

+const char* kNodePropertyIndexColumnsKey = "kg.v1.node_property_index_columns";
+const char* kEdgePropertyIndexColumnsKey = "kg.v1.edge_property_index_columns";


Hey @tylershunt, given that these key names seem to have no bearing on the RDG version, is it wise to keep calling them v1?

These are versions of the key namespace. I don't see a reason to change them at this time, we're just adding keys.

Everything is an IDspace to you these days. It makes me feel a bit better about it though.

nirjharm and others added 4 commits October 11, 2021 22:34

Nirjhar's index persistence work

44d14d0

Update names and data model for persistence.

28bd5ec

Add index columns to to_json and from_json

c848444

Comment tweaks.

a6fcbe6

oshofmann requested review from arekay and tylershunt October 11, 2021 22:40

tylershunt approved these changes Oct 12, 2021

View reviewed changes

Add a test for writing/loading a graph with indexes

733d8d0

oshofmann marked this pull request as ready for review October 12, 2021 16:55

Use KATANA_CHECKED

d62edf7

arekay approved these changes Oct 13, 2021

View reviewed changes

witchel reviewed Oct 13, 2021

View reviewed changes

Owen Hofmann added 3 commits October 13, 2021 22:02

Slashy slashy

a85f245

Only make indexes for loaded properties. Further comment massage.

ff38a7e

Devalue rvalues

679d473

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persist node and edge indexes in partition header #613

Persist node and edge indexes in partition header #613

oshofmann commented Oct 11, 2021

tylershunt left a comment

arekay left a comment

arekay Oct 13, 2021

oshofmann Oct 13, 2021

arekay Oct 13, 2021

witchel Oct 13, 2021

ddn0 Oct 13, 2021

arekay Oct 13, 2021

oshofmann Oct 13, 2021

arekay Oct 13, 2021

oshofmann Oct 13, 2021

witchel Oct 13, 2021

witchel left a comment

witchel Oct 13, 2021

oshofmann Oct 13, 2021

witchel Oct 14, 2021

witchel Oct 13, 2021

oshofmann Oct 13, 2021

witchel Oct 14, 2021

witchel Oct 13, 2021

witchel Oct 13, 2021

oshofmann Oct 13, 2021

witchel Oct 13, 2021

oshofmann Oct 13, 2021

witchel Oct 13, 2021

tylershunt Oct 13, 2021

witchel Oct 14, 2021

		template <>
		katana::Result<katana::PropertyIndex<katana::GraphTopology::Node>*>

		const char* kNodePropertyIndexColumnsKey = "kg.v1.node_property_index_columns";
		const char* kEdgePropertyIndexColumnsKey = "kg.v1.edge_property_index_columns";

Persist node and edge indexes in partition header #613

Are you sure you want to change the base?

Persist node and edge indexes in partition header #613

Conversation

oshofmann commented Oct 11, 2021

tylershunt left a comment

Choose a reason for hiding this comment

arekay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

witchel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment