Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist node and edge indexes in partition header #613

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion libgalois/include/katana/PropertyGraph.h
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,9 @@ class KATANA_EXPORT PropertyGraph {
Result<void> WriteView(
const std::string& uri, const std::string& command_line);

/// Recreate indexes from column names in RDG metadata.
katana::Result<void> RecreatePropertyIndexes();

tsuba::RDG rdg_;
std::unique_ptr<tsuba::RDGFile> file_;
GraphTopology topology_;
Expand All @@ -88,9 +91,11 @@ class KATANA_EXPORT PropertyGraph {
/// The edge EntityTypeID for each edge's most specific type
EntityTypeIDArray edge_entity_type_ids_;

// List of node and edge indexes on this graph.
/// List of node indexes on this graph.
std::vector<std::unique_ptr<PropertyIndex<GraphTopology::Node>>>
node_indexes_;

/// List of edge indexes on this graph.
std::vector<std::unique_ptr<PropertyIndex<GraphTopology::Edge>>>
edge_indexes_;

Expand Down
44 changes: 39 additions & 5 deletions libgalois/src/PropertyGraph.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,8 @@ katana::PropertyGraph::Make(
katana::GraphTopology topo =
KATANA_CHECKED(MapTopology(rdg.topology_file_storage()));

std::unique_ptr<katana::PropertyGraph> property_graph;

if (rdg.IsEntityTypeIDsOutsideProperties()) {
KATANA_LOG_DEBUG("loading EntityType data from outside properties");

Expand All @@ -236,24 +238,26 @@ katana::PropertyGraph::Make(
EntityTypeManager edge_type_manager =
KATANA_CHECKED(rdg.edge_entity_type_manager());

return std::make_unique<PropertyGraph>(
property_graph = std::make_unique<PropertyGraph>(
std::move(rdg_file), std::move(rdg), std::move(topo),
std::move(node_type_ids), std::move(edge_type_ids),
std::move(node_type_manager), std::move(edge_type_manager));

} else {
// we must construct id_arrays and managers from properties

auto pg = std::make_unique<PropertyGraph>(
property_graph = std::make_unique<PropertyGraph>(
std::move(rdg_file), std::move(rdg), std::move(topo),
MakeDefaultEntityTypeIDArray(topo.num_nodes()),
MakeDefaultEntityTypeIDArray(topo.num_edges()), EntityTypeManager{},
EntityTypeManager{});

KATANA_CHECKED(pg->ConstructEntityTypeIDs());

return MakeResult(std::move(pg));
KATANA_CHECKED(property_graph->ConstructEntityTypeIDs());
}

KATANA_CHECKED(property_graph->RecreatePropertyIndexes());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do we want to put this behind a config option - might make it easier to debug/test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a great idea - what do we use for that kind of config?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a great question - currently we have at least a couple of option classes. But that approach is not scalable as it requires the object to be passed around. I guess we can leave it for now and then revisit once we have better config management in place. @witchel might have better insight into this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two main option structures are BlockedReadOptions and RDGLoadOptions. The latter is passed to PropertyGraph for one of the Make routines. While I dream of cleaning up these classes, I have no current plan, so if you want to add a field to RDGLoadOptions that would probably be ok.

I did intend to ask @ddn0 if there is a reason to avoid a per-thread ambient options structure, like we have for the ProgressTracer object. I think Galois used to have this for stats, but something about it was bad.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually prefer explicitly passing things around over global variables (or thread-local variables).

(old) Galois stats are bad because it predates and does not interop with tracing (and future metrics).

I think tracing, logging and metrics is a little different than configuration for loading files. In the former, the policy and behavior is supposed to orthogonal to the code being instrumented.


return MakeResult(std::move(property_graph));
}

katana::Result<std::unique_ptr<katana::PropertyGraph>>
Expand Down Expand Up @@ -468,6 +472,19 @@ katana::PropertyGraph::DoWrite(
? KATANA_CHECKED(WriteEntityTypeIDsArray(edge_entity_type_ids_))
: nullptr;

// Update lists of node and edge index columns.
std::vector<std::string> node_index_columns(node_indexes_.size());
std::transform(
node_indexes_.begin(), node_indexes_.end(), node_index_columns.begin(),
[](const auto& index) { return index->column_name(); });
rdg_.set_node_property_index_columns(node_index_columns);

std::vector<std::string> edge_index_columns(edge_indexes_.size());
std::transform(
edge_indexes_.begin(), edge_indexes_.end(), edge_index_columns.begin(),
[](const auto& index) { return index->column_name(); });
rdg_.set_edge_property_index_columns(edge_index_columns);

return rdg_.Store(
handle, command_line, versioning_action, std::move(topology_res),
std::move(node_entity_type_id_array_res),
Expand Down Expand Up @@ -1289,3 +1306,20 @@ katana::PropertyGraph::GetNodePropertyIndex(
}
return KATANA_ERROR(katana::ErrorCode::NotFound, "node index not found");
}

katana::Result<void>
katana::PropertyGraph::RecreatePropertyIndexes() {
for (const std::string& column_name : rdg_.node_property_index_columns()) {
if (HasNodeProperty(column_name)) {
KATANA_CHECKED(MakeNodeIndex(column_name));
}
}

for (const std::string& column_name : rdg_.edge_property_index_columns()) {
if (HasEdgeProperty(column_name)) {
KATANA_CHECKED(MakeEdgeIndex(column_name));
}
}

return katana::ResultSuccess();
}
121 changes: 100 additions & 21 deletions libgalois/test/property-index.cpp
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#include <arrow/api.h>
#include <arrow/type.h>
#include <arrow/type_traits.h>
#include <boost/filesystem.hpp>

#include "TestTypedPropertyGraph.h"
#include "katana/Logging.h"
Expand All @@ -11,8 +12,12 @@ template <typename node_or_edge>
struct NodeOrEdge {
static katana::Result<katana::PropertyIndex<node_or_edge>*> MakeIndex(
katana::PropertyGraph* pg, const std::string& column_name);
static katana::Result<katana::PropertyIndex<node_or_edge>*> GetIndex(
katana::PropertyGraph* pg, const std::string& column_name);
static katana::Result<void> AddProperties(
katana::PropertyGraph* pg, std::shared_ptr<arrow::Table> properties);
static std::shared_ptr<arrow::Array> GetProperty(
katana::PropertyGraph* pg, const std::string& column_name);
static size_t num_entities(katana::PropertyGraph* pg);
};

Expand All @@ -21,12 +26,7 @@ using Edge = NodeOrEdge<katana::GraphTopology::Edge>;

template <>
katana::Result<katana::PropertyIndex<katana::GraphTopology::Node>*>
Comment on lines 27 to 28
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need these empty template declarations to make the NodeOrEdge trick work? They can't just be overloads?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are required for template specializations like these, unless I'm mistaken. I think it would be hard to do this with normal overloading, especially since it's.just the return type. I'm totally open to something else, this was just the limit of my imagination.

Node::MakeIndex(katana::PropertyGraph* pg, const std::string& column_name) {
auto result = pg->MakeNodeIndex(column_name);
if (!result) {
return result.error();
}

Node::GetIndex(katana::PropertyGraph* pg, const std::string& column_name) {
for (const auto& index : pg->node_indexes()) {
if (index->column_name() == column_name) {
return index.get();
Expand All @@ -37,13 +37,15 @@ Node::MakeIndex(katana::PropertyGraph* pg, const std::string& column_name) {
}

template <>
katana::Result<katana::PropertyIndex<katana::GraphTopology::Edge>*>
Edge::MakeIndex(katana::PropertyGraph* pg, const std::string& column_name) {
auto result = pg->MakeEdgeIndex(column_name);
if (!result) {
return result.error();
}
katana::Result<katana::PropertyIndex<katana::GraphTopology::Node>*>
Node::MakeIndex(katana::PropertyGraph* pg, const std::string& column_name) {
KATANA_CHECKED(pg->MakeNodeIndex(column_name));
return Node::GetIndex(pg, column_name);
}

template <>
katana::Result<katana::PropertyIndex<katana::GraphTopology::Edge>*>
Edge::GetIndex(katana::PropertyGraph* pg, const std::string& column_name) {
for (const auto& index : pg->edge_indexes()) {
if (index->column_name() == column_name) {
return index.get();
Expand All @@ -53,6 +55,13 @@ Edge::MakeIndex(katana::PropertyGraph* pg, const std::string& column_name) {
return KATANA_ERROR(katana::ErrorCode::NotFound, "Created index not found");
}

template <>
katana::Result<katana::PropertyIndex<katana::GraphTopology::Edge>*>
Edge::MakeIndex(katana::PropertyGraph* pg, const std::string& column_name) {
KATANA_CHECKED(pg->MakeEdgeIndex(column_name));
return Edge::GetIndex(pg, column_name);
}

template <>
size_t
Node::num_entities(katana::PropertyGraph* pg) {
Expand All @@ -79,6 +88,22 @@ Edge::AddProperties(
return pg->AddEdgeProperties(properties);
}

template <>
std::shared_ptr<arrow::Array>
Node::GetProperty(katana::PropertyGraph* pg, const std::string& column_name) {
auto prop_result = pg->GetNodeProperty(column_name);
KATANA_LOG_ASSERT(prop_result);
return prop_result.value()->chunk(0);
}

template <>
std::shared_ptr<arrow::Array>
Edge::GetProperty(katana::PropertyGraph* pg, const std::string& column_name) {
auto prop_result = pg->GetEdgeProperty(column_name);
KATANA_LOG_ASSERT(prop_result);
return prop_result.value()->chunk(0);
}

template <typename c_type>
std::shared_ptr<arrow::Table>
CreatePrimitiveProperty(
Expand Down Expand Up @@ -200,11 +225,8 @@ TestPrimitiveIndex(size_t num_nodes, size_t line_width) {
}

template <typename node_or_edge>
void
TestStringIndex(size_t num_nodes, size_t line_width) {
using IndexType = katana::StringPropertyIndex<node_or_edge>;
using ArrayType = arrow::LargeStringArray;

std::unique_ptr<katana::PropertyGraph>
MakeStringGraph(size_t num_nodes, size_t line_width) {
LinePolicy policy{line_width};

std::unique_ptr<katana::PropertyGraph> g =
Expand All @@ -230,6 +252,32 @@ TestStringIndex(size_t num_nodes, size_t line_width) {
nonuniform_index_result, "Could not create index: {}",
nonuniform_index_result.error());

return g;
}

template <typename node_or_edge>
std::unique_ptr<katana::PropertyGraph>
TestStringIndex(
std::unique_ptr<katana::PropertyGraph> g, size_t num_nodes,
size_t line_width) {
using IndexType = katana::StringPropertyIndex<node_or_edge>;
using ArrayType = arrow::LargeStringArray;

if (!g) {
g = MakeStringGraph<node_or_edge>(num_nodes, line_width);
}

auto uniform_index_result =
NodeOrEdge<node_or_edge>::GetIndex(g.get(), "uniform");
KATANA_LOG_VASSERT(
uniform_index_result, "Could not get index: {}",
uniform_index_result.error());
auto nonuniform_index_result =
NodeOrEdge<node_or_edge>::GetIndex(g.get(), "nonuniform");
KATANA_LOG_VASSERT(
nonuniform_index_result, "Could not get index: {}",
nonuniform_index_result.error());

auto* uniform_index = static_cast<IndexType*>(uniform_index_result.value());
auto* nonuniform_index =
static_cast<IndexType*>(nonuniform_index_result.value());
Expand All @@ -253,8 +301,8 @@ TestStringIndex(size_t num_nodes, size_t line_width) {
}

// The non-uniform index starts at "aaaa" and increases by 2.
auto typed_prop =
std::static_pointer_cast<ArrayType>(nonuniform_prop->column(0)->chunk(0));
auto typed_prop = std::static_pointer_cast<ArrayType>(
NodeOrEdge<node_or_edge>::GetProperty(g.get(), "nonuniform"));
it = nonuniform_index->Find("aaaj");
KATANA_LOG_ASSERT(it == nonuniform_index->end());
it = nonuniform_index->LowerBound("aaaj");
Expand All @@ -263,6 +311,31 @@ TestStringIndex(size_t num_nodes, size_t line_width) {
it = nonuniform_index->UpperBound("aaak");
KATANA_LOG_ASSERT(it != nonuniform_index->end());
KATANA_LOG_ASSERT(typed_prop->GetView(*it) == "aaam");

return g;
}

std::unique_ptr<katana::PropertyGraph>
ReloadGraph(std::unique_ptr<katana::PropertyGraph> g) {
auto uri_res = katana::Uri::MakeRand("/tmp/propertyfilegraph");
KATANA_LOG_ASSERT(uri_res);
std::string rdg_dir(uri_res.value().path());

auto write_result = g->Write(rdg_dir, "test command line");

if (!write_result) {
boost::filesystem::remove_all(rdg_dir);
KATANA_LOG_FATAL("writing result: {}", write_result.error());
}

katana::Result<std::unique_ptr<katana::PropertyGraph>> make_result =
katana::PropertyGraph::Make(rdg_dir, tsuba::RDGLoadOptions());
boost::filesystem::remove_all(rdg_dir);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not crazy about our dependence on boost::filesystem, but I've used remove_all. So nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, I'll also say that in tests I tend to be much more permissive for my own coding style and grabbing arbitrary dependencies to just get the job done.

if (!make_result) {
KATANA_LOG_FATAL("making result: {}", make_result.error());
}

return std::move(make_result.value());
}

int
Expand All @@ -274,8 +347,14 @@ main() {
TestPrimitiveIndex<katana::GraphTopology::Node, double_t>(10, 3);
TestPrimitiveIndex<katana::GraphTopology::Edge, double_t>(10, 3);

TestStringIndex<katana::GraphTopology::Node>(10, 3);
TestStringIndex<katana::GraphTopology::Edge>(10, 3);
auto node_g = TestStringIndex<katana::GraphTopology::Node>(nullptr, 10, 3);
auto edge_g = TestStringIndex<katana::GraphTopology::Edge>(nullptr, 10, 3);

node_g = ReloadGraph(std::move(node_g));
edge_g = ReloadGraph(std::move(edge_g));

TestStringIndex<katana::GraphTopology::Node>(std::move(node_g), 10, 3);
TestStringIndex<katana::GraphTopology::Edge>(std::move(edge_g), 10, 3);

return 0;
}
11 changes: 11 additions & 0 deletions libtsuba/include/tsuba/RDG.h
Original file line number Diff line number Diff line change
Expand Up @@ -230,6 +230,17 @@ class KATANA_EXPORT RDG {
/// Remove all edge properties
void DropEdgeProperties();

// Set the list of node and edge column names to persist. Consumes the
// provided parameters.
void set_node_property_index_columns(
oshofmann marked this conversation as resolved.
Show resolved Hide resolved
const std::vector<std::string>& node_property_index_columns);
void set_edge_property_index_columns(
const std::vector<std::string>& edge_property_index_columns);

// Return the list of node and edge column names.
const std::vector<std::string>& node_property_index_columns();
const std::vector<std::string>& edge_property_index_columns();

/// Remove topology data
katana::Result<void> DropTopology();

Expand Down
24 changes: 24 additions & 0 deletions libtsuba/src/RDG.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,30 @@ tsuba::RDG::WritePartArrays(const katana::Uri& dir, tsuba::WriteGroup* desc) {
return next_properties;
}

void
tsuba::RDG::set_node_property_index_columns(
const std::vector<std::string>& node_property_index_columns) {
core_->part_header().set_node_property_index_columns(
node_property_index_columns);
}

void
tsuba::RDG::set_edge_property_index_columns(
const std::vector<std::string>& edge_property_index_columns) {
core_->part_header().set_edge_property_index_columns(
edge_property_index_columns);
}

const std::vector<std::string>&
tsuba::RDG::node_property_index_columns() {
return core_->part_header().node_property_index_columns();
}

const std::vector<std::string>&
tsuba::RDG::edge_property_index_columns() {
return core_->part_header().edge_property_index_columns();
}

katana::Result<void>
tsuba::RDG::DoStoreTopology(
RDGHandle handle, std::unique_ptr<FileFrame> topology_ff,
Expand Down
15 changes: 15 additions & 0 deletions libtsuba/src/RDGPartHeader.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,9 @@ const char* kEdgeEntityTypeIDDictionaryKey =
const char* kNodeEntityTypeIDNameKey = "kg.v1.node_entity_type_id_name";
// Name maps from Atomic Edge Entity Type ID to set of string names for the Edge Entity Type ID
const char* kEdgeEntityTypeIDNameKey = "kg.v1.edge_entity_type_id_name";
// List of node and edge indexed columns
const char* kNodePropertyIndexColumnsKey = "kg.v1.node_property_index_columns";
const char* kEdgePropertyIndexColumnsKey = "kg.v1.edge_property_index_columns";
Comment on lines +37 to +38
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @tylershunt, given that these key names seem to have no bearing on the RDG version, is it wise to keep calling them v1?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are versions of the key namespace. I don't see a reason to change them at this time, we're just adding keys.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything is an IDspace to you these days. It makes me feel a bit better about it though.


//
//constexpr std::string_view mirror_nodes_prop_name = "mirror_nodes";
Expand Down Expand Up @@ -288,6 +291,8 @@ tsuba::to_json(json& j, const tsuba::RDGPartHeader& header) {
{kEdgeEntityTypeIDDictionaryKey, header.edge_entity_type_id_dictionary_},
{kNodeEntityTypeIDNameKey, header.node_entity_type_id_name_},
{kEdgeEntityTypeIDNameKey, header.edge_entity_type_id_name_},
{kNodePropertyIndexColumnsKey, header.node_property_index_columns_},
{kEdgePropertyIndexColumnsKey, header.edge_property_index_columns_},
};
}

Expand Down Expand Up @@ -319,6 +324,16 @@ tsuba::from_json(const json& j, tsuba::RDGPartHeader& header) {
j.at(kNodeEntityTypeIDNameKey).get_to(header.node_entity_type_id_name_);
j.at(kEdgeEntityTypeIDNameKey).get_to(header.edge_entity_type_id_name_);
}

header.node_property_index_columns_ = {};
if (auto it = j.find(kNodePropertyIndexColumnsKey); it != j.end()) {
it->get_to(header.node_property_index_columns_);
}

header.edge_property_index_columns_ = {};
if (auto it = j.find(kEdgePropertyIndexColumnsKey); it != j.end()) {
it->get_to(header.edge_property_index_columns_);
}
}

void
Expand Down
Loading