OOM error raised after subsequent calls to `plc.bfs` with different `SGGraph`s #4067

rlratzel · 2023-12-21T01:11:16Z

The following code produces the error shown below. The graph is too small to result in a legitimate OOM error, so a bug is likely causing something in libcugraph to allocate or attempt to allocate more memory than is available.

import pylibcugraph as plc
import cupy as cp

edges = [
    (0,2), (0,4), (0,5), (0,6), (0,7), (0,8), (0,10), (0,11), (0,12), (0,13),
    (0,17), (0,19), (0,21), (0,31), (2,7), (2,8), (2,9), (2,13), (2,27), (2,28),
    (2,32), (4,6), (4,10), (5,6), (5,10), (5,16), (6,16), (8,30), (8,32),
    (8,33), (9,33), (13,33), (14,32), (14,33), (15,32), (15,33), (18,32),
    (18,33), (19,33), (20,32), (20,33), (22,32), (22,33), (23,25), (23,27),
    (23,29), (23,32), (23,33), (24,25), (24,27), (24,31), (25,31), (26,29),
    (26,33), (27,33), (28,31), (28,33), (29,32), (29,33), (30,32), (30,33),
    (31,32), (31,33), (32,33),
]
edges += [(target, source) for (source, target) in edges]

source=0
src_indices=cp.array([e[0] for e in edges], dtype="int32")
dst_indices=cp.array([e[1] for e in edges], dtype="int32")
G=plc.SGGraph(
    resource_handle=plc.ResourceHandle(),
    graph_properties=plc.GraphProperties(
        is_multigraph=False,
        is_symmetric=False,
    ),
    src_or_offset_array=src_indices,
    dst_or_index_array=dst_indices,
    weight_array=None,
    store_transposed=False,
    renumber=False,
    do_expensive_check=False,
)

distances, predecessors, node_ids = plc.bfs(
    handle=plc.ResourceHandle(),
    graph=G,
    sources=cp.array([source], "int32"),
    direction_optimizing=False,
    depth_limit=-1,
    compute_predecessors=False,
    do_expensive_check=False,
)
MAXINT=2147483647
mask = distances != MAXINT
it = zip(node_ids[mask].tolist(), distances[mask].tolist())

del G

source=1
src_indices=cp.array([], dtype="int32")
dst_indices=cp.array([], dtype="int32")

G=plc.SGGraph(
    resource_handle=plc.ResourceHandle(),
    graph_properties=plc.GraphProperties(
        is_multigraph=False,
        is_symmetric=False,
    ),
    src_or_offset_array=src_indices,
    dst_or_index_array=dst_indices,
    weight_array=None,
    store_transposed=False,
    renumber=False,
    do_expensive_check=False,
)

distances, predecessors, node_ids = plc.bfs(
    handle=plc.ResourceHandle(),
    graph=G,
    sources=cp.array([source], "int32"),
    direction_optimizing=False, 
    depth_limit=1,
    compute_predecessors=False,
    do_expensive_check=False,
)

(base) root@98e76e671240:/Projects/cugraph# python oom_repro.py
Traceback (most recent call last):
  File "/Projects/cugraph/oom_repro.py", line 66, in <module>
    distances, predecessors, node_ids = plc.bfs(
  File "bfs.pyx", line 181, in pylibcugraph.bfs.bfs
  File "utils.pyx", line 53, in pylibcugraph.utils.assert_success
RuntimeError: non-success value returned from cugraph_bfs: CUGRAPH_UNKNOWN_ERROR std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/include/rmm/mr/device/cuda_memory_resource.hpp

This was initially discovered in #4029

The text was updated successfully, but these errors were encountered:

ChuckHastings · 2024-01-03T20:50:52Z

Did some analysis here. Here's what I found:

I believe the fundamental problem is this. For your second graph, you define a graph with no edges (and therefore no vertices), but you request a BFS from vertex id 1. There is no vertex id 1, but we don't check that boundary condition, so we attempt to perform a BFS from a non-existent vertex. This causes us to go out of bounds on an array reference.
In my test runs as you described above, this out of bounds triggers the OOM failure, because we read the out of bounds memory and compute a completely invalid amount of memory.
If I skip the first graph call and just make the second graph call, I get a different error (which is what actually led me to discover the out of bounds computation.
If I set do_expensive_check to True in the BFS call it will return the fundamental error (that we're passing in bad parameters).

Can we define the desired behavior in this situation? If the user creates a graph (whether empty or not) and executes BFS with a starting vertex that is not actually a vertex in the graph, what is the desired behavior? Return an error indicating bad input parameters? Silently fail, returning the result of running BFS from no vertex?

rlratzel · 2024-01-03T21:26:58Z

Thanks, Chuck. Here's what I noticed now:

1 & 2 make sense, but when I run the script with the entire first call and setup code commented out (everything from del G and above), which I think is what you did for 3, the script silently passes and I see no error. However, if I set do_expensive_check as you did, I see an error as you described in 4.

Given that, it seems like there could be a few issues:

If we think the input is invalid (seems like that's how it should be classified), then we should raise/return an error. For reference, NetworkX does this in the same situation:

>>> list(nx.bfs_edges(G,333))
Traceback (most recent call last):
  File "/Projects/networkx/networkx/classes/graph.py", line 1354, in neighbors
    return iter(self._adj[n])
KeyError: 333

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Projects/networkx/networkx/algorithms/traversal/breadth_first_search.py", line 218, in bfs_edges
    yield from generic_bfs_edges(G, source, successors, depth_limit)
  File "/Projects/networkx/networkx/algorithms/traversal/breadth_first_search.py", line 118, in generic_bfs_edges
    next_parents_children = [(source, neighbors(source))]
  File "/Projects/networkx/networkx/classes/graph.py", line 1356, in neighbors
    raise NetworkXError(f"The node {n} is not in the graph.") from err
networkx.exception.NetworkXError: The node 333 is not in the graph.

so raising/returning an error seems reasonable to me.

The out-of-bounds read obviously shouldn't happen. If this is only possible with an invalid source vertex and we're handling that now by checking and raising, then we've covered this one too. However, if there's another situation we can think of then perhaps there's some additional guardrails we should also add?
Any idea why I'm not seeing an error of any kind for your 3rd point above? Is that something else that needs to be looked at?

ChuckHastings · 2024-01-03T21:59:22Z

This particular OOB read is because of the invalid source vertex (which we never check). I can't think of other bad input cases that we're not checking... although I am admittedly bad at finding these a priori. I think adding this check should be sufficient. Correct input data should not result in an OOB error like this.

OOB errors are very non-deterministic in their behavior. It's one of the things that makes them a challenge to isolate and debug. For this particular code - which is fairly straightforward, it's probably a function of what GPU model, device driver and CUDA runtime versions that you're running. Fundamentally, in my run when I access the OOB element I'm reading a -1 - whatever was in the memory previously, and when I try to access whatever_array[-1] it's referencing an illegal location. In your example run you may be reading a different value (perhaps not even OOB), and then accessing some legal location (but wrong). Because the graph is empty it's likely only one OOB memory reference that doesn't even get used after the small block of code it's in. As long as the subtraction yields a positive value that's small enough to be allocated in GPU memory it will not fail, nor will it generate a bad result... because there is no result. If you did the same thing with a real graph you probably increase your chance of a bad outcome (a different failure or bogus result).

I will add the fixes to the C API for BFS to detect this error.

…valid vertex (#4077) * Added error check to be sure that the source vertex is a valid vertex in the graph. * Updated `nx_cugraph.Graph` class to create PLC graphs using `vertices_array` in order to include isolated vertices. This is now needed since the error check added in this PR prevents NetworkX tests from passing if isolated vertices are treated as invalid, so this change prevents that. * This resolves the problem that required the test workarounds done [here](#4029 (comment)) in [4029](#4029), so those workarounds have been removed in this PR. Closes #4067 Authors: - Chuck Hastings (https://github.com/ChuckHastings) - Rick Ratzel (https://github.com/rlratzel) Approvers: - Seunghwa Kang (https://github.com/seunghwak) - Ray Douglass (https://github.com/raydouglass) - Erik Welch (https://github.com/eriknw) URL: #4077

rlratzel added the bug Something isn't working label Dec 21, 2023

rlratzel self-assigned this Dec 21, 2023

This was referenced Dec 21, 2023

nx-cugraph: adds ancestors, descendants, and BFS algos #4029

Merged

Placeholder PR to fix OOM bug exposed in NetworkX/nx-cugraph tests #4068

Closed

rlratzel changed the title ~~OOM error raised when calling NX algos with nx-cugraph installed~~ OOM error raised after subsequent calls to plc.bfs with different SGGraphs Jan 3, 2024

rlratzel changed the title ~~OOM error raised after subsequent calls to plc.bfs with different SGGraphs~~ OOM error raised after subsequent calls to plc.bfs with different SGGraphs Jan 3, 2024

ChuckHastings mentioned this issue Jan 4, 2024

Fix OOB error, BFS C API should validate that the source vertex is a valid vertex #4077

Merged

rapids-bot bot closed this as completed in #4077 Jan 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM error raised after subsequent calls to `plc.bfs` with different `SGGraph`s #4067

OOM error raised after subsequent calls to `plc.bfs` with different `SGGraph`s #4067

rlratzel commented Dec 21, 2023 •

edited

Loading

ChuckHastings commented Jan 3, 2024

rlratzel commented Jan 3, 2024 •

edited

Loading

ChuckHastings commented Jan 3, 2024

OOM error raised after subsequent calls to plc.bfs with different SGGraphs #4067

OOM error raised after subsequent calls to plc.bfs with different SGGraphs #4067

Comments

rlratzel commented Dec 21, 2023 • edited Loading

ChuckHastings commented Jan 3, 2024

rlratzel commented Jan 3, 2024 • edited Loading

ChuckHastings commented Jan 3, 2024

OOM error raised after subsequent calls to `plc.bfs` with different `SGGraph`s #4067

OOM error raised after subsequent calls to `plc.bfs` with different `SGGraph`s #4067

rlratzel commented Dec 21, 2023 •

edited

Loading

rlratzel commented Jan 3, 2024 •

edited

Loading