Skip to content

Commit

Permalink
Add a new topological sort implementation
Browse files Browse the repository at this point in the history
This is the first chunk to break #10547 into smaller PRs suitable for reviewing. It adds a new topological sort implementation.

 #### Background

Topological sort is an algorithm that sorts the vertices of a directed acyclic graph (DAG) in a linear order such that for every directed edge from vertex A to vertex B, vertex A comes before vertex B in the ordering. This ordering is called a topological order.

Ultimately (beyond the scope of this PR), in the ORM we'll need this to find an order in which we can insert new entities into the database. When one entity needs to refer to another one by means of a foreign key, the referred-to entity must be inserted before the referring entity. Deleting entities is similar.

A topological sorting can be obtained by running a depth first search (DFS) on the graph. The order in which the DFS finishes on the vertices is a topological order. The DFS is possible iif there are no cycles in the graph. When there are cycles, the DFS will find them.

For more information about topological sorting, as well as a description of an DFS-based topological sorting algorithm, see https://en.wikipedia.org/wiki/Topological_sorting.

 #### Current situation

There is a DFS-based topological sorting implemented in the `CommitOrderCalculator`. This implementation has two kinks:

1. It does not detect cycles

When there is a cycle in the DAG that cannot be resolved, we need to know about it. Ultimately, this means we will not be able to insert entities into the database in any order that allows all foreign keys constraints to be satisfied.

If you look at `CommitOrderCalculator`, you'll see that there is no code dealing with this situation.

2. It has an obscure concept of edge "weights"

To me, it is not clear how those are supposed to work. The weights are related to whether a foreign key is nullable or not, but can (could) be arbitrary integers. An edge will be ignored if it has a higher (lower) weight than another, already processed edge... 🤷🏻?

 #### Suggested change

In fact, when inserting entities into the database, we have two kinds of foreign key relationships: Those that are `nullable`, and those that are not.

Non-nullable foreign keys are hard requirements: Referred-to entities must be inserted first, no matter what. These are "non-optional" edges in the dependency graph.

Nullable foreign keys can be set to `NULL` when first inserting an entity, and then coming back and updating the foreign key value after the referred-to (related) entity has been inserted into the database. This is already implemented in `\Doctrine\ORM\UnitOfWork::scheduleExtraUpdate`, at the expense of performing one extra `UPDATE` query after all the `INSERT`s have been processed. These edges are "optional".

When finding a cycle that consists of non-optional edges only, treat it as a failure. We won't be able to insert entities with a circular dependency when all foreign keys are non-NULLable.

When a cycle contains at least one optional edge, we can use it to break the cycle: Use backtracking to go back to the point before traversing the last _optional_ edge. This omits the edge from the topological sort order, but will cost one extra UPDATE later on.

To make the transition easier, the new implementation is provided as a separate class, which is marked as `@internal`.

 #### Outlook

Backtracking to the last optional edge is the simplest solution for now. In general, it might be better to find _another_ (optional) edge that would also break the cycle, if that edge is also part of _other_ cycles.

Remember, every optional edge skipped costs an extra UPDATE query later on. The underlying problem is known as the "minimum feedback arc set" problem, and is not easily/efficiently solvable. Thus, for the time being, picking the nearest optional edge seems to be reasonable.
  • Loading branch information
mpdude committed Apr 24, 2023
1 parent 33a19f1 commit a599140
Show file tree
Hide file tree
Showing 3 changed files with 468 additions and 0 deletions.
165 changes: 165 additions & 0 deletions lib/Doctrine/ORM/Internal/TopologicalSort.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
<?php

declare(strict_types=1);

namespace Doctrine\ORM\Internal;

use Doctrine\ORM\Internal\TopologicalSort\CycleDetectedException;

use function array_keys;
use function array_reverse;
use function array_unshift;
use function spl_object_id;

/**
* TopologicalSort implements topological sorting, which is an ordering
* algorithm for directed graphs (DG) using a depth-first searching (DFS)
* to traverse the graph built in memory.
* This algorithm has a linear running time based on nodes (V) and edges
* between the nodes (E), resulting in a computational complexity of O(V + E).
*
* @internal
*/
final class TopologicalSort
{
private const NOT_VISITED = 1;
private const IN_PROGRESS = 2;
private const VISITED = 3;

/**
* Array of all nodes, indexed by object ids.
*
* @var array<int, object>
*/
private $nodes = [];

/**
* DFS state for the different nodes, indexed by node object id and using one of
* this class' constants as value.
*
* @var array<int, self::*>
*/
private $states = [];

/**
* Edges between the nodes. The first-level key is the object id of the outgoing
* node; the second array maps the destination node by object id as key. The final
* boolean value indicates whether the edge is optional or not.
*
* @var array<int, array<int, bool>>
*/
private $edges = [];

/**
* Builds up the result during the DFS.
*
* @psalm-var list<object>
*/
private $sortResult = [];

/** @param object $node */
public function addNode($node): void
{
$id = spl_object_id($node);
$this->nodes[$id] = $node;
$this->states[$id] = self::NOT_VISITED;
$this->edges[$id] = [];
}

/** @param object $node */
public function hasNode($node): bool
{
return isset($this->nodes[spl_object_id($node)]);
}

/**
* Adds a new edge between two nodes to the graph
*
* @param object $from
* @param object $to
* @param bool $optional This indicates whether the edge may be ignored during the topological sort if it is necessary to break cycles.
*/
public function addEdge($from, $to, bool $optional): void
{
$fromId = spl_object_id($from);
$toId = spl_object_id($to);

if (isset($this->edges[$fromId][$toId]) && $this->edges[$fromId][$toId] === false) {
return; // we already know about this dependency, and it is not optional
}

$this->edges[$fromId][$toId] = $optional;
}

/**
* Returns a topological sort of all nodes. When we have an edge A->B between two nodes
* A and B, then A will be listed before B in the result.
*
* @psalm-return list<object>
*/
public function sort()
{
/*
* When possible, keep objects in the result in the same order in which they were added as nodes.
* Since nodes are unshifted into $this->>sortResult (see the visit() method), that means we
* need to work them in array_reverse order here.
*/
foreach (array_reverse(array_keys($this->nodes)) as $oid) {
if ($this->states[$oid] === self::NOT_VISITED) {
$this->visit($oid);
}
}

return $this->sortResult;
}

private function visit(int $oid): void
{
if ($this->states[$oid] === self::IN_PROGRESS) {
// This node is already on the current DFS stack. We've found a cycle!
throw new CycleDetectedException($this->nodes[$oid]);
}

if ($this->states[$oid] === self::VISITED) {
// We've reached a node that we've already seen, including all
// other nodes that are reachable from here. We're done here, return.
return;
}

$this->states[$oid] = self::IN_PROGRESS;

// Continue the DFS downwards the edge list
foreach ($this->edges[$oid] as $adjacentId => $optional) {
try {
$this->visit($adjacentId);
} catch (CycleDetectedException $exception) {
if ($exception->isCycleCollected()) {
// There is a complete cycle downstream of the current node. We cannot
// do anything about that anymore.
throw $exception;
}

if ($optional) {
// The current edge is part of a cycle, but it is optional and the closest
// such edge while backtracking. Break the cycle here by skipping the edge
// and continuing with the next one.
continue;
}

// We have found a cycle and cannot break it at $edge. Best we can do
// is to retreat from the current vertex, hoping that somewhere up the
// stack this can be salvaged.
$this->states[$oid] = self::NOT_VISITED;
$exception->addToCycle($this->nodes[$oid]);

throw $exception;
}
}

// We have traversed all edges and visited all other nodes reachable from here.
// So we're done with this vertex as well.

$this->states[$oid] = self::VISITED;
array_unshift($this->sortResult, $this->nodes[$oid]);
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
<?php

declare(strict_types=1);

namespace Doctrine\ORM\Internal\TopologicalSort;

use RuntimeException;

use function array_unshift;

class CycleDetectedException extends RuntimeException
{
/** @var list<object> */
private $cycle;

/** @var object */
private $startNode;

/**
* Do we have the complete cycle collected?
*
* @var bool
*/
private $cycleCollected = false;

/** @param object $startNode */
public function __construct($startNode)
{
parent::__construct('A cycle has been detected, so a topological sort is not possible. The getCycle() method provides the list of nodes that form the cycle.');

$this->startNode = $startNode;
$this->cycle = [$startNode];
}

/** @return list<object> */
public function getCycle(): array
{
return $this->cycle;
}

/** @param object $node */
public function addToCycle($node): void
{
array_unshift($this->cycle, $node);

if ($node === $this->startNode) {
$this->cycleCollected = true;
}
}

public function isCycleCollected(): bool
{
return $this->cycleCollected;
}
}
Loading

0 comments on commit a599140

Please sign in to comment.