Fix topological sort for sync nodes execution order #7431

chunnienc · 2023-02-28T23:31:43Z

This PR rewrites the algorithm to determine execution order of sync graph nodes:

The new algorithm guarantees that OrderParent < OrderChild if parent -> child always holds. The old implementation does not guarantee this when child is predefined (given with an input tensor, is weight, etc), which will lead to some error in execution-order based tensor disposal plan.
Improve the time complexity from O(NE) to O(E).

Verified with sample models (localbenchmark) and diffuser

To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.

This change is

mattsoulanille

Thanks! This looks good. I have a few ideas for how we can further improve this in the future.

A quick note about the PR description. I'm not sure iff in "OrderParent < OrderChild iff parent -> child" is correct. I think if is sufficient. Let me rewrite this with generic node names so I can better explain:

Order(a) < Order(b) iff a -> b

Since this is using iff, i.e. if and only if, it means

(Order(a) < Order(b) if a -> b) and (Order(a) < Order(b) only if a -> b)
(Order(a) < Order(b) if a -> b) and ((not Order(a) < Order(b)) if (not a -> b))
(Order(a) < Order(b) if a -> b) and (Order(a) >= Order(b) if (not a -> b))

This means if b is not a child of a, then a must come after b in the order, but this creates a contradiction if a and b have no parent - child relationship. Each must come after the other (note that two nodes can not have the same order).

mattsoulanille · 2023-03-03T19:36:33Z

tfjs-converter/src/executor/model_analysis.ts

+      usedNodes.has(typeof node === 'string' ? node : node.name);
+
+  function unique(nodes: Node[]): Node[] {
+    return [...new Map(nodes.map((node) => [node.name, node])).values()];


For a given node name, is the Node object corresponding to it always the same reference?

I think yes, but I won't do that because 1) it is not an contract that node objects with the same name in different map/array are always equal in reference. 2) The project-wide convention in tfjs-converter is to lookup nodes by name string instead of object reference, and I don't want to make it wrong by breaking the convention.

tfjs-converter/src/executor/model_analysis.ts

mattsoulanille · 2023-03-03T20:09:34Z

tfjs-converter/src/executor/model_analysis.ts

  while (frontier.length > 0) {
-    const node = frontier.pop();
-    seen.add(node.name);
-    if (!weightMap[node.name]) {
-      orderedNodes.push(node);
+    const nodeName = frontier.pop();
+    const node = nameToNode.get(nodeName)!;
+    for (const child of node.children.filter(isUsed)) {
+      if (--inCounts[child.name] === 0) {
+        orderedNodeNames.push(child.name);
+        frontier.push(child.name);
+      }
    }
-    node.children.forEach(child => {
-      if (!seen.has(child.name) && usedNodes.has(child.name) &&
-          child.inputs.every(input => seen.has(input.name))) {
-        frontier.push(child);
+  }


Do we need to verify that there are no cycles in the graph, or is that checked elsewhere (I realize the original algorithm doesn't do that either, so I won't block the PR on this).

It's checked in GraphExecutor.compile in graph_executor.ts, with the assumption that if a graph does not have dynamic node it's going to be a tree.

mattsoulanille · 2023-03-03T20:19:08Z

tfjs-converter/src/executor/model_analysis.ts

+  //   a --> b --> c --> d
+  // when node `c` is predefined (e.g. given as an input tensor), we can
+  // skip node `a` and `b` since their outputs will never be used.
+  // TODO: Filter out more nodes when >=2 nodes are predefined in a path.


We should remove all nodes from the graph where any one of these conditions holds:

The node can not be reached from any predefined node by following children pointers. During graph execution, this node will never be executed, so it doesn't need to be in the graph.

The node can not be reached from any output node by following parent pointers. In this case, none of the outputs depend on this node, so it does not need to be in the graph.

All other nodes are reachable by forward traversal from a predefined node and by reverse traversal from an output node. Therefore, each of them lies on a path from a predefined node to an output node, and is therefore necessary for computing the value of the output node.

We can compute this set of reachable nodes in O(V + E) by doing a DFS (caching each visited node) from the predefined nodes on the original graph and another DFS from the output nodes on the reversed graph (which we construct by swapping the direction of children pointers to create parent pointers). Then, we take the set intersection of both of these results.

It might be worth abstracting some of the graph logic away from the tfjs-specific logic to make writing and testing these graph operations easier. I'd be happy to do this.

This function intends to make the execution plan, without changing the graph. And its' upstream mark unused nodes by providing a set of names in usedNodes instead of modifying the graph (removing nodes from children or inputs). So I don't want to break the contract, unless there is a plan to refactor everything starting from reading graph from a file.

At the time of making the execution plan, we don't know which nodes are output nodes. Therefore it just assumes all predefined reachable nodes to be output nodes. The output node list are provided by external caller in GraphExecutor.execute through:

Parameter outputs?: string[]

Env flag KEEP_INTERMEDIATE_TENSORS
Therefore, it is not something fixed with the graph. Of course we can make different execution plan for different set of "output node names", but I can see why the original author did not do that: imagine a user is debugging a weird bug by adding intermediate name to param outputs, which can make a different plan and cause the execution behavior to be different.

Besides, I don't think abstracting out graph logic (like topological order solver) is a good idea. It would end up with lots of redundant conversions between tfjs graph and the abstract graph type for the solver input/output, and make it harder to extend/modify for new tfjs graph specific properties and requirements. For now this function is self-contained for the purpose of making the execution plan (execution plan is not limited to topological order, so it may still better to rename this function with its purpose instead of it's implementation), and there are no other functions need the solver. So I think it's okay to keep it like this.

But I think wrapping it in a TFJSGraph class as a method will definitely be better. It will even better if we can separate SyncGraph and AsyncGraph with different checks/planner/executor. Go ahead if you have any good ideas to refactor this part!

That makes sense. I'm happy to leave the graph as-is for this PR. We can talk more offline about how best to represent the graph in the future.

mattsoulanille · 2023-03-03T20:58:41Z

tfjs-converter/src/executor/model_analysis.ts

+  {
+    const nodeNameToOrder = new Map<string, number>(
+        filteredOrderedNodes.map((node, order) => [node.name, order]));
+    const predefinedNodeNames =
+        new Set(predefinedNodes.map((node) => node.name));
+    const isPredefined = (node: Node|string) =>
+        predefinedNodeNames.has(typeof node === 'string' ? node : node.name);
+    const willBeExecutedNodeNames =
+        new Set(filteredOrderedNodes.map((node) => node.name));
+    const willBeExecuted = (node: Node|string) => willBeExecutedNodeNames.has(
+        typeof node === 'string' ? node : node.name);
+
+    for (const node of filteredOrderedNodes) {
+      for (const child of node.children.filter(willBeExecuted)) {
+        if (!nodeNameToOrder.has(child.name)) {
+          throw new Error('TopologicalSortError: Child is unreachable.');
+        }
+        if (nodeNameToOrder.get(node.name) > nodeNameToOrder.get(child.name)) {
+          throw new Error(
+              'TopologicalSortError: Node has greater order than its child.');
+        }
+      }
+      if (!isPredefined(node)) {
+        for (const input of node.inputs) {
+          if (!nodeNameToOrder.has(input.name)) {
+            throw new Error('TopologicalSortError: Input is unreachable.');
+          }
+          if (nodeNameToOrder.get(input.name) >
+              nodeNameToOrder.get(node.name)) {
+            throw new Error(
+                'TopologicalSortError: Node has smaller order than its input.');
+          }
+        }
+      }
+    }


Should we do this validation here, or should this be integrated into a test suite instead? I guess it's better for the user to see one of these errors instead of running an incorrect graph.

Do we have a e2e test suite to test the execution results of models in our model pool?

mattsoulanille · 2023-03-03T21:01:34Z

tfjs-converter/src/executor/model_analysis.ts

+          throw new Error('TopologicalSortError: Child is unreachable.');
+        }
+        if (nodeNameToOrder.get(node.name) > nodeNameToOrder.get(child.name)) {
+          throw new Error(
+              'TopologicalSortError: Node has greater order than its child.');
+        }
+      }
+      if (!isPredefined(node)) {
+        for (const input of node.inputs) {
+          if (!nodeNameToOrder.has(input.name)) {
+            throw new Error('TopologicalSortError: Input is unreachable.');
+          }
+          if (nodeNameToOrder.get(input.name) >
+              nodeNameToOrder.get(node.name)) {
+            throw new Error(
+                'TopologicalSortError: Node has smaller order than its input.');


Nit: Can we log the name of each node or child we reference here to make debugging easier?

mattsoulanille · 2023-03-07T18:45:16Z

@pyu10055 Please take a look when you get a chance. Thanks!

pyu10055

Reviewable status: complete! 2 of 1 approvals obtained (waiting on @chunnienc and @mattsoulanille)

tfjs-converter/src/executor/model_analysis.ts line 103 at r3 (raw file):

 * need to be executed to compute the output.
 */
export function getNodesInTopologicalOrder(

Nit: This method is relatively long, it would be great if you can abstract the key steps into separate function.

chunnienc · 2023-03-08T18:12:08Z

@mattsoulanille Could you review the helper functions and doc comments again? Thanks.

tfjs-converter/src/executor/model_analysis.ts

Co-authored-by: Matthew Soulanille <matthew@soulanille.net>

chunnienc added 5 commits March 1, 2023 16:01

Fix attempt no. 1

5014b29

Fix naming

7dc3776

Fix typo

60e6c7a

Fix topological sort order

c8f43be

Add more check

cbc4d25

chunnienc force-pushed the fix-tensor-disposal branch from 889f679 to cbc4d25 Compare March 2, 2023 00:01

chunnienc added 4 commits March 1, 2023 16:02

Merge branch 'master' into fix-tensor-disposal

d526fa9

Rewrite topological sort algorithm

eb2ef5c

Fix lint

f9c4656

Add todo comment

5d242b7

chunnienc changed the title ~~Fix tensor disposal algorithm~~ Fix topological sort for sync nodes execution order Mar 2, 2023

chunnienc requested review from mattsoulanille and pyu10055 March 2, 2023 19:11

Merge branch 'master' into fix-tensor-disposal

946c193

chunnienc marked this pull request as ready for review March 2, 2023 19:13

Fix accessing children of undefined

1fdd232

mattsoulanille approved these changes Mar 3, 2023

View reviewed changes

Merge branch 'master' into fix-tensor-disposal

dca025c

pyu10055 approved these changes Mar 8, 2023

View reviewed changes

chunnienc mentioned this pull request Mar 8, 2023

Tensor is disposed too early for FaceLandmarkDetection&architecture=attention_mesh #7445

Closed

chunnienc added 3 commits March 8, 2023 09:59

Separate out helper functions

1bf14ad

Merge remote-tracking branch 'upstream/master' into fix-tensor-disposal

80eb8e6

Fix comment

c520db0

chunnienc requested a review from mattsoulanille March 8, 2023 18:11

chunnienc added 2 commits March 8, 2023 10:17

Fix error message

c37c6b0

Fix indent

fd28a63

mattsoulanille approved these changes Mar 8, 2023

View reviewed changes

Update tfjs-converter/src/executor/model_analysis.ts

5fa23f3

Co-authored-by: Matthew Soulanille <matthew@soulanille.net>

chunnienc and others added 3 commits March 8, 2023 15:34

Update tfjs-converter/src/executor/model_analysis.ts

71fd7e7

Co-authored-by: Matthew Soulanille <matthew@soulanille.net>

Update tfjs-converter/src/executor/model_analysis.ts

b4903da

Co-authored-by: Matthew Soulanille <matthew@soulanille.net>

Add class for validation error

f1abb5b

chunnienc merged commit 0f16d2d into tensorflow:master Mar 8, 2023

chunnienc deleted the fix-tensor-disposal branch March 9, 2023 18:26

vladmandic mentioned this pull request Mar 21, 2023

regression: tensor is disposed early #7504

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix topological sort for sync nodes execution order #7431

Fix topological sort for sync nodes execution order #7431

chunnienc commented Feb 28, 2023 •

edited

Loading

mattsoulanille left a comment •

edited

Loading

mattsoulanille Mar 3, 2023

chunnienc Mar 3, 2023

mattsoulanille Mar 3, 2023

chunnienc Mar 3, 2023

mattsoulanille Mar 3, 2023

chunnienc Mar 3, 2023 •

edited

Loading

mattsoulanille Mar 7, 2023

mattsoulanille Mar 3, 2023

chunnienc Mar 3, 2023

mattsoulanille Mar 3, 2023

mattsoulanille commented Mar 7, 2023

pyu10055 left a comment

chunnienc commented Mar 8, 2023

Fix topological sort for sync nodes execution order #7431

Fix topological sort for sync nodes execution order #7431

Conversation

chunnienc commented Feb 28, 2023 • edited Loading

mattsoulanille left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chunnienc Mar 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattsoulanille commented Mar 7, 2023

pyu10055 left a comment

Choose a reason for hiding this comment

chunnienc commented Mar 8, 2023

chunnienc commented Feb 28, 2023 •

edited

Loading

mattsoulanille left a comment •

edited

Loading

chunnienc Mar 3, 2023 •

edited

Loading