Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[node] Add SavedModel execution #2336

Merged
merged 22 commits into from
Nov 13, 2019
Merged

[node] Add SavedModel execution #2336

merged 22 commits into from
Nov 13, 2019

Conversation

kangyizhang
Copy link
Contributor

@kangyizhang kangyizhang commented Nov 5, 2019

This is the first implementation of SavedModel execution through TF C API.

Added more SavedModel for testing. These SavedModels are generated from python code.


This change is Reviewable

@kangyizhang kangyizhang changed the title Add SavedModel execution [node] Add SavedModel execution Nov 6, 2019
Copy link
Collaborator

@pyu10055 pyu10055 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thanks! couple comments.

Reviewable status: 0 of 1 approvals obtained (waiting on @dsmilkov, @kangyizhang, @nsthorat, and @pyu10055)


tfjs-node/binding/tfjs_backend.cc, line 1055 at r1 (raw file):

  auto savedmodel_entry = tf_savedmodel_map_.find(savedmodel_id);
  if (savedmodel_entry == tf_savedmodel_map_.end()) {
    NAPI_THROW_ERROR(env, "SavedModel ID not referenced (savedmodel_id: %d)",

not found? there are couple other similar instances.


tfjs-node/binding/tfjs_backend.cc, line 1076 at r1 (raw file):

num_input_ids

please verify the input ids length is same as input_op_names.


tfjs-node/binding/tfjs_backend.cc, line 1118 at r1 (raw file):

      return nullptr;
    }
    TF_Output in = {input_op, 0};

for input_op and output_op node, name can contain index as following format node_name:index, but if index is 0, it can be omitted.
Code here should handle with or without index cases, not always set to 0.


tfjs-node/binding/tfjs_backend.cc, line 1134 at r1 (raw file):

  }

  std::vector<TF_Tensor *> output_values(num_input_ids, nullptr);

num_output_ids?


tfjs-node/binding/tfjs_backend.cc, line 1147 at r1 (raw file):

output_tensor_infos

if the output tensor info construction is same as eager op execution, maybe we should refactor into a common method.


tfjs-node/binding/tfjs_binding.cc, line 221 at r1 (raw file):

  ENSURE_VALUE_IS_NUMBER_RETVAL(env, args[0], nullptr);
  ENSURE_VALUE_IS_ARRAY_RETVAL(env, args[1], nullptr);
  ENSURE_VALUE_IS_STRING_RETVAL(env, args[2], nullptr);

should the input and output node names be array instead of string? this way, you can do some validity check

                                                                                                                                                                                                                            of parameters in the js.

tfjs-node/src/saved_model.ts, line 213 at r1 (raw file):

    this.inputOpNames = new Set();
    inputNodeNames.map(inputName => {
      this.inputOpNames.add(inputName.split(':')[0]);

same here, the input/output names can contain index, and it should be handled.


tfjs-node/src/saved_model.ts, line 297 at r1 (raw file):

      } else {
        throw new Error(
            'TFSavedModel predict() does not support NamedTensorMap yet.');

namedTensorMap should be allow, since you can convert them into arrays for backend.

Copy link
Contributor Author

@kangyizhang kangyizhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @dsmilkov, @kangyizhang, @nsthorat, and @pyu10055)


tfjs-node/binding/tfjs_backend.cc, line 1055 at r1 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

not found? there are couple other similar instances.

Done.


tfjs-node/binding/tfjs_backend.cc, line 1076 at r1 (raw file):

Previously, pyu10055 (Ping Yu) wrote…
num_input_ids

please verify the input ids length is same as input_op_names.

Done.

Copy link
Contributor Author

@kangyizhang kangyizhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @dsmilkov, @nsthorat, and @pyu10055)


tfjs-node/binding/tfjs_backend.cc, line 1118 at r1 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

for input_op and output_op node, name can contain index as following format node_name:index, but if index is 0, it can be omitted.
Code here should handle with or without index cases, not always set to 0.

Done.


tfjs-node/binding/tfjs_backend.cc, line 1134 at r1 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

num_output_ids?

Done.


tfjs-node/binding/tfjs_backend.cc, line 1147 at r1 (raw file):

Previously, pyu10055 (Ping Yu) wrote…
output_tensor_infos

if the output tensor info construction is same as eager op execution, maybe we should refactor into a common method.

Done.


tfjs-node/binding/tfjs_binding.cc, line 221 at r1 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

should the input and output node names be array instead of string? this way, you can do some validity check

                                                                                                                                                                                                                            of parameters in the js.

The NAPI_VALUE does not support array of strings. So I am using stringArray.join() in js side and spliting the string with , in C++ side.


tfjs-node/src/saved_model.ts, line 213 at r1 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

same here, the input/output names can contain index, and it should be handled.

Done.


tfjs-node/src/saved_model.ts, line 297 at r1 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

namedTensorMap should be allow, since you can convert them into arrays for backend.

Done.

Copy link
Contributor

@nsthorat nsthorat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 21 of 25 files at r1, 5 of 5 files at r2.
Reviewable status: 0 of 1 approvals obtained (waiting on @dsmilkov, @kangyizhang, @nsthorat, and @pyu10055)


tfjs-node/binding/tfjs_backend.cc, line 1075 at r2 (raw file):

  // Get input/output op names as vector
  std::vector<char *> input_op_name_array = splitStringByComma(input_op_names);

const here and below


tfjs-node/src/saved_model_test.ts, line 331 at r2 (raw file):

  it('execute model float times three', async () => {
    const signature1 = await tf.node.loadSavedModel(

call this model? here and elsewhere


tfjs-node/src/saved_model_test.ts, line 339 at r2 (raw file):

    expect(output.dtype).toBe(input.dtype);
    expect(output.dtype).toBe('float32');
    test_util.expectArraysClose(await output.data(), await input.mul(3).data());

don't use input.mul(3).data() hard-code the array here and below


tfjs-node/test_objects/saved_model/times_two_int/saved_model.pb, line 0 at r2 (raw file):
do you think it would make sense to put these on GCP since these will now be part of the history and always have to be downloaded by anyone cloning the repo?

Copy link
Contributor

@nsthorat nsthorat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @dsmilkov, @kangyizhang, @nsthorat, and @pyu10055)


tfjs-node/src/saved_model.ts, line 198 at r2 (raw file):

/**
 * A `tf.TFSavedModel` is a signature loaded from a SavedModel
 * metagraph, and allows inference exeuction.

execution

Copy link
Contributor Author

@kangyizhang kangyizhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @dsmilkov, @nsthorat, and @pyu10055)


tfjs-node/binding/tfjs_backend.cc, line 1075 at r2 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

const here and below

Done.


tfjs-node/src/saved_model.ts, line 198 at r2 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

execution

Done.


tfjs-node/src/saved_model_test.ts, line 331 at r2 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

call this model? here and elsewhere

Done.


tfjs-node/src/saved_model_test.ts, line 339 at r2 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

don't use input.mul(3).data() hard-code the array here and below

Done.


tfjs-node/test_objects/saved_model/times_two_int/saved_model.pb, line at r2 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

do you think it would make sense to put these on GCP since these will now be part of the history and always have to be downloaded by anyone cloning the repo?

I would prefer to keep the test_objects in this folder, for two reasons:

  1. the total size of all these test_objects are 77.7 kb, it's not too big and there is not plan to add large size objects in future. Compare with the scripts folder, whose size is 34kb, the size of these test objects should be ok.
  2. GCP does not support downloads folder through code, which means to hosting these objects on GCP, every single file need to be downloaded and managed separately.

Copy link
Collaborator

@pyu10055 pyu10055 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @dsmilkov, @kangyizhang, @nsthorat, and @pyu10055)


tfjs-node/binding/tfjs_backend.cc, line 1147 at r1 (raw file):

Previously, kangyizhang (Kangyi Zhang) wrote…

Done.

I am wondering if the whole output tensor list generation can be refactor into one method?


tfjs-node/binding/tfjs_backend.cc, line 1092 at r3 (raw file):

  if (input_op_name_array.size() != num_input_ids) {
    NAPI_THROW_ERROR(env, "Input op names and input tensors do not match.");

Please clarify that it is the length does match


tfjs-node/binding/tfjs_backend.cc, line 1127 at r3 (raw file):

serving_default_x

do you make sure the op name always have index at the end in javascript, otherwise, name without index is allowed, and it referring to index 0.


tfjs-node/src/saved_model.ts, line 297 at r1 (raw file):

Previously, kangyizhang (Kangyi Zhang) wrote…

Done.

Please take a look at GraphModel, there are code that verify the input and output node name matches the models, in your case the signature. Please add similar verification here.

Copy link
Contributor Author

@kangyizhang kangyizhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @dsmilkov, @nsthorat, and @pyu10055)


tfjs-node/binding/tfjs_backend.cc, line 1147 at r1 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

I am wondering if the whole output tensor list generation can be refactor into one method?

No, the helper function can not process the output tensor list. Op execution returns a list of TFE_TensorHandle. While SavedModel execution (TF_SessionRun) returns a list of TF_Tensor, and when looping into this list, it converts the TF_Tensor into TFE_TensorHandle and generate Tensor info from one TFE_TensorHandle.


tfjs-node/binding/tfjs_backend.cc, line 1092 at r3 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

Please clarify that it is the length does match

Done.


tfjs-node/binding/tfjs_backend.cc, line 1127 at r3 (raw file):

Previously, pyu10055 (Ping Yu) wrote…
serving_default_x

do you make sure the op name always have index at the end in javascript, otherwise, name without index is allowed, and it referring to index 0.

Done. The index will be 0 is the op name does not have index information(:0) ad the end.


tfjs-node/src/saved_model.ts, line 297 at r1 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

Please take a look at GraphModel, there are code that verify the input and output node name matches the models, in your case the signature. Please add similar verification here.

Done. It's using a local helper function to check input node names match the signatureDef. Check of output will be added when implementing model.execute() function.

Copy link
Contributor

@dsmilkov dsmilkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Few comments, mostly related to testing mem leaks, and auto-cleaning with finalization. Might be better to do the mem leaks and finalization in a separate PR (LGTM for PR)

Reviewed 20 of 25 files at r1, 1 of 5 files at r2, 3 of 3 files at r4.
Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @dsmilkov, @kangyizhang, @nsthorat, and @pyu10055)


tfjs-node/binding/tfjs_backend.h, line 93 at r4 (raw file):

  TFE_Context *tfe_context_;
  std::map<int32_t, TFE_TensorHandle *> tfe_handle_map_;
  std::map<int32_t, std::pair<TF_Session *, TF_Graph *>> tf_savedmodel_map_;

tip i recently learned: use unordered_map, which is a hashmap and faster than map which is a binary search tree. Use map only when you need to iterate the keys in-order. Here and below.


tfjs-node/binding/tfjs_backend.cc, line 1092 at r3 (raw file):

Previously, kangyizhang (Kangyi Zhang) wrote…

Done.

Also would be good to provide the length of the names and the tensor ids in the error message. Something like:
"Length of input op names (3) does not match the length of tensor ids (4)."


tfjs-node/binding/tfjs_backend.cc, line 752 at r4 (raw file):

    TF_AutoStatus tf_status;
    TF_DeleteSession(kv.second.first, tf_status.status);
    TF_DeleteGraph(kv.second.second);

it would be great to have a unit test in cc that will assert for these mem leaks by checking the value of a counter that counts how many graphs and sessions are in memory? E.g. checking the size of the map. For example see xnn_operator_count in this WASM unit test: /~https://github.com/tensorflow/tfjs/blob/master/tfjs-backend-wasm/src/cc/kernels/Prelu_test.cc#L20 You can also assert this in js test instead of a cc test, by exposing the size of the map via napi.


tfjs-node/binding/tfjs_backend.cc, line 1120 at r4 (raw file):

    if (TF_GetCode(tf_status.status) != TF_OK) {
      NAPI_THROW_ERROR(env,
                       "Faile to get input tensor (tensor_id: %d) for session.",

typo: failed


tfjs-node/src/nodejs_kernel_backend.ts, line 1904 at r4 (raw file):

  runSavedModel(
      id: number, inputs: Tensor[], inputOpNames: string,

make inputOpNames and outputOpNames be string[] in backend.runSavedModel(). The caller of backend.runSavedModel() shouldn't know the internal details and limitations of the binding. You can keep it as string in binding.runSavedModel()


tfjs-node/src/saved_model_test.ts, line 339 at r4 (raw file):

    expect(output.dtype).toBe('float32');
    test_util.expectArraysClose(await output.data(), await input.mul(3).data());
    model.dispose();

since node supports finalization, can we make sure that model.dispose() is called automatically when the variable goes out of scope?


tfjs-node/src/saved_model_test.ts, line 372 at r4 (raw file):

  it('execute model with wrong tensor name', async done => {
    try {

tip: it's a bit cleaner to write tests like this using
expect(() => ...).toThrowError(/error message/)


tfjs-node/test_objects/saved_model/times_two_int/saved_model.pb, line at r2 (raw file):

Previously, kangyizhang (Kangyi Zhang) wrote…

I would prefer to keep the test_objects in this folder, for two reasons:

  1. the total size of all these test_objects are 77.7 kb, it's not too big and there is not plan to add large size objects in future. Compare with the scripts folder, whose size is 34kb, the size of these test objects should be ok.
  2. GCP does not support downloads folder through code, which means to hosting these objects on GCP, every single file need to be downloaded and managed separately.

Given the very small size and simplicity, I think this is ok.

Copy link
Contributor Author

@kangyizhang kangyizhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @dsmilkov, @kangyizhang, @nsthorat, and @pyu10055)


tfjs-node/binding/tfjs_backend.h, line 93 at r4 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

tip i recently learned: use unordered_map, which is a hashmap and faster than map which is a binary search tree. Use map only when you need to iterate the keys in-order. Here and below.

Done. Thanks for the info. tfe_handle_map_ has been changed to unordered_map as well.


tfjs-node/binding/tfjs_backend.cc, line 1092 at r3 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

Also would be good to provide the length of the names and the tensor ids in the error message. Something like:
"Length of input op names (3) does not match the length of tensor ids (4)."

Done.


tfjs-node/binding/tfjs_backend.cc, line 752 at r4 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

it would be great to have a unit test in cc that will assert for these mem leaks by checking the value of a counter that counts how many graphs and sessions are in memory? E.g. checking the size of the map. For example see xnn_operator_count in this WASM unit test: /~https://github.com/tensorflow/tfjs/blob/master/tfjs-backend-wasm/src/cc/kernels/Prelu_test.cc#L20 You can also assert this in js test instead of a cc test, by exposing the size of the map via napi.

Got it, thanks for the advice. I'll add the unit test in another PR so it does not overflow the size of this PR and makes code review easier.


tfjs-node/binding/tfjs_backend.cc, line 1120 at r4 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

typo: failed

Done.


tfjs-node/src/nodejs_kernel_backend.ts, line 1904 at r4 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

make inputOpNames and outputOpNames be string[] in backend.runSavedModel(). The caller of backend.runSavedModel() shouldn't know the internal details and limitations of the binding. You can keep it as string in binding.runSavedModel()

Done. This makes sense.


tfjs-node/src/saved_model_test.ts, line 339 at r4 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

since node supports finalization, can we make sure that model.dispose() is called automatically when the variable goes out of scope?

Can you elaborate a little bit here to help me understand? I think what you mean is what then a TFSavedModel instance is our of scope (no variable is referencing to the model instance), model.dispose() should be called automatically, similar to a deconstructor. Am I understanding correct? I tried to figure out how to achieve this by searching node js finalize, javascript finalize, and node js garbage collection no reference/out of scope, but didn't get any luck.


tfjs-node/src/saved_model_test.ts, line 372 at r4 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

tip: it's a bit cleaner to write tests like this using
expect(() => ...).toThrowError(/error message/)

Thank you for the advice. Based on my research expect().toThrowError() does not for code block with await, unless I wrap all the testing code into a new async function, which makes it less cleaner.

Copy link
Collaborator

@pyu10055 pyu10055 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 2 of 1 approvals obtained (waiting on @dsmilkov, @kangyizhang, @nsthorat, and @pyu10055)


tfjs-node/src/nodejs_kernel_backend.ts, line 1906 at r5 (raw file):

      outputOpNames: string[]): Tensor[] {
    const outputMetadata = this.binding.runSavedModel(
        id, this.getInputTensorIds(inputs), inputOpNames.join(),

even though the default value for join is ',', it would be clear to specify that here. join(',')

Copy link
Contributor Author

@kangyizhang kangyizhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 2 of 1 approvals obtained (waiting on @dsmilkov, @kangyizhang, @nsthorat, and @pyu10055)


tfjs-node/src/nodejs_kernel_backend.ts, line 1906 at r5 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

even though the default value for join is ',', it would be clear to specify that here. join(',')

Done.

Copy link
Contributor

@dsmilkov dsmilkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 2 of 1 approvals obtained (waiting on @kangyizhang, @nsthorat, and @pyu10055)


tfjs-node/src/saved_model_test.ts, line 339 at r4 (raw file):

Previously, kangyizhang (Kangyi Zhang) wrote…

Can you elaborate a little bit here to help me understand? I think what you mean is what then a TFSavedModel instance is our of scope (no variable is referencing to the model instance), model.dispose() should be called automatically, similar to a deconstructor. Am I understanding correct? I tried to figure out how to achieve this by searching node js finalize, javascript finalize, and node js garbage collection no reference/out of scope, but didn't get any luck.

Yes, when no js variable points to the model instance, model.dispose() should be called automatically. Few pointers:


tfjs-node/src/saved_model_test.ts, line 372 at r4 (raw file):

Previously, kangyizhang (Kangyi Zhang) wrote…

Thank you for the advice. Based on my research expect().toThrowError() does not for code block with await, unless I wrap all the testing code into a new async function, which makes it less cleaner.

Got it. I missed the await part.

Copy link
Contributor Author

@kangyizhang kangyizhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 2 of 1 approvals obtained (waiting on @dsmilkov, @kangyizhang, @nsthorat, and @pyu10055)


tfjs-node/src/saved_model_test.ts, line 339 at r4 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

Yes, when no js variable points to the model instance, model.dispose() should be called automatically. Few pointers:

Thank you for the resources. It makes a lot of sense. I plan to send another PR for this, so

  1. it doesn't flood this PR
  2. other model types (GraphModel, LayersModel) need to add this functionality as well
  3. I'm wondering if this functionality should be added into tf.tidy()

@kangyizhang kangyizhang merged commit 476fa7e into master Nov 13, 2019
@kangyizhang kangyizhang deleted the node-executemodel branch November 13, 2019 23:58
inputTensors.push(inputs);
return this.backend.runSavedModel(
this.sessionId, inputTensors, Object.values(this.inputNodeNames),
Object.values(this.outputNodeNames))[0];
Copy link

@Pierrci Pierrci Dec 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kangyizhang, is there a reason why when a single input is provided it returns only the first output, even if there are many? I'm currently experimenting with SavedModel support and because of that I have to pass the inputs as an array to be sure to get all the outputs, even if the model has only one input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. This should be fixed. If the output has only one item, it will return one tensor. If the output has multiple nodes, it will return an array. I'll send a PR to fix this and it should be included in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants