-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design Doc: Session #3993
Design Doc: Session #3993
Conversation
9de0cb3
to
b411ae0
Compare
doc/design/session.md
Outdated
|
||
## Abstract | ||
|
||
This design doc proposes to have an object called *Session* which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is these design came out from #3811 (comment)?
If so, you probably need to add descriptions like "session is able to distinguish running a graph locally or remotely, using CPU or GPU, using one device or more"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply. Great point! Added "The Session is able to distinguish running a graph locally or remotely, using CPU only or using one or more GPUs."
doc/design/session.md
Outdated
## Background | ||
|
||
A computation graph is executed in an environment which contains the | ||
[scope](./scope.md) and other states. PaddlePaddle used to only have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, what do you mean by "other states"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scope, device, context etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same question with @typhoonzero , @jacquesqiao do you mean the Session contains runtime resources?
e.g. already allocated memory in scope, occupied device, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply. @dzhwinter The environment contains runtime resources. The session is a "owner" of these runtime resources.
doc/design/session.md
Outdated
a = paddle.constant(1.0) | ||
b = paddle.constant(2.0) | ||
c = a + b | ||
sess = paddle.session() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts of mine:
-
An NN training job must contain: graph(containing sub nets like init_net, forward_net, backward_net, opt_net); scope containing tensors as parameters; hyper parameters(learning_rate, batch_size, etc.); settings(cluster or not, devices, quotas, node ip address etc.), so here may be something like:
sess = paddle.session(graph, scope_list, settings)
# or
sess = paddle.remote_session(graph, scop_list, cluster_settings) -
states can be stored all in some "scope" by creating scope for storing tensor for forward and backward, and for storing hyper parameters changing states.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with
sess = paddle.session(graph, scope_list, settings)
# or
sess = paddle.remote_session(graph, scop_list, cluster_settings)
states can be stored all in some "scope" by creating scope for storing tensor for forward and backward, and for storing hyper parameters changing states.
Yes the variable states are stored in "scope". One session means one scope (just added into the doc).
doc/design/session.md
Outdated
## Background | ||
|
||
A computation graph is executed in an environment which contains the | ||
[scope](./scope.md) and other states. PaddlePaddle used to only have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to specify one session contains one scope?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply. Added "This indicates different sessions have different scopes.".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR! But after reading it, still I couldn't tell what must be included in a Session. Though it seems that this information should appear in the first paragraph of this document?
doc/design/session.md
Outdated
|
||
## Background | ||
|
||
A computation graph is executed in an environment which contains the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try not to use passive voice.
A computation graph runs in an environment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply! Done.
doc/design/session.md
Outdated
## Background | ||
|
||
A computation graph is executed in an environment which contains the | ||
[scope](./scope.md) and other states. PaddlePaddle used to only have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"used to only have" => "used only to have" or "used to have only". Indeed, here we are describing the current statues, so it should be
The current design has an implicit session ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Done.
doc/design/session.md
Outdated
|
||
## Session | ||
|
||
Session is an object that owns all runtime states such as scope, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Session is an object ==> A session is an object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Done.
doc/design/session.md
Outdated
[scope](./scope.md) and other states. PaddlePaddle used to only have | ||
an implicit global session on which `paddle.eval()` is executed. | ||
|
||
This has the limitation that the user can not create two independent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a broken logic. It claimed but didn't explain why users cannot have two environments. The second sentence is to claim that it is necessary to have two environments.
From the text above, readers would ask it seems that what we need is two Scope instances, but not a new class Session.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Explained why user cannot have two environments, and also changed wording so that reader will know we need a new class Session.
doc/design/session.md
Outdated
label = reader.column(1) | ||
fc1 = paddle.op.fc(image, size=256, act="sigmoid") | ||
fc2 = paddle.op.fc(fc1, size=10, act="softmax") | ||
cost = paddle.op.cross_entropy(fc2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this line be cost = paddle.op.cross_entropy(fc2, label)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Done.
It's better to give a C++ or Python class definition of Session. |
doc/design/session.md
Outdated
## Background | ||
|
||
A computation graph is executed in an environment which contains the | ||
[scope](./scope.md) and other states. PaddlePaddle used to only have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same question with @typhoonzero , @jacquesqiao do you mean the Session contains runtime resources?
e.g. already allocated memory in scope, occupied device, etc?
doc/design/session.md
Outdated
|
||
## Abstract | ||
|
||
This design doc proposes to have an object called *Session* which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the Session try to solve the problem of
- unify the runtime sources management between a local machine and distributed environment.
- Replace the global scope with a named concept, which holds resources explicitly.
is it correct ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
Updated, please review:)
5960fd2
to
526c9eb
Compare
doc/design/session.md
Outdated
|
||
## Session | ||
|
||
A session is an object that owns all runtime states such as scope, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should define what a session owns exactly, like:
- The "graph" to run locally or remotely.
- Exactly one scope, containing all tensor variables for all the subnets.
- Settings and hyper-parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Updated (the session owns the scope), please review.
The session owns a single global scope, but a scope can have sub-scope, so I did not specify one scope.
The session does not own the graph, the graph is what gets evaluated with session.
The session does not own settings and hyper-parameters, the session could be created from settings and hyper-parameters.
Added Python interface, please review. |
|
||
Evaluates the target Operations or Variables in `targets`. | ||
|
||
- *targets*: the evaluation targets. Can be a single Operation or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should targets be an instance of "Block"(graph)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Targets will be a OP the output of a OP (Var). The "Block" (ProgramDesc to be exact) will be inferred by eval
.
To make the relationship more clear, I have updated the PR, please take a look.
doc/design/refactor/session.md
Outdated
|
||
The computation graph is implicitly inferred from the targets. | ||
|
||
- *feed_dict*: a dictionary that contains the tensors which overrides |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the input data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but not only the input data, can override any edge as well. E.g.,:
a = pd.constant(1.0, name="a")
b = pd.constant(2.0)
c = pd.mul(a,b)
sess.eval(targets=c, feed_dict={"a":3.0}) # returns 6.0
I have added the above example into the design doc.
doc/design/refactor/session.md
Outdated
close() | ||
``` | ||
|
||
Closes the session. Calling this method releases the scope. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need save()
and load()
also to do checkpointing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
save
and load
will be OPs, so user will need to run something like sess.eval(targets=save)
or sess.eval(targets=load)
.
Similar to TF (which treats save
and load
as OPs), we can add syntax sugar, wrap the saving and loading model into something like:
tf.reset_default_graph()
# Create some variables.
v1 = tf.get_variable("v1", [3], initializer = tf.zeros_initializer)
v2 = tf.get_variable("v2", [5], initializer = tf.zeros_initializer)
# Add ops to save and restore only `v2` using the name "v2"
saver = tf.train.Saver({"v2": v2})
# Use the saver object normally after that.
with tf.Session() as sess:
# Initialize v1 since the saver will not.
v1.initializer.run()
saver.restore(sess, "/tmp/model.ckpt")
print("v1 : %s" % v1.eval())
print("v2 : %s" % v2.eval())
I think the syntax sugar will not be in the scope of this design doc (maybe more suited for Python API design doc).
doc/design/refactor/session.md
Outdated
Creates a new session. One session owns one scope, so creating | ||
multiple sessions will create different scopes. | ||
|
||
- *gpu_ids*: a single `int` or a list of `int` of the GPU IDs to be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting up devices can take advantage of paddle v1 design, which default set gpu_ids
to all of the available GPUs.
Can we use a string to specify devices, because there may be other devices than GPU, like FPGA, what TF does is /job:worker/task:1/gpu:0
also can be /job:worker/task:1/fpga:0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea! Changed gpu_ids
to devices
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thinking what's the difference between Session and Executor. The Executor has a Run interface to execute a ProgramDesc created in compile-time. DeviceContext(CPUDeviceContext/CUDADeviceContext) is created and managed by Executor.
And Maybe Executor is a data member of Session. Session will get a target and some device ids. The target is parsed to get a ProgramDesc. Then, the ProgramDesc and device ids are passed to Executor. Executor will created a DeviceContextManeger according to device ids. At last, the ProgramDesc will be executed in specific hardwares.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@QiJune Great! That is a clear logic. We could add one more step ProgramOptimizer (currently called Converter) between Session and Executor. Please see this graph for more detail: /~https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/refactor/distributed_architecture.md#local-training-architecture (In the graph, "PaddlePaddle runtime" means the Executor. Btw, there are many names of the same thing, we need to decide on the naming).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the offline discussion with @helinwang , I believe Session
is
- a manager of the resources, which is a more high-level concept than DeviceContextManager. It owns the resources.
- The job handler.
- it's will construct a new graph according to the
targets
user given. - Interact with remote cluster or local machine, fetch/feed tensor, start/stop running a graph....
- The close interface allows user release the resource it owns.
- it's will construct a new graph according to the
The Executor
is a runtime concept, which has nothing to do with Session.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's will construct a new graph according to the targets user given.
The session will not "construct a new graph", it will send the graph to the converter, and the converter will construct a new graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the offline discussion with @helinwang , I believe Session
is
- a manager of the resources, which is a more high-level concept than DeviceContextManager. It owns the resources.
- The job handler.
- it's will construct a new graph according to the
targets
user given. - Interact with remote cluster or local machine, fetch/feed tensor, start/stop running a graph....
- The close interface allows user release the resource it owns.
- it's will construct a new graph according to the
The Executor
is a runtime concept, which has nothing to do with Session.
The session will not "construct a new graph", it will send the graph to the converter, and the converter will construct a new graph. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to implement this module before the 0.11.0?
doc/design/refactor/session.md
Outdated
a = pd.constant(1.0, name="a") | ||
b = pd.constant(2.0) | ||
c = pd.mul(a,b) | ||
sess.eval(targets=c, feed_dict={"a":3.0}) # returns 6.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
constant can be changed value looks weird. Maybe name them with variable
is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Done.
doc/design/refactor/session.md
Outdated
) | ||
``` | ||
|
||
Creates a new session. One session owns one scope, so creating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This statement is confusing. One session owns at least one scope, namely, global scope in one single session, right?
Or you mean that one session will have exactly one scope?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the global scope. Done.
doc/design/refactor/session.md
Outdated
Creates a new session. One session owns one scope, so creating | ||
multiple sessions will create different scopes. | ||
|
||
- *devices*: a single `string` or a list of `string` of device names, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we really need a SessionConfig
here? I mean, if one session cannot fully utilize the GPU resource, then another session may also own the same GPU.
In my view, we need to submit the config for this round of session run call. If it is a local run call, we provide the local config, vice versa, we provide the cluster config. Even more, if it is an inference run call, we may provide another config, which is totally different from those ones above.
We can leave these complexities to be solved in the future, but we need to figure the concept clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if one session cannot fully utilize the GPU resource, then another session may also own the same GPU.
Devices only means the devices that the session uses, multiple sessions can use the same device. I have added "Multiple sessions can use the same device." to clear up this point.
If it is a local run call, we provide the local config, vice versa, we provide the cluster config
Local session is created by paddle.session
, remote session is created by paddle.remote_session
.
if it is an inference run call, we may provide another config
In my view, inferencing is just user specifying a inference target, which is no different than training (specifying a training target). They should use the same kind of session. The layers below session should do the optimization (e.g., based on batch size) transparent to which session is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see.
We had discussed this module for a long-term and had reached an agreement, so it's better to merge it. For anyone who has any questions or different view, we can have an offline meeting. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Here is better for review.
Fixes: #4552