Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizer C_API #2168

Closed
dzhwinter opened this issue May 16, 2017 · 22 comments
Closed

Optimizer C_API #2168

dzhwinter opened this issue May 16, 2017 · 22 comments

Comments

@dzhwinter
Copy link
Contributor

dzhwinter commented May 16, 2017

上次讨论提到

Model Optimization Using Gradients

There are two ways to perform model optimization using gradients:

  • On Client
    The client does multiple steps of forward and backward update. In each step, the gradients are >calculated and a new model is generated. After some steps, the client will calculate the difference >between the newest model and the old model at step 0. The difference will be updated to >parameter servers. Parameter servers will just update parameters using the difference without any >optimization using gradients (such as Adam and L1 regularization).
  • On Parameter Server
    The client will send accumulated gradients to parameter servers, the parameter server will do the >optimization using gradients.

这两种更新参数的方法计划都支持。目前v1版本只支持1(On Client的方法),由于两种方法都需要Optimizer更新的策略,因此选择将Optimizer封装成一个库。

ParameterServer为Go语言实现,需要一个Optimizer的C接口,定义如下

    // support data type same with @helin's client design doc, 
    typedef enum {
      PADDLE_ELEMENT_TYPE_INT32   = 0,
      PADDLE_ELEMENT_TYPE_UINT32  = 1,
      PADDLE_ELEMENT_TYPE_INT64   = 2,
      PADDLE_ELEMENT_TYPE_UINT64  = 3,
      PADDLE_ELEMENT_TYPE_FLOAT32 = 4,
      PADDLE_ELEMENT_TYPE_FLOAT64 = 5,
    } paddle_element_type;
    
    
    
    /*
    @brief update interface of optimizer, which will be used in 
    Trainer process
    ParameterServer process to support On ParameterServer optimize
    @param buffer : array of parameters
    @param datatype : datatype of parameter and gradient
    @param optimizer_name: optimizer_name as algorithm id, "SGD, Adam"
    @param gradient : array of gradients, which will be apply to parameters
    */
    void updateParameter(void *buffer, paddle_element_type datatype, const char* optimizer_name, const void* gradient);

1、是否能将sparseUpdate和denseUpdate都使用这一个接口?

SparseUpdate存储为SparseRowMatrix,可以复用这个接口。

2、是否能将Regularizer一起封装在这个库里?

On Client的参数更新方式已经和通信耦合,特别是SparseUpdate时候,由于Update的过程是lazy的,本地迭代了多次,Regularizer需要保存计算的轮数,并且需要在某次读取时候触发更新。没想到合适的办法拆分通信状态

Optimizer计划封装math库里的底层applySGD等操作,详细代码位置见:

/~https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/math/TrainingAlgorithmOp.cu#L25

后期接入Majel可以迁移这部分代码

@dzhwinter
Copy link
Contributor Author

@dzhwinter
Copy link
Contributor Author

dzhwinter commented May 16, 2017

需要注意Regularizer的部分还是必须独立出来,目前代码中这个部分尝试optimizer和regularizer结合起来。看起来处于无用状态。
/~https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/parameter/OptimizerWithRegularizer.h#L63

希望的达到状态是:统一的库接管trainer和ParameterServer在优化方面的计算,封装一层math里的applySGD函数。

@helinwang
Copy link
Contributor

const char* optimizer_name是不是改成enum更好一些?

@dzhwinter
Copy link
Contributor Author

dzhwinter commented May 16, 2017

👍 放在一个proto文件定义这些声明?

enum OptimizerIdentifier {
  SGD =  0,
  ASGD = 1,
  ADAM = 2,
  // ...
};
    
    /*
    @brief update interface of optimizer, which will be used in 
    Trainer process
    ParameterServer process to support On ParameterServer optimize
    @param buffer : array of parameters
    @param datatype : datatype of parameter and gradient
    @param optimizer_name: optimizer_name as algorithm id, "SGD, Adam"
    @param gradient : array of gradients, which will be apply to parameters
    */
    void updateParameter(void *buffer, const void* gradient, OptimizerIdentifier optimizer, 
 paddle_element_type datatype);

这个接口非常简单,
确认两点,一个接口确实已经可以完全表达所有优化函数接口。
另外一点是regularizer确实做不到也作为一个函数接口提供给ParameterServer和Trainer分别调用。

@helinwang
Copy link
Contributor

Optimizer计划封装math库里的底层applySGD等操作,详细代码位置见:
/~https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/math/TrainingAlgorithmOp.cu#L25

我看到引用的代码是gpu代码,提醒一下,现在pserver不需要支持GPU,太费GPU资源了,CPU也不一定慢很多。

建议Optimizer C++代码我们自己先写一份吧,现在Tensor也在重构,到时候需要的话再换到共享的Tensor库。

@helinwang
Copy link
Contributor

helinwang commented May 17, 2017

请看/~https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/cluster_train/pserver_client.md 其中的

int paddle_begin_init_params(paddle_pserver_client* client, const char* config_proto);

是需要一个proto文件定义pserver的设置。但是建议Optimizer不用去理解proto,由pserver理解完了proto再根据proto要求创建Optimizer。

@helinwang
Copy link
Contributor

helinwang commented May 17, 2017

这个接口非常简单,
确认两点,一个接口确实已经可以完全表达所有优化函数接口。
另外一点是regularizer确实做不到也作为一个函数接口提供给ParameterServer和Trainer分别调用。

"一个接口确实已经可以完全表达所有优化函数接口。":应该是可以了,可能需要稍作修改:

typedef enum {
  SGD =  0,
  ASGD = 1,
  ADAM = 2,
  // ...
} optimizer_identifier;
typdef struct optimizer optimizer; // forward declaration
optimizer* paddle_create_optimizer(optimizer_identifier identifier);
void paddle_release_optimizer(optimizer* o);
int paddle_update_parameter(optimizer* o, void *buffer, paddle_element_type datatype, const void* gradient, int num_bytes, double learning_rate);

添加num_bytes,作为buffer和gradient的长度。
添加learning_rate
我把OptimizerIdentifier改成了optimizer_identifierupdateParameter改成了paddle_update_parameter,貌似咱们的c代码用的是下划线风格的?(/~https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/capi/matrix.h#L30)
另外c没有namespace,一般需要用个前缀作为namespace。所以我改成了paddle_update_parameter

@dzhwinter
Copy link
Contributor Author

dzhwinter commented May 17, 2017

赞同comment,开始着手重写一份optimizer

@dzhwinter
Copy link
Contributor Author

@helinwang
I found a new problem. How do we support different CPU instructions?
for example,
I had implemented the GradientDescent optimizer, but I found that old Paddle support avx through a difference interface. Do we need to add these feature in the new library?

void sgdUpdateAvx(float learningRate,

@helinwang
Copy link
Contributor

helinwang commented May 17, 2017

@dzhwinter avx is supported in almost all modern CPUs. However, maybe let's do non-avx CPU first (development time is faster as well because of no need to learn the avx intrinsics) and do avx if optimization is necessary.

@dzhwinter
Copy link
Contributor Author

We had already done the SGD algorithm job in v1, no more effort should be paid here since we are in a hurry.
But we need to re-think about the create_optimizer interface. Please take a look at this old optimizer configuration.
/~https://github.com/PaddlePaddle/Paddle/blob/develop/proto/TrainerConfig.proto#L21

I found that parse the configuration is inevitable.
for example,

// Adagrad optimizer
  optional double ada_epsilon = 24 [default = 1e-6];
  optional double ada_rou = 26 [default = 0.95];

  // Options Adam Optimizer //
  ////////////////////////////
  optional double adam_beta1 = 33 [default = 0.9];
  optional double adam_beta2 = 34 [default = 0.999];
  optional double adam_epsilon = 35 [default = 1e-8];

How about pass a serialized proto file to describe the configuration? we need this config when we create optimizer, otherwise we need a standalone create function in pure C for every optimizer method.

@helinwang
Copy link
Contributor

helinwang commented May 19, 2017

@dzhwinter Before we wanted the optimizer to be super simple, as few state as possible, and the Go part do the messy part like manage momentum memory, parsing config, so Go store many states (like momentum memory, since it needs to parse the config to know if it's momentum based optimizer).

If we want to let optimizer parse and create the specific optimizer, Go better don't parse it again. So all the config related states will need to be managed by C++, that would complicate interfaces (e.g., C++ will need to manage momentum memory for each parameter, so it needs the parameter name exposed by the C++-Go API).

I am a little worried about it complicating things more than "need a standalone create a function in pure C for every optimizer method.". How do you think?

@dzhwinter
Copy link
Contributor Author

I have no tendency if the Go part should responsible with the states and parsing config.

Actually, I just thought that if we write in Go part way, we need a bunch of different create function, update function
by the way, I cannot agree anymore that we only put pass stateless code into optimizer and do not need the oprimizer understanding config. Firstly, if we treat optimizer as a library, we should not manage a state. Secondly, understanding config and do config checking should be happen in unique place , which is the trainer in Go part . We suffering the config parsing problem for a long time.
I just think that maybe there is someway that we can make it better?

@dzhwinter
Copy link
Contributor Author

dzhwinter commented May 19, 2017

I just write some pseudo code here show what will happen if we put it in Go part.

// for create optimizer function signature
create_SGDOptimizer(learning_rate)
create_Momentum(learning_rate, mu)
create_Adam(learning_rate, )
// for update function signature

updateSGD(parameter,  gradient, mu, learning_rate);

updateMomentum(parameter, momentum,  gradient, mu, learning_rate);

updateAdam(parameter, momentum,  gradient, mu, rha, learning_rate);

@helinwang
Copy link
Contributor

helinwang commented May 19, 2017

I have a new idea up for discussion, sorry that my opinion changes on this one, because I have not think very clear on the problem.

Before in my mind, the optimizer is used like this (maybe the same in your mind as well):

One optimizer instance for all parameters:

- pserver process
  - optimizer instance
  - parameter map
    - "param_a": memory_buffer
    - "param_b": memory_buffer
    - ...

And during update, the memory_buffer and gradient for
param_a is sent to optimizer instance, optimizer instance
changes memory_buffer for param_a.

The problem with single optimizer instance is that it can not save per parameter state like momentum. Then we can do a workaround, let Go save momentum as well:

- pserver process
  - optimizer instance
  - parameter map
    - "param_a": memory_buffer, momentum_buffer
    - "param_b": memory_buffer, momentum_buffer
    - ...

And during update, the memory_buffer, momentum_buffer and gradient for
param_a is sent to optimizer instance, optimizer instance
changes memory_buffer for param_a.

This walkaround has some problem: what if there is a new per parameter state, should Go maintain it as well?

Another solution would be: not one optimizer instance for all parameters, but one optimizer instance for each parameter:

- pserver process
  - parameter optimizer map
    - "param_a": optimizer instance
    - "param_b": optimizer instance
    - ...

And during update, the gradient for param_a is sent to
optimizer instance for param_a, optimizer instance do the
update and update the parameter buffer that it owns.

The Go code will create an optimizer with initial parameter value (a `unsigned char*`),
the optimizer will own the parameter buffer. Go no longer manages the parameter buffer.

If we use this approach, the config parsing could be in C++, since Go does not need to maintain momentum, or parameter. The interface could be:

optimizer* paddle_create_optimizer(const unsigned char* config_proto, int config_proto_len, const unsigned char* buffer, int num_bytes);
void paddle_release_optimizer(optimizer* o);
int paddle_update_parameter(optimizer* o, paddle_element_type datatype, const unsigned char* gradient, int num_bytes);
const unsigned char* paddle_optimizer_param(optimizer* o, *int num_bytes);

@dzhwinter
Copy link
Contributor Author

dzhwinter commented May 22, 2017

well, solution 1 seems it is the right direction of a stateless library. after I dive into the all kinds of optimizer algorithms of Paddle, tensorflow. I find that Go maintain all the state can be achieved.

what if there is a new per parameter state, should Go maintain it as well?

the problem can be solved since there are limited kinds of optimizer parameters. I thought we can write it in this way

struct parameter_map{
void *parameter,
void *gradient,
void *momentum,
void *l1,
void *l2,
....
double learning_rate,
};
int paddle_update_parameter(optimizer* o, paddle_element_type datatype, parameter_map* params, int num_bytes);

we can clearly figure how many parameters would be there by listing algorithm in PaddlePaddle.
/~https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/training_ops.h
in solution 1.

- pserver process
  - optimizer instance
  - parameter map
    - "param_a": memory_buffer, momentum_buffer
    - "param_b": memory_buffer, momentum_buffer
    - ...

However, I found that each pserver process assume all the parameter in this node have the same optimizer algorithm, because they share the optimizer process. But different parameter always has different optimizers. e.g. deep and wide model https://arxiv.org/abs/1606.07792
left part network is optimized with adagrad, right part network is optimized with FTRL(follow the regularizer leader), as a result, each parameter must own its optimizer instance. we have to do this job in solution 2.

It is Sunday in your time Zone, so I rewrite optimizer in solution2, and focus other clound developing mission , hope it is the right choice.

@dzhwinter
Copy link
Contributor Author

by the way, if there is any solution can maintain the parameter state in Go part, I prefer that way since we are designing a library. Rewrite this part would not consume too much time, if there is any improved method, I can just re-write it.

@helinwang
Copy link
Contributor

helinwang commented May 22, 2017

@dzhwinter Writing in solution 2 sounds great!
Let's have concensus on the interface. Does this interface looks good to you?

optimizer* paddle_create_optimizer(const unsigned char* config_proto, int config_proto_len, const unsigned char* buffer, int num_bytes);
void paddle_release_optimizer(optimizer* o);
int paddle_update_parameter(optimizer* o, paddle_element_type datatype, const unsigned char* gradient, int num_bytes);
const unsigned char* paddle_optimizer_param(optimizer* o);

@dzhwinter
Copy link
Contributor Author

dzhwinter commented May 23, 2017

Thanks for the reply!
In detail, there is only one part is different.

>optimizer* paddle_create_optimizer(const unsigned char* config_proto, int config_proto_len, const 

unsigned char* buffer, int num_bytes);
ls the buffer argument design for store any other argument ? I am a little confused here.

> int paddle_update_parameter(optimizer* o, paddle_element_type datatype, const unsigned char* gradient, int num_bytes);

the argument should contain learning_rate, right?

int paddle_update_parameter(optimizer* o, paddle_element_type datatype, const unsigned char* gradient, int num_bytes, double learning_rate);

@helinwang
Copy link
Contributor

helinwang commented May 23, 2017

@dzhwinter Here is what in my mind, could be wrong, just for discussion:

The buffer argument is to pass the initial parameter to optimizer, then optimizer allocates a byte aligned memory buffer, stores the parameter for update. That's why there is function to get the parameter from optimizer: const unsigned char* paddle_optimizer_param(optimizer* o, *int num_bytes);

The learning rate is initialized by config_proto during paddle_create_optimizer. The optimizer will maintain the learning rate by itself. A lot of optimizer has its own schedule for learning rate:

// learning rate will be scaled according to learning_rate_schedule

@dzhwinter
Copy link
Contributor Author

dzhwinter commented May 23, 2017

@helinwang
Oh, it seems that I didn't understand the buffer at all.
I thought that solution 2 only export the parameter name to the optimizer, and let the optimizer library take the parsing code and parameter memory as well.

so, make a double check, is that this function returns the parameter name lists from optimizer?

const unsigned char* paddle_optimizer_param(optimizer* o, *int num_bytes);

In addition, there is a little problem of learning_rate policy if optimizer library takes it.

virtual real calcLearningRate(int64_t numSamplesProcessed, int64_t pass) = 0;
e.g 
class PolyLRS : public BaseLRS {
public:
  explicit PolyLRS(const OptimizationConfig& config) : BaseLRS(config) {}
  virtual real calcLearningRate(int64_t numSamplesProcessed, int64_t pass) {
    return learningRate_ * pow(1.0 + a_ * numSamplesProcessed, -b_);
  }
};

numSamplesProcessed will be another state, which goes the oppose way to our origin optimizer library idea.

@helinwang
Copy link
Contributor

solution 2 only export the parameter name to the optimizer, and let the optimizer library take the parsing code and parameter memory as well.

嗯是的,optimizer拥有parameter memory。这个memory需要初始化,就是为什么会有那个buffer作为参数传给optimizer。

因为optimizer拥有parameter memory,Go语言需要一个接口把这个memroy读出来,const unsigned char* paddle_optimizer_param(optimizer* o, *int num_bytes);就是做这个的。

virtual real calcLearningRate(int64_t numSamplesProcessed, int64_t pass) {
    return learningRate_ * pow(1.0 + a_ * numSamplesProcessed, -b_);
  }

这里面的变量numSamplesProcessed被用到了,可以是optimizer自己存一个状态。Solution 2我理解是所有状态都Optimizer存,每一个parameter创建一个Optimizer,Go部分只存name到optimizer的映射。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants