-
Notifications
You must be signed in to change notification settings - Fork 396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: benchamrk about tcmalloc and memcpy #28
Comments
Hi @guangqianpeng,
I think the main reason of such result is because the
Here is actually a knack in the code like Even when there is no such sse enhancement provided by the compiler, such "very-short memcpy inline" with these general purpose registers are still more efficient than to call the libc memcpy directly. That is because, in the case of such short copy, the cost of a function call is not small enough to neglect anymore. So, there would be some gains anyway. Maybe in the future we should choose to use the sse directly instead of counting on such compiler's behavior ;-) But I'm afraid that such plan has to be postponed since there is a much more important thing to do now, i.e. the #22.
All the discussions and questions about libaco would always be welcomed here. Just feel free to open any new issue you like :D |
I did some benchmark about such conditional memcpy-inline in the past and did get the result I want. But there were no records. I would like to do another test as soon as I get another spare time. |
BTW, the libaco project is great, I am looking forward to your next version (especially the co schedualer). |
Thank you very much for your kind encouragement, @guangqianpeng, and I would try to finish the next release as soon as possible ;-) |
How about the next release? :) |
hi, I am reading the code and have done some benchmark:
/~https://github.com/guangqianpeng/libaco/blob/master/bench_result
I have two questions:
tcmalloc
improves the benchmark results. Withaco_amount=1000000
andcopy_stack_size=56B
, the tcmalloc version achieves 37ns peraco_resume()
operation but the default takes 66ns. Why? In this case,aco_resume()
does not allocate memory, which is really confusing...%xmm
registers to optimize small memory copying. But according to my benchmark, this does not make many differences. I guessmemcpy()
already takes advantage of these registers. Do you have more benchmark results?I will be very grateful to you for answering my questions :-)
The text was updated successfully, but these errors were encountered: