Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: prevent potential thread starvation in SessionPool #85

Merged
merged 3 commits into from
Feb 28, 2020

Conversation

olavloite
Copy link
Collaborator

The SessionPool uses the same executor for both creating sessions and preparing these for read/write transactions. This could in very specific circumstances lead to thread starvation, which again can cause the session pool to stop returning sessions, even when there are sessions available in the pool.

@olavloite olavloite added the api: spanner Issues related to the googleapis/java-spanner API. label Feb 27, 2020
@olavloite olavloite requested a review from skuruppu February 27, 2020 15:25
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Feb 27, 2020
@olavloite olavloite changed the title Prevent potential thread starvation in SessionPool fix: prevent potential thread starvation in SessionPool Feb 27, 2020
@skuruppu skuruppu added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Feb 28, 2020
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Feb 28, 2020
Copy link
Contributor

@skuruppu skuruppu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this fix. Also nice test.

We'll merge this in for now but I'm happy to also consider the inlining of BeginTransaction with the first statement. I think that PR is still in google-cloud-java and I haven't reviewed it.

@skuruppu skuruppu added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Feb 28, 2020
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Feb 28, 2020
@skuruppu skuruppu added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Feb 28, 2020
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Feb 28, 2020
@skuruppu skuruppu merged commit 248048a into master Feb 28, 2020
@skuruppu skuruppu deleted the prevent-thread-starvation branch February 28, 2020 03:28
@hengfengli
Copy link
Contributor

hengfengli commented Feb 28, 2020

@olavloite Thanks for fixing this. I am trying to understand your test and I would like to ask some questions.

Assume that executor.getCorePoolSize() returns 8 according to: /~https://github.com/googleapis/java-core/blob/74f8632732a47e3f68cde9a29b95da9b9547dfbb/google-cloud-core-grpc/src/main/java/com/google/cloud/grpc/GrpcTransportOptions.java#L60

Then, the session pool has a config between 8 (min) and 24 (max). In your test, you set the BatchCreateSessions call to be blocked (not responding forever) after initializing. After that, you simulate a traffic with 16 reads and 24 writes.

Let's say that 8 transactions get 8 sessions at first and then the 9th transaction cannot get a session. The 9th transaction will try to call BatchCreateSession to get a new one and be waiting for the new session available. The 9th transaction will be unlocked if one of the following two cases happens:

  1. a session is returned by BatchCreateSession request
  2. another transaction finishes and returns a session back to the session pool

Therefore, in this test, case 1 would not happen (never returns) and case 2 will let the remaining transactions to finish their jobs.

a) My question is that how is this test related to the issue that prepareReadWriteTransaction blocks all threads in executor (your fix to add a separate executor)? How does your test cover the case that prepareReadWriteTransaction is slow and blocks threads?

b) Another question after I read the code is that if we set the max sessions to 1500 and numChannels to 16, will it change the number of threads in executor? Does it mean that the number threads in executor is always fixed?

@olavloite
Copy link
Collaborator Author

@hengfengli

a) My question is that how is this test related to the issue that prepareReadWriteTransaction blocks all threads in executor (your fix to add a separate executor)? How does your test cover the case that prepareReadWriteTransaction is slow and blocks threads?

The test case does not cover that, as the problem is actually caused by BatchCreateSessions being slow. The fix does however also solve potential thread starvation problems for a situation where prepareReadWriteTransaction is slow as well. What happens in this test case is the following:

  1. The session pool contains MinSessions (in this example 8) read-only sessions and all seems to be well (writeFraction has been set to 0.0 for this test case for simplicity, but any value for writeFraction would in the end cause the same problem).
  2. The BatchCreateSessions rpc on the backend starts to show degraded performance, for example because of very heavy load from a secondary application using the same database as the main application.
  3. The main user application continues to execute transactions, and at a given moment gets a small burst of requests which triggers additional session creation in the pool as the number of requests cannot be handled by the 8 sessions currently in the pool. As these requests do not come at the same time, it will trigger 8 separate BatchCreateSessions RPCs. Each of these RPCs will be waiting forever and keeping one of the background threads of the session pool occupied. The default for the session pool is a thread pool of 8 threads, so the test case triggers exactly MinSessions + threadCount (8+8) read requests to be executed in parallel to ensure that all background threads will be used. All background threads of the session pool are now stuck on waiting for new sessions.
  4. We now have a session pool that still contains MinSessions healthy read-only sessions, but no available background threads for session creation and preparing transactions. All read requests can still be served by the existing sessions in the pool, as every time a read request finishes and releases a session to the pool, it is directly passed on to the next read waiter without any need for a background thread.
  5. The main user application now requests a write-prepared session. There are none available in the session pool, so the session pool takes one of the read-only sessions and schedules a prepareReadWriteTransaction request for it. This request will never be executed, because all threads of the background executor are stuck waiting for BatchCreateSessions requests, and the executor has a fixed size. There are now only 7 sessions left in the pool for serving read requests.
  6. Execute the above step as many times as there are sessions in the pool, and no more sessions will be available to serve read requests, because all are waiting for a prepareReadWriteTransaction to finish.

While it's not directly covered by the test case, the above would also apply the other way around. I.e. if the BeginTransaction RPC is slow, it will also have an effect on the ability of the session pool to create sessions, regardless whether the BatchCreateSessions RPC responds quickly or not. Using the same bounded executor to perform two different tasks, means that slowing down one of the tasks can have an impact on the other task as well. Separating them into two different executors will minimize that problem. It will obviously not solve the problem that a slow BatchCreateSessions RPC will have a performance impact on the user application.

b) Another question after I read the code is that if we set the max sessions to 1500 and numChannels to 16, will it change the number of threads in executor? Does it mean that the number threads in executor is always fixed?

The number of threads in the executor is determined based on the gRPC transport options. The default executor that will be used if you don't specify anything else is this one: /~https://github.com/googleapis/java-core/blob/74f8632732a47e3f68cde9a29b95da9b9547dfbb/google-cloud-core-grpc/src/main/java/com/google/cloud/grpc/GrpcTransportOptions.java#L54

A user application can change this by setting a custom executor factory like this:

    SpannerOptions opts =
        SpannerOptions.newBuilder()
            .setTransportOptions(
                GrpcTransportOptions.newBuilder().setExecutorFactory(someFactory).build())
            .build();

Note that the setup() method of the test case determines the number of sessions in the pool etc. based on the number of core threads in the executor that is used. This is done to ensure that all threads in the pool are actually blocked by the requests.

@hengfengli
Copy link
Contributor

@olavloite Your answer is great and well-explained! 👍 Completely resolve my confusion. Thank you very much! 👏

My concern about the second question is that most people may not know about the relationship between the session pool and the thread pool behind it. If they only change max to 1500 (min is 100) and numChannels to 16 without adjusting the number of threads in executor, will they end up with using a 8-threads pool executor to make 1400 BatchCreateSessions RPCs? Even for the new 8-threads pool prepareExecutor, we have to make 1500 BeginTransaction RPCs in order to convert 1500 read sessions to write sessions. Would it be very slow for this case?

@olavloite
Copy link
Collaborator Author

@hengfengli

My concern about the second question is that most people may not know about the relationship between the session pool and the thread pool behind it. If they only change max to 1500 (min is 100) and numChannels to 16 without adjusting the number of threads in executor, will they end up with using a 8-threads pool executor to make 1400 BatchCreateSessions RPCs? Even for the new 8-threads pool prepareExecutor, we have to make 1500 BeginTransaction RPCs in order to convert 1500 read sessions to write sessions. Would it be very slow for this case?

Yes, both concerns are certainly valid. Neither of them will normally cause any real problems, but the performance could be better. The best configuration is normally to keep MinSessions=MaxSessions, or at least configure MinSessions to the number of sessions that the user application will need under normal circumstances. I would say that there are two things we could consider to mitigate these potential issues:

  1. The session pool currently creates 1 new session when a request is made that cannot be served directly from the pool. We could change this to a higher value (e.g. 100), which would prevent repeated requests from creating sessions one by one. So in the case that a user application has configured MinSessions=100 and MaxSessions=1500 and session number 101 is needed, we could do a BatchCreateSessions call for 100 new sessions. The execution time difference between BatchCreateSessions for 1 or 100 sessions is negligible.
  2. As mentioned in a different discussion, moving preparing sessions for read/write transactions away from the session pool and into including the BeginTransaction option with the first statement of a read/write transaction would prevent the dependency on the executor in the session pool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the googleapis/java-spanner API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants