Skip to content
This repository has been archived by the owner on Nov 24, 2023. It is now read-only.

*: refactor subtask management and communication model #116

Merged
merged 61 commits into from
May 22, 2019
Merged

*: refactor subtask management and communication model #116

merged 61 commits into from
May 22, 2019

Conversation

IANTHEREAL
Copy link
Collaborator

@IANTHEREAL IANTHEREAL commented Apr 18, 2019

What problem does this PR solve?

The subtask state machine and worker controller mechanisms are not ideal now.
Users use direct interaction to manipulate tasks, like user issue a create task request and the request fails, dm-worker wouldn't leave any information about it. the implementation is not conducive to debugging and make restoring tasks more difficult when dm-worker restarts. ref #77

What is changed and how it works?

  • Introducing log in dm-worker to store task request
    • use leveldb as local kv db, store task meta and task operation log in it
    • wrap task operation log as a queue-like object, providing operation
      • peek oldest non-handled log
      • forward handled pointer and mark the result of this log
    • make DM-worker's communication framework asynchronous
      • DM-worker would return success while it store the request operation into log
      • a background goroutine worker/handleTask would apply these logs in order and mark it's result
      • add a waitOperationOk function for DM-server to make the asynchronous request be asynchronous
    • simplify subtask state machine
      • only leave Run, Close operations and make them reentrant

I try to keep the original code of subtask.go and worker.go, and we will further simplify them while improve the test coverage.
In fact because the code is too much, I think it's better to issue first edition firstly. Don't worry, most of them are unit test.

Some review concerns

  • review the design and implement from multiple pointer
    • log
    • rpc interface
    • subtask state machine
    • worker controller
  • some point are really different
    • asynchronous request - even we make the asynchronous request be synchronous, but in fact it's different with old fashion, the former takes effect more slowly. In some failpoint panic testing , we should change checking result way
    • there a lot of problems related to the use and release and of resources, I remove pause and resume function, it may affect some mechanisms like online ddl

Check List

Tests

  • Unit test
  • Integration test
  • Manual test
    • start a dm cluster
    • start/pause/stop/resume task
  • No code

Code changes

  • Has interface methods change
  • Has persistent data change

Side effects

  • Breaking backward compatibility - we should upgrade dm-worker/master at the same time

@IANTHEREAL IANTHEREAL added the status/WIP This PR is still work in progress label Apr 18, 2019
@IANTHEREAL
Copy link
Collaborator Author

/run-all-tests

@IANTHEREAL
Copy link
Collaborator Author

/run-all-tests

@IANTHEREAL
Copy link
Collaborator Author

/run-all-tests

1 similar comment
@amyangfei
Copy link
Contributor

/run-all-tests

dm/worker/log.go Outdated
for {
select {
case <-ctx.Done():
log.Infof("[task log gc] goroutine exist!")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.Infof("[task log gc] goroutine exist!")
log.Info("[task log gc] goroutine exits!")

@amyangfei
Copy link
Contributor

/run-all-tests

3 similar comments
@amyangfei
Copy link
Contributor

/run-all-tests

@amyangfei
Copy link
Contributor

/run-all-tests

@amyangfei
Copy link
Contributor

/run-all-tests

dm/worker/worker.go Outdated Show resolved Hide resolved
@amyangfei
Copy link
Contributor

/run-all-tests

2 similar comments
@amyangfei
Copy link
Contributor

/run-all-tests

@zhouqiang-cl
Copy link
Contributor

/run-all-tests

@IANTHEREAL
Copy link
Collaborator Author

/run-all-tests

2 similar comments
@IANTHEREAL
Copy link
Collaborator Author

/run-all-tests

@IANTHEREAL
Copy link
Collaborator Author

/run-all-tests

Copy link
Member

@csuzhangxc csuzhangxc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@csuzhangxc csuzhangxc added status/LGT1 One reviewer already commented LGTM and removed status/PTAL This PR is ready for review. Add this label back after committing new changes labels May 21, 2019
@amyangfei amyangfei added status/LGT2 Two reviewers already commented LGTM, ready for merge and removed status/LGT1 One reviewer already commented LGTM labels May 21, 2019
@IANTHEREAL IANTHEREAL merged commit b8b1b2a into pingcap:master May 22, 2019
@IANTHEREAL IANTHEREAL deleted the ian/refine-task branch May 22, 2019 01:10
lichunzhu pushed a commit to lichunzhu/dm that referenced this pull request Apr 6, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
priority/normal Minor change, requires approval from ≥1 primary reviewer status/LGT2 Two reviewers already commented LGTM, ready for merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants