Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestRegionStatistics is unstable #8319

Closed
HuSharp opened this issue Jun 21, 2024 · 3 comments · Fixed by #8320 or #8411
Closed

TestRegionStatistics is unstable #8319

HuSharp opened this issue Jun 21, 2024 · 3 comments · Fixed by #8320 or #8411
Labels
type/ci The issue is related to CI.

Comments

@HuSharp
Copy link
Member

HuSharp commented Jun 21, 2024

Flaky Test

Which jobs are failing

--- FAIL: TestRegionStatistics (14.52s)
    cluster_test.go:262: 
        	Error Trace:	/home/runner/work/pd/pd/tests/server/cluster/cluster_test.go:262
        	Error:      	Not equal: 
        	            	expected: "pd2"
        	            	actual  : "pd1"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-pd2
        	            	+pd1
        	Test:       	TestRegionStatistics
FAIL

CI link

/~https://github.com/tikv/pd/actions/runs/9611005214/job/26508694960?pr=7756

Reason for failure (if possible)

Because this test has only 2 pd's and five ResignLeader times, whenever there is a network or disk problem will cause the lease to expire, and result in one more leader resigning.
There is a high probability of being recognized as frequent campaign times.

Anything else

@HuSharp HuSharp added the type/ci The issue is related to CI. label Jun 21, 2024
ti-chi-bot bot pushed a commit that referenced this issue Jun 25, 2024
close #8319

Because this test has only 2 pd's and five ResignLeader times, whenever there is a network or disk problem will cause the lease to expire, and result in one more leader resigning.
And then there is a high probability of being recognized as frequent campaign times to transfer leader failed

Signed-off-by: husharp <jinhao.hu@pingcap.com>
@lhy1024
Copy link
Contributor

lhy1024 commented Jul 15, 2024

--- FAIL: TestRegionStatistics (25.54s)
    cluster_test.go:264: 
        	Error Trace:	/home/runner/work/pd/pd/tests/server/cluster/cluster_test.go:264
        	Error:      	Should not be: "pd2"
        	Test:       	TestRegionStatistics
FAIL

/~https://github.com/tikv/pd/actions/runs/9936180084/job/27443847817

@lhy1024 lhy1024 reopened this Jul 15, 2024
@okJiang
Copy link
Member

okJiang commented Jul 17, 2024

--- FAIL: TestRegionStatistics (17.47s)
    cluster_test.go:262: 
        	Error Trace:	/home/runner/work/pd/pd/tests/server/cluster/cluster_test.go:262
        	Error:      	Should not be: "pd3"
        	Test:       	TestRegionStatistics

/~https://github.com/tikv/pd/actions/runs/9967757112/job/27541904077

@HuSharp
Copy link
Member Author

HuSharp commented Jul 17, 2024

--- FAIL: TestRegionStatistics (17.47s)
    cluster_test.go:262: 
        	Error Trace:	/home/runner/work/pd/pd/tests/server/cluster/cluster_test.go:262
        	Error:      	Should not be: "pd3"
        	Test:       	TestRegionStatistics

tikv/pd/actions/runs/9967757112/job/27541904077

Conclude: Because frequently change(There are many resign leader in this test and slow disk can make frequently happened in high probability :(

cluster_test.go:262 checks pd leader not be pd3, which means cluster.ResignLeader() regards pd3 is leader, but tc.WaitLeader() still get pd3

	leaderName = leaderServer.GetServer().Name()
	leaderServer.ResignLeader()
	re.NotEqual(tc.WaitLeader(), leaderName)

Let's check the log:

// now leader is pd3
[2024/07/17 03:38:56.625 +00:00] [INFO] [server.go:1737] ["campaign PD leader ok"] [campaign-leader-name=pd3]
[2024/07/17 03:38:57.143 +00:00] [INFO] [member.go:356] ["try to resign etcd leader to next pd-server"] [from=pd3] [to=]
// due to frequently transfer leader to pd3 as well
[2024/07/17 03:38:58.149 +00:00] [INFO] [member.go:356] ["try to resign etcd leader to next pd-server"] [from=pd2] [to=]
[2024/07/17 03:38:58.190 +00:00] [INFO] [server.go:1737] ["campaign PD leader ok"] [campaign-leader-name=pd3]
[2024/07/17 03:38:58.650 +00:00] [ERROR] [server.go:1717] ["campaign PD leader meets error due to etcd error"] [campaign-leader-name=pd2] [error="[PD:server:ErrLeaderFrequentlyChange]leader pd2 frequently changed, leader-key is [/pd/7392444153452143012/leader]"]

ti-chi-bot bot added a commit that referenced this issue Jul 19, 2024
close #8319

Signed-off-by: husharp <ihusharp@gmail.com>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/ci The issue is related to CI.
Projects
None yet
3 participants