Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core, Spark3.5: Fix tests failure due to timeout #11654

Merged
merged 1 commit into from
Dec 17, 2024

Conversation

manuzhang
Copy link
Collaborator

@manuzhang manuzhang commented Nov 26, 2024

This PR attempts to fix following tests failure due to timeout.

#11047
#11066
#11651

@manuzhang manuzhang force-pushed the fix-tests-timeout branch 2 times, most recently from a43d09b to 6f8fb28 Compare November 27, 2024 04:02
@manuzhang manuzhang closed this Nov 27, 2024
@manuzhang manuzhang reopened this Nov 27, 2024
@manuzhang manuzhang force-pushed the fix-tests-timeout branch 2 times, most recently from daf014a to 45e1cd6 Compare December 10, 2024 10:10
@manuzhang
Copy link
Collaborator Author

@nastra Please take another look, thanks!

@manuzhang manuzhang closed this Dec 12, 2024
@manuzhang manuzhang reopened this Dec 12, 2024
@manuzhang manuzhang requested a review from nastra December 17, 2024 04:51
@manuzhang manuzhang force-pushed the fix-tests-timeout branch 2 times, most recently from 7961140 to 55f635a Compare December 17, 2024 13:51
@nastra
Copy link
Contributor

nastra commented Dec 17, 2024

@RussellSpitzer or @amogh-jahagirdar can any of you also take a look at this please?

@@ -1609,8 +1611,14 @@ public synchronized void testMergeWithSnapshotIsolation()
createOrReplaceView("source", Collections.singletonList(1), Encoders.INT());

sql(
"ALTER TABLE %s SET TBLPROPERTIES('%s' '%s')",
tableName, MERGE_ISOLATION_LEVEL, "snapshot");
"ALTER TABLE %s SET TBLPROPERTIES('%s', '%s', '%s', '%s', '%s', '%s')",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you've got too many commas here. I believe this should be ('%s' '%s', '%s' '%s', '%s' '%s') and same for the other Spark test

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand right, to address flakiness of these tests that are timing out, we're reducing a lot of commit properties (like the max retry and the min retry wait), that way we just fail fast? Typically these should not exceed these values and any tests which do, would just end up failing fast.

Do we have any indication on what could be the root cause for the commits in these tests taking so long?

To be clear, I'm open to giving the fix in this PR a try but I think we should be a bit cautious that we're not introducing flakiness the other way where another set of tests end up failing more frequently due to not waiting enough for example.

@nastra
Copy link
Contributor

nastra commented Dec 17, 2024

@amogh-jahagirdar from what I saw in some local debugging is that these specific tests typically took quite a long time due to exponential retries and eventually timed out, hence this is lowering the min/max wait times for these retries in these 3 particular tests. Other tests shouldn't be affected by this

@amogh-jahagirdar
Copy link
Contributor

Sounds good,

these specific tests typically took quite a long time due to exponential retries and eventually timed out

Yeah this was the part I was trying to unpack was why are these particular tests retrying a lot? Like I said though, that may be difficult to reason about and we may just want to tune these properties in the interim, so all good!

@nastra nastra merged commit ed06c9c into apache:main Dec 17, 2024
49 checks passed
@manuzhang
Copy link
Collaborator Author

Latest test failure in /~https://github.com/apache/iceberg/actions/runs/12387757253/job/34578581231

        Caused by:
        org.apache.iceberg.exceptions.CommitFailedException: Cannot commit: Base metadata location 'file:/tmp/hive12623576563843060988/table/metadata/00016-82aeb0b7-eb25-441d-9ef0-fbef9bb5b390.metadata.json' is not same as the current table metadata location 'file:/tmp/hive12623576563843060988/table/metadata/00017-77c2b7b4-59e6-4dbf-b69e-4d453f8ec3cd.metadata.json' for default.table
            at app//org.apache.iceberg.hive.HiveTableOperations.doCommit(HiveTableOperations.java:219)
            at app//org.apache.iceberg.BaseMetastoreTableOperations.commit(BaseMetastoreTableOperations.java:125)
            at app//org.apache.iceberg.SnapshotProducer.lambda$commit$2(SnapshotProducer.java:433)
            at app//org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413)
            at app//org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219)
            at app//org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203)
            at app//org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:196)
            at app//org.apache.iceberg.SnapshotProducer.commit(SnapshotProducer.java:405)
            at app//org.apache.iceberg.BaseOverwriteFiles.commit(BaseOverwriteFiles.java:30)
            at app//org.apache.iceberg.spark.source.SparkWrite.commitOperation(SparkWrite.java:233)
            at app//org.apache.iceberg.spark.source.SparkWrite$CopyOnWriteOperation.commitWithSnapshotIsolation(SparkWrite.java:480)

@manuzhang
Copy link
Collaborator Author

@nastra COMMIT_MAX_RETRY_WAIT_MS='1000' is not useful when COMMIT_NUM_RETRIES=4 by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants