Skip to content

Conversation

@shimizukko
Copy link
Contributor

@shimizukko shimizukko commented Jan 26, 2026

test_start_back_to_back interferes with test_two_pools_healthy. The repaired fault in test_start_back_to_back (orphan container) appears in the check query output during test_two_pools_healthy. The test expects the query output to be clean at the beginning of the test, so reset the checker state by stopping (dmg check stop) and restarting with --reset (dmg check start --reset) after confirming the repair result.

Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: DMGCheckStartCornerCaseTest

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

…rvers_once: False

test_start_back_to_back interferes with test_two_pools_healthy.
The injected fault during test_start_back_to_back (orphan
container) somehow appears in the check query output during
test_two_pools_healthy. It's related to the stale checker state
that's not cleared between the tests. I tried to reproduce, but
hasn't been successful. In the meantime, the workaround is to
restart servers between the tests.

The test works if executed individually, so run all the tests
in check_start_corner_case.py.

Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: DMGCheckStartCornerCaseTest
Signed-off-by: Makito Kano <[email protected]>
@github-actions
Copy link

github-actions bot commented Jan 26, 2026

Ticket title is 'recovery/check_start_corner_case.py:DMGCheckStartCornerCaseTest.test_two_pools_healthy - Checker didn't detect inconsistent container label'
Status is 'In Review'
Labels: 'ci_master_weekly,weekly_test'
https://daosio.atlassian.net/browse/DAOS-18481

Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: DMGCheckStartCornerCaseTest
Signed-off-by: Makito Kano <[email protected]>
@shimizukko shimizukko marked this pull request as ready for review January 26, 2026 20:23
@shimizukko shimizukko requested review from a team as code owners January 26, 2026 20:23
Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do the tests only work if executed serially? Are tests leaving the system in a bad state?

Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: DMGCheckStartCornerCaseTest
Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: DMGCheckStartCornerCaseTest
Signed-off-by: Makito Kano <[email protected]>
@shimizukko shimizukko changed the title DAOS-18481 test: recovery/check_start_corner_case.yaml - Add start_se… DAOS-18481 test: recovery/check_start_corner_case.yaml - Reset checker state Jan 28, 2026
@shimizukko
Copy link
Contributor Author

Why do the tests only work if executed serially? Are tests leaving the system in a bad state?

I learned that the checker state doesn't get reset by simply restarting the server (dmg system stop; dmg system start), so the output will contain the old result and that interferes with subsequent test. We can reformat the system to clear, but more efficient way is to reset the checker by restarting with --reset after confirming the repair result.

@daltonbohning
Copy link
Contributor

Why do the tests only work if executed serially? Are tests leaving the system in a bad state?

I learned that the checker state doesn't get reset by simply restarting the server (dmg system stop; dmg system start), so the output will contain the old result and that interferes with subsequent test. We can reformat the system to clear, but more efficient way is to reset the checker by restarting with --reset after confirming the repair result.

Thanks. I think this is a better approach than restarting the system

@daltonbohning daltonbohning added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Jan 28, 2026
@daltonbohning daltonbohning requested a review from a team January 28, 2026 15:48
@daltonbohning daltonbohning merged commit 0374aaa into master Jan 28, 2026
36 checks passed
@daltonbohning daltonbohning deleted the makito/DAOS-18481 branch January 28, 2026 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed.

Development

Successfully merging this pull request may close these issues.

4 participants