CI Test Failure: Racing Replacements For Soft-Deleted Disks
<!--
This template is for cases where you've got a test that failed in CI for a pull request and
you believe it's not related to changes in your branch.
-->
This test failed on a CI run on #8793:
https://github.com/oxidecomputer/omicron/pull/8793/checks?check_run_id=47633713534
Log showing the specific test failure:
https://buildomat.eng.oxide.computer/wg/0/details/01K2370FZHV5NHH94QQGDB1D0N/1Lv7O49z4b0pRR7N7spQ8PaTfcZTnb7XJAqOppFf4x6DomXP/01K23719HMEAPD9BPAF8QXKFRH
Excerpt from the log showing the failure:
7461 2025-08-07T22:46:37.091Z FAIL [ 75.113s] omicron-nexus::test_all integration_tests::crucible_replacements::test_racing_replacements_for_soft_deleted_disk_volume
7462 2025-08-07T22:46:37.091Z stdout ───
7463 2025-08-07T22:46:37.091Z
7464 2025-08-07T22:46:37.091Z running 1 test
7465 2025-08-07T22:46:37.091Z test integration_tests::crucible_replacements::test_racing_replacements_for_soft_deleted_disk_volume has been running for over 60 seconds
7466 2025-08-07T22:46:37.091Z test integration_tests::crucible_replacements::test_racing_replacements_for_soft_deleted_disk_volume ... FAILED
7467 2025-08-07T22:46:37.091Z
7468 2025-08-07T22:46:37.092Z failures:
7469 2025-08-07T22:46:37.092Z
7470 2025-08-07T22:46:37.092Z failures:
7471 2025-08-07T22:46:37.092Z integration_tests::crucible_replacements::test_racing_replacements_for_soft_deleted_disk_volume
7472 2025-08-07T22:46:37.092Z
7473 2025-08-07T22:46:37.092Z test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 452 filtered out; finished in 74.71s
7474 2025-08-07T22:46:37.092Z
7475 2025-08-07T22:46:37.092Z stderr ───
7476 2025-08-07T22:46:37.092Z log file: /var/tmp/omicron_tmp/test_all-a7445ac8ad43db82-test_racing_replacements_for_soft_deleted_disk_volume.22394.0.log
7477 2025-08-07T22:46:37.092Z note: configured to log to "/var/tmp/omicron_tmp/test_all-a7445ac8ad43db82-test_racing_replacements_for_soft_deleted_disk_volume.22394.0.log"
7478 2025-08-07T22:46:37.092Z DB URL: postgresql://root@[::1]:62634/omicron?sslmode=disable
7479 2025-08-07T22:46:37.092Z DB address: [::1]:62634
7480 2025-08-07T22:46:37.092Z log file: /var/tmp/omicron_tmp/test_all-a7445ac8ad43db82-test_racing_replacements_for_soft_deleted_disk_volume.22394.2.log
7481 2025-08-07T22:46:37.092Z note: configured to log to "/var/tmp/omicron_tmp/test_all-a7445ac8ad43db82-test_racing_replacements_for_soft_deleted_disk_volume.22394.2.log"
7482 2025-08-07T22:46:37.092Z log file: /var/tmp/omicron_tmp/test_all-a7445ac8ad43db82-test_racing_replacements_for_soft_deleted_disk_volume.22394.3.log
7483 2025-08-07T22:46:37.092Z note: configured to log to "/var/tmp/omicron_tmp/test_all-a7445ac8ad43db82-test_racing_replacements_for_soft_deleted_disk_volume.22394.3.log"
7484 2025-08-07T22:46:37.092Z old_region_id: Some(Region identity, dataset_id: 149d20c7-9f4c-4024-be94-7855c1d3e364 (dataset), volume_id: 20f3c7c2-86c9-449f-b342-7bd173f625e5 (volume), block_size: ByteCount(ByteCount(512)), blocks_per_extent: 131072, extent_count: 16, port: Some(SqlU16(4000)), read_only: false, deleting: false, reservation_percent: TwentyFive })
7485 2025-08-07T22:46:37.092Z new region id: None
7486 2025-08-07T22:46:37.092Z
7487 2025-08-07T22:46:37.092Z thread 'integration_tests::crucible_replacements::test_racing_replacements_for_soft_deleted_disk_volume' panicked at nexus/tests/integration_tests/crucible_replacements.rs:1074:5:
7488 2025-08-07T22:46:37.092Z assertion failed: match last_background_task.last
7489 2025-08-07T22"); false }
7493 2025-08-07T22:46:37.092Z Ok(v) => !v.drive_invoked_ok.is_empty(),
7494 2025-08-07T22:46:37.092Z }
7495 2025-08-07T22:46:37.092Z }
7496 2025-08-07T22:46:37.092Z _ => false }
7497 2025-08-07T22
7498 2025-08-07T22:46:37.092Z stack backtrace:
7499 2025-08-07T22:46:37.092Z 0: __rustc::rust_begin_unwind
7500 2025-08-07T22:46:37.092Z at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/panicking.rs:697:5
7501 2025-08-07T22:46:37.092Z 1: core::panicking::panic_fmt
7502 2025-08-07T22:46:37.092Z at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/core/src/panicking.rs:75:14
7503 2025-08-07T22:46:37.092Z 2: core::panicking::panic
7504 2025-08-07T22:46:37.093Z at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/core/src/panicking.rs:145:5
7505 2025-08-07T22:46:37.093Z 3: async_fn#0}
7506 2025-08-07T22
7508 2025-08-07T22:46:37.093Z at ./tests/integration_tests/crucible_replacements.rs:730:1
7509 2025-08-07T22:46:37.093Z 5: poll<&mut dyn core::future::future::Future<Output=()>>
7510 2025-08-07T22:46:37.093Z at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/core/src/future/future.rs:124:9
7511 2025-08-07T22:46:37.093Z 6: poll<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
7512 2025-08-07T22:46:37.093Z at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/core/src/future/future.rs:124:9
7513 2025-08-07T22:46:37.093Z 7: closure#0}<core::block_on::closure#0}::closure_env#0}<core::block_on::closure#0}::closure_env#0}<core<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
7520 2025-08-07T22:46:37.094Z at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/current_thread/mod.rs:742:25
7521 2025-08-07T22:46:37.094Z 11: tokio::runtime::scheduler::current_thread::Context::enter
7522 2025-08-07T22:46:37.094Z at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/current_thread/mod.rs:432:19
7523 2025-08-07T22:46:37.094Z 12: closure#0}<core}
7526 2025-08-07T22:46:37.094Z at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/current_thread/mod.rs:829:68
7527 2025-08-07T22:46:37.094Z 14: tokio::runtime::context::scoped::Scoped
# Diving Deep into the CI Failure
Hey guys! Let's break down this **test failure** in CI for the `test_racing_replacements_for_soft_deleted_disk_volume` test. It looks like we've got a situation where the test failed during a CI run on pull request #8793. Specifically, the issue occurred during an integration test related to crucible replacements. The error message indicates a panic related to an assertion failure. Let's dive into the details and see if we can figure out what's going on and how to address it.
## Understanding the Test Failure: `test_racing_replacements_for_soft_deleted_disk_volume`
The core issue lies within the `test_racing_replacements_for_soft_deleted_disk_volume` test. This test, part of the `omicron-nexus` suite, is designed to verify the behavior of the system when dealing with racing conditions during the replacement of soft-deleted disk volumes. Essentially, it checks if the system handles concurrent operations correctly when a disk volume is being replaced after it has been soft-deleted. The failure excerpt from the log points to an assertion failure within the test code, specifically at `nexus/tests/integration_tests/crucible_replacements.rs:1074:5`. This line is crucial as it gives us the exact location where the test encountered an unexpected state.
Let's take a closer look at the relevant code snippet from the log:
7487 2025-08-07T22:46:37.092Z thread 'integration_tests::crucible_replacements::test_racing_replacements_for_soft_deleted_disk_volume' panicked at nexus/tests/integration_tests/crucible_replacements.rs:1074:5: 7488 2025-08-07T22:46:37.092Z assertion failed: match last_background_task.last 7489 2025-08-07T22"); false } 7493 2025-08-07T22:46:37.092Z Ok(v) => !v.drive_invoked_ok.is_empty(), 7494 2025-08-07T22:46:37.092Z } 7495 2025-08-07T22:46:37.092Z } 7496 2025-08-07T22:46:37.092Z _ => false } 7497 2025-08-07T22
The assertion is checking the result of a background task, specifically whether the `drive_invoked_ok` field in the `RegionReplacementDriverStatus` is not empty. This suggests that the test expects a successful drive invocation as part of the disk volume replacement process. If `drive_invoked_ok` is empty, it means the drive invocation either failed or was not invoked at all, leading to the assertion failure.
## Analyzing the Logs and Contextual Information
To further diagnose the issue, let's consider the additional information provided in the logs. The logs indicate the database URL and address used during the test, along with the locations of the log files generated by the test. These log files (`/var/tmp/omicron_tmp/test_all-a7445ac8ad43db82-test_racing_replacements_for_soft_deleted_disk_volume.22394.0.log`, `/var/tmp/omicron_tmp/test_all-a7445ac8ad43db82-test_racing_replacements_for_soft_deleted_disk_volume.22394.2.log`, and `/var/tmp/omicron_tmp/test_all-a7445ac8ad43db82-test_racing_replacements_for_soft_deleted_disk_volume.22394.3.log`) are invaluable for understanding the sequence of events leading up to the failure. A thorough examination of these logs can reveal detailed error messages, warnings, and the state of the system at various points during the test execution.
The log excerpt also shows the `old_region_id` and `new region id`. The `old_region_id` is `Some(Region { ... })`, which means a region existed before the replacement process. However, `new region id` is `None`, which could indicate that the region replacement process failed to create a new region or that the new region was not properly registered. This discrepancy between the old and new region IDs is a critical clue.
## Potential Causes and Troubleshooting Steps
Based on the information available, here are some potential causes for the test failure and the corresponding troubleshooting steps:
1. **Race Condition:** The test name itself, `test_racing_replacements_for_soft_deleted_disk_volume`, suggests a focus on race conditions. It's possible that concurrent operations are interfering with the disk volume replacement process. This could be due to timing issues or synchronization problems in the code. To investigate this, we can:
* Review the code related to disk volume replacement and identify potential race conditions.
* Add more logging to the test to track the sequence of operations and identify where the race condition might be occurring.
* Consider using synchronization primitives (e.g., mutexes, semaphores) to protect critical sections of code.
2. **Database Issues:** The test involves database operations, and failures could be related to database connectivity, data corruption, or concurrency issues within the database. To address this, we can:
* Verify the database connection settings and ensure the database is running correctly.
* Examine the database logs for any errors or warnings.
* Check for any database deadlocks or lock contention issues.
3. **Soft Deletion Logic:** The test specifically deals with soft-deleted disk volumes. There might be issues in the logic that handles soft deletion and subsequent replacement. We can:
* Review the code responsible for soft deletion and ensure it's functioning as expected.
* Check if the soft deletion process is correctly marking the disk volume as deleted.
* Verify that the replacement process is correctly identifying and handling soft-deleted volumes.
4. **Region Replacement Driver:** The assertion failure involves the `RegionReplacementDriverStatus`, indicating a potential problem with the driver responsible for region replacement. We should:
* Inspect the `RegionReplacementDriver` code for any bugs or unexpected behavior.
* Ensure that the driver is correctly invoked during the replacement process.
* Verify that the driver is reporting the correct status after the replacement attempt.
5. **Serialization/Deserialization:** The test uses `serde_json` to deserialize the `RegionReplacementDriverStatus`. If the data being serialized or deserialized is not in the expected format, it can lead to errors. We need to:
* Verify that the `RegionReplacementDriverStatus` struct is correctly defined.
* Check if the JSON data being serialized and deserialized matches the structure of the struct.
* Look for any potential issues with the serialization or deserialization process.
## Reproducing the Failure Locally
One of the most effective ways to troubleshoot a test failure is to reproduce it locally. This allows us to run the test in a controlled environment, set breakpoints, and examine the state of the system at various points. To reproduce the failure, we should:
* Ensure we have the same environment as the CI environment (e.g., operating system, dependencies, database configuration).
* Run the test using the same command-line arguments and settings as in CI.
* Attach a debugger to the test process and step through the code to identify the exact point of failure.
## Next Steps
To effectively resolve this **test failure**, the following steps should be taken:
1. **Examine the log files:** The detailed logs generated by the test are crucial for understanding the sequence of events and identifying the root cause.
2. **Reproduce the failure locally:** This allows for debugging and a deeper understanding of the issue.
3. **Review the code:** Pay close attention to the areas related to disk volume replacement, soft deletion, and the `RegionReplacementDriver`.
4. **Implement targeted logging:** Add more logging statements to the test and the relevant code to track the state of the system and identify the point of failure.
5. **Consider potential race conditions:** Use synchronization primitives if necessary to protect critical sections of code.
6. **Address any database issues:** Verify database connectivity, examine logs, and check for deadlocks or lock contention.
By systematically investigating these areas, we can pinpoint the root cause of the **test failure** and implement the necessary fixes. Let's get to work and make sure this test passes consistently!