CI Test Failure: Racing Replacements For Soft-Deleted Disks

by Luna Greco 60 views
<!--
 This template is for cases where you've got a test that failed in CI for a pull request and
 you believe it's not related to changes in your branch.
-->

This test failed on a CI run on #8793:

https://github.com/oxidecomputer/omicron/pull/8793/checks?check_run_id=47633713534

Log showing the specific test failure:

https://buildomat.eng.oxide.computer/wg/0/details/01K2370FZHV5NHH94QQGDB1D0N/1Lv7O49z4b0pRR7N7spQ8PaTfcZTnb7XJAqOppFf4x6DomXP/01K23719HMEAPD9BPAF8QXKFRH

Excerpt from the log showing the failure:

7461 2025-08-07T22:46:37.091Z FAIL [ 75.113s] omicron-nexus::test_all integration_tests::crucible_replacements::test_racing_replacements_for_soft_deleted_disk_volume 7462 2025-08-07T22:46:37.091Z stdout ─── 7463 2025-08-07T22:46:37.091Z 7464 2025-08-07T22:46:37.091Z running 1 test 7465 2025-08-07T22:46:37.091Z test integration_tests::crucible_replacements::test_racing_replacements_for_soft_deleted_disk_volume has been running for over 60 seconds 7466 2025-08-07T22:46:37.091Z test integration_tests::crucible_replacements::test_racing_replacements_for_soft_deleted_disk_volume ... FAILED 7467 2025-08-07T22:46:37.091Z 7468 2025-08-07T22:46:37.092Z failures: 7469 2025-08-07T22:46:37.092Z 7470 2025-08-07T22:46:37.092Z failures: 7471 2025-08-07T22:46:37.092Z integration_tests::crucible_replacements::test_racing_replacements_for_soft_deleted_disk_volume 7472 2025-08-07T22:46:37.092Z 7473 2025-08-07T22:46:37.092Z test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 452 filtered out; finished in 74.71s 7474 2025-08-07T22:46:37.092Z 7475 2025-08-07T22:46:37.092Z stderr ─── 7476 2025-08-07T22:46:37.092Z log file: /var/tmp/omicron_tmp/test_all-a7445ac8ad43db82-test_racing_replacements_for_soft_deleted_disk_volume.22394.0.log 7477 2025-08-07T22:46:37.092Z note: configured to log to "/var/tmp/omicron_tmp/test_all-a7445ac8ad43db82-test_racing_replacements_for_soft_deleted_disk_volume.22394.0.log" 7478 2025-08-07T22:46:37.092Z DB URL: postgresql://root@[::1]:62634/omicron?sslmode=disable 7479 2025-08-07T22:46:37.092Z DB address: [::1]:62634 7480 2025-08-07T22:46:37.092Z log file: /var/tmp/omicron_tmp/test_all-a7445ac8ad43db82-test_racing_replacements_for_soft_deleted_disk_volume.22394.2.log 7481 2025-08-07T22:46:37.092Z note: configured to log to "/var/tmp/omicron_tmp/test_all-a7445ac8ad43db82-test_racing_replacements_for_soft_deleted_disk_volume.22394.2.log" 7482 2025-08-07T22:46:37.092Z log file: /var/tmp/omicron_tmp/test_all-a7445ac8ad43db82-test_racing_replacements_for_soft_deleted_disk_volume.22394.3.log 7483 2025-08-07T22:46:37.092Z note: configured to log to "/var/tmp/omicron_tmp/test_all-a7445ac8ad43db82-test_racing_replacements_for_soft_deleted_disk_volume.22394.3.log" 7484 2025-08-07T22:46:37.092Z old_region_id: Some(Region identity RegionIdentity { id: 77b77b81-81ae-40a1-bf0e-57db31e8143a, time_created: 2025-08-07T22:46:11.598723Z, time_modified: 2025-08-07T22:46:11.598723Z , dataset_id: 149d20c7-9f4c-4024-be94-7855c1d3e364 (dataset), volume_id: 20f3c7c2-86c9-449f-b342-7bd173f625e5 (volume), block_size: ByteCount(ByteCount(512)), blocks_per_extent: 131072, extent_count: 16, port: Some(SqlU16(4000)), read_only: false, deleting: false, reservation_percent: TwentyFive }) 7485 2025-08-07T22:46:37.092Z new region id: None 7486 2025-08-07T22:46:37.092Z 7487 2025-08-07T22:46:37.092Z thread 'integration_tests::crucible_replacements::test_racing_replacements_for_soft_deleted_disk_volume' panicked at nexus/tests/integration_tests/crucible_replacements.rs:1074:5: 7488 2025-08-07T22:46:37.092Z assertion failed: match last_background_task.last 7489 2025-08-07T2246:37.092Z LastResult::Completed(last_result_completed) => { 7490 2025-08-07T22:46:37.092Z match serde_json::from_value::(last_result_completed.details) 7491 2025-08-07T22:46:37.092Z { 7492 2025-08-07T22:46:37.092Z Err(e) => { eprintln!("{e"); false } 7493 2025-08-07T22:46:37.092Z Ok(v) => !v.drive_invoked_ok.is_empty(), 7494 2025-08-07T22:46:37.092Z } 7495 2025-08-07T22:46:37.092Z } 7496 2025-08-07T22:46:37.092Z _ => false } 7497 2025-08-07T2246:37.092Z 7498 2025-08-07T22:46:37.092Z stack backtrace: 7499 2025-08-07T22:46:37.092Z 0: __rustc::rust_begin_unwind 7500 2025-08-07T22:46:37.092Z at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/panicking.rs:697:5 7501 2025-08-07T22:46:37.092Z 1: core::panicking::panic_fmt 7502 2025-08-07T22:46:37.092Z at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/core/src/panicking.rs:75:14 7503 2025-08-07T22:46:37.092Z 2: core::panicking::panic 7504 2025-08-07T22:46:37.093Z at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/core/src/panicking.rs:145:5 7505 2025-08-07T22:46:37.093Z 3: async_fn#0} 7506 2025-08-07T2246:37.093Z at ./tests/integration_tests/crucible_replacements.rs:1074:5 7507 2025-08-07T22:46:37.093Z 4: {async_block#0 7508 2025-08-07T22:46:37.093Z at ./tests/integration_tests/crucible_replacements.rs:730:1 7509 2025-08-07T22:46:37.093Z 5: poll<&mut dyn core::future::future::Future<Output=()>> 7510 2025-08-07T22:46:37.093Z at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/core/src/future/future.rs:124:9 7511 2025-08-07T22:46:37.093Z 6: poll<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>> 7512 2025-08-07T22:46:37.093Z at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/core/src/future/future.rs:124:9 7513 2025-08-07T22:46:37.093Z 7: closure#0}<core:pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>> 7514 2025-08-07T22:46:37.093Z at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/current_thread/mod.rs:742:54 7515 2025-08-07T22:46:37.093Z 8: with_budget<core::task::poll::Poll<()>, tokio::runtime::scheduler::current_thread::{impl#8::block_on::closure#0}:{closure#0::closure_env#0}<core:pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>> 7516 2025-08-07T22:46:37.094Z at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/task/coop/mod.rs:167:5 7517 2025-08-07T22:46:37.094Z 9: budget<core::task::poll::Poll<()>, tokio::runtime::scheduler::current_thread::{impl#8::block_on::closure#0}:{closure#0::closure_env#0}<core:pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>> 7518 2025-08-07T22:46:37.094Z at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/task/coop/mod.rs:133:5 7519 2025-08-07T22:46:37.094Z 10: {closure#0<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>> 7520 2025-08-07T22:46:37.094Z at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/current_thread/mod.rs:742:25 7521 2025-08-07T22:46:37.094Z 11: tokio::runtime::scheduler::current_thread::Context::enter 7522 2025-08-07T22:46:37.094Z at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/current_thread/mod.rs:432:19 7523 2025-08-07T22:46:37.094Z 12: closure#0}<core:pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>> 7524 2025-08-07T22:46:37.094Z at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/current_thread/mod.rs:741:36 7525 2025-08-07T22:46:37.094Z 13: tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure} 7526 2025-08-07T22:46:37.094Z at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/current_thread/mod.rs:829:68 7527 2025-08-07T22:46:37.094Z 14: tokio::runtime::context::scoped::Scoped::set 7528 2025-08-07T22:46:37.094Z at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/context/scoped.rs:40:9 7529 2025-08-07T22:46:37.094Z 15: tokio::runtime::context::set_scheduler::{closure}} 7530 2025-08-07T2246:37.094Z at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/context.rs:176:26 7531 2025-08-07T22:46:37.094Z 16: try_with<tokio::runtime::context::Context, tokio::runtime::context::set_scheduler::{closure_env#0<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<()>), tokio::runtime::scheduler::current_thread::impl#8}:enter::{closure_env#0<tokio::runtime::scheduler::current_thread::impl#8}:block_on::{closure_env#0<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>, core::option::Option<()>>>, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<()>)}> 7532 2025-08-07T22:46:37.094Z at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/thread/local.rs:315:12 7533 2025-08-07T22:46:37.094Z 17: std::thread::local::LocalKey::with 7534 2025-08-07T22:46:37.094Z at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/std/src/thread/local.rs:279:15 7535 2025-08-07T22:46:37.094Z 18: tokio::runtime::context::set_scheduler 7536 2025-08-07T22:46:37.094Z at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/context.rs:176:9 7537 2025-08-07T22:46:37.094Z 19: tokio::runtime::scheduler::current_thread::CoreGuard::enter 7538 2025-08-07T22:46:37.095Z at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/current_thread/mod.rs:829:27 7539 2025-08-07T22:46:37.095Z 20: tokio::runtime::scheduler::current_thread::CoreGuard::block_on 7540 2025-08-07T22:46:37.095Z at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/current_thread/mod.rs:729:19 7541 2025-08-07T22:46:37.095Z 21: closure#0}<core:pin::Pin<&mut dyn core::future::future::Future<Output=()>>> 7542 2025-08-07T22:46:37.095Z at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/current_thread/mod.rs:200:28 7543 2025-08-07T22:46:37.095Z 22: tokio::runtime::context::runtime::enter_runtime 7544 2025-08-07T22:46:37.095Z at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/context/runtime.rs:65:16 7545 2025-08-07T22:46:37.095Z 23: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>> 7546 2025-08-07T22:46:37.095Z at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/scheduler/current_thread/mod.rs:188:9 7547 2025-08-07T22:46:37.095Z 24: tokio::runtime::runtime::Runtime::block_on_inner 7548 2025-08-07T22:46:37.095Z at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/runtime.rs:356:47 7549 2025-08-07T22:46:37.095Z 25: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>> 7550 2025-08-07T22:46:37.095Z at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.47.0/src/runtime/runtime.rs:330:13 7551 2025-08-07T22:46:37.095Z 26: test_racing_replacements_for_soft_deleted_disk_volume 7552 2025-08-07T22:46:37.095Z at ./tests/integration_tests/crucible_replacements.rs:730:1 7553 2025-08-07T22:46:37.095Z 27: test_all::integration_tests::crucible_replacements::test_racing_replacements_for_soft_deleted_disk_volume::{{closure} 7554 2025-08-07T22:46:37.095Z at ./tests/integration_tests/crucible_replacements.rs:733:2 7555 2025-08-07T22:46:37.095Z 28: core::ops::function::FnOnce::call_once 7556 2025-08-07T22:46:37.095Z at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/core/src/ops/function.rs:250:5 7557 2025-08-07T22:46:37.095Z 29: core::ops::function::FnOnce::call_once 7558 2025-08-07T22:46:37.095Z at /rustc/6b00bc3880198600130e1cf62b8f8a93494488cc/library/core/src/ops/function.rs:250:5


# Diving Deep into the CI Failure

Hey guys! Let's break down this **test failure** in CI for the `test_racing_replacements_for_soft_deleted_disk_volume` test. It looks like we've got a situation where the test failed during a CI run on pull request #8793. Specifically, the issue occurred during an integration test related to crucible replacements. The error message indicates a panic related to an assertion failure. Let's dive into the details and see if we can figure out what's going on and how to address it.

## Understanding the Test Failure: `test_racing_replacements_for_soft_deleted_disk_volume`

The core issue lies within the `test_racing_replacements_for_soft_deleted_disk_volume` test. This test, part of the `omicron-nexus` suite, is designed to verify the behavior of the system when dealing with racing conditions during the replacement of soft-deleted disk volumes. Essentially, it checks if the system handles concurrent operations correctly when a disk volume is being replaced after it has been soft-deleted. The failure excerpt from the log points to an assertion failure within the test code, specifically at `nexus/tests/integration_tests/crucible_replacements.rs:1074:5`. This line is crucial as it gives us the exact location where the test encountered an unexpected state.

Let's take a closer look at the relevant code snippet from the log:

7487 2025-08-07T22:46:37.092Z thread 'integration_tests::crucible_replacements::test_racing_replacements_for_soft_deleted_disk_volume' panicked at nexus/tests/integration_tests/crucible_replacements.rs:1074:5: 7488 2025-08-07T22:46:37.092Z assertion failed: match last_background_task.last 7489 2025-08-07T2246:37.092Z LastResult::Completed(last_result_completed) => { 7490 2025-08-07T22:46:37.092Z match serde_json::from_value::(last_result_completed.details) 7491 2025-08-07T22:46:37.092Z { 7492 2025-08-07T22:46:37.092Z Err(e) => { eprintln!("{e"); false } 7493 2025-08-07T22:46:37.092Z Ok(v) => !v.drive_invoked_ok.is_empty(), 7494 2025-08-07T22:46:37.092Z } 7495 2025-08-07T22:46:37.092Z } 7496 2025-08-07T22:46:37.092Z _ => false } 7497 2025-08-07T2246:37.092Z


The assertion is checking the result of a background task, specifically whether the `drive_invoked_ok` field in the `RegionReplacementDriverStatus` is not empty. This suggests that the test expects a successful drive invocation as part of the disk volume replacement process. If `drive_invoked_ok` is empty, it means the drive invocation either failed or was not invoked at all, leading to the assertion failure.

## Analyzing the Logs and Contextual Information

To further diagnose the issue, let's consider the additional information provided in the logs. The logs indicate the database URL and address used during the test, along with the locations of the log files generated by the test. These log files (`/var/tmp/omicron_tmp/test_all-a7445ac8ad43db82-test_racing_replacements_for_soft_deleted_disk_volume.22394.0.log`, `/var/tmp/omicron_tmp/test_all-a7445ac8ad43db82-test_racing_replacements_for_soft_deleted_disk_volume.22394.2.log`, and `/var/tmp/omicron_tmp/test_all-a7445ac8ad43db82-test_racing_replacements_for_soft_deleted_disk_volume.22394.3.log`) are invaluable for understanding the sequence of events leading up to the failure. A thorough examination of these logs can reveal detailed error messages, warnings, and the state of the system at various points during the test execution.

The log excerpt also shows the `old_region_id` and `new region id`. The `old_region_id` is `Some(Region { ... })`, which means a region existed before the replacement process. However, `new region id` is `None`, which could indicate that the region replacement process failed to create a new region or that the new region was not properly registered. This discrepancy between the old and new region IDs is a critical clue.

## Potential Causes and Troubleshooting Steps

Based on the information available, here are some potential causes for the test failure and the corresponding troubleshooting steps:

1.  **Race Condition:** The test name itself, `test_racing_replacements_for_soft_deleted_disk_volume`, suggests a focus on race conditions. It's possible that concurrent operations are interfering with the disk volume replacement process. This could be due to timing issues or synchronization problems in the code. To investigate this, we can:

    *   Review the code related to disk volume replacement and identify potential race conditions.
    *   Add more logging to the test to track the sequence of operations and identify where the race condition might be occurring.
    *   Consider using synchronization primitives (e.g., mutexes, semaphores) to protect critical sections of code.

2.  **Database Issues:** The test involves database operations, and failures could be related to database connectivity, data corruption, or concurrency issues within the database. To address this, we can:

    *   Verify the database connection settings and ensure the database is running correctly.
    *   Examine the database logs for any errors or warnings.
    *   Check for any database deadlocks or lock contention issues.

3.  **Soft Deletion Logic:** The test specifically deals with soft-deleted disk volumes. There might be issues in the logic that handles soft deletion and subsequent replacement. We can:

    *   Review the code responsible for soft deletion and ensure it's functioning as expected.
    *   Check if the soft deletion process is correctly marking the disk volume as deleted.
    *   Verify that the replacement process is correctly identifying and handling soft-deleted volumes.

4.  **Region Replacement Driver:** The assertion failure involves the `RegionReplacementDriverStatus`, indicating a potential problem with the driver responsible for region replacement. We should:

    *   Inspect the `RegionReplacementDriver` code for any bugs or unexpected behavior.
    *   Ensure that the driver is correctly invoked during the replacement process.
    *   Verify that the driver is reporting the correct status after the replacement attempt.

5.  **Serialization/Deserialization:** The test uses `serde_json` to deserialize the `RegionReplacementDriverStatus`. If the data being serialized or deserialized is not in the expected format, it can lead to errors. We need to:

    *   Verify that the `RegionReplacementDriverStatus` struct is correctly defined.
    *   Check if the JSON data being serialized and deserialized matches the structure of the struct.
    *   Look for any potential issues with the serialization or deserialization process.

## Reproducing the Failure Locally

One of the most effective ways to troubleshoot a test failure is to reproduce it locally. This allows us to run the test in a controlled environment, set breakpoints, and examine the state of the system at various points. To reproduce the failure, we should:

*   Ensure we have the same environment as the CI environment (e.g., operating system, dependencies, database configuration).
*   Run the test using the same command-line arguments and settings as in CI.
*   Attach a debugger to the test process and step through the code to identify the exact point of failure.

## Next Steps

To effectively resolve this **test failure**, the following steps should be taken:

1.  **Examine the log files:** The detailed logs generated by the test are crucial for understanding the sequence of events and identifying the root cause.
2.  **Reproduce the failure locally:** This allows for debugging and a deeper understanding of the issue.
3.  **Review the code:** Pay close attention to the areas related to disk volume replacement, soft deletion, and the `RegionReplacementDriver`.
4.  **Implement targeted logging:** Add more logging statements to the test and the relevant code to track the state of the system and identify the point of failure.
5.  **Consider potential race conditions:** Use synchronization primitives if necessary to protect critical sections of code.
6.  **Address any database issues:** Verify database connectivity, examine logs, and check for deadlocks or lock contention.

By systematically investigating these areas, we can pinpoint the root cause of the **test failure** and implement the necessary fixes. Let's get to work and make sure this test passes consistently!