controller/vm: data race in Stats(), nil-handle Close, and Invalid-state overwrite in waitForVMExit

## Summary

Three findings in `internal/controller/vm/vm.go` discovered during code review of #2634. None are introduced by that PR; all are pre-existing in `upstream/main`.

## 1. Data race in `Stats()` on `c.vmmemProcess` (most important)

**Location:** [`internal/controller/vm/vm.go:464-471`](https://github.com/microsoft/hcsshim/blob/main/internal/controller/vm/vm.go#L464-L471)

```go
c.mu.RLock()
defer c.mu.RUnlock()
...
// Initialization of vmmemProcess to calculate stats properly for VA-backed UVMs.
if c.vmmemProcess == 0 {
    vmmemHandle, err := vmutils.LookupVMMEM(ctx, c.uvm.RuntimeID(), &iwin.WinAPI{})
    if err != nil {
        return nil, fmt.Errorf("cannot get stats: %w", err)
    }
    c.vmmemProcess = vmmemHandle  // ← write under RLock
}
```

`Stats()` writes `c.vmmemProcess` while holding only `c.mu.RLock()`. The same field is mutated under exclusive `c.mu.Lock()` in `TerminateVM` (vm.go:541) and read under exclusive lock in other paths.

`Stats()` is called by the shim's `SandboxMetrics` RPC, which containerd polls on a timer for every running pod. Two concurrent metric scrapes on the same Controller can both:
1. Acquire read lock (allowed concurrently)
2. Both observe `vmmemProcess == 0`
3. Both call `LookupVMMEM` (succeeds, returns a handle)
4. Both assign to `c.vmmemProcess` — race + handle leak (one of the two opened handles is overwritten without being closed; the other is "lucky" to be the kept value)

**Reproduction:** add a parallel-`Stats` test under `-race` and watch the data-race detector fire.

**Fix sketch:**
- Take `c.mu.Lock()` (exclusive) for the duration of `Stats()`, OR
- Lazy-initialize `vmmemProcess` once during `StartVM` (under exclusive lock), OR
- Use `sync.Once` keyed on `c` for the lookup.

## 2. `windows.Close(0)` when `vmmemProcess` was never initialized

**Location:** [`internal/controller/vm/vm.go:541`](https://github.com/microsoft/hcsshim/blob/main/internal/controller/vm/vm.go#L541)

```go
// Best effort attempt to clean up the open vmmem handle.
_ = windows.Close(c.vmmemProcess)
```

If `Stats()` was never called (e.g., a sandbox that was Created and Terminated without ever transitioning to Running long enough to be scraped), `c.vmmemProcess` is 0. `windows.Close(0)` returns `ERROR_INVALID_HANDLE`. The error is swallowed by `_ =` so there is no functional damage, but it is technically wrong and produces ETW noise.

**Fix:** add a nil-handle guard:

```go
if c.vmmemProcess != 0 {
    _ = windows.Close(c.vmmemProcess)
}
```

## 3. `waitForVMExit` overwrites `StateInvalid` to `StateTerminated`

**Location:** [`internal/controller/vm/vm.go:328-333`](https://github.com/microsoft/hcsshim/blob/main/internal/controller/vm/vm.go#L328-L333)

```go
c.mu.Lock()
if c.vmState != StateTerminated {
    c.vmState = StateTerminated
}
c.mu.Unlock()
```

Behaviour question rather than a clear bug: the guard `if c.vmState != StateTerminated` overwrites `StateInvalid` to `StateTerminated`. `StateInvalid` is set by `TerminateVM` when `uvm.Close` fails, indicating a leaked HCS handle. If the background `waitForVMExit` then runs and sees the VM has actually exited, it currently masks the Invalid signal.

The state-transition table in `state.go` documents `StateInvalid → StateTerminated` only via "TerminateVM called". The current code adds a second path through `waitForVMExit`.

**Possible interpretations:**
- **Intentional:** once the VM is actually gone, `Terminated` is the truthful state regardless of how the controller arrived. `StateInvalid` was about the leaked handle, which `windows.Close` in `TerminateVM` already attempted.
- **Unintentional:** the recovery contract assumes `StateInvalid` persists until a successful `TerminateVM` retry. If `waitForVMExit` sneaks in first and clears it, callers cannot distinguish "cleanly terminated" from "terminated with a leaked handle".

A regression guard test pinning the current behaviour was added in #2634 (`TestWaitForVMExit_OverwritesInvalidToTerminated`); that test should be updated if the intent is changed.

**Suggested fix (if unintentional):** narrow the guard to `if c.vmState == StateRunning`.

## Out of scope

- The fixes for (1) and (2) are small and could ship together. (3) needs a design call.
- Tests will be added with the fixes (race test for #1, table case for #2, behaviour-change test for #3).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

controller/vm: data race in Stats(), nil-handle Close, and Invalid-state overwrite in waitForVMExit #2713

Summary

1. Data race in `Stats()` on `c.vmmemProcess` (most important)

2. `windows.Close(0)` when `vmmemProcess` was never initialized

3. `waitForVMExit` overwrites `StateInvalid` to `StateTerminated`

Out of scope

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

controller/vm: data race in Stats(), nil-handle Close, and Invalid-state overwrite in waitForVMExit #2713

Description

Summary

1. Data race in Stats() on c.vmmemProcess (most important)

2. windows.Close(0) when vmmemProcess was never initialized

3. waitForVMExit overwrites StateInvalid to StateTerminated

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Data race in `Stats()` on `c.vmmemProcess` (most important)

2. `windows.Close(0)` when `vmmemProcess` was never initialized

3. `waitForVMExit` overwrites `StateInvalid` to `StateTerminated`