Skip to content

[codex] fix studio expose reliability#17

Merged
DjDeveloperr merged 2 commits intomainfrom
fix-studio-expose-reliability
May 5, 2026
Merged

[codex] fix studio expose reliability#17
DjDeveloperr merged 2 commits intomainfrom
fix-studio-expose-reliability

Conversation

@DjDeveloperr
Copy link
Copy Markdown
Collaborator

Summary

This PR fixes the Studio expose reliability issue where a local simdeck studio expose process could keep a Studio session alive even after its local daemon was gone. In that state Studio kept showing "Starting simulator" forever because the provider bridge continued to heartbeat as provider-online, but the SimDeck health and simulator probes could no longer succeed, so the session never returned to ready.

The fix makes the provider bridge lifecycle explicit. The server now passes the expose parent PID into the bridge, and the bridge exits when that parent process disappears. The bridge also tracks local daemon availability; if the local daemon remains unreachable for the configured timeout, the bridge marks the Studio session failed and exits rather than leaving a zombie provider-online session behind. Normal shutdown still marks sessions expired.

This also keeps the remote streaming changes that were part of the reliability work: remote sessions default to software H.264, 30 fps, balanced quality, expose remote FPS choices of 15/30/60, and avoid tearing down recoverable remote WebRTC connections on short stalls. The stream-quality endpoint is now idempotent for identical settings so clients do not churn encoders by reposting the same config.

The E2E WebRTC harness now supports Studio-hosted URLs directly. It derives the Studio SimDeck API proxy path from /simulator/:sessionId, can pull large visual-reference screenshots directly from the local daemon when Studio RPC body size is not appropriate for PNG screenshots, records sampler errors, and treats visual artifacts as sustained/repeated failures instead of failing on one unsynchronized screenshot/video sample.

Validation

  • npm run build
  • npm run lint
  • npm run test
  • node --test scripts/studio-provider-bridge.test.mjs
  • Fresh Studio expose moved the session to ready.
  • Daemon-death regression: killed the local daemon under a running expose; the expose process exited and Studio status became failed instead of staying provider-online.
  • 60-second Studio-hosted WebRTC E2E passed with:
    • 0 reconnects
    • 0 packets lost
    • 0 decoder drops
    • max observed frame gap about 50ms
    • visual artifact failure ratio 0

@DjDeveloperr DjDeveloperr force-pushed the fix-studio-expose-reliability branch from 755af98 to 3843ee5 Compare May 4, 2026 21:13
@DjDeveloperr
Copy link
Copy Markdown
Collaborator Author

Update after remote Mac mini repro:

  • The remote session showed substantial RTP packet loss over internet UDP (packetsLost increasing, RTT around 100-220ms) without TURN/relay candidates.
  • Realtime H.264 now advertises generic nack feedback as well as PLI/FIR so browsers can request missing RTP retransmits instead of waiting for a keyframe.
  • Server-side peer disconnected grace is increased to cover short remote ICE wobbles, preventing the server from killing a recoverable connection before the client recovers.
  • Provider bridge failure detection is now more precise: studio expose passes the local daemon supervisor PID and daemon log path; the bridge fails immediately if that process is gone, and if HTTP is unavailable it prints recent daemon logs so we can identify why the daemon stopped answering instead of only printing fetch failed.

Validation rerun:

  • npm run build
  • npm run lint
  • npm run test
  • node --test scripts/studio-provider-bridge.test.mjs

@DjDeveloperr DjDeveloperr force-pushed the fix-studio-expose-reliability branch from 3843ee5 to 342bfd7 Compare May 4, 2026 21:22
@DjDeveloperr
Copy link
Copy Markdown
Collaborator Author

Corrected note for the latest update:

  • A local REST/health outage no longer makes the Studio bridge mark the provider failed or exit while the daemon supervisor process is still alive. It now keeps the bridge up, logs the local failure reason, and tails recent daemon logs.
  • The bridge only marks the provider failed when the daemon supervisor PID is gone.
  • Added focused tests so an HTTP-only outage stays non-terminal, while a supervisor exit remains terminal.

This matches the case where the WebRTC stream can still be working even though the health or simulator list endpoints temporarily fail.

@DjDeveloperr DjDeveloperr force-pushed the fix-studio-expose-reliability branch from 342bfd7 to a5b420b Compare May 4, 2026 21:30
@DjDeveloperr
Copy link
Copy Markdown
Collaborator Author

Follow-up on the hang/reconnect cause:

The daemon log showed the concrete failure mode: the HTTP listener can hit accept error: Bad file descriptor, after which the runtime heartbeat can still be fresh and existing WebRTC media may continue, but Studio metadata/control/new requests are dead.

Latest update fixes recovery instead of masking it:

  • Provider bridge no longer exits on local HTTP outage while the daemon supervisor is alive.
  • Server watchdog now treats repeated local HTTP health probe failure as recoverable daemon failure even when the runtime heartbeat is fresh.
  • HTTP probe failure threshold is separate and faster: 3 consecutive failures, while stale runtime heartbeat remains 12.
  • Added tests for HTTP-listener unhealthy, transient HTTP failures, and stale heartbeat restart decisions.

Local reproduction/verification:

  • Existing daemon log contained repeated accept error: Bad file descriptor and old behavior logging that it was keeping active streams alive.
  • 1,200-request stress run against health/metrics/simulators/stream-quality passed.
  • 60s repeated Studio-style health + simulator probe loop passed with 0 failures.
  • 60s local WebRTC E2E with concurrent probes passed: reconnects 0, packetsLost 0, decoder drops 0.

@DjDeveloperr DjDeveloperr force-pushed the fix-studio-expose-reliability branch from a5b420b to 263df42 Compare May 4, 2026 21:44
@DjDeveloperr
Copy link
Copy Markdown
Collaborator Author

Follow-up from the remote Mac mini logs:

The bridge was repeatedly failing plain local fetches to http://127.0.0.1:4310, which means the Studio bridge/cloud path was alive but the local daemon HTTP listener was not accepting connections. The supervisor PID being alive was not enough; it could be supervising a dead/restarting/non-serving daemon child.

Latest update adds an explicit Studio-expose recovery path:

  • studio expose now passes the bridge a daemon restart command matching the expose stream settings.
  • If local HTTP stays unavailable for 45s, the bridge runs simdeck daemon restart, then reads daemon status to refresh local URL/token/PID/log path.
  • This preserves the same Studio session URL instead of leaving the page stuck or requiring a manual expose restart.
  • Existing daemon watchdog still restarts on repeated local health probe failure as a second line of defense.

Validation: npm run lint, npm run test, and npm run build pass. Added tests for preserving the remote restart args/default smooth software settings.

@DjDeveloperr DjDeveloperr force-pushed the fix-studio-expose-reliability branch from 263df42 to 685f4b8 Compare May 4, 2026 21:57
@DjDeveloperr
Copy link
Copy Markdown
Collaborator Author

Added the shared encode/fanout improvement for multiple Studio viewers:

  • Removed the per-WebRTC-peer live refresh timer. Multiple viewers no longer multiply native refresh/encode requests.
  • Added one shared simulator-session refresh pump that runs only while there are active frame subscribers.
  • Each WebRTC peer still packetizes/sends independently, but they all consume the same encoded H.264 broadcast frames from the single native session encoder.
  • Raised the per-simulator WebRTC viewer cap from 4 to 16 now that refresh/encode is shared.
  • Stream-quality changes still reconfigure the shared native encoder through the registry, so settings apply to all viewers on that simulator.

Verification:

  • npm run lint
  • npm run test
  • npm run build
  • Two concurrent local WebRTC E2E viewers both reported 0 reconnects, 0 packet loss, 0 decoder drops.
  • Daemon metrics after the run showed frames_encoded=2424 and frames_sent=4836, matching one encode with roughly two sends per frame.

@DjDeveloperr DjDeveloperr force-pushed the fix-studio-expose-reliability branch 2 times, most recently from bc4e9e7 to 7e44bd1 Compare May 4, 2026 22:31
@DjDeveloperr
Copy link
Copy Markdown
Collaborator Author

Added an expose/WebRTC multi-viewer stress test in scripts/e2e-webrtc-stress.mjs and wired it as npm run test:e2e:webrtc:stress.

What it does:

  • defaults to 10 clients total: 5 steady viewers + 5 churn viewers
  • supports local URLs and Studio expose /simulator/<id> URLs by routing metrics through /api/provider-sessions/<id>/simdeck
  • launches independent Chrome/CDP viewers, disables touch interactions for stress children, and uses structured JSON summaries from each child run
  • asserts no reconnects, no decoder drops by default, and no active stream leak after a settle window

Local validation:

  • npm run lint
  • npm run test
  • npm run build
  • 2-client stress smoke passed: 1 steady + 1 churn, 0 reconnects, 0 decoder drops, no active stream leak ✅
  • strict 10-client short stress run intentionally failed on this machine due decoder drops under load, with 0 reconnects and no active stream leak. That means the new test is catching the current multi-viewer video-health bottleneck instead of passing over it.

@DjDeveloperr DjDeveloperr force-pushed the fix-studio-expose-reliability branch from 7e44bd1 to 6d05f91 Compare May 4, 2026 23:27
@DjDeveloperr
Copy link
Copy Markdown
Collaborator Author

Follow-up fixes pushed in 6d05f91:

  • Studio remote pages no longer try the local /api/simulators/:udid/control WebSocket fallback when the WebRTC data channel is not open. That fallback cannot work through provider RPC and was the source of the misleading GET /control failed: fetch failed logs.
  • Provider bridge now detects WebSocket upgrade requests before proxying and returns 426 without marking local SimDeck HTTP unavailable.
  • FPS/quality changes no longer tear down and recreate the WebRTC peer connection. They now POST /api/stream-quality in place and ask the active stream for a fresh keyframe. Encoder mode changes still reconnect because hardware/software can affect negotiated H.264 behavior.

Validated locally with npm run lint && npm run test && npm run build, plus node --test scripts/studio-provider-bridge.test.mjs.

@DjDeveloperr DjDeveloperr force-pushed the fix-studio-expose-reliability branch from 6d05f91 to fba1036 Compare May 5, 2026 00:58
@DjDeveloperr DjDeveloperr force-pushed the fix-studio-expose-reliability branch from fba1036 to 620db98 Compare May 5, 2026 01:02
@DjDeveloperr DjDeveloperr merged commit 6236d1f into main May 5, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant