perf: Fast file sync with content SHA checks by shreyas-goenka · Pull Request #5166 · databricks/cli

shreyas-goenka · 2026-05-04T08:31:16Z

What

Sync now runs in three layers:

Discovery (libs/git, libs/fileset) — walks the local tree.
Snapshot diff (libs/sync/snapshot.go, diff.go) — compares against the local mtime snapshot to produce an action plan (puts/deletes/mkdirs/rmdirs). Unchanged from main.
Remote filter (new, libs/sync/remote_filter.go) — pre-flight that bulk-fetches content SHAs from the workspace via /api/2.0/workspace/list-repo?return_wsfs_metadata=true and drops puts whose remote SHA already matches the local SHA.

Layer 3 only runs when the snapshot is fresh (no prior local state). With an existing snapshot, Layer 2 is precise and up to date in normal usage for DABs; paying for a bulk remote list would be wasted work.

The change is purely additive: snapshot schema is unchanged (still v1), and Layer 3 can only ever remove puts, never add. Errors and missing remote state degrade gracefully — worst case is the existing behavior (re-upload).

Why

CI runners and any other context that wipes .databricks/ between deploys today re-upload every file in the bundle, even when the workspace already has them with identical contents. Layer 3 makes the workspace's actual state authoritative, so a fresh runner skips uploads of already-current files.

Benchmarks

Head-to-head against an AWS workspace (id 4342141078796725). Each scenario: 3 trials, medians shown. Baseline = origin/main, PR 2 = shreyas-goenka/sha-snapshots. Both binaries use import-file for upload

Cold deploy of an unchanged bundle (CI re-deploy scenario)

This is the headline use case: a CI runner clones the repo and deploys, but nothing has actually changed since the previous deploy. Today, every file gets re-uploaded.

Bundle size	Baseline (no Layer 3)	With Layer 3	Saved	% saved
20 files × 50 KB	1.99 s	0.89 s	1.10 s	55 %
100 files × 50 KB	9.24 s	0.91 s	8.33 s	90 %
200 files × 50 KB	15.71 s	1.02 s	14.69 s	94 %
500 files × 50 KB	34.63 s	1.28 s	33.35 s	96 %

Layer 3's wall-clock cost stays nearly flat (≈0.9–1.3 s for 20 → 500 files) because it's dominated by one bulk list-repo call (~150–500 ms RTT) plus negligible local SHA-256 hashing (~1 GB/s).

Cold deploy with one large file changed

Bundle = 100 small files (50 KB each, all match remote) + one 10 MB wheel that's been re-built locally.

	Baseline	With Layer 3	Saved
100 files + 10 MB wheel, wheel changed	9.06 s	0.91 s	8.15 s (90 %)

The 100 unchanged files are skipped via SHA match; only the wheel is uploaded.

Worst case: cold deploy to an empty workspace

Layer 3 has nothing to skip — its only effect is the wasted list-repo call.

	Baseline	With Layer 3	Cost
100 files, empty remote	9.08 s	8.65 s	within noise (Layer 3 cost ≈ 200 ms, lost in upload-pool variance)

Warm snapshot path is unchanged

When a local snapshot exists, Layer 3 is gated off. Performance must equal main.

	Baseline	With Layer 3	Δ
100 files, warm snapshot, no changes	0.72 s	0.72 s	within noise

Tree depth doesn't matter

list-repo is server-side recursive: it returns the entire subtree in one RTT. To verify, I compared list-repo against a hypothetical client-side parallel walk via /api/2.0/workspace/list (non-recursive, 16 workers).

Tree shape	`list-repo` (current)	parallel `list` walk (w=16)
200 files, flat	154 ms	151 ms
200 files, depth 5, ~243 dirs	170 ms	20,133 ms (~100×)
200 files, depth 10, ~1024 dirs	176 ms	26,929 ms (~150×)
1000 files, depth 5	259 ms	(not run; would be even worse)

list-repo is essentially flat with respect to depth. Cost vs. file count (separate measurement, 10 trials per N, medians):

N	median list time
≤ 200	~150 ms (RTT-bound; flat)
500	~286 ms (server-side step; variance spikes here)
1000–2000	~300–340 ms
5000	~440 ms

Bounded between ~150 ms for small bundles and < 500 ms even at N = 5000. Parallel client-side walking pays an RTT per intermediate directory and never recovers.

How the heuristic was chosen

We considered adding an N-files threshold to gate Layer 3. The data argues against it:

The list call's cost is bounded: ~150 ms for typical bundles, ~440 ms even at N = 5000 (see table above).
Local hashing is ≈ 1 GB/s — negligible for any realistic bundle size.
Per-file upload latency floor is ≈ 220 ms; with 8-worker parallelism, the break-even matched-file count is ~6.
Empty-remote worst case costs ~200 ms; common-case savings are seconds to tens of seconds.

So the only gate is snapshot.New. An N threshold would protect a rare edge case at the cost of the high-value common case.

Tested

End-to-end against an AWS workspace (id 4342141078796725) for every workspace-filesystem object type DABs uses:

Local file	object_type	Local SHA = remote `content_sha256_hex`?
plain file	`FILE`	✓
`.py` with `# Databricks notebook source`	`NOTEBOOK` (PYTHON)	✓
`.sql` with `-- Databricks notebook source`	`NOTEBOOK` (SQL)	✓
`.ipynb`	`NOTEBOOK` (PYTHON)	✓
`.lvdash.json`	`DASHBOARD`	✓

Filter correctly handles the notebook extension-stripping in LocalToRemoteNames (so my-nb.py locally maps to my-nb in the list-repo response). Dashboards keep their full .lvdash.json extension.

Notes / follow-ups

requires_sync_to_wsfs: true on a notebook or dashboard with a pending UI autosave would currently cause the filter to skip the upload (the workspace's flushed SHA matches local but the in-flight UI state doesn't). Harmless for re-deploys from local source;
list-repo is an internal endpoint; the WHS migration plan does not currently include a recursive list. If list-repo goes away, we'll need the WHS-side replacement (ListTreeNodes with blob_info.sha256).

Test plan

Unit tests in libs/sync/remote_filter_test.go covering empty/error/match/mismatch/missing/notebook-rename/mixed-diff cases.
End-to-end verification against an AWS workspace (id 4342141078796725) for cold + warm snapshot paths.
Object-type matrix against an AWS workspace (id 4342141078796725).
Wall-clock benchmarks above.
Acceptance test in acceptance/bundle/deploy/ covering the cold-snapshot path. (Existing tests still pass; a dedicated regression test for Layer 3 would be a nice-to-have.)

Sync runs in three layers: 1. Discovery (libs/git, libs/fileset): walk the local tree. 2. Snapshot diff (libs/sync/snapshot.go, diff.go): compare against the local mtime snapshot to produce an action plan (puts/deletes/etc). 3. Remote filter (new, libs/sync/remote_filter.go): pre-flight that bulk-fetches content SHAs from the workspace and drops puts whose remote SHA already matches the local SHA. Layer 3 only runs when the snapshot is fresh (no prior local state) — the case where Layer 2 produces false-positive puts at scale, e.g. on a CI runner that has just cloned the repo. With an existing snapshot, Layer 2 is precise; paying for a bulk remote list would be wasted work. The remote SHA list uses /api/2.0/workspace/list-repo with the return_wsfs_metadata=true flag, exposed via a new RemoteFileMetadata type and ListWithSHAs method on WorkspaceFilesClient. Errors and missing remote state degrade gracefully: the filter returns the unmodified diff and the worst case is the existing behavior (re-upload). Verified end-to-end against bundle-dev: a sync that deletes the local snapshot and re-runs produces zero uploads when contents are unchanged, and uploads only the edited file when one file changes. Notebooks (.py with the magic header and .ipynb) preserve raw uploaded bytes server-side, so their SHAs match local SHAs verbatim — no notebook-specific carve-out needed. Snapshot schema is unchanged; this is purely additive. Co-authored-by: Isaac

…cceptance Acceptance tests were failing on the direct engine path because Layer 3 of sync (the remote-SHA filter) calls /api/2.0/workspace/list-repo, but the testserver had no default handler. The CLI's filter logged a Warn line ("could not fetch remote content SHAs ... No stub found") which contaminated recorded output.txt diffs. Add WorkspaceListRepo to the FakeWorkspace: walks the in-memory files/directories maps, computes SHA-256 over each file's stored bytes, and returns objects in the same shape as the real list-repo response (path, object_type, content_sha256_hex, has_wsfs_metadata, size, language). On a workspace with no imported files, returns {"objects": []}, which causes Layer 3 to no-op cleanly. Co-authored-by: Isaac

Layer 3 of sync issues a GET /api/2.0/workspace/list-repo with return_wsfs_metadata=true on cold-snapshot deploys (which is what acceptance tests do). Add the new request to the recorded fixtures for both engines so the diff comparison passes. The User-Agent on the new call inherits cmd/bundle_deploy and engine/<x> as expected. Co-authored-by: Isaac

…tions Companion to the previous fixture updates. The parent user_agent test aggregates User-Agent observations from per-engine recorded request logs and writes a shared output.txt; that needs the same list-repo entries (one per engine). Co-authored-by: Isaac

shreyas-goenka · 2026-05-04T13:02:06Z

The reason to make SHA checks an additive layer ot local snapshots is its an independent mechanism to rule out files we need to upload and in the future we can choose to expand the scope for similar checks if the marginal performance gains are justified.

…aths Behind a benchworkspace build tag so they don't run in CI. Three groups: - BenchmarkListRepoByCount: list-repo cost vs N (10/100/500/1000). - BenchmarkListRepoByContent: list-repo cost across plain files / notebooks (py/sql/ipynb) / dashboards / mixed at fixed N=200. - BenchmarkSyncRunOnceColdSnapshot: end-to-end Sync.RunOnce against a pre-warmed remote, with and without Layer 3, at N=20/100/500. The "without" variant zeroes out s.remoteFilter to bypass the new path. Each run creates and tears down a unique /Users/$USER/.tmp/sync-bench-X tree on the configured workspace. Run instructions are in the file's top comment. Required env: DATABRICKS_BENCH_PROFILE=<profile> DATABRICKS_BENCH_USER=<email> go test -tags benchworkspace -bench=. -benchtime=5x -timeout=60m ./libs/sync/... Co-authored-by: Isaac

Expand the benchworkspace benchmarks so every cell can be combined with a tree shape: flat (no nesting), small (depth 2, branch 2), medium (depth 4, branch 2), large (depth 6, branch 2). All four list-repo and sync benchmarks now parameterize over (shape × N). Add a test-only parallelWalk runner that lists everything under a path by issuing non-recursive /api/2.0/workspace/list calls and recursing client-side with a configurable worker pool. Add BenchmarkListWalkers that compares list-repo vs parallel-walk (workers=8 and workers=32) head to head across (shape × N). The runner is deliberately not exported and not wired into Sync — production code uses list-repo. The bench is here to make the comparison reproducible and to back the claim that list-repo dominates parallel-walk at any non-flat shape. BenchmarkListRepoByContent now also varies shape, so content × shape is fully covered. Run all four groups with: go test -tags benchworkspace -bench=. -benchtime=5x -timeout=90m ./libs/sync/... Co-authored-by: Isaac

shreyas-goenka temporarily deployed to test-trigger-is May 4, 2026 08:31 — with GitHub Actions Inactive

shreyas-goenka changed the title ~~sync: add a remote-SHA filter as a third layer of the upload pipeline~~ perf: Fast file sync with content SHA checks May 4, 2026

shreyas-goenka temporarily deployed to test-trigger-is May 4, 2026 08:58 — with GitHub Actions Inactive

shreyas-goenka temporarily deployed to test-trigger-is May 4, 2026 09:16 — with GitHub Actions Inactive

shreyas-goenka temporarily deployed to test-trigger-is May 4, 2026 09:30 — with GitHub Actions Inactive

shreyas-goenka temporarily deployed to test-trigger-is May 4, 2026 13:59 — with GitHub Actions Inactive

shreyas-goenka temporarily deployed to test-trigger-is May 4, 2026 14:51 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Fast file sync with content SHA checks#5166

perf: Fast file sync with content SHA checks#5166
shreyas-goenka wants to merge 6 commits intomainfrom
shreyas-goenka/sha-snapshots

shreyas-goenka commented May 4, 2026 •

edited

Loading

Uh oh!

shreyas-goenka commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shreyas-goenka commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Benchmarks

Cold deploy of an unchanged bundle (CI re-deploy scenario)

Cold deploy with one large file changed

Worst case: cold deploy to an empty workspace

Warm snapshot path is unchanged

Tree depth doesn't matter

How the heuristic was chosen

Tested

Notes / follow-ups

Test plan

Uh oh!

shreyas-goenka commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shreyas-goenka commented May 4, 2026 •

edited

Loading