Skip to content

perf: Fast file sync with content SHA checks#5166

Draft
shreyas-goenka wants to merge 6 commits intomainfrom
shreyas-goenka/sha-snapshots
Draft

perf: Fast file sync with content SHA checks#5166
shreyas-goenka wants to merge 6 commits intomainfrom
shreyas-goenka/sha-snapshots

Conversation

@shreyas-goenka
Copy link
Copy Markdown
Contributor

@shreyas-goenka shreyas-goenka commented May 4, 2026

What

Sync now runs in three layers:

  1. Discovery (libs/git, libs/fileset) — walks the local tree.
  2. Snapshot diff (libs/sync/snapshot.go, diff.go) — compares against the local mtime snapshot to produce an action plan (puts/deletes/mkdirs/rmdirs). Unchanged from main.
  3. Remote filter (new, libs/sync/remote_filter.go) — pre-flight that bulk-fetches content SHAs from the workspace via /api/2.0/workspace/list-repo?return_wsfs_metadata=true and drops puts whose remote SHA already matches the local SHA.

Layer 3 only runs when the snapshot is fresh (no prior local state). With an existing snapshot, Layer 2 is precise and up to date in normal usage for DABs; paying for a bulk remote list would be wasted work.

The change is purely additive: snapshot schema is unchanged (still v1), and Layer 3 can only ever remove puts, never add. Errors and missing remote state degrade gracefully — worst case is the existing behavior (re-upload).

Why

CI runners and any other context that wipes .databricks/ between deploys today re-upload every file in the bundle, even when the workspace already has them with identical contents. Layer 3 makes the workspace's actual state authoritative, so a fresh runner skips uploads of already-current files.

Benchmarks

Head-to-head against an AWS workspace (id 4342141078796725). Each scenario: 3 trials, medians shown. Baseline = origin/main, PR 2 = shreyas-goenka/sha-snapshots. Both binaries use import-file for upload

Cold deploy of an unchanged bundle (CI re-deploy scenario)

This is the headline use case: a CI runner clones the repo and deploys, but nothing has actually changed since the previous deploy. Today, every file gets re-uploaded.

Bundle size Baseline (no Layer 3) With Layer 3 Saved % saved
20 files × 50 KB 1.99 s 0.89 s 1.10 s 55 %
100 files × 50 KB 9.24 s 0.91 s 8.33 s 90 %
200 files × 50 KB 15.71 s 1.02 s 14.69 s 94 %
500 files × 50 KB 34.63 s 1.28 s 33.35 s 96 %

Layer 3's wall-clock cost stays nearly flat (≈0.9–1.3 s for 20 → 500 files) because it's dominated by one bulk list-repo call (~150–500 ms RTT) plus negligible local SHA-256 hashing (~1 GB/s).

Cold deploy with one large file changed

Bundle = 100 small files (50 KB each, all match remote) + one 10 MB wheel that's been re-built locally.

Baseline With Layer 3 Saved
100 files + 10 MB wheel, wheel changed 9.06 s 0.91 s 8.15 s (90 %)

The 100 unchanged files are skipped via SHA match; only the wheel is uploaded.

Worst case: cold deploy to an empty workspace

Layer 3 has nothing to skip — its only effect is the wasted list-repo call.

Baseline With Layer 3 Cost
100 files, empty remote 9.08 s 8.65 s within noise (Layer 3 cost ≈ 200 ms, lost in upload-pool variance)

Warm snapshot path is unchanged

When a local snapshot exists, Layer 3 is gated off. Performance must equal main.

Baseline With Layer 3 Δ
100 files, warm snapshot, no changes 0.72 s 0.72 s within noise

Tree depth doesn't matter

list-repo is server-side recursive: it returns the entire subtree in one RTT. To verify, I compared list-repo against a hypothetical client-side parallel walk via /api/2.0/workspace/list (non-recursive, 16 workers).

Tree shape list-repo (current) parallel list walk (w=16)
200 files, flat 154 ms 151 ms
200 files, depth 5, ~243 dirs 170 ms 20,133 ms (~100×)
200 files, depth 10, ~1024 dirs 176 ms 26,929 ms (~150×)
1000 files, depth 5 259 ms (not run; would be even worse)

list-repo is essentially flat with respect to depth. Cost vs. file count (separate measurement, 10 trials per N, medians):

N median list time
≤ 200 ~150 ms (RTT-bound; flat)
500 ~286 ms (server-side step; variance spikes here)
1000–2000 ~300–340 ms
5000 ~440 ms

Bounded between ~150 ms for small bundles and < 500 ms even at N = 5000. Parallel client-side walking pays an RTT per intermediate directory and never recovers.

How the heuristic was chosen

We considered adding an N-files threshold to gate Layer 3. The data argues against it:

  • The list call's cost is bounded: ~150 ms for typical bundles, ~440 ms even at N = 5000 (see table above).
  • Local hashing is ≈ 1 GB/s — negligible for any realistic bundle size.
  • Per-file upload latency floor is ≈ 220 ms; with 8-worker parallelism, the break-even matched-file count is ~6.
  • Empty-remote worst case costs ~200 ms; common-case savings are seconds to tens of seconds.

So the only gate is snapshot.New. An N threshold would protect a rare edge case at the cost of the high-value common case.

Tested

End-to-end against an AWS workspace (id 4342141078796725) for every workspace-filesystem object type DABs uses:

Local file object_type Local SHA = remote content_sha256_hex?
plain file FILE
.py with # Databricks notebook source NOTEBOOK (PYTHON)
.sql with -- Databricks notebook source NOTEBOOK (SQL)
.ipynb NOTEBOOK (PYTHON)
.lvdash.json DASHBOARD

Filter correctly handles the notebook extension-stripping in LocalToRemoteNames (so my-nb.py locally maps to my-nb in the list-repo response). Dashboards keep their full .lvdash.json extension.

Notes / follow-ups

  • requires_sync_to_wsfs: true on a notebook or dashboard with a pending UI autosave would currently cause the filter to skip the upload (the workspace's flushed SHA matches local but the in-flight UI state doesn't). Harmless for re-deploys from local source;
  • list-repo is an internal endpoint; the WHS migration plan does not currently include a recursive list. If list-repo goes away, we'll need the WHS-side replacement (ListTreeNodes with blob_info.sha256).

Test plan

  • Unit tests in libs/sync/remote_filter_test.go covering empty/error/match/mismatch/missing/notebook-rename/mixed-diff cases.
  • End-to-end verification against an AWS workspace (id 4342141078796725) for cold + warm snapshot paths.
  • Object-type matrix against an AWS workspace (id 4342141078796725).
  • Wall-clock benchmarks above.
  • Acceptance test in acceptance/bundle/deploy/ covering the cold-snapshot path. (Existing tests still pass; a dedicated regression test for Layer 3 would be a nice-to-have.)

Sync runs in three layers:

  1. Discovery (libs/git, libs/fileset): walk the local tree.

  2. Snapshot diff (libs/sync/snapshot.go, diff.go): compare against the
     local mtime snapshot to produce an action plan (puts/deletes/etc).

  3. Remote filter (new, libs/sync/remote_filter.go): pre-flight that
     bulk-fetches content SHAs from the workspace and drops puts whose
     remote SHA already matches the local SHA.

Layer 3 only runs when the snapshot is fresh (no prior local state) — the
case where Layer 2 produces false-positive puts at scale, e.g. on a CI
runner that has just cloned the repo. With an existing snapshot, Layer 2
is precise; paying for a bulk remote list would be wasted work.

The remote SHA list uses /api/2.0/workspace/list-repo with the
return_wsfs_metadata=true flag, exposed via a new RemoteFileMetadata type
and ListWithSHAs method on WorkspaceFilesClient. Errors and missing remote
state degrade gracefully: the filter returns the unmodified diff and the
worst case is the existing behavior (re-upload).

Verified end-to-end against bundle-dev: a sync that deletes the local
snapshot and re-runs produces zero uploads when contents are unchanged,
and uploads only the edited file when one file changes. Notebooks (.py
with the magic header and .ipynb) preserve raw uploaded bytes server-side,
so their SHAs match local SHAs verbatim — no notebook-specific carve-out
needed.

Snapshot schema is unchanged; this is purely additive.

Co-authored-by: Isaac
@shreyas-goenka shreyas-goenka changed the title sync: add a remote-SHA filter as a third layer of the upload pipeline perf: Fast file sync with content SHA checks May 4, 2026
…cceptance

Acceptance tests were failing on the direct engine path because Layer 3
of sync (the remote-SHA filter) calls /api/2.0/workspace/list-repo, but
the testserver had no default handler. The CLI's filter logged a Warn
line ("could not fetch remote content SHAs ... No stub found") which
contaminated recorded output.txt diffs.

Add WorkspaceListRepo to the FakeWorkspace: walks the in-memory
files/directories maps, computes SHA-256 over each file's stored bytes,
and returns objects in the same shape as the real list-repo response
(path, object_type, content_sha256_hex, has_wsfs_metadata, size,
language). On a workspace with no imported files, returns
{"objects": []}, which causes Layer 3 to no-op cleanly.

Co-authored-by: Isaac
Layer 3 of sync issues a GET /api/2.0/workspace/list-repo with
return_wsfs_metadata=true on cold-snapshot deploys (which is what
acceptance tests do). Add the new request to the recorded fixtures for
both engines so the diff comparison passes. The User-Agent on the new
call inherits cmd/bundle_deploy and engine/<x> as expected.

Co-authored-by: Isaac
…tions

Companion to the previous fixture updates. The parent user_agent test
aggregates User-Agent observations from per-engine recorded request
logs and writes a shared output.txt; that needs the same list-repo
entries (one per engine).

Co-authored-by: Isaac
@shreyas-goenka
Copy link
Copy Markdown
Contributor Author

The reason to make SHA checks an additive layer ot local snapshots is its an independent mechanism to rule out files we need to upload and in the future we can choose to expand the scope for similar checks if the marginal performance gains are justified.

…aths

Behind a benchworkspace build tag so they don't run in CI. Three groups:

  - BenchmarkListRepoByCount: list-repo cost vs N (10/100/500/1000).
  - BenchmarkListRepoByContent: list-repo cost across plain files /
    notebooks (py/sql/ipynb) / dashboards / mixed at fixed N=200.
  - BenchmarkSyncRunOnceColdSnapshot: end-to-end Sync.RunOnce against a
    pre-warmed remote, with and without Layer 3, at N=20/100/500. The
    "without" variant zeroes out s.remoteFilter to bypass the new path.

Each run creates and tears down a unique /Users/$USER/.tmp/sync-bench-X
tree on the configured workspace.

Run instructions are in the file's top comment. Required env:

  DATABRICKS_BENCH_PROFILE=<profile>
  DATABRICKS_BENCH_USER=<email>
  go test -tags benchworkspace -bench=. -benchtime=5x -timeout=60m ./libs/sync/...

Co-authored-by: Isaac
Expand the benchworkspace benchmarks so every cell can be combined with a
tree shape: flat (no nesting), small (depth 2, branch 2), medium (depth
4, branch 2), large (depth 6, branch 2). All four list-repo and sync
benchmarks now parameterize over (shape × N).

Add a test-only parallelWalk runner that lists everything under a path
by issuing non-recursive /api/2.0/workspace/list calls and recursing
client-side with a configurable worker pool. Add BenchmarkListWalkers
that compares list-repo vs parallel-walk (workers=8 and workers=32) head
to head across (shape × N). The runner is deliberately not exported and
not wired into Sync — production code uses list-repo. The bench is here
to make the comparison reproducible and to back the claim that
list-repo dominates parallel-walk at any non-flat shape.

BenchmarkListRepoByContent now also varies shape, so content × shape is
fully covered. Run all four groups with:

  go test -tags benchworkspace -bench=. -benchtime=5x -timeout=90m ./libs/sync/...

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant