perf: Fast file sync with content SHA checks#5166
Draft
shreyas-goenka wants to merge 6 commits intomainfrom
Draft
perf: Fast file sync with content SHA checks#5166shreyas-goenka wants to merge 6 commits intomainfrom
shreyas-goenka wants to merge 6 commits intomainfrom
Conversation
Sync runs in three layers:
1. Discovery (libs/git, libs/fileset): walk the local tree.
2. Snapshot diff (libs/sync/snapshot.go, diff.go): compare against the
local mtime snapshot to produce an action plan (puts/deletes/etc).
3. Remote filter (new, libs/sync/remote_filter.go): pre-flight that
bulk-fetches content SHAs from the workspace and drops puts whose
remote SHA already matches the local SHA.
Layer 3 only runs when the snapshot is fresh (no prior local state) — the
case where Layer 2 produces false-positive puts at scale, e.g. on a CI
runner that has just cloned the repo. With an existing snapshot, Layer 2
is precise; paying for a bulk remote list would be wasted work.
The remote SHA list uses /api/2.0/workspace/list-repo with the
return_wsfs_metadata=true flag, exposed via a new RemoteFileMetadata type
and ListWithSHAs method on WorkspaceFilesClient. Errors and missing remote
state degrade gracefully: the filter returns the unmodified diff and the
worst case is the existing behavior (re-upload).
Verified end-to-end against bundle-dev: a sync that deletes the local
snapshot and re-runs produces zero uploads when contents are unchanged,
and uploads only the edited file when one file changes. Notebooks (.py
with the magic header and .ipynb) preserve raw uploaded bytes server-side,
so their SHAs match local SHAs verbatim — no notebook-specific carve-out
needed.
Snapshot schema is unchanged; this is purely additive.
Co-authored-by: Isaac
…cceptance
Acceptance tests were failing on the direct engine path because Layer 3
of sync (the remote-SHA filter) calls /api/2.0/workspace/list-repo, but
the testserver had no default handler. The CLI's filter logged a Warn
line ("could not fetch remote content SHAs ... No stub found") which
contaminated recorded output.txt diffs.
Add WorkspaceListRepo to the FakeWorkspace: walks the in-memory
files/directories maps, computes SHA-256 over each file's stored bytes,
and returns objects in the same shape as the real list-repo response
(path, object_type, content_sha256_hex, has_wsfs_metadata, size,
language). On a workspace with no imported files, returns
{"objects": []}, which causes Layer 3 to no-op cleanly.
Co-authored-by: Isaac
Layer 3 of sync issues a GET /api/2.0/workspace/list-repo with return_wsfs_metadata=true on cold-snapshot deploys (which is what acceptance tests do). Add the new request to the recorded fixtures for both engines so the diff comparison passes. The User-Agent on the new call inherits cmd/bundle_deploy and engine/<x> as expected. Co-authored-by: Isaac
…tions Companion to the previous fixture updates. The parent user_agent test aggregates User-Agent observations from per-engine recorded request logs and writes a shared output.txt; that needs the same list-repo entries (one per engine). Co-authored-by: Isaac
Contributor
Author
|
The reason to make SHA checks an additive layer ot local snapshots is its an independent mechanism to rule out files we need to upload and in the future we can choose to expand the scope for similar checks if the marginal performance gains are justified. |
…aths
Behind a benchworkspace build tag so they don't run in CI. Three groups:
- BenchmarkListRepoByCount: list-repo cost vs N (10/100/500/1000).
- BenchmarkListRepoByContent: list-repo cost across plain files /
notebooks (py/sql/ipynb) / dashboards / mixed at fixed N=200.
- BenchmarkSyncRunOnceColdSnapshot: end-to-end Sync.RunOnce against a
pre-warmed remote, with and without Layer 3, at N=20/100/500. The
"without" variant zeroes out s.remoteFilter to bypass the new path.
Each run creates and tears down a unique /Users/$USER/.tmp/sync-bench-X
tree on the configured workspace.
Run instructions are in the file's top comment. Required env:
DATABRICKS_BENCH_PROFILE=<profile>
DATABRICKS_BENCH_USER=<email>
go test -tags benchworkspace -bench=. -benchtime=5x -timeout=60m ./libs/sync/...
Co-authored-by: Isaac
Expand the benchworkspace benchmarks so every cell can be combined with a tree shape: flat (no nesting), small (depth 2, branch 2), medium (depth 4, branch 2), large (depth 6, branch 2). All four list-repo and sync benchmarks now parameterize over (shape × N). Add a test-only parallelWalk runner that lists everything under a path by issuing non-recursive /api/2.0/workspace/list calls and recursing client-side with a configurable worker pool. Add BenchmarkListWalkers that compares list-repo vs parallel-walk (workers=8 and workers=32) head to head across (shape × N). The runner is deliberately not exported and not wired into Sync — production code uses list-repo. The bench is here to make the comparison reproducible and to back the claim that list-repo dominates parallel-walk at any non-flat shape. BenchmarkListRepoByContent now also varies shape, so content × shape is fully covered. Run all four groups with: go test -tags benchworkspace -bench=. -benchtime=5x -timeout=90m ./libs/sync/... Co-authored-by: Isaac
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Sync now runs in three layers:
libs/git,libs/fileset) — walks the local tree.libs/sync/snapshot.go,diff.go) — compares against the local mtime snapshot to produce an action plan (puts/deletes/mkdirs/rmdirs). Unchanged from main.libs/sync/remote_filter.go) — pre-flight that bulk-fetches content SHAs from the workspace via/api/2.0/workspace/list-repo?return_wsfs_metadata=trueand drops puts whose remote SHA already matches the local SHA.Layer 3 only runs when the snapshot is fresh (no prior local state). With an existing snapshot, Layer 2 is precise and up to date in normal usage for DABs; paying for a bulk remote list would be wasted work.
The change is purely additive: snapshot schema is unchanged (still v1), and Layer 3 can only ever remove puts, never add. Errors and missing remote state degrade gracefully — worst case is the existing behavior (re-upload).
Why
CI runners and any other context that wipes
.databricks/between deploys today re-upload every file in the bundle, even when the workspace already has them with identical contents. Layer 3 makes the workspace's actual state authoritative, so a fresh runner skips uploads of already-current files.Benchmarks
Head-to-head against an AWS workspace (id
4342141078796725). Each scenario: 3 trials, medians shown. Baseline =origin/main, PR 2 =shreyas-goenka/sha-snapshots. Both binaries useimport-filefor uploadCold deploy of an unchanged bundle (CI re-deploy scenario)
This is the headline use case: a CI runner clones the repo and deploys, but nothing has actually changed since the previous deploy. Today, every file gets re-uploaded.
Layer 3's wall-clock cost stays nearly flat (≈0.9–1.3 s for 20 → 500 files) because it's dominated by one bulk list-repo call (~150–500 ms RTT) plus negligible local SHA-256 hashing (~1 GB/s).
Cold deploy with one large file changed
Bundle = 100 small files (50 KB each, all match remote) + one 10 MB wheel that's been re-built locally.
The 100 unchanged files are skipped via SHA match; only the wheel is uploaded.
Worst case: cold deploy to an empty workspace
Layer 3 has nothing to skip — its only effect is the wasted list-repo call.
Warm snapshot path is unchanged
When a local snapshot exists, Layer 3 is gated off. Performance must equal main.
Tree depth doesn't matter
list-repois server-side recursive: it returns the entire subtree in one RTT. To verify, I comparedlist-repoagainst a hypothetical client-side parallel walk via/api/2.0/workspace/list(non-recursive, 16 workers).list-repo(current)listwalk (w=16)list-repois essentially flat with respect to depth. Cost vs. file count (separate measurement, 10 trials per N, medians):Bounded between ~150 ms for small bundles and < 500 ms even at N = 5000. Parallel client-side walking pays an RTT per intermediate directory and never recovers.
How the heuristic was chosen
We considered adding an N-files threshold to gate Layer 3. The data argues against it:
So the only gate is
snapshot.New. An N threshold would protect a rare edge case at the cost of the high-value common case.Tested
End-to-end against an AWS workspace (id
4342141078796725) for every workspace-filesystem object type DABs uses:content_sha256_hex?FILE.pywith# Databricks notebook sourceNOTEBOOK(PYTHON).sqlwith-- Databricks notebook sourceNOTEBOOK(SQL).ipynbNOTEBOOK(PYTHON).lvdash.jsonDASHBOARDFilter correctly handles the notebook extension-stripping in
LocalToRemoteNames(somy-nb.pylocally maps tomy-nbin the list-repo response). Dashboards keep their full.lvdash.jsonextension.Notes / follow-ups
requires_sync_to_wsfs: trueon a notebook or dashboard with a pending UI autosave would currently cause the filter to skip the upload (the workspace's flushed SHA matches local but the in-flight UI state doesn't). Harmless for re-deploys from local source;list-repois an internal endpoint; the WHS migration plan does not currently include a recursive list. Iflist-repogoes away, we'll need the WHS-side replacement (ListTreeNodeswithblob_info.sha256).Test plan
libs/sync/remote_filter_test.gocovering empty/error/match/mismatch/missing/notebook-rename/mixed-diff cases.4342141078796725) for cold + warm snapshot paths.4342141078796725).acceptance/bundle/deploy/covering the cold-snapshot path. (Existing tests still pass; a dedicated regression test for Layer 3 would be a nice-to-have.)