Usage¶
Setup¶
Initialize DVC in your repository (if not already done):
dvc init git add .dvc git commit -m "initialize DVC"
Add the Databricks Volume as a DVC remote:
dvc remote add -d myremote \ dbvol:///Volumes/<catalog>/<schema>/<volume>/<path>
Set your Databricks profile:
export DATABRICKS_CONFIG_PROFILE=<your-profile-name>
Note
DVC remotes do not support arbitrary config keys, so the Databricks profile must be provided via this environment variable — it cannot be stored in
.dvc/config. Add the export to your~/.zshrcor~/.bashrcto make it permanent.
Standard DVC workflow¶
Track a data file:
dvc add data/dataset.csv
Push data to the Volume:
dvc push
Commit the pointer to git:
git add data/dataset.csv.dvc .gitignore
git commit -m "track dataset v1 with DVC"
git push
Pull data in another environment:
git clone <your-repo>
pip install dvc-databricks
export DATABRICKS_CONFIG_PROFILE=<your-profile-name>
dvc pull
How it works¶
Your git repo Databricks Volume (S3 / ADLS)
────────────────── ───────────────────────────────────
data/dataset.csv.dvc ──────► /Volumes/catalog/schema/vol/
.dvc/config └── files/md5/
├── ab/cdef1234... ← actual data
└── 9f/123abc... ← actual data
dvc add hashes the file and stores it in the local DVC cache (.dvc/cache).
A .dvc pointer file containing the MD5 hash is created next to your data file.
dvc push uploads from the local cache to the Volume using the Databricks Files
API (WorkspaceClient.files.upload).
dvc pull downloads from the Volume into the local cache, then restores the file
to its original path.
Only .dvc pointer files are ever committed to git — the data stays on the Volume.