Usage ===== Setup ----- 1. Initialize DVC in your repository (if not already done): .. code-block:: bash dvc init git add .dvc git commit -m "initialize DVC" 2. Add the Databricks Volume as a DVC remote: .. code-block:: bash dvc remote add -d myremote \ dbvol:///Volumes//// 3. Set your Databricks profile: .. code-block:: bash export DATABRICKS_CONFIG_PROFILE= .. note:: DVC remotes do not support arbitrary config keys, so the Databricks profile must be provided via this environment variable — it cannot be stored in ``.dvc/config``. Add the export to your ``~/.zshrc`` or ``~/.bashrc`` to make it permanent. Standard DVC workflow --------------------- Track a data file: .. code-block:: bash dvc add data/dataset.csv Push data to the Volume: .. code-block:: bash dvc push Commit the pointer to git: .. code-block:: bash git add data/dataset.csv.dvc .gitignore git commit -m "track dataset v1 with DVC" git push Pull data in another environment: .. code-block:: bash git clone pip install dvc-databricks export DATABRICKS_CONFIG_PROFILE= dvc pull How it works ------------ .. code-block:: text Your git repo Databricks Volume (S3 / ADLS) ────────────────── ─────────────────────────────────── data/dataset.csv.dvc ──────► /Volumes/catalog/schema/vol/ .dvc/config └── files/md5/ ├── ab/cdef1234... ← actual data └── 9f/123abc... ← actual data ``dvc add`` hashes the file and stores it in the local DVC cache (``.dvc/cache``). A ``.dvc`` pointer file containing the MD5 hash is created next to your data file. ``dvc push`` uploads from the local cache to the Volume using the Databricks Files API (``WorkspaceClient.files.upload``). ``dvc pull`` downloads from the Volume into the local cache, then restores the file to its original path. Only ``.dvc`` pointer files are ever committed to git — the data stays on the Volume.