Skip to content

PureStorage-OpenConnect/proxmox-pure-snap-restore

Repository files navigation

proxmox-pure-snap-restore

Web app that restores Proxmox VE virtual machines from Pure Storage FlashArray snapshots. Disks are cloned via lvm-xcopy so the bytes never traverse the host: SCSI EXTENDED COPY on iSCSI/FC, and NVMe Copy (cross-namespace, Format 2h / TP4130) on NVMe-TCP. Falls back to a host-side qemu-img convert only when an NVMe controller does not support cross-namespace Copy, the restore is cross-array, or the operator explicitly requests it.

Features

  • Multi-cluster: connect one or more Proxmox clusters (API token or password auth) and one or more Pure FlashArrays (API token).
  • Automatic 1:1 mapping of Pure volumes ↔ Proxmox LVM storages via the SCSI serial / NVMe WWN of the PV backing each VG.
  • Per-node Pure host auto-match: if no host group is configured for a Proxmox connection, the backend reads each node's IQN / NQN / WWN and matches it against hosts defined on the array.
  • Tree view of VMs → on-array snapshots (ad-hoc and protection-group).
  • On-demand snapshot creation on the array.
  • Two restore modes:
    • Overwrite the source VM's disks in place (VM is stopped first).
    • Create a new VM with the source's exact configuration, fresh LVs, and optionally preserved MAC addresses.
  • Boot option: bring the restored VM up with all NICs link_down=1 so it cannot reach the network until the operator explicitly re-enables them.
  • Background job runner with live log tail per restore (polled via REST).
  • Background inventory refresh: a detached asyncio task walks every configured Proxmox cluster on a fixed cadence (default 600s, configurable via APP_INVENTORY_REFRESH_SECONDS, set to 0 to disable) so deleted VMs and new Pure snapshots stay reflected in the local DB without an operator clicking Refresh.
  • Deleted-VM surfacing: when a VM is destroyed in Proxmox the app keeps its last-seen config in remembered_vms keyed by (vmid, vm_create_time), so multiple incarnations of the same VMID coexist. As long as a Pure snapshot of the deleted incarnation's disks survives, the inventory tree still lists it (with a deleted pill) and new_vm restores remain available — pinned to the exact remembered row by id.
  • Predates-disk safety: the inventory API records the first time it observed each VM-disk tuple (anchored to the latest Proxmox qmcreate task so VMID reuse is handled correctly) and refuses restores — HTTP 409 with a descriptive message — when the chosen snapshot was taken before the current disk existed. The UI disables the Restore button and shows a red "predates disk" pill for those snapshots. For deleted VMs the check anchors on the remembered incarnation's own create/last-seen time instead of the shared sighting row, so a VMID-reusing successor can't poison the deleted VM's view.
  • Force host-copy override: an opt-in checkbox on the restore dialog bypasses lvm-xcopy / NVMe Copy and forces a host-side qemu-img convert even on intra-array restores. Useful as a manual escape hatch when a Pure firmware revision rejects cross-namespace NVMe Copy (status 0x4002, Invalid Field in Command); the host path always works at the cost of read+write bandwidth across the SAN.

Architecture

flowchart LR
    subgraph Browser["User Browser"]
        UI["React + Vite SPA<br/>(Tailwind, React Query)"]
    end

    subgraph Host["Docker host"]
        subgraph FE["pps-frontend container"]
            NGX["nginx :443 (TLS)<br/>:80 → 301 → :443"]
        end
        subgraph BE["pps-backend container"]
            API["FastAPI (uvicorn) :8000<br/>routers: auth, connections,<br/>inventory, restore, security"]
            DB[("SQLite /data/app.db<br/>users, connections,<br/>vm_disk_sightings,<br/>remembered_vms,<br/>restore_jobs")]
            Jobs["Job runner<br/>(asyncio.create_task)<br/>+ log flusher"]
            Refresh["Inventory refresh<br/>(periodic asyncio task,<br/>default every 600s)"]
        end
    end

    subgraph Proxmox["Proxmox cluster"]
        PAPI["Proxmox API :8006<br/>(proxmoxer)"]
        NODE["Proxmox node(s)<br/>rescan-scsi-bus, multipath,<br/>vgimportclone, lvm-xcopy, dd"]
    end

    subgraph Pure["Pure FlashArray"]
        PREST["FlashArray REST<br/>(py-pure-client)"]
        PVOL[("Volumes +<br/>Snapshots")]
    end

    UI -- "HTTPS /api/*" --> NGX
    NGX -- "proxy_pass /api/" --> API
    API --- DB
    API --- Jobs
    Refresh --- DB
    Refresh -- "GET /tree path:<br/>upsert remembered_vms +<br/>vm_disk_sightings" --> PAPI
    Refresh -- "list snapshots" --> PREST
    Jobs -- "proxmoxer: VM list,<br/>config, tasks, clone" --> PAPI
    Jobs -- "py-pure-client:<br/>copy snap → temp vol,<br/>connect host/HG, list snaps" --> PREST
    Jobs -- "asyncssh: pvscan,<br/>vgimportclone, lvm-xcopy / dd,<br/>cleanup" --> NODE
    PREST --- PVOL
    PAPI -.- NODE
    PVOL -- "iSCSI / FC: XCOPY offload<br/>NVMe-TCP: NVMe Copy F2h<br/>(host-side qemu-img fallback)" --> NODE
Loading

Components

Piece Where Purpose
frontend/ React 18 + Vite + Tailwind, served by nginx in pps-frontend SPA; nginx also reverse-proxies /api/* to the backend container
backend/app/api/ FastAPI routers: auth, connections, inventory, restore, security Stateless HTTP surface
backend/app/services/ proxmox (proxmoxer), pure (py-pure-client), ssh (asyncssh), mapping, restore, inventory_refresh, context, crypto, security, tls Integrations, restore orchestration, periodic inventory refresh
backend/app/models/ SQLAlchemy 2.x async User, ProxmoxConnection, PureConnection, ProxmoxPureLink, SshCredential, VmDiskSighting, RememberedVm, RestoreJob
SQLite ./data/app.db (bind-mounted into the backend container) Persists users, encrypted connection secrets, disk sightings, remembered VM configs, job history + logs

Restore flow (detailed)

sequenceDiagram
    autonumber
    participant U as Operator UI
    participant API as FastAPI /api/restore
    participant DB as SQLite
    participant J as Job runner asyncio
    participant PX as Proxmox API
    participant SSH as Proxmox node ssh
    participant PU as Pure FlashArray
    U->>API: POST /api/restore kind vmid snapshot
    API->>DB: load sightings + storage mappings
    API->>PU: list snapshots and validate created_at
    API-->>U: 409 if snapshot predates any disk
    API->>DB: insert RestoreJob pending
    API-->>U: 201 returning job_id
    API->>J: asyncio.create_task _run_job
    J->>PU: copy snapshot to temp volume pxrestore-XXXX
    J->>PU: connect temp vol to host group or matched per-node host
    J->>SSH: rescan-scsi-bus and multipath -r
    J->>SSH: find PV by WWN under /dev/mapper or /dev/nvme
    J->>SSH: vgimportclone to pxrestore_XXXX VG
    J->>SSH: vgchange -ay then list LVs as JSON
    alt options.force_host_copy OR cross-array
        J->>SSH: lvcreate then qemu-img convert with O_DIRECT
    else native offload
        J->>SSH: lvm-xcopy src to dst array offload
        Note over SSH: SCSI/FC EXTENDED COPY<br/>NVMe-TCP NVMe Copy F2h TP4130
        opt NVMe controller lacks cross-namespace Copy
            J->>SSH: fallback lvcreate then qemu-img convert
        end
    end
    opt new_vm flow
        J->>PX: allocate new VMID and disks replicate config
    end
    opt overwrite flow
        J->>PX: stop source VM
    end
    J->>SSH: vgchange -an pxrestore_XXXX then vgremove -f
    J->>PU: disconnect and delete temp volume
    J->>SSH: rescan and drop stale LUN
    J->>DB: RestoreJob.status success or failed plus error
Loading
  1. Stage. Pure API post_volumes(..., source=<snapshot>) creates a metadata-only clone pxrestore-<tag>. The temp volume is then connected to either the configured host group or the target node's matched Pure host.
  2. Attach. SSH into the target node: rescan-scsi-bus.sh -r, iscsiadm -m session --rescan (iSCSI), multipath -r, pvscan --cache --activate ay. Locate the new PV by WWN (3624a9370<serial> for SCSI, nvme* by-id for NVMe-TCP).
  3. Import. vgimportclone --basevgname pxrestore_<tag> <device> avoids a UUID collision with the source VG, then vgchange -ay.
  4. Copy. Per VM disk LV: lvm-xcopy /dev/pxrestore_<tag>/<lv> /dev/<target_vg>/<lv>. The driver picks the right offload primitive automatically:
    • SCSI/FC PV → SCSI EXTENDED COPY (LID1) via SG_IO.
    • NVMe-TCP PV → NVMe Copy with Source Descriptor Format 2h (TP4130 cross-namespace); CDFE is enabled on the controller automatically.
    • If the NVMe controller / firmware rejects Format 2h (status 0x4002), or the destination LV lives on a different Pure array (cross-array restore), or the operator ticked Force host copy in the restore dialog, the runner uses host-side lvcreate -L <bytes>B plus qemu-img convert -n -t none -T none -W -m 8 -f raw -O raw. Parallel coroutines + O_DIRECT measured ~3x the throughput of dd bs=8M on Pure NVMe-TCP because qemu-img keeps multiple I/Os in flight while dd serializes one block at a time.
  5. Finalize. Overwrite mode stops the source VM before the copy; new-VM mode allocates fresh LVs and replicates the source config (optionally preserving MACs, and optionally starting with link_down=1).
  6. Cleanup. vgchange -an, vgremove -f the temp VG; Pure delete_connections + delete_volumes the temp volume; one more rescan to evict the stale LUN.

Inventory model & predates-disk safety

VMID reuse on Proxmox (destroy VM 120, later recreate VM 120) can make an old snapshot "look" compatible by VMID alone, but that snapshot was taken against a totally different volume. The inventory layer is built around a small identity model that catches this and also keeps deleted VMs restorable:

  • Every call to GET /api/inventory/{proxmox_id}/tree walks Proxmox task history for each node, picks the latest successful qmcreate per VMID, and uses that timestamp as the canonical vm_create_time.
  • vm_disk_sightings upserts one row per (proxmox_connection_id, vmid, storage, volume) with first_seen_at = <qmcreate start_time> (or "now" for cold-start when task history is unavailable, in which case predates checks are suppressed for that disk on the current response). Later refreshes realign first_seen_at if a newer creation event appears.
  • remembered_vms snapshots the live VM config and is keyed by (proxmox_connection_id, vmid, vm_create_time), so multiple incarnations of the same VMID coexist as separate rows. Each restore from a deleted VM is pinned to a specific row id (remembered_vm_id on the restore request) so the right incarnation's config is replayed.
  • A snapshot is flagged predates_disk=true when its Pure created time is earlier than first_seen_at for any current disk on a live VM, or earlier than vm_create_time (or last_seen_at as a fallback) for a deleted VM. Anchoring deleted VMs on their own remembered timestamps avoids leaking a VMID-reusing successor's disk birthdate into the deleted view.
  • POST /api/restore re-runs the same check server-side and returns HTTP 409 with a message like Snapshot '…test' was taken at 2026-04-23T13:01:35Z but disk 'vm-120-disk-0' on VM 120 was first observed at 2026-04-23T14:16:45Z; the snapshot predates this disk and cannot contain its LV. Refusing restore.
  • The UI disables the Restore button and renders a red "predates disk" pill with a tooltip explaining why.

Background inventory refresh

Two concerns drive the periodic refresh:

  1. The remembered_vms / vm_disk_sightings upsert only happens inside GET /api/inventory/{proxmox_id}/tree. If a VM is destroyed in Proxmox while no operator is in the UI, the next inventory load only sees the already-gone VM and would have no remembered config to restore from.
  2. New Pure snapshots taken between operator visits should still surface in the tree without a manual refresh.

The backend therefore spawns a detached asyncio task on startup (app.services.inventory_refresh.start_periodic_refresh) that walks every ProxmoxConnection and calls get_tree on a fixed cadence. Each connection runs in its own session so a transient failure on one cluster does not block the others. Cadence is controlled by APP_INVENTORY_REFRESH_SECONDS (default 600); set it to 0 to disable the loop entirely.

Prerequisites

On each Proxmox node the app will SSH into:

  • sg3-utils (for rescan-scsi-bus.sh)

  • multipath-tools if you use multipath

  • lvm2 (present by default)

  • nvme-cli if any target storage is NVMe-TCP (used for initiator identification and rescans)

  • git, build-essential — only needed once, to build lvm-xcopy; the backend installs lvm-xcopy into /usr/local/lib/lvm-xcopy with a /usr/local/bin/lvm-xcopy launcher on first use. You can pre-install manually:

    git clone --depth 1 https://github.com/PureStorage-OpenConnect/lvm-xcopy
    cd lvm-xcopy && make && sudo install -m 0755 lvm-xcopy /usr/local/bin/lvm-xcopy

On the Pure array, one of:

  • Host group (recommended for clusters): create a host group containing every Proxmox node as a Pure host, and set its name as pure_host_group on the Proxmox connection. The temp volume is connected to the host group during a restore.
  • Per-node hosts: leave pure_host_group blank. The backend reads each node's IQN (/etc/iscsi/initiatorname.iscsi), NQN (/etc/nvme/hostnqn), and FC WWNs (/sys/class/fc_host/.../port_name), and matches them to hosts already defined on the array. The temp volume is connected only to the matched host for the target node.

Plus, on the array:

  • An API token with privileges to create/destroy volumes and connections and to read snapshots.

On Proxmox:

  • An API token (recommended) or a user/password with permissions on VM.Config.*, VM.PowerMgmt, Datastore.Allocate, Datastore.AllocateSpace, Datastore.Audit, Sys.Audit, VM.Audit.
  • SSH access to every node with sudo-less root or a user in the sudo group that can run lvm, vgimportclone, rescan-scsi-bus.sh, multipath, iscsiadm, nvme, dd, and lvm-xcopy without a password prompt.

Deployment

The app runs entirely in containers and reaches Proxmox over the API (:8006) and SSH (:22), the application must be able to reach both your Proxmox cluster(s) and your Pure FlashArray management interfaces. Deploy the application on a separate Linux machine that can run Docker and has network access to both your Proxmox cluster(s) and your Pure FlashArray management interfaces.

Two supported flows. The first is the recommended one for production; the second is for local development or air-gapped sites.

Recommended: pull pre-built images from GHCR

Images are published to GitHub Container Registry by the Release workflow on every push to main and on every vX.Y.Z git tag. The container host needs only Docker — no source tree, no build toolchain.

One-shot install on a fresh Linux container host (run as root):

curl -fsSL https://raw.githubusercontent.com/PureStorage-OpenConnect/proxmox-pure-snap-restore/main/deploy/install.sh \
    | PPS_OWNER=purestorage-openconnect PPS_IMAGE_TAG=v0.2.1 bash

The script:

  1. Installs Docker via the upstream get.docker.com script (or deploy/install_docker.sh if you cloned the repo first).
  2. Lays out /opt/proxmox-pure-snap-restore/{data,docker-compose.yml,.env}.
  3. Generates APP_SECRET_KEY + APP_ENCRYPTION_KEY and writes the image coordinates (PPS_BACKEND_IMAGE, PPS_FRONTEND_IMAGE, PPS_IMAGE_TAG) into .env.
  4. docker compose pull && up -d.

After install:

cd /opt/proxmox-pure-snap-restore
deploy/upgrade.sh v0.2.2     # pin a new tag and pull
deploy/upgrade.sh v0.2.1     # roll back the same way
docker compose logs -f       # tail

Pinning a specific git tag in .env (PPS_IMAGE_TAG=v0.2.1) is the intended production posture; latest is fine for staging.

From source (development / air-gapped)

cp .env.example .env
# Generate keys:
python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"
# Paste into APP_ENCRYPTION_KEY. Also set APP_SECRET_KEY and APP_ADMIN_PASSWORD.

docker compose up -d --build
# UI: https://<host>:8443
# API: http://<host>:8000/api  (direct, internal use only)

For an existing remote host that already has the repo, make deploy-remote REMOTE_HOST=root@host rsyncs a tarball, runs deploy/remote_deploy.sh, and rebuilds locally on the host. Useful when the host can't reach ghcr.io.

TLS, admin seeding, persistent state

The first boot generates a self-signed TLS cert into ./data/certs/ (cert.pem + key.pem). Replace it from the Settings → Security tab in the UI: upload your own PEM cert and key, or click Generate & install to mint a fresh self-signed cert. The frontend container receives a SIGHUP and reloads nginx without dropping HTTP traffic.

The SQLite DB lives under ./data/app.db (bind-mounted into the backend container). The initial admin user is seeded from APP_ADMIN_USERNAME / APP_ADMIN_PASSWORD on first boot only. Leaving APP_ADMIN_PASSWORD blank seeds the admin with no password; the first sign-in is forced through a "set new password" page before any other part of the UI is reachable.

Configuration (environment variables, prefix APP_)

Var Default Purpose
APP_SECRET_KEY dev-insecure-change-me JWT signing key (HS256)
APP_ENCRYPTION_KEY Fernet key that encrypts secrets at rest
APP_DB_URL sqlite+aiosqlite:////data/app.db Async SQLAlchemy URL
APP_CORS_ORIGINS http://localhost:5173 Comma-separated allowlist
APP_ADMIN_USERNAME admin Seeded on first boot
APP_ADMIN_PASSWORD (blank) Seeded on first boot; blank forces change-on-first-login
APP_JWT_EXPIRES_MINUTES 480 Session length
APP_LOG_LEVEL INFO Uvicorn/app log level
APP_INVENTORY_REFRESH_SECONDS 600 Cadence of the background inventory refresh task; 0 disables it
APP_TLS_CERT_DIR /data/certs Where the backend writes/reads cert.pem + key.pem for the frontend
APP_FRONTEND_CONTAINER pps-frontend Container name SIGHUP'd after a TLS cert update
APP_DOCKER_SOCKET /var/run/docker.sock Path to docker socket used to signal the frontend

The image-coordinate variables (read by docker-compose.prod.yml) live in the same .env file:

Var Default Purpose
PPS_BACKEND_IMAGE ghcr.io/purestorage-openconnect/proxmox-pure-snap-restore-backend Backend image reference
PPS_FRONTEND_IMAGE ghcr.io/purestorage-openconnect/proxmox-pure-snap-restore-frontend Frontend image reference
PPS_IMAGE_TAG latest Tag pulled by docker compose pull. Pin a vX.Y.Z tag for production
PPS_HTTP_PORT 8080 Host port that nginx serves HTTP on (redirects to HTTPS)
PPS_HTTPS_PORT 8443 Host port that nginx serves HTTPS on

Continuous integration & releases

Two GitHub Actions workflows live under .github/workflows/:

  • ci.yml runs on every PR and push to main:
    • Builds backend/Dockerfile --target test and runs pytest -v inside that image, so the test environment is bit-for-bit identical to the runtime image (same Python, same wheels) plus the [dev] extras.
    • Builds the frontend with npm install && npm run build to catch TypeScript and Vite breakage before merge.
  • release.yml runs on push to main and on v*.*.* git tags. It builds backend (target runtime) and frontend images and pushes them to ghcr.io/purestorage-openconnect/proxmox-pure-snap-restore-{backend,frontend} with the canonical tag set produced by docker/metadata-action:
    • :main and :sha-<short> on every push to main
    • :vX.Y.Z, :X.Y, :X, and :latest on each v*.*.* git tag

Cutting a release is therefore:

git tag v0.2.1
git push origin v0.2.1
# Wait for the Release workflow to finish, then on the host:
ssh root@<host> "cd /opt/proxmox-pure-snap-restore && deploy/upgrade.sh v0.2.1"

The image visibility on GHCR defaults to private. Make the two packages public from the GitHub Packages tab if you want the install script to work without docker login.

Local convenience

A Makefile wraps the common operations:

make test                 # backend pytest in the test image
make lint                 # ruff + mypy in the test image
make frontend-build       # vite build in a node:20 container
make build TAG=dev        # build both runtime images locally
make push TAG=dev REGISTRY=ghcr.io/alice
make up / make down       # docker compose up -d / down (dev compose)
make deploy-remote REMOTE_HOST=root@docker-host.example.com
make upgrade-remote REMOTE_HOST=root@docker-host.example.com

REST API (summary)

All routes are under /api. Auth is JWT bearer on everything except /api/auth/login and /api/health. Admin-only routes enforce require_admin.

Method Path Purpose
POST /api/auth/login Exchange credentials for a JWT
GET /api/auth/me Current user (username, role, must_change_password)
POST /api/auth/change-password Change current user's password
GET /api/connections/proxmox List Proxmox connections (admin)
POST /api/connections/proxmox Create Proxmox connection (admin)
PATCH/DELETE /api/connections/proxmox/{id} Update/remove (admin)
POST /api/connections/proxmox/{id}/test Live ping the Proxmox API
GET / POST / PATCH / DELETE /api/connections/pure[...] Same for Pure arrays
GET / POST / PATCH / DELETE /api/connections/ssh[...] SSH credentials
GET /api/inventory/{proxmox_id}/tree Full VM → disks → snapshots tree, with mapping diagnostics, deleted-VM rows from remembered_vms, and predates_disk flags
POST /api/inventory/snapshots Create an ad-hoc Pure snapshot on a volume
POST /api/restore Start a restore (admin); accepts force_host_copy to bypass array offload, and remembered_vm_id to pin a deleted-VM restore to a specific incarnation; returns 409 if the snapshot predates any disk
GET /api/restore List recent jobs
GET /api/restore/{id} Job detail incl. streaming log buffer
GET /api/security/tls Current TLS cert status (admin)
POST /api/security/tls/upload Upload custom cert + key PEM (admin)
POST /api/security/tls/regenerate Mint a fresh self-signed cert (admin)
POST /api/security/tls/reload SIGHUP the frontend nginx to pick up the new cert (admin)
GET /api/health Liveness probe

Development

# Backend
cd backend
python -m venv .venv && . .venv/Scripts/activate  # or: source .venv/bin/activate
pip install -e .[dev]
uvicorn app.main:app --reload

# Frontend
cd frontend
npm install
BACKEND_URL=http://localhost:8000 npm run dev

Security notes

  • All credentials (Proxmox secret, Pure API token, SSH private key and passphrase, SSH password) are encrypted at rest using Fernet with APP_ENCRYPTION_KEY. If you lose that key, the secrets are unrecoverable.
  • Passwords are hashed with Argon2id.
  • The restore orchestrator performs destructive operations. Overwrite mode stops the source VM before the copy; it does not take a confirmation snapshot automatically — consider triggering an ad-hoc Pure snapshot first (the Inventory page has a "Snapshot now" action for this).
  • vm_disk_sightings + the 409 predates check reduce the blast radius of operator error with VMID reuse, but they are not a substitute for array- side snapshot retention policies.
  • Multi-user with RBAC is scaffolded (role column on users, require_admin dep) but only a single admin is seeded today.

Limitations / roadmap

  • Assumes 1:1 Pure volume ↔ LVM VG mapping. VGs spanning multiple PVs are flagged unmapped and not restorable.
  • qcow2 / raw-file storages are not supported; the copy path requires LVM thick volumes so that lvm-xcopy (or the dd fallback) has an LV on both sides.
  • For NVMe-TCP storage, cross-namespace offload requires a controller that supports NVMe Copy Format 2h (TP4130) and advertises it in OCFS. On controllers without TP4130 support the runner falls back to host-side dd, in which case restore time scales with network + device bandwidth rather than being metadata-only.
  • Protection-group snapshot naming shows up naturally in the tree (snapshot name prefixed with the PG), but the app does not yet let you create PGs.
  • No cluster-wide HA awareness: restore runs against a specific node.

License

Licensed under the Apache License, Version 2.0. Redistributions must retain the copyright, license, and any NOTICE file per §4 of the license.

About

A utility to allow restoring Proxmox LVM based VMs from FlashArrary snapshots

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors