Using OADP for VM Data Protection

blog.cudanet.org

Jun 4, 2024 • 12 min read

There are basically only three things in IT that can get you fired; theft, hitting the big red button in the data center and f***ing up the backups

Gather round folks, let's put ourselves in the mindset of the typical VMware admin and I will tell you a story. That's not hard for me to do because, well... that's what I did for a good portion of my career in IT. My first "real" job in IT, I was a sysadmin on the Enterprise Virtual Platform team at a large health care company. One of the duties that fell into my lap while on that team was data protection, which I have kind of a love/hate relationship with - mostly hate. When I first started working in IT, I was told by one of my team leads - "There are basically only three things in IT that can get you fired; theft, hitting the big red button in the data center and f***ing up the backups". So one would wonder why any enterprise IT organization would entrust something as vitally important as backups to the most junior member of the team? Honestly - because under the best circumstances, handling the backups is a thankless job, and under the worst of circumstances it can get you fired, which if we're being honest, is typically the only time anyone except the poor data protection admin ever thinks about them, because there's something wrong and they need you to fix it - yesterday. The short answer is because as the most junior member of the team, you're expendable.

Not wanting to lose my job over a failed VM restore, I poured countless hours into getting our backups whipped into shape. I inherited a disjointed mess of data protection solutions including Tivoli backing up DB2 to tape, PHD Virtual (now known as Unitrends), and some one-off robocopy scripts. When I first took over the backups, they were limping along at an abysmal 30-40% success rate for nightly snapshots due to neglect and inattention. We had all sorts of problems, from expired license keys to backup disks being out of space due to auto-retention rules not properly cleaning up old snapshots. We had hung snapshots in vCenter all over the place due to someone who had no idea what they were doing trying to snapshot VMs with RDM (raw disk mappings). After a few months of near constant work, I got our backups working reliably for the most part, we'd always have one or two VMs fail per night but it was a hell of a lot better than the mess I'd inherited. Miraculously, we never suffered a single data loss event during the time when our backups were still in bad shape. Then one day my backups did get put to the test - we were performing an ESXi upgrade. Due to the fact that we were using Brocade fibre channel cards that were not supported out of the box by the default drivers on ESXi, we always had to roll our own custom install ISO, which honestly wasn't that bad. We'd written the automation to generate the install image with the appropriate kickstarts for each host and the process was pretty much on-rails. A problem arose one day when the on-board RAID controller of one of the ESXi hosts had failed. The kickstart we'd written simply specified installing on the 'first bootable disk'. Well... without the RAID controller present, the first bootable disk just happened to be a fibre channel LUN... which happened to be the LUN housing some of our mission critical MSSQL databases. No... I was not the one who did the install on that particular ESXi host. However, I was working on another host when all of a sudden I saw a couple dozen VMs in the prod cluster go offline and vCenter started blowing up with errors. I figured it was just a blip at first, maybe due to the rolling reboots of ESXi hosts as they got redeployed, but after a couple minutes and the primary LUN and all the VMs running on it would not come back online, I started to panic. Then my boss called my desk demanding to know why the hospitals were unable to process any transactions. It took us a few minutes to unravel what had just happened. I ssh'd into the ESXi host and tried to see if I could get it to mount the LUN manually... huh, that's odd, it says the LUN is already mounted at another location. Which location? Root. It was mounted at root. I started to feel a knot forming in my gut as it dawned on me exactly what happened. I decided to take a look at iLO and sure enough, it had been throwing an error about the RAID card being offline for a couple of weeks. We should have been alerted to the failure, but after looking through the settings on iLO, SNMP traps had never been configured. Someone didn't do their job when these servers were first spun up, but that was a problem for another day.

We ended up yanking the cables out of the back of that ESXi server and just never put it back into production until that hardware was later decommed. Thankfully, after we went back in and wiped the ESXi install off the LUN, we were able to restore every single on of the VMs that had been running on it from the nightly backup from the night before. It took about 4 hours to put all of the VMs back. We lost about a half a day of financial transactions that I'm sure was a paperwork nightmare for the hospitals to sort out, but in the end, not only was I able to restore all of the VMs, but I managed to do so within our stated SLA. Did I get a high five, or a "thank you" or a "great job" for putting Humpty back together? No... I did not. I got chastised for weeks by the DBA team for the hours of SQL transactions that were lost, and my whole team got lectured about the cascading series of failures that led up to the data loss event - the misconfigured SNMP settings, the failed RAID card, the "poorly written" automation in our ESXi kick starts (which to be fair, had worked flawlessly through dozens of upgrades before). Not only that, but they revoked everyones work from home privileges out of spite (ostensibly so that we would be able to respond to a future incident like this more quickly - I was in the office that day...). However, nobody got fired which I'm calling a win. When I say data protection is a thankless job, I mean it - but it is an absolutely VITAL part of any IT organization.

Fast forward to today. With our VMware sysadmin hats on, essentially every IT organization out there is facing the exact same challenge today; following the Broadcom acquisition of VMware, we're starting to hear rumblings in the field about customers seeing price hikes on their VMware licensing to the tune of 700-800% increase. To put it bluntly, everyone is scrambling for an exit strategy and we (Red Hat) are positioning Openshift Virtualization as our go-to solution for customers looking to repatriate their workloads off of VMware.

I'm not going to lie, as much as I love Openshift, especially Openshift Virtualization, trying to sell a container orchestration platform as a hypervisor is an uphill battle to begin with, and data protection has always been our achilles heel when it comes to trying to position ourselves as a VMware competitor, despite the fact that OADP (Openshift APIs for Data Protection) is available as part of the Openshift platform at no additional cost.

For your typical virtualization admin, the expected workflow for data protection is

VM is snapshotted
disk snapshot(s) and metadata are backed up somewhere else

Up until fairly recently, this simply wasn't possible. By default, OADP relies on CSI snapshots only, meaning that your data never egresses your cluster to be backed up elsewhere - which is absolutely not what people have in mind when they think about data protection. That would be like just taking VM snapshots in vCenter and hanging on to them, which is a huge no-no at pretty much every company I've ever worked at. Snapshots aren't mean to stick around any longer than is absolutely necessary - eg; a complex software upgrade. However, with a feature called Data Mover enabled, your PVCs as well as desired state config (keep in mind, kubernetes is a declarative platform where all objects in a namespace are defined in YAML). With the release of OADP v. 1.3.1, you can now use kopia as the backend for datamover, which, tl;dr - means that you can now back up both block and filesystem PVCs to an S3 bucket, and perhaps more importantly, restore them. It cannot be understated that this is an absolute game changer when it comes to our story around virtualization.

Admittedly, data protection on Openshift (and kubernetes in general) lags behind VMware, but honestly that should be expected. VMware has been around since the late 90s. Kubernetes has only been around since 2015, so of course there's going to be room for improvement. In terms of "can I back up and restore my VMs on Openshift"? Yeah, you can... and it's not going to cost you a single extra dollar to do so, unlike any data protection solution for VMware. Granted, Openshift lacks some features like change block tracking, which means that each VM backup would be a 1:1, and you're going to need to handle things like deduplication on your storage array, but if you're looking for an offramp from VMware and you just need to check the box for backups and restores - yes, OADP has you covered.

Prerequisites

In order to use OADP for backing up your VMs you'll need a few things

Openshift 4.15.x with Openshift Virtualization
OADP v.13.x operator
an S3 bucket

My 'prod' cluster is running Openshift 4.15.12 with the latest stable OCP-V operator, 4.15.2. I'm running the latest stable OADP operator, ver. 1.3.1. I have a MinIO server on my secondary NAS that provides S3 storage.

Configuration for OCP-V

Once you've deployed the OCP-V operator on your Openshift cluster and configured a hyperconverged instance (by default, it should be kubevirt-hyperconverged in the openshift-cnv namespace), the main things you'll need in order to have properly working backups are proper CSI drivers with corresponding volume snapshot classes, and you'll need to configure your storage profiles. Although it's not 100% necessary, as a best practice it's recommended that you should use a block storage class for VMs. As I've previously written about, I use democratic-csi for iSCSI and NFS storage backed by ZFS on my primary NAS, which has all flash storage. My secondary NAS is enterprise SATA hard disks with NVMe disks for read and write cache.

The default storage class for my prod cluster is actually NFS, but you can add an annotation to your iSCSI storage class that sets it as the default for virtual machines, eg;

oc patch storageclass zol-iscsi-stor01 -p '{"annotations":{"storageclass.kubevirt.io/is-default-virt-class":"true"}}' --type=merge

When you deploy OCP-V, it creates a new CRD called a StorageProfile for each of your storage classes which must be configured with some basic instructions on how those PVCs can be managed by OCP-V. You'll need to patch the spec of the corresponding StorageProfile to instruct kubevirt how block PVCs can be handled, eg;

oc patch storageprofile zol-iscsi-stor01 -p '{"spec":{"claimPropertySets":[{"accessModes":["ReadWriteMany"],"volumeMode":"Block"}],"cloneStrategy":"csi-clone","dataImportCronSourceFormat":"pvc"}}' --type=merge

Then you just need to make sure you have an appropriate volumesnapshotclass for your iSCSI storage class, eg;

apiVersion: snapshot.storage.k8s.io/v1
deletionPolicy: Delete
driver: org.democratic-csi.iscsi
kind: VolumeSnapshotClass
metadata:
  annotations:
  creationTimestamp: "2024-05-25T16:38:30Z"
  generation: 1
  labels:
    app.kubernetes.io/instance: zfs-iscsi
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: democratic-csi
    helm.sh/chart: democratic-csi-0.14.6
  name: zol-iscsi-stor01-snapclass
  resourceVersion: "1666735"
  uid: 7ed6b7d8-f84b-4697-8370-b0ff2557fb52
parameters:
  detachedSnapshots: "true"
  secrets: map[snapshotter-secret:<nil>]

However, this assumes that both A. your storage class supports CSI snapshots and B. the appropriate snapshot controller is deployed either as part of the CSI driver deployment, or can be added after the fact. Either is the case with Democratic CSI, but not all CSI drivers support snapshots. For example, if you're using ODF which is generally natively compatible with Openshift, cephfs and ceph-rbd storage classes support snapshots, but the two object storage classes ODF provides - ceph-rgw and noobaa - do not. If your storage class does not support CSI snapshots, I think it's still theoretically possible to do backups with kopia on OADP, but I have not successfully tested this yet.

Configuring S3 storage

I won't go over how to deploy MinIO as that's not within the scope of this blog post, but I will say that if you want SSL enabled for your S3 server, it's a lot easier to use Nginx to reverse proxy to your MinIO server and handle the SSL termination than it is to try and get MinIO to handle SSL itself.

On your S3 server, you'll need to create the appropriate bucket and path for your backups, which will be defined in the DataProtectionApplication in the next step. In my case, the bucket I'm using is oadp > prod.

Configuring OADP

Setting up OADP is actually not that complicated. Basically, you just need to install the operator, create a secret with the credentials for your S3 storage and then create a DataProtectionApplication to point OADP at your S3 server.

S3 secret

cat << EOF > credentials-velero
[default]
aws_access_key_id=YOURACCESSKEY
aws_secret_access_key=YOURSECRETKEY
EOF

oc create secret generic cloud-credentials -n openshift-adp --from-file cloud=credentials-velero

then you create the data protection application, eg;

apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
metadata:
  name: prod-ocp
  namespace: openshift-adp
spec:
  backupLocations:
    - velero:
        config:
          insecureSkipTLSVerify: 'true'
          profile: default
          region: us-east-1
          s3ForcePathStyle: 'true'
          s3Url: 'https://minio.cudanet.org:9443'
        credential:
          key: cloud
          name: cloud-credentials
        default: true
        objectStorage:
          bucket: oadp
          prefix: prod
        provider: aws
  configuration:
    nodeAgent:
      enable: true 
      uploaderType: kopia
    velero:
      defaultPlugins:
        - openshift
        - aws
        - kubevirt
        - csi
      defaultSnapshotMoveData: true
      defaultVolumesToFSBackup: false
      featureFlags:
        - EnableCSI
  snapshotLocations:
    - velero:
        config:
          profile: default
          region: us-east-1
        provider: aws

OADP will assume that your S3 storage is on AWS. I use MinIO as my S3 server locally but you can use dummy values for things that don't actually apply to your server like region: us-east-1.

Once you create your DataProtectionApplication, OADP will validate the configuration and then create a couple other custom resources including a VolumeSnapshotLocation and a BackupStorageLocation for your S3 server. If you run into issues, make sure that things like your URL and credentials are correct.

Testing Backup and Restore

I started by creating a simple Fedora VM from one of the provided templates and created a single text file in the fedora users home directory with the words "Hello world" in it, just to prove that when I restore it's not just doing something screwy like deploying a new VM from template.

Then I created a backup of the running VM. Truth be told, it's probably easier to just use CLI to do your backups, eg;

echo "alias velero='oc -n openshift-adp exec deployment/velero -c velero -it -- ./velero'" >> ~/.bash_profile

source ~/.bash_profile

velero create backup test --snapshot-move-data true --snapshot-volumes true --include-namespaces test

With the typical VMware admin in mind, you would accomplish the same thing using the Web UI by navigating to create new > backup and making sure to specify those values by checking the appropriate boxes on the form, eg;

Once you kick off your backup, a few things will happen. One or more volumesnapshots will be created for each PVC in your namespace, a backuprepository will be created, and a dataupload will be created for each PVC in your namespace, which will then spin up a backup pod in the openshift-adp namespace, which will be running a kopia job to copy your block PVC data from your Openshift cluster to your S3 bucket. YMMV, but OADP uploads data to my MinIO server in ~20MB chunks

Once the backup had been completed (only takes a few minutes, these VM templates are only a few GB) I destroyed the VM and made sure the PVC was gone, and then I created a restore task.

Again, you can do this easier with CLI, something like eg;

velero create restore --from-backup test --restore-volumes true

Or you can do so using the Web UI, eg;

Once you kick off the restore, it will create a restore object and a datadownload for each PVC in your namespace. After a few minutes (depending on the size or your VM... like I said, mine was only a few GB) a VM and a PVC was created in my virtualization namespace, the PVC was eventually bound and the VM booted back up.

Success!

Scheduling VM Backups

A one-off backup of a VM is fine, but having regular backups (and perhaps more importantly - testing restores!) is a must for any IT organization for critical VM workloads. A backup schedule in OADP is set up as a cron job, like this:

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: schedule-virtualization-nightly
  namespace: openshift-adp
spec:
  schedule: 0 0 * * *
  template:
    includedNamespaces:
      - virtualization
    snapshotVolumes: true
    snapshotMoveData: true
    storageLocation: prod-ocp-1

which would run a nightly backup at midnight. Worth noting, cron time on Openshift is expressed in UTC, so you may want to adjust accordingly, eg; midnight Greenwich Mean Time is only 5PM here in Phoenix, AZ so I would use 0 7 * * * for nightly backups at midnight. That's it.