We desided for a number of business reasons to move one of our existing Kubernetes cluster new a geographic region.
The thing we were most worried about is that we had persistent volumes attached to MySQL instances for our test environments running in k8s.
There isn’t a straightforward way for this. One common way is to create a snapshot of etcd
but we’re on GKE so that’s out of the question. Luckily we found Ark.
Ark is a disaster recovery tool for Kubernetes clusters. It can take backups of the whole cluster with the ability to restore it using a single command. We can even have it run on a schedule. Persistent volumes are also taken care of. It has good documentation so setting it up was almost a breeze if not because of a bug with RBAC in GKE.
Download
A simple git clone This email address is being protected from spambots. You need JavaScript enabled to view it.:heptio/ark.git
was all I did to download Ark. Its master branch is frequently updated and is not stable. The maintainers recommend checking out the latest tagged version. At this time, the latest release is v0.9.5
.
Setting it up
Ark works by creating custom resources in k8s for its operations conveniently defined in a single yaml file.
I had to kubectl apply
the yaml file to the American cluster and the shiny new European cluster where we’re moving into.
This is where the RBAC bug on GKE appears:
User "This email address is being protected from spambots. You need JavaScript enabled to view it." cannot create clusterrolebindings.rbac.authorization.k8s.io at the cluster scope: No policy matched.
To work around this I had to have my Google account granted with the cluster-admin
role in both clusters:
kubectl create clusterrolebinding paul-cluster-admin-binding --clusterrole=cluster-admin --user=
This email address is being protected from spambots. You need JavaScript enabled to view it.
Ironically, it spits out the same error unless you’re an account with the Owner IAM Role.
Apparently, this is a known issue on GKE:
Because of the way Container Engine checks permissions when you create a Role or ClusterRole, you must first create a RoleBinding that grants you all of the permissions included in the role you want to create. An example workaround is to create a RoleBinding that gives your Google identity a cluster-admin role before attempting to create additional Role or ClusterRole permissions. This is a known issue in the Beta release of Role-Based Access Control in Kubernetes and Container Engine version 1.6.
Cloud Storage Bucket
Apart from persistent volumes, Ark stores its backups in a cloud storage bucket. This bucket should be exclusive to Ark because each backup is stored in its own subdirectory in the bucket’s root. A service account will be needed to authorize Ark to upload files into the bucket.
Service account
I created a service account just for Ark to use. It will need read and write access to the bucket. In GKE, persistent volumes are just disks attached to the nodes so I had to give it permissions for those too. These are permissions given to the service account:
compute.disks.get
compute.disks.create
compute.disks.createSnapshot
compute.snapshots.get
compute.snapshots.create
compute.snapshots.useReadOnly
compute.snapshots.delete
compute.projects.get
The Ark server config
At this point the bucket has been created and Ark has been allowed upload to it. Now it will need to know which bucket to use by setting the Ark Config (a custom resource defined by Ark):
# examples/gcp/00-ark-config.yaml
...
backupStorageProvider:
name: gcp
bucket: neso-cluster-backup
...
The Ark server Deployment
To hand off the service account to Ark a k8s secret
named cloud-credentials
containing the service account key will have to be created.
# download service account key
gcloud iam service-accounts keys create ark-svc-account \
--iam-account $SERVICE_ACCOUNT_EMAIL
# create secret
kubectl create secret generic cloud-credentials \
--namespace heptio-ark \
--from-file cloud=ark-svc-account
In the Ark Deployment yaml file, there wasn’t anything that needed to be changed. All that’s left to start the server is to kubectl apply
the Config and the Deployment.
Generating a backup
After everything’s been set up on both clusters and the Ark client install locally. It’s time to put Ark to the test. Making sure kubectl
's context was set to the US cluster, with fingers crossed we generated the backup:
$ ark backup create us-cluster --exclude-namespaces kube-system,kube-public,heptio-ark
Gave it a few minutes and then:
$ ark backup get
NAME STATUS CREATED EXPIRES SELECTOR
us-cluster Completed 2018-09-21 15:59:35 +0800 +08 30d <none>
Restoring the backup
The backup includes all the resources from pods to ingresses. We wanted to keep the IP addresses we used in the old cluster. To free up the IP addresses, down go the ingresses in the old cluster.
Now setting the kubectl
context to the new cluster in Europe. It took a while for the cluster to see the backup but it did appear eventually:
$ ark backup get
NAME STATUS CREATED EXPIRES SELECTOR
us-cluster Completed 2018-09-21 15:59:35 +0800 +08 30d <none>
$ ark restore create --from-backup us-cluster
Ark was able to restore everything except for the persistent volumes. Our applications could not connect to the databases. Taking a closer look, it appears that Ark created the disks but they were in the region where the backups were created. The maintainers are aware of this issue and added the fix for this in V0.10.0
.
We can’t wait for that release, though. We had no choice but to move the databases out of k8s. We ultimately decided to spin up a CloudSQL instance and stick our test environments’ databases there.
Conclusion
Ark is an awesome tool. Although the migration did not go as smooth as it should have, some good came out of it. It forced us to move our database outside of kubernetes which we shouldn’t be doing in the first place. Also, we now have regular backups of our new cluster.