A full vSphere deployment is a long process that requires quite a lot of resources. In addition to that, vSphere is rather picky regarding its execution environment.
The CI of the VMware modules for Ansible runs on OpenStack. Our OpenStack providers use kvm based hypervisor. They expect image in the qcow2 format.
In this blog post, we will explain how we prepare a cloud image of vSphere (also called golden image).
First thing, get an large ESXi instance
The vSphere (VCSA) installation process depends on an ESXi. In our case we use a script and Virt-Lightning to prepare and run an ESXi image on Libvirt. But you can use your own ESXi node as soon as it respects the following minimal constraints:
12GB of memory
50GB of disk space
2 vCPUs
Deploy the vSphere VM (VCSA)
For this, I use my own role called goneri.ansible-role-vcenter-instance. It delegates to the vcsa-deploy command deployment. As a result, you don’t needany human interaction during the full process. This is handy if you want to deploy your vSphere in a CI environment.
At the end of the process, you’ve got a large VM running on your ESXi node.
Cloud-Init is the de-facto tool that handle all the post-configuration tasks that we can expect from a Cloud image: inject the user SSH key, resize the filesystem, create an user account, etc.
By default, the vSphere VCSA comes with a gazillion of disks, this is a problem in the case of a cloud environment where an instance is associated with a single disk image. So I also move the content of the different partitions in the root filesystem and adjust the /etc/fstab to remove all the reference to the other disks. This way I will be able to only maintain on qcow2 image.
All these steps are handled by the following playbook: prepare_vm.yml
Prepare the final Qcow2 image
At this stage, the VM is still running, so I shut it down. Once this is done, I extract the raw image of the disk using the curl command:
vCenterServerAppliance.raw is the name of the local file
192.168.123.5 is the IP address of my ESXi
vCenter-Server-Appliance is the name of the vSphere instance vCenter-Server-Appliance-flat.vmdk is the associated raw disk
The local .raw file is large (50GB), ensure you’ve got enough free space.
You can finally convert the raw file to a qcow2 file. You can use Qemu’s qemu-img for that, it will work fine BUT the image will be monstrously large. I instead use virt-sparsify from the libGuestFS project. This command will reduce the size of the image to the bare minimum.
If your OpenStack provider uses Ceph, you will probably want to reconvert the image to a flat raw file before the upload. With vSphere 6.7U3 and before, you need to force the use of a e1000 NIC. For that, add --property hw_vif_model=e1000 to the command above.
I’ve just done done the whole process with vSphere 7.0.0U1 in 1h30 (Lenovo T580 laptop). I use the ./run.sh script from https://github.com/virt-lightning/vcsa_to_qcow2, which auotmate everything.
The final result is certainly not supported by VMware, but we’ve already run hundreds of successful CI jobs with this kind of vSphere instances. The CI prepares a fresh CI lab in around 10 minutes.
This post compares the start-up duration of the most popular Cloud images.By start-up, I mean the time until we’ve got an operational SSH server.
For this test, I use a pet project called Virt-Lightning ( https://virt-lightning.org/ ). This tool allow any Linux user to start standard Cloud image locally. It will prepare the meta-data and start a VM in your local libvirt. It’s very handy for people like be, who work on Linux and spend the day starting new VM. The image are in the QCow2 format, and it uses the OpenStack meta-data format. Technically speaking, the performance should match what you get with OpenStack.
Actually, OpenStack is often slightly slower because it does some extra operations. It may need to create a volume on Ceph, or prepare extra network configuration.
The 2.0.0 release of Virt-Lighnint exposes a public API. My test scenario is built on top of that. It uses Python to pull the different images, and creates a VM from it 10 times in a row.
All the images are public, Virt-Lightning can fetch them with the vl pull foo command:
vl pull centos-6
During the boot process, the VM will set-up a static network configuration, resize the filesystem, create an user, and inject a SSH key.
By default, Virt-Lightning uses a static network configuration because it’s faster, and it gives better performance when we start a large number of VM at the same time. I choose to stick with this.
I did my tests on my Lenovo T580, which comes with a NVMe storage, and 32GB of memory. I would be curious to see the results with the same scenario, but on regular spinning disk.
The target images
For this test, I compare the following Linux distributions: CentOS, Debian, Fedora, Ubuntu and OpenSUSE. As far as I know, there is no public Cloud image available for the other common distributions. If you think I’m wrong, please post a comment below.
I also included the last FreeBSD, NetBSD and OpenBSD releases. They don’t provide official Cloud Images. This is the reason why, I reuse the unofficial ones from https://bsd-cloud-image.org/.
The lack of pre-existing Windows image is the reason why this OS is not included.
Results
Debian 10 is by far the fastest image with an impressive 15s on average. Basically, 5s less than any other Cloud Image.
Regarding the BSD, FreeBSD is the only system able to resize the root filesystem without a reboot. Consequently, OpenBSD and NetBSD need to start two times in a row. This explains to big difference. The NetBSD kernel hardware probe is rather slow, for instance it takes 5s to initialize the ATA bus of the CDROM. This is the reason why the results look rather bad.
About Ubuntu, I was surprised by the boot duration of Ubuntu 18.04. It is about two times longer than for 16.04. 20.04 is bit better but still, we are far from the 15s of 14.04. I would be curious to know the origin of this. Maybe AppArmor?
CentOS 6 results are not really consiste. They vary between 17.9s and 25.21s. This is the largest delta if you compare with the other distribution. This being said, CentOS 6 is rather old, and won’t be supported anymore at the end of the year.
Conclusions
All the recent Linux images are based on systemd. It would be great to extract the metrics from systemd-analyze to understand what impact the performance the most.
Most of the time, when I deploy a test VM, the very first thing I do is the installation of some import packages. This scenario may be covered later in another blog post.