My Github workflow

The Ansible community uses Github to develop ansible-core and most of the Ansible Collections. The only exception I know is the Openstack’s ansible-collection-openstack which uses a Gerrit (ansible-collections-openstack).

So, as an Ansible developer, my normal day-to-day activities involve a lot of GitHub interactions. I review Pull Request (PR) and prepare new PR all the time.

Before joining Ansible, I was working with Gerrit which is a nice alternative solution to collaborate on a stream of patches.

In Gerrit, each patch from a branch is a PR. Everytime we update a patch, its sha2 changes, and so Gerrit tracks them with a dedicated ID called Change-id. It looks like an extra line in the body of the commit message. e.g:

     Change-Id: Ic8aaa0728a43936cd4c6e1ed590e01ba8f0fbf5b

Gerrit provides a tool called git-review to pull and push the patches. When a contributor pushes a series of patches, each patch is correctly tracked by Gerrit and updates the right existing PR. This allows the contributor to reorganize the patches, change the order of series or import a patch from another branch.

With GitHub, a branch is a PR and most of the time, the projects prefer to use the branch to trace the iteration of the PR:

  • my fancy feature
  • fix: correct the test-suite
  • fix: fix the fix
  • fox: typo in previous commit
  • bla

And this is fine, because most of the time, the branch will ultimately be squashed (one branch -> one Git commit) during the final merge.

GitHub workflow is certainly more friendly for newcomers but it tends to be a source of complexity when you want to work on several PR at the same time. For instance, I work on a new feature, but I also want to cherry-pick an experimental commit from a contributor. In this case I must remove this commit before I push my branch back on GitHub, or the extra commit will end-up in my feature branch.

Another example, if I’m working on a feature branch and find an issue with something unrelated, I need to switch to another branch to commit by fix and push it. This is cumbersome and often people just prefer to merge the fix in their feature branch which leads to confusion and questions during the code review.

To simplify, Gerrit allows better code modularity but also implies a better understanding of Git  which is annoying when we try to attract new contributors. This is the reason why we use the current workflow.

To address the problem I wrote a script called push-patch (https://github.com/goneri/push-patch). I use it to push just my commits. For instance, I work on this branch:

  • 1: doc: explain how to do something
  • 2: typo: adjust a little details
  • 3: a workaround for issue #19 that should not be merged

The two first commits are not directly related with the feature I’m implementing. And I would like to submit them immediately.

push-patch will allow me to only push the change 1 and 2 in two dedicated PR. Both branches will be based on main and can be merged independently.

$ push-patch 1
$ push-patch 2

Now, and that’s the cool part 😋! Let’s imagine I want to push another revision of my first patch, I can use “git rebase -i” to adjust this commit and use push-patch again to use the updated patch.

$ vim foo
$ git add foo
$ git rebase --continue
$ ./push-patch 1

Internally push-patch uses git-notes to trace the remote branch of the patch. The Public-Branch field traces the name of the branch in my remote clone of the project and Pr-Url is the URL of the PR in the upstream project. e.g:

commit 1198db8807ebf9f4099598bcd41df25d465cbcae (HEAD -> main)
Author: Gonéri Le Bouder <goneri@lebouder.net>
Date:   Thu Jan 7 11:31:41 2021 -0500

   elb_application_lb: enable the functional test
    
   Remove the `unsupported` aliases for the `elb_application_lb` test.
    
   Use HTTP instead of HTTPS to avoid the dependency on
   `iam:ListServerCertificates` and the other Certificate related operations.

Notes:
   Public-Branch: elb_application_lb-enable-the-functional-test_24328
    
   PR-Url: https://github.com/ansible-collections/community.aws/pull/348

This means that even if the patch content evolves, push-patch will still be able to continue to update the right PR.

In a nutshell, for each patch it will:

  1. clone the project and switch on the main branch
  2. read the patch notes
    1. if a branch name already exists it will use it, otherwise it will create a new one
  3. switch to the branch
  4. cherry-pick the patch
  5. push the branch

push-patch expects just the sha2 of the commit to push. It also accepts a list of sha2. This is the reason why I often type thing like that:

push-patch $(git log -2 –pretty=tformat:%H)

The command passes to push-patch the SHA2 of the two last commits. It will push them in the two associated branches upstream. And at the end, I can use git log, or better tig, to get the URL of the Github review.

Right now, the command is a shell script and depends on the hub command. I would like to rewrite it with a better programming language.

What about you? Do you also use some special tools to handle your PR?

Ansible: How we prepare the vSphere instances of the VMware CI

As explain quickly in CI of the Ansible modules for VMware: a retrospective, the Ansible CI uses OpenStack to spawn ephemeral vSphere labs. Our CI tests are run against them.

A full vSphere deployment is a long process that requires quite a lot of resources. In addition to that, vSphere is rather picky regarding its execution environment.

The CI of the VMware modules for Ansible runs on OpenStack. Our OpenStack providers use kvm based hypervisor. They expect image in the qcow2 format.

In this blog post, we will explain how we prepare a cloud image of vSphere (also called golden image).

a full lab running on libvirt

First thing, get an large ESXi instance

The vSphere (VCSA) installation process depends on an ESXi. In our case we use a script and Virt-Lightning to prepare and run an ESXi image on Libvirt. But you can use your own ESXi node as soon as it respects the following minimal constraints:

  • 12GB of memory
  • 50GB of disk space
  • 2 vCPUs

Deploy the vSphere VM (VCSA)

For this, I use my own role called goneri.ansible-role-vcenter-instance. It delegates to the vcsa-deploy command deployment. As a result, you don’t needany human interaction during the full process. This is handy if you want to deploy your vSphere in a CI environment.

At the end of the process, you’ve got a large VM running on your ESXi node.

In my case, all these steps are handled by the following playbook: https://github.com/virt-lightning/vcsa_to_qcow2/blob/master/install_vcsa.yml

Tune-up the instance

Before you shut down the freshly created VM, you would like to do some adjustment.
I use the following playbook for this: prepare_vm.yml

During this step, I ensure that:

  • Cloud-Init is installed,
  • the root account is enabled with a real shell,
  • the virtio drivers are available

Cloud-Init is the de-facto tool that handle all the post-configuration tasks that we can expect from a Cloud image: inject the user SSH key, resize the filesystem, create an user account, etc.

By default, the vSphere VCSA comes with a gazillion of disks, this is a problem in the case of a cloud environment where an instance is associated with a single disk image.
So I also move the content of the different partitions in the root filesystem and adjust the /etc/fstab to remove all the reference to the other disks. This way I will be able to only maintain on qcow2 image.

All these steps are handled by the following playbook: prepare_vm.yml

Prepare the final Qcow2 image

At this stage, the VM is still running, so I shut it down.
Once this is done, I extract the raw image of the disk using the curl command:

curl -v -k --user 'root:!234AaAa56' -o vCenterServerAppliance.raw 'https://192.168.123.5/folder/vCenter-Server-Appliance/vCenter-Server-Appliance-flat.vmdk?dcPath=ha%252ddatacenter&dsName=l
ocal'
  • root:!234AaAa56 is my login and password
  • vCenterServerAppliance.raw is the name of the local file
  • 192.168.123.5 is the IP address of my ESXi
  • vCenter-Server-Appliance is the name of the vSphere instance vCenter-Server-Appliance-flat.vmdk is the associated raw disk

The local .raw file is large (50GB), ensure you’ve got enough free space.

You can finally convert the raw file to a qcow2 file. You can use Qemu’s qemu-img for that, it will work fine BUT the image will be monstrously large. I instead use virt-sparsify from the libGuestFS project. This command will reduce the size of the image to the bare minimum.

virt-sparsify --tmp tmp --compress --convert qcow2 vCenterServerAppliance.raw vSphere.qcow2

Conclusion

You can upload the image in your OpenStack project with following command:

openstack image create --disk-format qcow2 --file vSphere.qcow2 --property hw_qemu_guest_agent=no vSphere

If your OpenStack provider uses Ceph, you will probably want to reconvert the image to a flat raw file before the upload. With vSphere 6.7U3 and before, you need to force the use of a e1000 NIC. For that, add --property hw_vif_model=e1000 to the command above.

I’ve just done done the whole process with vSphere 7.0.0U1 in 1h30 (Lenovo T580 laptop). I use the ./run.sh script from https://github.com/virt-lightning/vcsa_to_qcow2, which auotmate everything.

The final result is certainly not supported by VMware, but we’ve already run hundreds of successful CI jobs with this kind of vSphere instances. The CI prepares a fresh CI lab in around 10 minutes.

Ansible and k8s: How to get the K8S_AUTH_API_KEY value?

The community.kubernetes collection accepts an api_key parameter that may sounds a bit confusing. It’s actually the value of the token of a serviceaccount. It’s actually an OAuth 2.0 (Bearer) token, it’s associated with a user and a secret key. It’s rather similar to what we can do with a login and a password.

In this example, we want to run our playbook as the k8sadmin user. We need to find the token associated with the user. The are actually looks for the a secret. You can list them this way:

[root@kind-vm ~]# kubectl -n kube-system get secret
NAME                                             TYPE                                  DATA   AGE
(...)
foobar                                           Opaque                                0      5h3m
foobar-token-w8lmt                               kubernetes.io/service-account-token   3      5h15m
foobar2-token-hpd6f                              kubernetes.io/service-account-token   3      5h9m
generic-garbage-collector-token-l7hvk            kubernetes.io/service-account-token   3      25h
horizontal-pod-autoscaler-token-sssg5            kubernetes.io/service-account-token   3      25h
job-controller-token-dnfds                       kubernetes.io/service-account-token   3      25h
k8sadmin-token-bklpd                             kubernetes.io/service-account-token   3      5h40m
(...)

The use the -n parameter to specific the kube-system namespace. Our system account is in the list, it’s k8sadmin-token-bklpd. We can see the content of the token with this command:

[root@kind-vm ~]# kubectl -n kube-system describe secret k8sadmin-token-bklpd
Name:         k8sadmin-token-bklpd
Namespace:    kube-system
Labels:       <none>
Annotations:  kubernetes.io/service-account.name: k8sadmin
             kubernetes.io/service-account.uid: 412bf773-ca8e-4afa-a778-dac0f11b7807

Type:  kubernetes.io/service-account-token

Data
====
namespace:  11 bytes
token:      eyJhbGciO(...)2A
ca.crt:     1066 bytes

Here, you're done. The token is in the command output. You need now to pass its content to Ansible. Just keep in mind the token needs to remain secret. So it's a good idea to encrypt it with Ansible Vault.
You can use the K8S_AUTH_API_KEY environment variable to pass the token to the k8s_* modules:

$ K8S_AUTH_API_KEY=eyJhbGciO(…)2A ansible-playbook my_playbook.yaml

Ansible: Performance Impact of the Python version

Until recently, I was not really paying attention to the version of Python I was using with Ansible, this as soon as it was Python3. The default version was always good enough for Ansible.

During the last weeks, I spent the majority of my time working on the performance the community.kubernetes collection. The modules of these collection depend on a large library (OpenShift SDK) and Python needs to reload it before every task execution. The goal was to benefit from what is already in place with vmware.vmware_rest: See: my AnsibleFest presentation.

And while working on this, I realized that my metrics were not consistent, I was not able to reproduce some test-cases that I did 2 months ago. After a quick investigation, the Python version matters much more than expected.

To compare the different Python versions, I decided to run some tests.

The target host is a t2.medium instance (2 vCPUS, 4GiB) running on AWS. And the Operating system is Fedora 33, which is really handy for this because it ships all the Python versions from 3.6 to 3.10!

I use the last stable version of Ansible (2.10.3) that I install with pip in a Python virtual environment. The list of the dependencies present in the virtualenvs.

Finally, I deploy Kubernetes on Podman with Kubernetes Kind.

For the first test, I use a Python one-liner to evaluate the time Python takes to load the OpenShift SDK. This is one of the operations that I want to optimize for my work and so it matters a lot to me.

#!/bin/bash
python=$1
for i in $(seq 100); do
echo ${i}
venv-${python}/bin/python -c 'from openshift.dynamic import DynamicClient' >> /dev/null 2>&1
done
view raw run.sh hosted with ❤ by GitHub

Here the loading is done 100 times in a row.

The result shows a steady improvement of the performance since Python 3.6.

Python3.6Python3.7Python3.8Python3.9Python3.10
time (sec)48.40145.08841.75140.92440.385

With this test, the loading of the SDK is 16.5% faster with Python 3.10.

The next test does the same thing, but this time through Ansible. My test uses the following playbook:

hosts: localhost
gather_facts: false
tasks:
k8s_info:
kind: Pod
with_sequence: count=100
view raw gistfile1.yaml hosted with ❤ by GitHub

It runs the k8s_info module 100 times in a row. In addition, I also use an ansible.cfg with the following content. This way, ansible-playbook returns a nice output of the task execution duration:

[defaults]
callback_whitelist = ansible.posix.profile_tasks
view raw ansible.cfg hosted with ❤ by GitHub
Python3.6Python3.7Python3.8Python3.9Python3.10
time (sec)85.580.575.3575.0571.19

It’s a 16.76% boost between Python 3.6 and Python 3.10. I was not expecting such tight correlation between the two tests.

While Python is obviously not the faster technology out there, it’s great to see how its performance are getting better release after release. Python 3.10 is not even released yet and looks promising.

If your playbooks use some modules with dependency on large Python library, it may be interesting to give a try to the lastest Python versions.

And for those who are still running Python 2.7, I get a 49.2% the performance boost between 2.7 and 3.10.

How to start minishift on Fedora-33

update: minishift is bascially dead and won’t support OpenShift 4. You probably want to use crc instead.

I just spent to much time trying to start minishift with my user account. After 3h fighting with permission issues between libvirt and the ~/.minishift directory, I’ve finally decided to stay sane and use sudo… This is how I start minishift on my Fedora-33.

Install and start libvirt

sudo dnf install -y libvirt qemu-kvm
sudo systemctl start libvirtd

Fetch and install minishift and the kvm driver

sudo dnf install -y origin-clients
curl -L https://github.com/minishift/minishift/releases/download/v1.34.3/minishift-1.34.3-linux-amd64.tgz|sudo tar -C /usr/local/bin -xvz minishift-1.34.3-linux-amd64/minishift  --strip-com
ponents=1
sudo curl -L https://github.com/dhiltgen/docker-machine-kvm/releases/download/v0.10.0/docker-machine-driver-kvm-centos7 -o /usr/local/bin/docker-machine-driver-kvm
sudo chmod +x /usr/local/bin/docker-machine-driver-kvm

Start minishift (with sudo…)

sudo minishift

And configure the local user

sudo cp -r /root/.kube /home/goneri
sudo chown -R goneri:goneri /home/goneri/.kube

How to speed up your (API client) modules

The slide deck of my presentation for AnsibleFest 2020. It focus on the modules designed to interact with a remote service (REST, SOAP, etc). In general these modules just wrap a SDK library, the presentation explains how to improve the performance. I actually use this strategy ( ansible_turbo.module ) with the vmware.vmware_rest collection to speed up the modules.

How we use auto-generate content in the documentation of our Ansible Collection

Introduction

Most of the content of the vmware.vmware_rest collection is auto-generated. This article focuses on the documentation and explains how we build it.

Auto-generated example blocks

This collection comes with an exhaustive series of functional tests. Technically speaking, these tests are just some Ansible playbooks that we run with ansible-playbook. They should run all the modules and ideally, in all the potential scenarios (e.g: create, modify, delete). If the playbooks execution is fine, the test is successful and we assume the modules are in a consistent state.

We can hardly generate the content of documentation but these playbooks are an interesting source of inspiration since they actually cover and go beyond all the use-cases that we want to document.

Our strategy is to record all the tasks and their results in a directory. And our documentation will just point on this content. This provides two interesting benefits:

  • We know our examples work fine because it’s actually the output of the CI.
  • When the format of a result changes, our documentation will take it into account automatically.

We import these files in our git repository, git-diff shows us the difference between the previous version. It’s an opportunity to spot a regression.

Cooking the collection

How do we collect the tasks and the results?

For this, we use a callback plugin ( https://github.com/goneri/ansible-collection-goneri.utils ). The configuration is done using three environment variables:

  • ANSIBLE_CALLBACK_WHITELIST=goneri.utils.collect_task_outputs: Ask Ansible to load the callback plugin.
  • COLLECT_TASK_OUTPUTS_COLLECTION=vmware.vmware_rest: Specify the name of the collection.
  • COLLECT_TASK_OUTPUTS_TARGET_DIR=/somewhere: Target directory where to write the results.


When we finally calls the ansible-playbook command, the callback plugin will be loaded, record all the interaction of the vmware.vmware_rest modules and store the results in the target directory.

The final script looks like that:

#!/usr/bin/env bash
set -eux

export ANSIBLE_CALLBACK_WHITELIST=goneri.utils.collect_task_outputs
export COLLECT_TASK_OUTPUTS_COLLECTION=vmware.vmware_rest
export COLLECT_TASK_OUTPUTS_TARGET_DIR=$(realpath ../../../../docs/source/vmware_rest_scenarios/task_outputs/)
export INVENTORY_PATH=/tmp/inventory-vmware_rest
source ../init.sh
exec ansible-playbook -i ${INVENTORY_PATH} playbook.yaml

The documentation

Like a lot of Python project, Ansible uses ReStructuredText for it’s documentation. To include our samples we use the literalinclude directive. The result looks like that, the includes are done line 3 and 8:

Here we use ``vcenter_datastore_info`` to get a list of all the datastores:

.. literalinclude:: task_outputs/Retrieve_a_list_of_all_the_datastores.task.yaml

Result
------

.. literalinclude:: task_outputs/Retrieve_a_list_of_all_the_datastores.result.json

This is how the final result looks like:

And the RETURN blocks?


Each Ansible module is supposed to come with a RETURN block ( https://docs.ansible.com/ansible/latest/dev_guide/developing_modules_documenting.html#documentation-block ) that describe the output of the module. Each key of the module output is documented in this JSON structure.
The RETURN section and the task result above should be consistent. We can actually reformat the result and generate a JSON structure that matches the RETURN block expectation.
Once this is done, we just need to inject the content in the module file.

We reuse the task results in our modules with the following command:

./scripts/inject_RETURN.py ~/.ansible/collections/ansible_collections/vmware/vmware_rest/docs/source/vmware_rest_scenarios/task_outputs/ ~/git_repos/ansible-collections/vmware_rest/ --config-file config/inject_RETURN.yaml

vmware_rest: why a new Ansible Collection?

vmware.vmware_rest (https://galaxy.ansible.com/vmware/vmware_rest) is a new Ansible Collection for VMware. You can use it to manage the guests of your vCenter. If you’re familiar with Ansible and VMware, you will notice this Collection overlaps with some features of community.vmware. You may think the two collections are competing and this it’s a waste of resources. It’s not that simple.

A bit of context will be necessary to fully understand why it’s not exactly the case. The development of the community.vmware collection started during the vCenter 6.0 cycle. At this time, the de facto SDK to build Python application was pyvmomi, which you may know as the vSphere SDK for Python. This Python library relies on the SOAP interface that has been around for more than a decade. By comparison, the vSphere REST interface was still a novelty. The support of some important API were missing and documentation was limited.

Today, the situation has evolved. Pyvmomi is not actively maintained anymore and some new services are only exposed on the REST interface; a good example is the tagging API. VMware has also introduced a new Python SDK called vSphere Automation SDK (https://github.com/vmware/vsphere-automation-sdk-python) to consume this new API. For instance, this is what community.vmware_tag_info uses underneath.

This new SDK comes at a cost for the users. They need to pull an extra Python dependency in addition to pyvmomi and to make the situation worse, this library is not on pypi (See: https://github.com/vmware/vsphere-automation-sdk-python/issues/38), Python’s official library repository. They need to install it from Github instead. This is a source of confusion for our users.

From a development perspective, we don’t like when a module needs to load a gazillion of Python dependencies because this slows down the execution time and it’s a source of complexity. But we cannot ditch pyvmomi immediately because a lot of modules rely on it. We can potentially rewrite these modules to use the vSphere Automation SDK.

These modules are already stable and of high quality. Many users depend on them. Modifying these modules to use the vSphere Automation SDK is risky. Any single regression would have wide impact.

Our users would be frustrated by a transition, especially because it would bring absolutely zero new features to them. This also means we would have to reproduce the exact same behaviour, we miss an opportunity to improve the modules.

Technically speaking, an application that consumes a REST interface doesn’t really need an SDK. It can be handy sometime, for instance for the authentication, but overall the standard HTTP client should be enough. After all, the ‘S’ of REST stands for Simple for a reason.

vSphere REST API is not always consistent, but it’s well documented. VMware maintains a tool called vmware-openapi-generator (https://github.com/vmware/vmware-openapi-generator) to extract it in a machine compatible format (Swagger 2.0 or OpenAPI 3).

During our quest for a solution to this Python dependency problem, we’ve designed a Proof of Concept (PoC). It was based on a set modules with no dependency on any third party library. And off course, these modules were auto-generated. We’ve mentioned the conclusion of the PoC back in March during the community during the VMware / Ansible weekly meeting (https://github.com/ansible/community/issues/423).

The feedback convinced us we were on the right path. And here we go, 5 months after. The first beta release of vmware.vmware_rest has just been announced on Ansible’s blog!

CI of the Ansible modules for VMware: a retrospective

Simple VMware Provisioning, Management and Deprovisioning

Since January 2020, every new Ansible VMware pull request is tested by the CI against a real VMware lab. Creating the CI environment against a real VMWare lab has been a long journey, which I’ll share in this blog post.

Ansible VMware provides more than 170 modules, each of them dedicated to a specific area. You can use them to manage your ESXi hosts, the vCenters, the vSAN, do the common day-to-day guest management, etc.

Our modules are maintained by a community of contributors, and a large number of the Pull Requests (PR) are contributions from newcomers. The classic scenario is a user who’s found a problem or a limitation with a module, and would like to address it.

This is the reason why our contributors are not necessary developers. The average contributor doesn’t necessarily have advanced Python experience. We can hardly ask them to write Python unit-tests. Requiring this level of work creates a barrier to contribution. this would be a source of confusion and frustration and we would lose a lot of valuable contributions. However, they are power users. They have a great understanding of VMware and Ansible, and so we maintain a test playbook for most of the modules.

Previously, when a new change was submitted, was running the light Ansible sanity test-suite and an integration test against govcsim, a VMware API simulator (https://github.com/vmware/govmomi/tree/master/vcsim).

govcsim is a handy piece of software; you can start it locally to mock a vSphere infrastructure. But it doesn’t fully support some important VMware components like the network devices or datastore. As a consequence, the core-reviewers were asked to download the changeset locally, and run the functional tests against their own vSphere lab.

In our context, a vSphere lab is:

– a vCenter instance

– 2 ESXi

– 2 NFS datastores, with some pre-existing files.

We also had the challenge in our test environment. Functional tests destroy or create network switches, enable IPv6, add new datastores, and rarely if ever restored the system to initial configuration once complete. Leaving the labs in disarray, and compounding with each series of tests. Consequently, the reviews were slow, and we were wasting days fixing our infrastructures. Since the tests were not reproducible and done locally, it was hard to distinguish set-up errors from actual issues and therefore it was hard to provide meaningful feedback to contributors: Is this error coming from my set-up? I need to manually copy/past the error with the contributor, sometime several days after the initial commit.

This was a frustrating situation for us, and for the contributors. But well, we’ve spent years doing that…

You may find we like to suffer, which is probably true to some extent, but the real problem is that it’s rather complex to automate the full deployment of a lab. vSphere is an appliance VM in the OVA format. It has to be deployed on an ESXi. Officially, the ESXi can’t be virtualized, unless they run on an ESXi themselves. In addition, we use Evaluation licenses, and as a consequence, we cannot rely on features like snapshotting, and we have to redeploy our lab every 60 days.

We can do better! Some others did!

The Ansible network modules were facing similar challenges. Network devices are required to fully validate a change, but it’s costly to stack and maintain operation of hundreds of devices just for validation.

They’ve decided to invest in OpenStack and a CI solution called Zuul-CI (https://zuul-ci.org/). I don’t want to elaborate too much on Zuul since the topic itself is worth a book. But basically, everytime a change gets pushed, Zuul will spawn a multi node test environment, prepare the test execution using … Ansible, Yeah! And finally, run the test and collect the result. This environment makes use of appliances coming from the vendors. It’s basically just a VM. OpenStack is pretty flexible for this use-case, especially when you’ve got top-notch support with the providers.

Let’s build some VMware Cloud images!

To run a VM in a cloud environment, it has to match the following requirements:

  • use one single disk image, a qcow2 in the case of OpenStack.
  • supports the hardware exposed by the hypervisor, qemu-kvm in our case
  • configures itself according to the metadata information exposed by the cloud provider (IP, SSH keys, etc). This service is handled by Cloud-init most of the time.

ESXi cloud image

For ESXi, the first step was to deploy ESXi on libvirt/qemu-kvm. This works fine as we avoid virtio. And with a bit more effort, we can automate the process ( https://github.com/virt-lightning/esxi-cloud-images ). But our VM is not yet self-configuring. We need an alternative to Cloud-init. This is what esxi-cloud-init ( https://github.com/goneri/esxi-cloud-init/ ) will do for us. It reads the cloud metadata and prepares the network configuration of the ESXi host, and it also injects the SSH keys.

The image build process is rather simple once you’ve got libvirt and virt-install on your machine:

$ git clone https://github.com/virt-lightning/esxi-cloud-images

$ cd esxi-cloud-images

$ ./build.sh ~/Downloads/VMware-VMvisor-Installer-7.0.0-15525992.x86_64.iso

(…)

$ ls esxi-6.7.0-20190802001-STANDARD.qcow2

The image can run on OpenStack, but also on libvirt. Virt-Lightning (https://virt-lightning.org/) is the tool we use to spawn our environment locally.

vCenter cloud image too?

update: See Ansible: How we prepare the vSphere instances of the VMware CI for a more detailed explaination of the VCSA deployment process.

We wanted to deploy vCenter on our instance, but this is daunting. vCenter has a slow installation process, it requires an ESXi host, and is extremely sensitive to any form of network configuration changes…

So the initial strategy was to spawn a ESXi instance, and deploy vCenter on it. This is handled by ansible-role-vcenter-instance ( https://github.com/goneri/ansible-role-vcenter-instance ). The full process takes about 25m.

We became operational but the deployment process overwhelmed our lab. Additionally, the ESXi instance (16GB of RAM) was too much for running on a laptop. I started investigating new options.

Technically speaking, the vCenter Server Appliance or VCSA, is based on Photon Linux, the Linux distribution of VMware, and the VM actually comes with 15 large disks. This is a bit problematic since our final cloud image must be a single disk and be as small as possible. I developed this strategy:

  1. connect on the running VCSA, move all the content from the partition to the main partition, and drop the extra disk from the /etc/fstab
  2. do some extra things regarding the network and Cloud-init configuration.
  3. stop the server
  4. extract the raw disk image from the ESXi datastore
  5. convert it to the qcow2 format
  6. and voilà! You’ve got a nice cloud image of your vCenter.

All the steps are automated by the following tool:  https://github.com/virt-lightning/vcsa_to_qcow2. It also enables virtio for better performance.

Preparing development environment locally

To simplify the deployment, I use the following tool: https://github.com/goneri/deploy-vmware-ci. It will use virt-lightning to spawn the nodes, and do the post-configuration with Ansible. It reuses the roles that we consume in the CI to configure the vCenter, the host names, and populate the datastore.

In this example, I use it to start my ESXi environment on my Lenovo T580 laptop; the full run takes 15 minutes: https://asciinema.org/a/349246

Being able to redeploy a work environment in 15 minutes has been a life changer. I often recreate it several times a day. In addition, the local deployment workflow reproduces what we do in the CI, it’s handy to validate a changeset, or troubleshoot a problem.

The CI integration

Each Ansible module is different, which makes for different test requirements. We’ve got 3 topologies:

  • vcenter_only: only one single vCenter instance)
  • vcenter_1esxi_with_nested: one vCenter with an ESXi, this ESXi is capable of starting a nested VM.
  • vcenter_1_esxi_without_nested: the same. but this time, we don’t start nested VM. Compared to the previous case, this set-up is compatible with all our providers.
  • vcenter_2_esxi_without_nested: well, like the previous one, but with a second ESXi, for instance to test ha or migration.

The nodeset definition is done in the following file: https://github.com/ansible/ansible-zuul-jobs/blob/master/zuul.d/nodesets.yaml

We split the hours long test execution time on the different environments. An example of job result:

As you can see, we still run govcsim in the CI, even if it’s superseded by a real test environment. Since govcsim jobs run faster, we assume that failure would also fail against the real lab and abort the other jobs. This is a way to save time and resources.

I would like to thank Chuck Copello for the helpful review of this blog post.

Cloud images, which one is the fastest?

Introduction

This post compares the start-up duration of the most popular Cloud images.By start-up, I mean the time until we’ve got an operational SSH server.

For this test, I use a pet project called Virt-Lightning ( https://virt-lightning.org/ ). This tool allow any Linux user to start standard Cloud image locally. It will prepare the meta-data and start a VM in your local libvirt. It’s very handy for people like be, who work on Linux and spend the day starting new VM. The image are in the QCow2 format, and it uses the OpenStack meta-data format. Technically speaking, the performance should match what you get with OpenStack.

Actually, OpenStack is often slightly slower because it does some extra operations. It may need to create a volume on Ceph, or prepare extra network configuration.

The 2.0.0 release of Virt-Lighnint exposes a public API. My test scenario is built on top of that. It uses Python to pull the different images, and creates a VM from it 10 times in a row.

All the images are public, Virt-Lightning can fetch them with the vl pull foo command:

vl pull centos-6

During the boot process, the VM will set-up a static network configuration, resize the filesystem, create an user, and inject a SSH key.

By default, Virt-Lightning uses a static network configuration because it’s faster, and it gives better performance when we start a large number of VM at the same time. I choose to stick with this.

I did my tests on my Lenovo T580, which comes with a NVMe storage, and 32GB of memory. I would be curious to see the results with the same scenario, but on regular spinning disk.

The target images

For this test, I compare the following Linux distributions: CentOS, Debian, Fedora, Ubuntu and OpenSUSE. As far as I know, there is no public Cloud image available for the other common distributions. If you think I’m wrong, please post a comment below.

I also included the last FreeBSD, NetBSD and OpenBSD releases. They don’t provide official Cloud Images. This is the reason why, I reuse the unofficial ones from https://bsd-cloud-image.org/.

The lack of pre-existing Windows image is the reason why this OS is not included.

Results

Debian 10 is by far the fastest image with an impressive 15s on average. Basically, 5s less than any other Cloud Image.

Regarding the BSD, FreeBSD is the only system able to resize the root filesystem without a reboot. Consequently, OpenBSD and NetBSD need to start two times in a row. This explains to big difference. The NetBSD kernel hardware probe is rather slow, for instance it takes 5s to initialize the ATA bus of the CDROM. This is the reason why the results look rather bad.

About Ubuntu, I was surprised by the boot duration of Ubuntu 18.04. It is about two times longer than for 16.04. 20.04 is bit better but still, we are far from the 15s of 14.04. I would be curious to know the origin of this. Maybe AppArmor?

CentOS 6 results are not really consiste. They vary between 17.9s and 25.21s. This is the largest delta if you compare with the other distribution. This being said, CentOS 6 is rather old, and won’t be supported anymore at the end of the year.

Conclusions

All the recent Linux images are based on systemd. It would be great to extract the metrics from systemd-analyze to understand what impact the performance the most.

Most of the time, when I deploy a test VM, the very first thing I do is the installation of some import packages. This scenario may be covered later in another blog post.

Raw results for each images

CentOS 6

image from: https://cloud.centos.org/centos/6/images/CentOS-6-x86_64-GenericCloud.qcow2
Date: Thu, 08 Aug 2019 13:28:32 GMT
Size: 806748160

distro=centos-6, elapsed_time=023.20
distro=centos-6, elapsed_time=021.41
distro=centos-6, elapsed_time=024.97
distro=centos-6, elapsed_time=025.21
distro=centos-6, elapsed_time=020.29
distro=centos-6, elapsed_time=020.67
distro=centos-6, elapsed_time=020.13
distro=centos-6, elapsed_time=019.83
distro=centos-6, elapsed_time=020.09
distro=centos-6, elapsed_time=017.92

The average is 21.3s.

CentOS 7

image from: http://cloud.centos.org/centos/7/images/CentOS-7-x86_64-GenericCloud.qcow2
Date: Wed, 22 Apr 2020 12:24:07 GMT
Size: 858783744

distro=centos-7, elapsed_time=020.88
distro=centos-7, elapsed_time=020.51
distro=centos-7, elapsed_time=020.42
distro=centos-7, elapsed_time=020.58
distro=centos-7, elapsed_time=020.18
distro=centos-7, elapsed_time=021.14
distro=centos-7, elapsed_time=020.74
distro=centos-7, elapsed_time=020.80
distro=centos-7, elapsed_time=020.48
distro=centos-7, elapsed_time=020.15

Average: 20.5s

CentOS 8

image from: https://cloud.centos.org/centos/8/x86_64/images/CentOS-8-GenericCloud-8.1.1911-20200113.3.x86_64.qcow2
Date: Mon, 13 Jan 2020 21:57:45 GMT
Size: 716176896

distro=centos-8, elapsed_time=023.55
distro=centos-8, elapsed_time=023.27
distro=centos-8, elapsed_time=024.39
distro=centos-8, elapsed_time=023.61
distro=centos-8, elapsed_time=023.52
distro=centos-8, elapsed_time=023.49
distro=centos-8, elapsed_time=023.53
distro=centos-8, elapsed_time=023.30
distro=centos-8, elapsed_time=023.34
distro=centos-8, elapsed_time=023.67

Average: 23.5s

Debian 9

image from: https://cdimage.debian.org/cdimage/openstack/current-9/debian-9-openstack-amd64.qcow2
Date: Wed, 29 Jul 2020 09:59:59 GMT
Size: 594190848

distro=debian-9, elapsed_time=020.69
distro=debian-9, elapsed_time=020.59
distro=debian-9, elapsed_time=020.16
distro=debian-9, elapsed_time=020.30
distro=debian-9, elapsed_time=020.02
distro=debian-9, elapsed_time=020.01
distro=debian-9, elapsed_time=020.71
distro=debian-9, elapsed_time=020.48
distro=debian-9, elapsed_time=020.65
distro=debian-9, elapsed_time=020.57

Average is 20.4s.

Debian 10

image from: https://cdimage.debian.org/cdimage/openstack/current-10/debian-10-openstack-amd64.qcow2
Date: Sat, 01 Aug 2020 20:10:01 GMT
Size: 530629120

distro=debian-10, elapsed_time=015.25
distro=debian-10, elapsed_time=015.28
distro=debian-10, elapsed_time=014.88
distro=debian-10, elapsed_time=015.07
distro=debian-10, elapsed_time=015.39
distro=debian-10, elapsed_time=015.35
distro=debian-10, elapsed_time=015.47
distro=debian-10, elapsed_time=014.94
distro=debian-10, elapsed_time=015.57
distro=debian-10, elapsed_time=015.57

Average is 15.2s

Debian testing

Debian testing is a rolling release, so I won’t include it in the charts, but I found interesting to include it in the results.

image from: https://cdimage.debian.org/cdimage/openstack/testing/debian-testing-openstack-amd64.qcow2
Date: Mon, 01 Jul 2019 08:39:27 GMT
Size: 536621056

distro=debian-testing, elapsed_time=015.07
distro=debian-testing, elapsed_time=015.03
distro=debian-testing, elapsed_time=014.93
distro=debian-testing, elapsed_time=015.33
distro=debian-testing, elapsed_time=014.85
distro=debian-testing, elapsed_time=015.53
distro=debian-testing, elapsed_time=014.94
distro=debian-testing, elapsed_time=015.22
distro=debian-testing, elapsed_time=015.19
distro=debian-testing, elapsed_time=014.86

Average 15s

Fedora 31

image from: https://download.fedoraproject.org/pub/fedora/linux/releases/31/Cloud/x86_64/images/Fedora-Cloud-Base-31-1.9.x86_64.qcow2
Date: Wed, 23 Oct 2019 23:06:38 GMT
Size: 355350528

distro=fedora-31, elapsed_time=020.48
distro=fedora-31, elapsed_time=020.39
distro=fedora-31, elapsed_time=020.37
distro=fedora-31, elapsed_time=020.30
distro=fedora-31, elapsed_time=020.29
distro=fedora-31, elapsed_time=020.31
distro=fedora-31, elapsed_time=020.50
distro=fedora-31, elapsed_time=020.51
distro=fedora-31, elapsed_time=020.27
distro=fedora-31, elapsed_time=020.91

Average 20.4s

Fedora 32

image from: https://download.fedoraproject.org/pub/fedora/linux/releases/32/Cloud/x86_64/images/Fedora-Cloud-Base-32-1.6.x86_64.qcow2
Date: Wed, 22 Apr 2020 22:36:57 GMT
Size: 302841856

distro=fedora-32, elapsed_time=021.68
distro=fedora-32, elapsed_time=022.43
distro=fedora-32, elapsed_time=022.17
distro=fedora-32, elapsed_time=023.06
distro=fedora-32, elapsed_time=022.23
distro=fedora-32, elapsed_time=022.83
distro=fedora-32, elapsed_time=022.54
distro=fedora-32, elapsed_time=021.46
distro=fedora-32, elapsed_time=022.37
distro=fedora-32, elapsed_time=023.14

Average: 22.4s

FreeBSD 11.4

image from: https://bsd-cloud-image.org/images/freebsd/11.4/freebsd-11.4.qcow2
Date: Wed, 05 Aug 2020 01:24:32 GMT
Size: 412895744

distro=freebsd-11.4, elapsed_time=030.68
distro=freebsd-11.4, elapsed_time=030.64
distro=freebsd-11.4, elapsed_time=030.29
distro=freebsd-11.4, elapsed_time=030.29
distro=freebsd-11.4, elapsed_time=029.86
distro=freebsd-11.4, elapsed_time=029.74
distro=freebsd-11.4, elapsed_time=029.90
distro=freebsd-11.4, elapsed_time=029.77
distro=freebsd-11.4, elapsed_time=030.04
distro=freebsd-11.4, elapsed_time=029.70

Average 30s

FreeBSD 12.1

image from: https://bsd-cloud-image.org/images/freebsd/12.1/freebsd-12.1.qcow2
Date: Wed, 05 Aug 2020 01:46:11 GMT
Size: 479029760

distro=freebsd-12.1, elapsed_time=029.78
distro=freebsd-12.1, elapsed_time=030.32
distro=freebsd-12.1, elapsed_time=029.56
distro=freebsd-12.1, elapsed_time=029.60
distro=freebsd-12.1, elapsed_time=029.76
distro=freebsd-12.1, elapsed_time=029.89
distro=freebsd-12.1, elapsed_time=029.55
distro=freebsd-12.1, elapsed_time=029.66
distro=freebsd-12.1, elapsed_time=029.31
distro=freebsd-12.1, elapsed_time=029.77

Average 29.7

NetBSD 8.2

image from: https://bsd-cloud-image.org/images/netbsd/8.2/netbsd-8.2.qcow2
Date: Wed, 05 Aug 2020 02:06:57 GMT
Size: 155385856

distro=netbsd-8.2, elapsed_time=066.71
distro=netbsd-8.2, elapsed_time=067.80
distro=netbsd-8.2, elapsed_time=067.15
distro=netbsd-8.2, elapsed_time=066.97
distro=netbsd-8.2, elapsed_time=066.84
distro=netbsd-8.2, elapsed_time=067.01
distro=netbsd-8.2, elapsed_time=066.98
distro=netbsd-8.2, elapsed_time=067.73
distro=netbsd-8.2, elapsed_time=067.34
distro=netbsd-8.2, elapsed_time=066.90

Average 67.1

NetBSD 9.0

image from: https://bsd-cloud-image.org/images/netbsd/9.0/netbsd-9.0.qcow2
Date: Wed, 05 Aug 2020 02:25:11 GMT
Size: 149291008

distro=netbsd-9.0, elapsed_time=067.04
distro=netbsd-9.0, elapsed_time=066.92
distro=netbsd-9.0, elapsed_time=066.89
distro=netbsd-9.0, elapsed_time=067.24
distro=netbsd-9.0, elapsed_time=067.41
distro=netbsd-9.0, elapsed_time=067.13
distro=netbsd-9.0, elapsed_time=066.14
distro=netbsd-9.0, elapsed_time=066.75
distro=netbsd-9.0, elapsed_time=067.25
distro=netbsd-9.0, elapsed_time=066.60

Average: 66.9s

OpenBSD 6.6

image from: https://bsd-cloud-image.org/images/openbsd/6.7/openbsd-6.7.qcow2
Date: Wed, 05 Aug 2020 04:09:44 GMT
Size: 520704512

distro=openbsd-6.6, elapsed_time=048.80
distro=openbsd-6.6, elapsed_time=049.72
distro=openbsd-6.6, elapsed_time=049.07
distro=openbsd-6.6, elapsed_time=048.36
distro=openbsd-6.6, elapsed_time=049.28
distro=openbsd-6.6, elapsed_time=049.12
distro=openbsd-6.6, elapsed_time=049.36
distro=openbsd-6.6, elapsed_time=049.80
distro=openbsd-6.6, elapsed_time=048.05
distro=openbsd-6.6, elapsed_time=049.71

Average: 49.1s

OpenBSD 6.7

image from: https://bsd-cloud-image.org/images/openbsd/6.7/openbsd-6.7.qcow2
Date: Wed, 05 Aug 2020 04:09:44 GMT
Size: 520704512

distro=openbsd-6.7, elapsed_time=048.81
distro=openbsd-6.7, elapsed_time=048.96
distro=openbsd-6.7, elapsed_time=049.86
distro=openbsd-6.7, elapsed_time=049.12
distro=openbsd-6.7, elapsed_time=049.75
distro=openbsd-6.7, elapsed_time=050.63
distro=openbsd-6.7, elapsed_time=050.85
distro=openbsd-6.7, elapsed_time=049.92
distro=openbsd-6.7, elapsed_time=048.98
distro=openbsd-6.7, elapsed_time=050.83

Average: 49.7s

Ubuntu 14.04

image from: https://cloud-images.ubuntu.com/trusty/current/trusty-server-cloudimg-amd64-disk1.img
Date: Thu, 07 Nov 2019 15:38:05 GMT
Size: 264897024

distro=ubuntu-14.04, elapsed_time=014.40
distro=ubuntu-14.04, elapsed_time=014.42
distro=ubuntu-14.04, elapsed_time=014.94
distro=ubuntu-14.04, elapsed_time=015.44
distro=ubuntu-14.04, elapsed_time=015.64
distro=ubuntu-14.04, elapsed_time=014.59
distro=ubuntu-14.04, elapsed_time=015.02
distro=ubuntu-14.04, elapsed_time=015.22
distro=ubuntu-14.04, elapsed_time=015.44
distro=ubuntu-14.04, elapsed_time=015.44

Average: 15s

Ubuntu 16.04

image from: https://cloud-images.ubuntu.com/xenial/current/xenial-server-cloudimg-amd64-disk1.img
Date: Thu, 13 Aug 2020 08:36:38 GMT
Size: 309657600

distro=ubuntu-16.04, elapsed_time=015.13
distro=ubuntu-16.04, elapsed_time=015.39
distro=ubuntu-16.04, elapsed_time=015.42
distro=ubuntu-16.04, elapsed_time=015.62
distro=ubuntu-16.04, elapsed_time=015.29
distro=ubuntu-16.04, elapsed_time=015.60
distro=ubuntu-16.04, elapsed_time=015.62
distro=ubuntu-16.04, elapsed_time=015.21
distro=ubuntu-16.04, elapsed_time=015.62
distro=ubuntu-16.04, elapsed_time=015.67

Average: 15.4

Ubuntu 18.04

image from: https://cloud-images.ubuntu.com/bionic/current/bionic-server-cloudimg-amd64.img
Date: Wed, 12 Aug 2020 16:58:30 GMT
Size: 357302272

distro=ubuntu-18.04, elapsed_time=028.58
distro=ubuntu-18.04, elapsed_time=028.25
distro=ubuntu-18.04, elapsed_time=028.36
distro=ubuntu-18.04, elapsed_time=028.45
distro=ubuntu-18.04, elapsed_time=028.79
distro=ubuntu-18.04, elapsed_time=028.28
distro=ubuntu-18.04, elapsed_time=028.11
distro=ubuntu-18.04, elapsed_time=028.07
distro=ubuntu-18.04, elapsed_time=027.75
distro=ubuntu-18.04, elapsed_time=028.25

Average: 28.3s

Ubuntu 20.04

image from: https://cloud-images.ubuntu.com/focal/current/focal-server-cloudimg-amd64.img
Date: Mon, 10 Aug 2020 22:19:47 GMT
Size: 545587200

distro=ubuntu-20.04, elapsed_time=023.23
distro=ubuntu-20.04, elapsed_time=022.74
distro=ubuntu-20.04, elapsed_time=023.20
distro=ubuntu-20.04, elapsed_time=022.96
distro=ubuntu-20.04, elapsed_time=024.04
distro=ubuntu-20.04, elapsed_time=024.06
distro=ubuntu-20.04, elapsed_time=023.60
distro=ubuntu-20.04, elapsed_time=023.88
distro=ubuntu-20.04, elapsed_time=023.24
distro=ubuntu-20.04, elapsed_time=024.27

Average: 23.5s

OpenSUSE Leap 15.2

image from: https://download.opensuse.org/repositories/Cloud:/Images:/Leap_15.2/images/openSUSE-Leap-15.2-OpenStack.x86_64.qcow2
Date: Sun, 07 Jun 2020 11:42:01 GMT
Size: 566047744

distro=opensuse-leap-15.2, elapsed_time=027.10
distro=opensuse-leap-15.2, elapsed_time=027.61
distro=opensuse-leap-15.2, elapsed_time=027.07
distro=opensuse-leap-15.2, elapsed_time=027.12
distro=opensuse-leap-15.2, elapsed_time=027.57
distro=opensuse-leap-15.2, elapsed_time=026.86
distro=opensuse-leap-15.2, elapsed_time=027.25
distro=opensuse-leap-15.2, elapsed_time=027.10
distro=opensuse-leap-15.2, elapsed_time=027.69
distro=opensuse-leap-15.2, elapsed_time=027.39

Average: 27.3s