performance: expanduser with pathlib or os.path

Python3 provides a new fancy library to manage pretty much all the Path related operations This is a really welcome improvement since the before that we had to use a long list of unrelated modules.

I recently had to chose between Pathlib and os.path to expand a string in the ~/path format to the absolute path. Since the performance was important I took the time to benchmark the two options:

#!/usr/bin/env python3

import timeit

setup = '''
from pathlib import PosixPath
'''
with_pathlib = timeit.timeit("abs_remote_tmp = str(PosixPath('~/.ansible/tmp').expanduser())", setup=setup)

setup = '''
from os.path import expanduser
'''

with_os_path = timeit.timeit("abs_remote_tmp = expanduser('~/.ansible/tmp')", setup=setup)

print(f"with pathlib: {with_pathlib}\nwith os.path: {with_os_path}")

os.path is just about 4 times faster (x1000000) for this very specific case. The fact we need to instantiate a PosixPath object has an impact. Also, once again we observe a nice performance boost with Python 3.8 onwards.

Ansible collections and venv

I work on a large number of collections and in order to test them properly, I’ve to switch between the Python versions and the associated Pypi dependencies. Nothing special here, this is pretty much the life of all of us who work on the Ansible collections.

Initially, I was maintaining a set of clean Python virtual environments. Basically, one per version of Python. And I was juggling between then. Sadly, it’s easy to lose track of what’s going one. Every time I was switching to a different collection, I had to pull a new set of dependencies and the order was never the same.

I ended up being actually frustrated by the wasted time spent on looking at the pip freeze output to understand some oddity. It’s so easy to mess up the whole cathedral. A good example is that use a lot pip install -e git/something to install a local copy of a library. And as a result, any change there can potentially nuke the fragile little creature.

So now, I use another approach. I’ve got a script that spawn a virtual environment on the light, pull the right dependencies and initialize the shell. It may sounds like a trivial thing, but I actually use it several times every days and I don’t call pip freeze that much.

For instance if I need to work with Ansible 2.10 and Python 3.10, I just need to do:

$ cd .ansible/collections/ansible_collections/vmware/vmware_rest
$ source ~/bin/ansible-venv.fish 3.10 stable-2.10

and I’m ready to run ansible-playbook or ansible-test in my clean environment. And when I want to reinitialize the venv, I’ve just to remove the venv directory.

The script is here and depends on FishShell, my favorite Shell.

0ad, Firewall and Fedora33

By default, the firewall will prevent connection to your 0ad server. To adjust that, you need to open up the port 20595 (UDP). This three lines create a Firewalld service called 0ad, attach it to the default zone and reload the firewall:

$ sudo firewall-cmd --permanent --new-service=0ad --set-description="0ad, A free, open-source game of ancient warfare" --add-port=20595/udp
$ sudo firewall-cmd --zone=FedoraWorkstation --add-service=0ad --permanent
$ sudo firewall-cmd --reload

eGPU, Wayland, Gnome3 and Fedora

I’ve just got a Razor Core X that I use with a Radeon graphic card. By default Wayland, well Mutter actually, continue to use the Intel card of my T580.

To force it to use the second card, I had to add a udev rules and reboot. And that’s all!

$ cat /etc/udev/rules.d/61-mutter-primary-gpu.rules
ENV{DEVNAME}=="/dev/dri/card1", TAG+="mutter-device-preferred-primary"

note: You need Gnome 3.38.2 for this to work properly. See: https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1562

update: the post was initially written for Fedora 33, but the fix also work great with Fedora 34.

How to waste a friday…

Yesterday morning I got frustrated by a really slow download speed of some files. What should have taken seconds with my 400mb/s connection actually takes more than 16 minutes. In addition, I’m able to reproduce the problem on my router.

Here curl statistics show a 26:52 minutes long download at 415kb/s:

$ curl -o /dev/null https://s3.us-east-2.amazonaws.com/zuul-images/fedora-open-vm-tools-livecd.iso
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed
0 627M 0 2684k 0 0 398k 0 0:26:52 0:00:06 0:26:46 415k^C

Just to be sure, I retry with a lower MTU. The issue remains. Two friends in Montréal also confirm everything works well for them.

My laptop uses a wire connection that I’ve been using for years now. A large part of the population works from home and the Internet is under constant pressure.So I suspected an ISP bandwidth limitation to preserve the Quality of Service. I contacted them and … complaint… a lot.

At some point, the technician manages to get my attention and asks me to connect the modem directly to my laptop.

This is an obviously pointless request since the downloads are also slow from the router. But well, if I want them to act, I also need to be cooperative. I give it a try and… I immediately felt embarrassed. It was damn fast! The download speed is actually even above the 400mb/s ceiling. WTF.

So I start to reconsider my whole life in a deep introspection. How can this be real? My router runs OpenBSD 6.7 with an absolutely basic pf configuration. Its hardware is decent (Intel(R) Core(TM) i3-2120 CPU @ 3.30GHz). The router is connected to the modem with a very common Realtek RTL8125 and the LAN is connected to an Intel I350. I can download large files between my laptop and the router at full speed.

All this takes out of the equation my laptop, the wire, the router’s I350. The last potential culprit is the Realtek NIC. I replace it with another Intel I350 and retried. Same problem. The downloads are still slow on my laptop and the router. The Realtek NIC is innocent.

So I start thinking. Why the rest of the family is not really affected by the poor performances of the Internet connection. And so, I try to download the same file from another computer. And it’s fast… Damn, what’s going on. I switch some ports of the switch just to be sure. The problem is now between my laptop, the wire.

My laptop uses a bonding of the Wifi and the Wire connection. The gives me the ability to move around with the laptop without losing all my open connections. For instance, I can remove my wire in the middle of a meeting and the laptop will use the Wifi seamlessly. But well, I disgress. I remove this special configuration. And, … the problem remains.

But during this last check, I also saw a large number of RX errors associated with the laptop NIC. Interesting, let’s try another NIC. I plug in a USB3 Realtek RTL8156 NIC and this time… it works.

The wire is a CAT6 cable and is not that long (20m), but it sounds like the NIC (Intel I219-LM) of my T580 is a bit picky with the quality of the signal. It can also be a problem with the e1000e drive of the new 5.10 kernel. The cable is good, I’ve just tested it. Anyway I’ve put a switch just before my laptop NIC and everything works great now.

I’m still not sure why by download are still slow on OpenBSD. But this is an adventure for another day. The slow downloads were all with HTTPS sites (S3 and a Caddy website), the DF flag was on (TCP Don’t Fragment) which exaserves the impact of the transmission errors.

I found the whole situation to be interesting. It’s a series of wrong assumptions and the solution is really far from what I would have imagined.

Also, thank you Teksavvy for your great support.

update: this partially explains the OpenBSD S3 download problem.

update2: I’m now running Linux 5.10.11 and… I don’t see any RX errors anymore! The S3 download is just as fast as it should be. So this was indeed a problem with the e1000e driver.

update3: The problem is back, I’m not sure if it’s an hardware limitation of the NIC itself. I now use the Realtek NIC all the time.

My Github workflow

The Ansible community uses Github to develop ansible-core and most of the Ansible Collections. The only exception I know is the Openstack’s ansible-collection-openstack which uses a Gerrit (ansible-collections-openstack).

So, as an Ansible developer, my normal day-to-day activities involve a lot of GitHub interactions. I review Pull Request (PR) and prepare new PR all the time.

Before joining Ansible, I was working with Gerrit which is a nice alternative solution to collaborate on a stream of patches.

In Gerrit, each patch from a branch is a PR. Everytime we update a patch, its sha2 changes, and so Gerrit tracks them with a dedicated ID called Change-id. It looks like an extra line in the body of the commit message. e.g:

     Change-Id: Ic8aaa0728a43936cd4c6e1ed590e01ba8f0fbf5b

Gerrit provides a tool called git-review to pull and push the patches. When a contributor pushes a series of patches, each patch is correctly tracked by Gerrit and updates the right existing PR. This allows the contributor to reorganize the patches, change the order of series or import a patch from another branch.

With GitHub, a branch is a PR and most of the time, the projects prefer to use the branch to trace the iteration of the PR:

  • my fancy feature
  • fix: correct the test-suite
  • fix: fix the fix
  • fox: typo in previous commit
  • bla

And this is fine, because most of the time, the branch will ultimately be squashed (one branch -> one Git commit) during the final merge.

GitHub workflow is certainly more friendly for newcomers but it tends to be a source of complexity when you want to work on several PR at the same time. For instance, I work on a new feature, but I also want to cherry-pick an experimental commit from a contributor. In this case I must remove this commit before I push my branch back on GitHub, or the extra commit will end-up in my feature branch.

Another example, if I’m working on a feature branch and find an issue with something unrelated, I need to switch to another branch to commit by fix and push it. This is cumbersome and often people just prefer to merge the fix in their feature branch which leads to confusion and questions during the code review.

To simplify, Gerrit allows better code modularity but also implies a better understanding of Git  which is annoying when we try to attract new contributors. This is the reason why we use the current workflow.

To address the problem I wrote a script called push-patch (https://github.com/goneri/push-patch). I use it to push just my commits. For instance, I work on this branch:

  • 1: doc: explain how to do something
  • 2: typo: adjust a little details
  • 3: a workaround for issue #19 that should not be merged

The two first commits are not directly related with the feature I’m implementing. And I would like to submit them immediately.

push-patch will allow me to only push the change 1 and 2 in two dedicated PR. Both branches will be based on main and can be merged independently.

$ push-patch 1
$ push-patch 2

Now, and that’s the cool part 😋! Let’s imagine I want to push another revision of my first patch, I can use “git rebase -i” to adjust this commit and use push-patch again to use the updated patch.

$ vim foo
$ git add foo
$ git rebase --continue
$ ./push-patch 1

Internally push-patch uses git-notes to trace the remote branch of the patch. The Public-Branch field traces the name of the branch in my remote clone of the project and Pr-Url is the URL of the PR in the upstream project. e.g:

commit 1198db8807ebf9f4099598bcd41df25d465cbcae (HEAD -> main)
Author: Gonéri Le Bouder <goneri@lebouder.net>
Date:   Thu Jan 7 11:31:41 2021 -0500

   elb_application_lb: enable the functional test
    
   Remove the `unsupported` aliases for the `elb_application_lb` test.
    
   Use HTTP instead of HTTPS to avoid the dependency on
   `iam:ListServerCertificates` and the other Certificate related operations.

Notes:
   Public-Branch: elb_application_lb-enable-the-functional-test_24328
    
   PR-Url: https://github.com/ansible-collections/community.aws/pull/348

This means that even if the patch content evolves, push-patch will still be able to continue to update the right PR.

In a nutshell, for each patch it will:

  1. clone the project and switch on the main branch
  2. read the patch notes
    1. if a branch name already exists it will use it, otherwise it will create a new one
  3. switch to the branch
  4. cherry-pick the patch
  5. push the branch

push-patch expects just the sha2 of the commit to push. It also accepts a list of sha2. This is the reason why I often type thing like that:

push-patch $(git log -2 –pretty=tformat:%H)

The command passes to push-patch the SHA2 of the two last commits. It will push them in the two associated branches upstream. And at the end, I can use git log, or better tig, to get the URL of the Github review.

Right now, the command is a shell script and depends on the hub command. I would like to rewrite it with a better programming language.

What about you? Do you also use some special tools to handle your PR?

Ansible: How we prepare the vSphere instances of the VMware CI

As explain quickly in CI of the Ansible modules for VMware: a retrospective, the Ansible CI uses OpenStack to spawn ephemeral vSphere labs. Our CI tests are run against them.

A full vSphere deployment is a long process that requires quite a lot of resources. In addition to that, vSphere is rather picky regarding its execution environment.

The CI of the VMware modules for Ansible runs on OpenStack. Our OpenStack providers use kvm based hypervisor. They expect image in the qcow2 format.

In this blog post, we will explain how we prepare a cloud image of vSphere (also called golden image).

a full lab running on libvirt

First thing, get an large ESXi instance

The vSphere (VCSA) installation process depends on an ESXi. In our case we use a script and Virt-Lightning to prepare and run an ESXi image on Libvirt. But you can use your own ESXi node as soon as it respects the following minimal constraints:

  • 12GB of memory
  • 50GB of disk space
  • 2 vCPUs

Deploy the vSphere VM (VCSA)

For this, I use my own role called goneri.ansible-role-vcenter-instance. It delegates to the vcsa-deploy command deployment. As a result, you don’t needany human interaction during the full process. This is handy if you want to deploy your vSphere in a CI environment.

At the end of the process, you’ve got a large VM running on your ESXi node.

In my case, all these steps are handled by the following playbook: https://github.com/virt-lightning/vcsa_to_qcow2/blob/master/install_vcsa.yml

Tune-up the instance

Before you shut down the freshly created VM, you would like to do some adjustment.
I use the following playbook for this: prepare_vm.yml

During this step, I ensure that:

  • Cloud-Init is installed,
  • the root account is enabled with a real shell,
  • the virtio drivers are available

Cloud-Init is the de-facto tool that handle all the post-configuration tasks that we can expect from a Cloud image: inject the user SSH key, resize the filesystem, create an user account, etc.

By default, the vSphere VCSA comes with a gazillion of disks, this is a problem in the case of a cloud environment where an instance is associated with a single disk image.
So I also move the content of the different partitions in the root filesystem and adjust the /etc/fstab to remove all the reference to the other disks. This way I will be able to only maintain on qcow2 image.

All these steps are handled by the following playbook: prepare_vm.yml

Prepare the final Qcow2 image

At this stage, the VM is still running, so I shut it down.
Once this is done, I extract the raw image of the disk using the curl command:

curl -v -k --user 'root:!234AaAa56' -o vCenterServerAppliance.raw 'https://192.168.123.5/folder/vCenter-Server-Appliance/vCenter-Server-Appliance-flat.vmdk?dcPath=ha%252ddatacenter&dsName=l
ocal'
  • root:!234AaAa56 is my login and password
  • vCenterServerAppliance.raw is the name of the local file
  • 192.168.123.5 is the IP address of my ESXi
  • vCenter-Server-Appliance is the name of the vSphere instance vCenter-Server-Appliance-flat.vmdk is the associated raw disk

The local .raw file is large (50GB), ensure you’ve got enough free space.

You can finally convert the raw file to a qcow2 file. You can use Qemu’s qemu-img for that, it will work fine BUT the image will be monstrously large. I instead use virt-sparsify from the libGuestFS project. This command will reduce the size of the image to the bare minimum.

virt-sparsify --tmp tmp --compress --convert qcow2 vCenterServerAppliance.raw vSphere.qcow2

Conclusion

You can upload the image in your OpenStack project with following command:

openstack image create --disk-format qcow2 --file vSphere.qcow2 --property hw_qemu_guest_agent=no vSphere

If your OpenStack provider uses Ceph, you will probably want to reconvert the image to a flat raw file before the upload. With vSphere 6.7U3 and before, you need to force the use of a e1000 NIC. For that, add --property hw_vif_model=e1000 to the command above.

I’ve just done done the whole process with vSphere 7.0.0U1 in 1h30 (Lenovo T580 laptop). I use the ./run.sh script from https://github.com/virt-lightning/vcsa_to_qcow2, which auotmate everything.

The final result is certainly not supported by VMware, but we’ve already run hundreds of successful CI jobs with this kind of vSphere instances. The CI prepares a fresh CI lab in around 10 minutes.

Ansible and k8s: How to get the K8S_AUTH_API_KEY value?

The community.kubernetes collection accepts an api_key parameter that may sounds a bit confusing. It’s actually the value of the token of a serviceaccount. It’s actually an OAuth 2.0 (Bearer) token, it’s associated with a user and a secret key. It’s rather similar to what we can do with a login and a password.

In this example, we want to run our playbook as the k8sadmin user. We need to find the token associated with the user. The are actually looks for the a secret. You can list them this way:

[root@kind-vm ~]# kubectl -n kube-system get secret
NAME                                             TYPE                                  DATA   AGE
(...)
foobar                                           Opaque                                0      5h3m
foobar-token-w8lmt                               kubernetes.io/service-account-token   3      5h15m
foobar2-token-hpd6f                              kubernetes.io/service-account-token   3      5h9m
generic-garbage-collector-token-l7hvk            kubernetes.io/service-account-token   3      25h
horizontal-pod-autoscaler-token-sssg5            kubernetes.io/service-account-token   3      25h
job-controller-token-dnfds                       kubernetes.io/service-account-token   3      25h
k8sadmin-token-bklpd                             kubernetes.io/service-account-token   3      5h40m
(...)

The use the -n parameter to specific the kube-system namespace. Our system account is in the list, it’s k8sadmin-token-bklpd. We can see the content of the token with this command:

[root@kind-vm ~]# kubectl -n kube-system describe secret k8sadmin-token-bklpd
Name:         k8sadmin-token-bklpd
Namespace:    kube-system
Labels:       <none>
Annotations:  kubernetes.io/service-account.name: k8sadmin
             kubernetes.io/service-account.uid: 412bf773-ca8e-4afa-a778-dac0f11b7807

Type:  kubernetes.io/service-account-token

Data
====
namespace:  11 bytes
token:      eyJhbGciO(...)2A
ca.crt:     1066 bytes

Here, you're done. The token is in the command output. You need now to pass its content to Ansible. Just keep in mind the token needs to remain secret. So it's a good idea to encrypt it with Ansible Vault.
You can use the K8S_AUTH_API_KEY environment variable to pass the token to the k8s_* modules:

$ K8S_AUTH_API_KEY=eyJhbGciO(…)2A ansible-playbook my_playbook.yaml

Ansible: Performance Impact of the Python version

Until recently, I was not really paying attention to the version of Python I was using with Ansible, this as soon as it was Python3. The default version was always good enough for Ansible.

During the last weeks, I spent the majority of my time working on the performance the community.kubernetes collection. The modules of these collection depend on a large library (OpenShift SDK) and Python needs to reload it before every task execution. The goal was to benefit from what is already in place with vmware.vmware_rest: See: my AnsibleFest presentation.

And while working on this, I realized that my metrics were not consistent, I was not able to reproduce some test-cases that I did 2 months ago. After a quick investigation, the Python version matters much more than expected.

To compare the different Python versions, I decided to run some tests.

The target host is a t2.medium instance (2 vCPUS, 4GiB) running on AWS. And the Operating system is Fedora 33, which is really handy for this because it ships all the Python versions from 3.6 to 3.10!

I use the last stable version of Ansible (2.10.3) that I install with pip in a Python virtual environment. The list of the dependencies present in the virtualenvs.

Finally, I deploy Kubernetes on Podman with Kubernetes Kind.

For the first test, I use a Python one-liner to evaluate the time Python takes to load the OpenShift SDK. This is one of the operations that I want to optimize for my work and so it matters a lot to me.

#!/bin/bash
python=$1
for i in $(seq 100); do
echo ${i}
venv-${python}/bin/python -c 'from openshift.dynamic import DynamicClient' >> /dev/null 2>&1
done
view raw run.sh hosted with ❤ by GitHub

Here the loading is done 100 times in a row.

The result shows a steady improvement of the performance since Python 3.6.

Python3.6Python3.7Python3.8Python3.9Python3.10
time (sec)48.40145.08841.75140.92440.385

With this test, the loading of the SDK is 16.5% faster with Python 3.10.

The next test does the same thing, but this time through Ansible. My test uses the following playbook:

hosts: localhost
gather_facts: false
tasks:
k8s_info:
kind: Pod
with_sequence: count=100
view raw gistfile1.yaml hosted with ❤ by GitHub

It runs the k8s_info module 100 times in a row. In addition, I also use an ansible.cfg with the following content. This way, ansible-playbook returns a nice output of the task execution duration:

[defaults]
callback_whitelist = ansible.posix.profile_tasks
view raw ansible.cfg hosted with ❤ by GitHub
Python3.6Python3.7Python3.8Python3.9Python3.10
time (sec)85.580.575.3575.0571.19

It’s a 16.76% boost between Python 3.6 and Python 3.10. I was not expecting such tight correlation between the two tests.

While Python is obviously not the faster technology out there, it’s great to see how its performance are getting better release after release. Python 3.10 is not even released yet and looks promising.

If your playbooks use some modules with dependency on large Python library, it may be interesting to give a try to the lastest Python versions.

And for those who are still running Python 2.7, I get a 49.2% the performance boost between 2.7 and 3.10.