03 May 2017 • on Blog, Open_Source, Technology, ProxMox

by Roberto Rivera

PVE: Virtualization for Work and Play (Part 3)

System Optimization…

In the previous post we installed ProxMox Virtual Environment (PVE) and configured our ZFS ZPool storage system. Let’s tweak our system to improve performance.

Table of Contents

System Optimization…
ZFS Tune-up
Graphics Processing Unit (GPU) Passthrough
- Enable the VFIO Modules:
- Configure the VFIO Modules
  - Identify Passthrough Device
  - Enable Passthrough Device
Update Boot Settings
Final Thoughts

ZFS Tune-up

Solaris-based UNIX and Linux treat extended attributes differently, which has some performance implications. By default, ZFS on Linux is set to xattr=on which causes the extended attributes to be stored in separate hidden directories. Changing this property to "system attributes" improves performance significantly under Linux as a results of extended attributes being stored more efficiently on disk.

To check the attribute run zfs get -r xattr tank (update tank with your zpool name) and we’ll see something like the following:

NAME                         PROPERTY  VALUE  SOURCE
tank                         xattr     on     default
tank/vm-disks                xattr     on     default

To update the property run zfs set xattr=sa tank

Linux has extended attributes called Posix ACL that are not functional on other platforms. To check the attribute, run zfs get -r acltype tank and we should get something like:

NAME                         PROPERTY  VALUE     SOURCE
tank                         acltype   off       local
tank/vm-disks                acltype   off       inherited from zentank

To update the property run zfs set acltype=posixacl tank. Also, so that ACLs get passed to files created within a directory, we need to run zfs set aclinherit=passthrough tank as well.

Linux has two parameters related to the times when files are accessed. The first is atime, which tracks the “last” access time. This creates a lot of overhead because every time a file is read, an update has to write to disk to reflect this access time. The second is relatime, which is similar but is a relative atime, and writes on fewer occasions. We can check both of them with zfs get -r atime tank && zfs get -r relatime tank and we should see something like:

NAME                         PROPERTY  VALUE  SOURCE
tank                         atime     on     default
tank/vm-disks                atime     on     default

NAME                         PROPERTY  VALUE     SOURCE
tank                         relatime  off       default
tank/vm-disks                relatime  off       default

If we need to track access times, "relatime" is preferable, however, we can disable both by running zfs set atime=off tank, since "relatime" is already disabled.

We can set additional options for reliability. Run zpool set autoreplace=on tank so that ZFS can automatically switch to an available hot spare if hardware errors are detected on online disks. Run zpool set autoexpand=on tank to allows the pool to grow when all VDEVs have been replaced with larger ones. This must be set before any drives are replaced, so we may as well set it now.

About That: Many ZFS properties are not retroactive. To apply to existing files, we would need to replace the files. In other words, if you already have files or data stored on your ZFS pool, you would need to move them somewhere else (i.e. backup) and then move them back so that the changes in properties are applied correctly.

Graphics Processing Unit (GPU) Passthrough

Passthrough allows our Virtual Machine (VM) to access GPU hardware for games, graphics, and heavy computation (i.e. deep learning). We must enable IOMMU ("Input–output memory management unit") drivers, which allocate device-visible virtual addresses to the actual physical addresses. IOMMU enables our VM to communicate with our GPU using the virtual addresses as if it were directly communicating to the GPU.

VFIO ("Virtual Function I/O") modules are part of an IOMMU device-agnostic framework for exposing direct device access to userspace, in a secure IOMMU protected environment. In other words, they provide access to non-privileged, low-overhead userspace drivers.

Enable the VFIO Modules:

Run nano /etc/modules and add the following:

vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

Save and exit: press CTRL+X, Y for yes, and ENTER.

Configure the VFIO Modules

Identify Passthrough Device

To identify the GPU to passthrough run lspci -nn | grep VGA.

21:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107 [GeForce GTX 745] [10de:1382] (rev a2)
28:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)

Identify the GPU slot IDs (first pair of numbers separated by a colon):

My GPU Slot ID for passthrough is: 28:00
My GPU Slot ID for the host is: 21:00

Identify the vendor ID for passthrough: lspci -nns 28:00 | cut -d "(" -f 1 | cut -d ":" -f 3,4

 NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80]
 NVIDIA Corporation GP104 High Definition Audio Controller [10de:10f0]

Vendor ID for GPU VGA device: 10de:1b80
Vendor ID for GPU Audio device: 10de:10f0

Enable Passthrough Device

To enable passthrough, add the following module options (including the comma separated vendor IDs identified in the prior step). This loads options for the vfio-pci kernel module, which maps memory regions from the PCI bus to the VM, and activates support for IOMMU groups.

Run nano /etc/modprobe.d/kvm.conf and add some of the following options (see Table 1 for details):

# uncomment the first option if required for your system.
#options vfio_iommu_type1 allow_unsafe_interrupts=1
options vfio-pci         ids=10de:1b80,10de:10f0
options vfio-pci         disable_vga=1
options kvm-amd          npt=0
options kvm              ignore_msrs=1

Save and exit: press CTRL+X, Y for yes, and ENTER.

Table 1. Module option details
Option	Details
allow_unsafe_interrupts=1	This workaround is for platforms without interrupt remapping support, which provides device isolation. It removes protection against MSI-based interrupt injection attacks by guests. Only trusted guests and drivers should be run with this configuration.
ids=10de:1b80,10de:10f0	Assign desired GPU to the virtual pci for use in our VM.
disable_vga=1	Opt-out devices from vga arbitration if possible.
npt=0	Disable Nested Page Table If VM performance is very slow. Linux guests with Q35 and OVMF may work with npt on or off, however a Linux guest with i440fx only works with npt disabled.
ignore_msrs=1	Prevent some Nvidia applications from crashing the VM.

Update Boot Settings

Configure IOMMU and VFIO to load first so that framebuffer drivers don’t grab the GPU while booting. After these changes, commit them to grub and generate a new boot image.

Run nano /etc/default/grub and change GRUB_CMDLINE_LINUX_DEFAULT="quiet" as follows:

GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on kvm_amd.avic=1 rd.driver.pre=vfio-pci video=efifb:off"

Save and exit: press CTRL+X, Y for yes, and ENTER.

Afterward, run:

update-grub          # update boot loader
update-initramfs -u  # update boot image
reboot               # reboot machine

After our computer reboots, run lspci -nnks 28:00 to check that the driver loaded correctly. If everything went well, for each device we should see vfio-pci for our "Kernel driver in use".

28:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
	Subsystem: ZOTAC International (MCO) Ltd. GP104 [GeForce GTX 1080] [19da:1451]
	Kernel driver in use: vfio-pci
	Kernel modules: nvidiafb, nouveau
28:00.1 Audio device [0403]: NVIDIA Corporation GP104 High Definition Audio Controller [10de:10f0] (rev a1)
	Subsystem: ZOTAC International (MCO) Ltd. GP104 High Definition Audio Controller [19da:1451]
	Kernel driver in use: vfio-pci
	Kernel modules: snd_hda_intel

Also, run dmesg | grep -e AMD-Vi -e vAPIC to check our IOMMU settings.

[    0.893699] AMD-Vi: IOMMU performance counters supported
[    0.895145] AMD-Vi: Found IOMMU at 0000:00:00.2 cap 0x40
[    0.895146] AMD-Vi: Extended features (0xf77ef22294ada):
[    0.895146]  PPR NX GT IA GA PC GA_vAPIC
[    0.895148] AMD-Vi: Interrupt remapping enabled
[    0.895149] AMD-Vi: virtual APIC enabled
[    0.895257] AMD-Vi: Lazy IO/TLB flushing enabled

About That: AMD Virtual Interrupt Controller (AVIC) virtualizes local APIC registers of each vCPU via the virtual APIC (vAPIC) backing page. This allows guest access to certain APIC registers without needing to emulate the hardware behavior, and should speed up workloads that generate large amount of interrupts.

Final Thoughts

Congratulations! We have our PVE server configured and ready to use. We can now begin creating Virtual Machines (VMs) or Containers. In future posts, we’ll consider additional opportunties for enhancing performance and security for our server, VMs, and Containers.

Although we have configured passthrough on the server, updates to our VMs are required to leverage that feature. Because Nvidia sells a commercial line of GPUs (Quadro), they do not support passthrough, and actively try to inhibit passthrough on their consumer line (GeForce). We will have to consider potential workarounds to enable that functionality, which may involve future tweaks to our server settings.

System Optimization…​

ZFS Tune-up

Graphics Processing Unit (GPU) Passthrough

Enable the VFIO Modules:

Configure the VFIO Modules

Identify Passthrough Device

Enable Passthrough Device

Update Boot Settings

Final Thoughts

System Optimization…