PVE: Virtualization for Work and Play (Part 3)
System Optimization…
In the previous post we installed ProxMox Virtual Environment (PVE) and configured our ZFS ZPool storage system. Let’s tweak our system to improve performance.
ZFS Tune-up
Solaris-based UNIX and Linux treat extended attributes differently, which has some performance implications. By default, ZFS on Linux is set to xattr=on
which causes the extended attributes to be stored in separate hidden directories. Changing this property to "system attributes" improves performance significantly under Linux as a results of extended attributes being stored more efficiently on disk.
To check the attribute run zfs get -r xattr tank
(update tank with your zpool name) and we’ll see something like the following:
NAME PROPERTY VALUE SOURCE
tank xattr on default
tank/vm-disks xattr on default
To update the property run zfs set xattr=sa tank
Linux has extended attributes called Posix ACL that are not functional on other platforms. To check the attribute, run zfs get -r acltype tank
and we should get something like:
NAME PROPERTY VALUE SOURCE
tank acltype off local
tank/vm-disks acltype off inherited from zentank
To update the property run zfs set acltype=posixacl tank
. Also, so that ACLs get passed to files created within a directory, we need to run zfs set aclinherit=passthrough tank
as well.
Linux has two parameters related to the times when files are accessed. The first is atime
, which tracks the “last” access time. This creates a lot of overhead because every time a file is read, an update has to write to disk to reflect this access time. The second is relatime
, which is similar but is a relative atime, and writes on fewer occasions. We can check both of them with zfs get -r atime tank && zfs get -r relatime tank
and we should see something like:
NAME PROPERTY VALUE SOURCE
tank atime on default
tank/vm-disks atime on default
NAME PROPERTY VALUE SOURCE
tank relatime off default
tank/vm-disks relatime off default
If we need to track access times, "relatime" is preferable, however, we can disable both by running zfs set atime=off tank
, since "relatime" is already disabled.
We can set additional options for reliability. Run zpool set autoreplace=on tank
so that ZFS can automatically switch to an available hot spare if hardware errors are detected on online disks. Run zpool set autoexpand=on tank
to allows the pool to grow when all VDEVs have been replaced with larger ones. This must be set before any drives are replaced, so we may as well set it now.
About That: Many ZFS properties are not retroactive. To apply to existing files, we would need to replace the files. In other words, if you already have files or data stored on your ZFS pool, you would need to move them somewhere else (i.e. backup) and then move them back so that the changes in properties are applied correctly. |
Graphics Processing Unit (GPU) Passthrough
Passthrough allows our Virtual Machine (VM) to access GPU hardware for games, graphics, and heavy computation (i.e. deep learning). We must enable IOMMU ("Input–output memory management unit") drivers, which allocate device-visible virtual addresses to the actual physical addresses. IOMMU enables our VM to communicate with our GPU using the virtual addresses as if it were directly communicating to the GPU.
VFIO ("Virtual Function I/O") modules are part of an IOMMU device-agnostic framework for exposing direct device access to userspace, in a secure IOMMU protected environment. In other words, they provide access to non-privileged, low-overhead userspace drivers.
Enable the VFIO Modules:
Run nano /etc/modules
and add the following:
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd
Save and exit: press CTRL+X, Y for yes, and ENTER.
Configure the VFIO Modules
Identify Passthrough Device
To identify the GPU to passthrough run lspci -nn | grep VGA
.
21:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107 [GeForce GTX 745] [10de:1382] (rev a2)
28:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
Identify the GPU slot IDs (first pair of numbers separated by a colon):
-
My GPU Slot ID for passthrough is:
28:00
-
My GPU Slot ID for the host is:
21:00
Identify the vendor ID for passthrough: lspci -nns 28:00 | cut -d "(" -f 1 | cut -d ":" -f 3,4
NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80]
NVIDIA Corporation GP104 High Definition Audio Controller [10de:10f0]
-
Vendor ID for GPU VGA device:
10de:1b80
-
Vendor ID for GPU Audio device:
10de:10f0
Enable Passthrough Device
To enable passthrough, add the following module options (including the comma separated vendor IDs identified in the prior step). This loads options for the vfio-pci kernel module, which maps memory regions from the PCI bus to the VM, and activates support for IOMMU groups.
Run nano /etc/modprobe.d/kvm.conf
and add some of the following options (see Table 1 for details):
# uncomment the first option if required for your system.
#options vfio_iommu_type1 allow_unsafe_interrupts=1
options vfio-pci ids=10de:1b80,10de:10f0
options vfio-pci disable_vga=1
options kvm-amd npt=0
options kvm ignore_msrs=1
Save and exit: press CTRL+X, Y for yes, and ENTER.
Option | Details |
---|---|
allow_unsafe_interrupts=1 |
This workaround is for platforms without interrupt remapping support, which provides device isolation. It removes protection against MSI-based interrupt injection attacks by guests. Only trusted guests and drivers should be run with this configuration. |
ids=10de:1b80,10de:10f0 |
Assign desired GPU to the virtual pci for use in our VM. |
disable_vga=1 |
Opt-out devices from vga arbitration if possible. |
npt=0 |
Disable Nested Page Table If VM performance is very slow. Linux guests with Q35 and OVMF may work with npt on or off, however a Linux guest with i440fx only works with npt disabled. |
ignore_msrs=1 |
Prevent some Nvidia applications from crashing the VM. |
Update Boot Settings
Configure IOMMU and VFIO to load first so that framebuffer drivers don’t grab the GPU while booting. After these changes, commit them to grub and generate a new boot image.
Run nano /etc/default/grub
and change GRUB_CMDLINE_LINUX_DEFAULT="quiet"
as follows:
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on kvm_amd.avic=1 rd.driver.pre=vfio-pci video=efifb:off"
Save and exit: press CTRL+X, Y for yes, and ENTER.
Afterward, run:
update-grub # update boot loader
update-initramfs -u # update boot image
reboot # reboot machine
After our computer reboots, run lspci -nnks 28:00
to check that the driver loaded correctly. If everything went well, for each device we should see vfio-pci
for our "Kernel driver in use".
28:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
Subsystem: ZOTAC International (MCO) Ltd. GP104 [GeForce GTX 1080] [19da:1451]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau
28:00.1 Audio device [0403]: NVIDIA Corporation GP104 High Definition Audio Controller [10de:10f0] (rev a1)
Subsystem: ZOTAC International (MCO) Ltd. GP104 High Definition Audio Controller [19da:1451]
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel
Also, run dmesg | grep -e AMD-Vi -e vAPIC
to check our IOMMU settings.
[ 0.893699] AMD-Vi: IOMMU performance counters supported
[ 0.895145] AMD-Vi: Found IOMMU at 0000:00:00.2 cap 0x40
[ 0.895146] AMD-Vi: Extended features (0xf77ef22294ada):
[ 0.895146] PPR NX GT IA GA PC GA_vAPIC
[ 0.895148] AMD-Vi: Interrupt remapping enabled
[ 0.895149] AMD-Vi: virtual APIC enabled
[ 0.895257] AMD-Vi: Lazy IO/TLB flushing enabled
About That: AMD Virtual Interrupt Controller (AVIC) virtualizes local APIC registers of each vCPU via the virtual APIC (vAPIC) backing page. This allows guest access to certain APIC registers without needing to emulate the hardware behavior, and should speed up workloads that generate large amount of interrupts. |
Final Thoughts
Congratulations! We have our PVE server configured and ready to use. We can now begin creating Virtual Machines (VMs) or Containers. In future posts, we’ll consider additional opportunties for enhancing performance and security for our server, VMs, and Containers.
Although we have configured passthrough on the server, updates to our VMs are required to leverage that feature. Because Nvidia sells a commercial line of GPUs (Quadro), they do not support passthrough, and actively try to inhibit passthrough on their consumer line (GeForce). We will have to consider potential workarounds to enable that functionality, which may involve future tweaks to our server settings.