Skip to content

The shape of VM to come

pclouzet edited this page May 21, 2025 · 17 revisions

Install a Virtual Machine using qemu

Install qemu

Get qemu
git clone https://gitlab.com/qemu-project/qemu.git

Install qemu

mkdir build \
cd build \
../configure --enable-slirp \
make -j \
sudo make install \

Install a VM using qemu

Get an image of debian12.2.0 we want to boot on as a virtual machine:
wget https://www.debian.org/distrib/netinst/debian-12.2.0-amd64-netinst.iso .

Create a disk image (qcow2 format) where the vm will store
qemu-img create -f qcow2 mydisk.img 20G

Install the vm running on debian with qemu:

qemu-system-x86_64 -boot d -cdrom debian-12.2.0-amd64-netinst.iso -m 4G \
-device e1000,netdev=net0,mac=52:54:00:12:34:56 -netdev user,id=net0,hostfwd=tcp::10022-:22 \
-hda mydisk.img -accel kvm

Follow all instruction from the interface and you're done. -accel kvm helps boosting the installation time (from 1h30 to 20min in my case)

Launch our new VM

Let's say we want to run debian with 8Gb of ram:
qemu-system-x86_64 -hda mydisk.img -m 8G -accel kvm

A vm can use a lot of ressources and slow down its usage, we can lighten our efforts by disabling all graphical interface: Open a terminal within the vm and run

sudo systemctl set-default multi-user.target
sudo reboot

Just in case, you can re-enable it with:

systemctl set-default graphical.target
sudo reboot

Snapshot of an Image

Before going further, we can save our mydisk.img as snapshots thanks to qcow2 format:

qemu-img create -f qcow2 -b mydisk.img -F qcow2 snapshot.img

'mydisk.img' should not be modified anymore, because, any change could corrupt snapshots.

Set some architecture features using qemu

Using qemu, let's set our VM's hardware with 4 NUMA nodes, each with 4cpus of 4,2,1 and 1Gb of memory: \

qemu-system-x86_64 -hda snapshot.img -m 8G \
        -accel kvm \
       -smp cpus=16 \
       -object memory-backend-ram,size=4G,id=ram0 \
       -object memory-backend-ram,size=2G,id=ram1 \
       -object memory-backend-ram,size=1G,id=ram2 \
       -object memory-backend-ram,size=1G,id=ram3 \
       -numa node,nodeid=0,memdev=ram0,cpus=0-3 \
       -numa node,nodeid=1,memdev=ram1,cpus=4-7 \
       -numa node,nodeid=2,memdev=ram2,cpus=8-11 \
       -numa node,nodeid=3,memdev=ram3,cpus=12-15 \

Add an nvdimm node

qemu-system-x86_64 -hda img/snapshot.img -accel kvm \
        -device e1000,netdev=net0,mac=52:54:00:12:34:56 -netdev user,id=net0,hostfwd=tcp::10022-:22 \
        -machine pc,nvdimm=on \
        -m 8G,slots=1,maxmem=9G \
        -smp cpus=16 \
        -object memory-backend-ram,size=4G,id=ram0 \
        -object memory-backend-ram,size=2G,id=ram1 \
        -object memory-backend-ram,size=1G,id=ram2 \
        -object memory-backend-ram,size=1G,id=ram3 \
        -device nvdimm,id=nvdimm1,memdev=nvdimm1,unarmed=off,node=4 \
        -object memory-backend-file,id=nvdimm1,share=on,mem-path=img/nvdimm.img,size=1G \
        -numa node,nodeid=0,memdev=ram0,cpus=0-3 \
        -numa node,nodeid=1,memdev=ram1,cpus=4-7 \
        -numa node,nodeid=2,memdev=ram2,cpus=8-11 \
        -numa node,nodeid=3,memdev=ram3,cpus=12-15 \
        -numa node,nodeid=4

By running the command: ndctl list -NRD we can list the active and enabled nvdimm devices:

{
  "dimms":[
    {
      "dev":"nmem0",
      "id":"8680-56341200",
      "handle":1,
      "phys_id":0
    }
  ],
  "regions":[
    {
      "dev":"region0",
      "size":1073741824,
      "align":16777216,
      "available_size":0,
      "max_available_extent":0,
      "type":"pmem",
      "mappings":[
        {
          "dimm":"nmem0",
          "offset":0,
          "length":1073741824,
          "position":0
        }
      ],
      "persistence_domain":"unknown",
      "namespaces":[
        {
          "dev":"namespace0.0",
          "mode":"raw",
          "size":1073741824,
          "sector_size":512,
          "blockdev":"pmem0"
        }
      ]
    }
  ]
}

By defaults, the namespaceX.Y (here namespace0.0) is set as a raw mode. Which means, the nvdimm device acts as a memory disk not supporting dax. We need to disable the namespace, create a new one and finally set mode to devdax with following commands:

sudo ndctl disable-namespace namespace0.0
sudo ndctl create-namespace -m devdax
sudo daxctl reconfigure-device -m system-ram all --force

Node 4 is now congired as dax:

{
  "dimms":[
    {
      "dev":"nmem0",
      "id":"8680-56341200",
      "handle":1,
      "phys_id":0
    }
  ],
  "regions":[
    {
      "dev":"region0",
      "size":1073741824,
      "align":16777216,
      "available_size":0,
      "max_available_extent":0,
      "type":"pmem",
      "mappings":[
        {
          "dimm":"nmem0",
          "offset":0,
          "length":1073741824,
          "position":0
        }
      ],
      "persistence_domain":"unknown",
      "namespaces":[
        {
          "dev":"namespace0.0",
          "mode":"devdax",
          "map":"dev",
          "size":1054867456,
          "uuid":"ed8bb2a9-41fb-48e0-a0b2-7dbf0d9ca9ba",
          "chardev":"dax0.0",
          "align":2097152
        }
      ]
    }
  ]
}

CXL

To be sure, ewe work with latest linux kernel: 6.7.0-rc3+

Persistent memory example

First we need a CXL hostbridge (Pci EXtended Bridge, i.e, pxb-cxl "cxl.1"), then we attach a root-port (cxl-rp "root_port13" here), then a Type 3 device.
In this case it is a pmem device so it needs two "memory-backend-file" objects, one for the memory ("pmem0" here) and one for its label storage area (LSA, i.e "cxl-lsa0"). Finally we need a Fixed Memory Window (FMW, i.e, cxl-fwm) to map that memory in the host:

qemu-system-x86_64 -hda img/snapshot.img -accel kvm \
        -machine q35,nvdimm=on,cxl=on \
        -device e1000,netdev=net0,mac=52:54:00:12:34:56 \
        -netdev user,id=net0,hostfwd=tcp::10022-:22 \
        -m 4G,slots=8,maxmem=8G \
        -smp 4 \
        -object memory-backend-ram,size=4G,id=mem0 \
        -numa node,nodeid=0,cpus=0-3,memdev=mem0 \
        -object memory-backend-file,id=pmem0,share=on,mem-path=/tmp/cxltest.raw,size=256M \
        -object memory-backend-file,id=cxl-lsa0,share=on,mem-path=/tmp/lsa.raw,size=256M \
        -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
        -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
        -device cxl-type3,bus=root_port13,persistent-memdev=pmem0,lsa=cxl-lsa0,id=cxl-pmem0 \
        -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G

We need to create the region using cxl create-region and make it available as nvm numa node:

sudo cxl create-region -m -d decoder0.0 -t pmem mem0
sudo daxctl reconfigure-device -m system-ram dax0.0 --force

volatile memory example, with interleaving and switch

Lets build with 2 sockets. Each socket has 2 cpus and 2 cxl devices, 1 switch.
We need a PXB per socket with 2 RP per socket. A switch is installed on each socket. We need to set 1 upstream port per socket and 2 downstream ports per sockets. Both pxb set as upstream port for the switch, have to be attached on slot 0. Hence, we need to distinguish chassis from each other numa nodes. In this case it is a vmem device so it needs two "memory-backend-ram" objects per socket. Finally we set 2 Fixed Memory Window to map both memory in the host:

qemu-system-x86_64 -hda img/snapshot.img -accel kvm \
        -machine q35,nvdimm=on,cxl=on \
        -device e1000,netdev=net0,mac=52:54:00:12:34:56 \
        -netdev user,id=net0,hostfwd=tcp::10022-:22 \
        -m 2G,slots=8,maxmem=10G \
        -smp cpus=4,cores=2,sockets=2 \
        -object memory-backend-ram,size=1G,id=ram0 \
        -object memory-backend-ram,size=1G,id=ram1 \
        -object memory-backend-ram,id=cxl-mem0,share=on,size=256M \
        -object memory-backend-ram,id=cxl-mem1,share=on,size=256M \
        -object memory-backend-ram,id=cxl-mem2,share=on,size=256M \
        -object memory-backend-ram,id=cxl-mem3,share=on,size=256M \
        -numa node,nodeid=0,cpus=0-1,memdev=ram0 \
        -numa node,nodeid=1,cpus=2-3,memdev=ram1 \
        -device pxb-cxl,numa_node=0,bus_nr=24,bus=pcie.0,id=pxb-cxl.1 \
        -device pxb-cxl,numa_node=1,bus_nr=32,bus=pcie.0,id=pxb-cxl.2 \
        -device cxl-rp,port=0,bus=pxb-cxl.1,id=root_port1,chassis=0,slot=0 \
        -device cxl-rp,port=1,bus=pxb-cxl.1,id=root_port2,chassis=0,slot=1 \
        -device cxl-rp,port=2,bus=pxb-cxl.2,id=root_port3,chassis=1,slot=0 \
        -device cxl-rp,port=3,bus=pxb-cxl.2,id=root_port4,chassis=1,slot=2 \
        -device cxl-upstream,bus=root_port1,id=us0 \
        -device cxl-upstream,bus=root_port3,id=us1 \
        -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=3 \
        -device cxl-type3,bus=swport0,volatile-memdev=cxl-mem0,id=cxl-vmem0 \
        -device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=4 \
        -device cxl-type3,bus=swport1,volatile-memdev=cxl-mem1,id=cxl-vmem1 \
        -device cxl-downstream,port=2,bus=us1,id=swport2,chassis=1,slot=5 \
        -device cxl-type3,bus=swport2,volatile-memdev=cxl-mem2,id=cxl-vmem2 \
        -device cxl-downstream,port=3,bus=us1,id=swport3,chassis=1,slot=6 \
        -device cxl-type3,bus=swport3,volatile-memdev=cxl-mem3,id=cxl-vmem3 \
        -M cxl-fmw.0.targets.0=pxb-cxl.1,cxl-fmw.0.size=4G,cxl-fmw.1.targets.0=pxb-cxl.2,cxl-fmw.1.size=4G

Here, we selected root_port1 and root_port3 to be plugged on slot 0 on chassis 0 and chassis 1 respectively. bus_nr of PXBs may lead to error messages because they may be already used. Just change them to another value.
From the vm, list cxl memory devices with cxl list -M :

[
  {
    "memdev":"mem1",
    "ram_size":268435456,
    "serial":0,
    "numa_node":1,
    "host":"0000:23:00.0"
  },
  {
    "memdev":"mem0",
    "ram_size":268435456,
    "serial":0,
    "numa_node":1,
    "host":"0000:24:00.0"
  },
  {
    "memdev":"mem2",
    "ram_size":268435456,
    "serial":0,
    "numa_node":0,
    "host":"0000:1b:00.0"
  },
  {
    "memdev":"mem3",
    "ram_size":268435456,
    "serial":0,
    "numa_node":0,
    "host":"0000:1c:00.0"
  }
]

We can list decoders available with cxl list -D:

[
  {
    "root decoders":[
      {
        "decoder":"decoder0.0",
        "size":4294967296,
        "interleave_ways":1,
        "max_available_extent":-17985175553,
        "pmem_capable":true,
        "volatile_capable":true,
        "accelmem_capable":true,
        "nr_targets":1
      },
      {
        "decoder":"decoder0.1",
        "size":4294967296,
        "interleave_ways":1,
        "max_available_extent":-22280142849,
        "pmem_capable":true,
        "volatile_capable":true,
        "accelmem_capable":true,
        "nr_targets":1
      }
    ]
  }
]

We assemble a cxl region with the cxl list create-region command. We need to select the decoder where the region will be created under and containing cxl devices. Below, we first assemble mem1 and mem0 located under decoder0.1, with a 2 way interleaving:

sudo cxl create-region -m -d decoder0.1 -t ram -w 2 mem1 mem0

And we assemble with decoder 0.0 mem2 and mem3 with 1 way interleaving

sudo cxl create-region -m -d decoder0.0 -t ram -w 1 mem2
sudo cxl create-region -m -d decoder0.0 -t ram -w 1 mem3

We can see they are now available with command: daxctl list

[
  {
    "chardev":"dax1.0",
    "size":268435456,
    "target_node":3,
    "align":2097152,
    "mode":"system-ram"
  },
  {
    "chardev":"dax3.0",
    "size":268435456,
    "target_node":3,
    "align":2097152,
    "mode":"system-ram"
  },
  {
    "chardev":"dax0.0",
    "size":536870912,
    "target_node":2,
    "align":2097152,
    "mode":"system-ram"
  }
]

New DAX device should appear under /sys/bus/dax/devices. By default, new NUMA nodes appear offline. Run daxctl online-memory all to make them online. \

Both pmem and vmem

Lets build a vm with 4 sockets, one socket with only cpus, one with cxl pmem device, one with 2 cxl 2-way interleaved, one with 2 cxl 1-way interleaved

qemu-system-x86_64 -hda img/snapshot.img -accel kvm \
        -machine q35,nvdimm=on,cxl=on \
        -device e1000,netdev=net0,mac=52:54:00:12:34:56 \
        -netdev user,id=net0,hostfwd=tcp::10022-:22 \
        -m 4G,slots=8,maxmem=10G \
        -smp cpus=8,cores=2,sockets=4 \
        -object memory-backend-ram,size=1G,id=ram0 \
        -object memory-backend-ram,size=1G,id=ram1 \
        -object memory-backend-ram,size=1G,id=ram2 \
        -object memory-backend-ram,size=1G,id=ram3 \
        -object memory-backend-ram,id=cxl-mem0,share=on,size=256M \
        -object memory-backend-ram,id=cxl-mem1,share=on,size=256M \
        -object memory-backend-ram,id=cxl-mem2,share=on,size=256M \
        -object memory-backend-ram,id=cxl-mem3,share=on,size=256M \
        -object memory-backend-file,id=cxl-mem4,share=on,mem-path=/tmp/cxltest.raw,size=256M \
        -object memory-backend-file,id=cxl-lsa4,share=on,mem-path=/tmp/lsa.raw,size=256M \
        -numa node,nodeid=0,cpus=0-1,memdev=ram0 \
        -numa node,nodeid=1,cpus=2-3,memdev=ram1 \
        -numa node,nodeid=2,cpus=4-5,memdev=ram2 \
        -numa node,nodeid=3,cpus=6-7,memdev=ram3 \
        -device pxb-cxl,numa_node=0,bus_nr=24,bus=pcie.0,id=pxb-cxl.1 \
        -device pxb-cxl,numa_node=1,bus_nr=32,bus=pcie.0,id=pxb-cxl.2 \
        -device pxb-cxl,numa_node=3,bus_nr=40,bus=pcie.0,id=pxb-cxl.3 \
        -device cxl-rp,port=0,bus=pxb-cxl.1,id=root_port1,chassis=0,slot=0 \
        -device cxl-rp,port=1,bus=pxb-cxl.1,id=root_port2,chassis=0,slot=3 \
        -device cxl-rp,port=2,bus=pxb-cxl.2,id=root_port3,chassis=1,slot=0 \
        -device cxl-rp,port=3,bus=pxb-cxl.2,id=root_port4,chassis=1,slot=5 \
        -device cxl-rp,port=0,bus=pxb-cxl.3,id=root_port5,chassis=2,slot=0 \
        -device cxl-upstream,bus=root_port1,id=us0 \
        -device cxl-upstream,bus=root_port3,id=us1 \
        -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=7 \
        -device cxl-type3,bus=swport0,volatile-memdev=cxl-mem0,id=cxl-vmem0 \
        -device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=8 \
        -device cxl-type3,bus=swport1,volatile-memdev=cxl-mem1,id=cxl-vmem1 \
        -device cxl-downstream,port=2,bus=us1,id=swport2,chassis=1,slot=9 \
        -device cxl-type3,bus=swport2,volatile-memdev=cxl-mem2,id=cxl-vmem2 \
        -device cxl-downstream,port=3,bus=us1,id=swport3,chassis=1,slot=10 \
        -device cxl-type3,bus=swport3,volatile-memdev=cxl-mem3,id=cxl-vmem3 \
        -device cxl-type3,bus=root_port5,persistent-memdev=cxl-mem4,lsa=cxl-lsa4,id=cxl-pmem0 \
        -M cxl-fmw.0.targets.0=pxb-cxl.1,cxl-fmw.0.size=4G,cxl-fmw.1.targets.0=pxb-cxl.2,cxl-fmw.1.size=4G,cxl-fmw.2.targets.0=pxb-cxl.3,
           cxl-fmw.2.size=512M

TIP: How to identify which decoder corresponds to which device.
When listing with cxl list -Dv, identify the id. Here the decoder0.0 corresponds to the id=24. It corresponds to the bus number attached to a node. From our previous qemu script, the bus_nr=24 corresponds to our numa_node=0

"decoders:root0":[
      {
        "decoder":"decoder0.0",
        "size":4294967296,
        "interleave_ways":1,
        "max_available_extent":4294967296,
        "pmem_capable":true,
        "volatile_capable":true,
        "accelmem_capable":true,
        "nr_targets":1,
        "targets":[
          {
            "target":"pci0000:18",
            "alias":"ACPI0016:02",
            "position":0,
            "id":24
          }
        ]
      },
      {
        "decoder":"decoder0.1",
        "size":4294967296,
        "interleave_ways":1,
        "max_available_extent":4294967296,
        "pmem_capable":true,
        "volatile_capable":true,
        "accelmem_capable":true,
        "nr_targets":1,
        "targets":[
          {
            "target":"pci0000:20",
            "alias":"ACPI0016:01",
            "position":0,
            "id":32
          }
        ]
      },
      {
{
        "decoder":"decoder0.2",
        "size":536870912,
        "interleave_ways":1,
        "max_available_extent":536870912,
        "pmem_capable":true,
        "volatile_capable":true,
        "accelmem_capable":true,
        "nr_targets":1,
        "targets":[
          {
            "target":"pci0000:28",
            "alias":"ACPI0016:00",
            "position":0,
            "id":40
          }
        ]

Lets select the id 24. It is attached to the decoder0.0. To identify which memory device is below that decoder, run cxl list -M:

[
  {
    "memdev":"mem0",
    "pmem_size":268435456,
    "serial":0,
    "numa_node":3,
    "host":"0000:29:00.0"
  },
  {
    "memdev":"mem1",
    "ram_size":268435456,
    "serial":0,
    "numa_node":0,
    "host":"0000:1b:00.0"
  },
  {
    "memdev":"mem4",
    "ram_size":268435456,
    "serial":0,
    "numa_node":0,
    "host":"0000:1c:00.0"
  },
  {
    "memdev":"mem3",
    "ram_size":268435456,
    "serial":0,
    "numa_node":1,
    "host":"0000:23:00.0"
  },
  {
    "memdev":"mem2",
    "ram_size":268435456,
    "serial":0,
    "numa_node":1,
    "host":"0000:24:00.0"
  }
]

We can see that in the numa_node 0, mem1 and mem4 are located.
So we can run: sudo cxl create-region -m -t ram -d decoder0.0 -w2 mem4 mem1 without doubt whether it is the right decoder with the rights memory devices. Then we finalize the configuration with one way interleaving:

sudo cxl create-region -m -t ram -d decoder0.1 -w1 mem3
sudo cxl create-region -m -t ram -d decoder0.1 -w1 mem2

The region with persistent memory:

sudo cxl create-region -m -t pmem -d decoder0.2 mem0
sudo ndctl create-namespace -t pmem -m devdax -r region2 -f

And finally online all devices:

sudo daxctl online-memory all
sudo daxctl reconfigure-device -m system-ram dax2.0 --force

We oberved error messages like: failed to create namespace: No space left on device after running the namespace creation. To encounter this issue, erase files declared in the mem-path argument (usually /tmp/ ) of your qemu script, reboot the vm.

How to turn on, run scripts, turn off a vm automatically:

The approach is to copy a rc.local script within the guest. So that when we turn on the vm, the rc.local would run things we want before halting. It means the rc.local is executed as root. A log.txt file is written and is transferred back to the host through scp.

Prepare for scp

Login to the guest, switch to root with su and run ssh-keygen. Add the id_rsa.pub content to ~/.ssh/authorized_keys 's host. From the guest, ssh to host( use ip address to get your host address), to pass the first attempt warning message.

Example of rc.local script

#!/bin/sh

# Guest and Host share the same login.
login=pclouzet
ip_adress_of_host=192.168.134.40
set -e

cd /home/$login
rm -rf archive*

# Get hwloc, unzip
wget https://ci.inria.fr/hwloc/job/basic/job/master/lastSuccessfulBuild/artifact/*zip*/archive.zip
unzip archive.zip
cd archive
filename=$(ls hwloc*.tar.bz2)
tar xf $filename
hwloc_dir="${filename%*.tar.bz2}"
cd $hwloc_dir

# Manual Install
./configure >> log.txt
make >> log.txt

# Test
./utils/lstopo/lstopo-no-graphics >> log.txt

# Copy log file and send it back to host
cp log.txt /home/$login
chown $login /home/$login/log.txt
scp /home/$login/log.txt pclouzet@$ip_adress_of_host:/mnt/scratch/$login/qemu/scripts/hwloc_test
# Shutdown guest
halt 

Embed rc.local to vm's image

We use the virt-customize command to embed the rc.local script into the vm: virt-customize -a $your_img --copy-in rc.local:/etc/ Your image needs to be in qcow2 format and the vm turned off.

Set up to have a no graphic vm

According to qemu's doc, a simple -nographic argument is enough not to have a window opening etc. For qemu yes, but we still need to modify the behavior of Grub, otherwise we'd be stuck. Edit as su the /etc/defaut/grub file and add these lines:

# If you change this file, run 'update-grub' afterwards to update
# /boot/grub/grub.cfg.
# For full documentation of the options in this file, see:
#   info -f grub -n 'Simple configuration'

# to skip the window of grub
GRUB_GFXMODE=text
# to boot automatically (idle otherwise)
GRUB_DEFAULT=saved
GRUB_SAVEDEFAULT=true

# no time to waste
GRUB_HIDDEN_TIMEOUT=0
GRUB_TIMEOUT=0
GRUB_CMDLINE_LINUX_DEFAULT="quiet"
# to stay in the current terminal prompt
GRUB_CMDLINE_LINUX="console=ttyS0"

Update the grub configuration file with the command update-grub.

Run qemu

Here is an example:

#!/bin/bash
echo "... Running qemu ..."
qemu-system-x86_64 \
        -hda /mnt/scratch/pclouzet/qemu/img/new_clean.img \
        -device e1000,netdev=net0,mac=52:54:00:12:34:56 \
        -netdev user,id=net0,hostfwd=tcp::10022-:22 \
        -accel kvm \
        -machine q35,nvdimm=on,cxl=off \
        -m size=8G,slots=1,maxmem=9G \
        -smp cpus=16 \
        -object memory-backend-ram,size=4G,id=ram0 \
        -object memory-backend-ram,size=2G,id=ram1 \
        -object memory-backend-ram,size=1G,id=ram2 \
        -object memory-backend-ram,size=1G,id=ram3 \
        -numa node,nodeid=0,memdev=ram0,cpus=0-3 \
        -numa node,nodeid=1,memdev=ram1,cpus=4-7 \
        -numa node,nodeid=2,memdev=ram2,cpus=8-11 \
        -numa node,nodeid=3,memdev=ram3,cpus=12-15 \
        -object memory-backend-ram,size=1G,id=nvdimm0,share=on \
        -device nvdimm,id=nvdimm0,memdev=nvdimm0,unarmed=off,node=4 \
        -numa node,nodeid=4 \
        -nographic

To take back control of your terminal. Press ctrl + a + x

Clone this wiki locally