VFIO Part I. VFIO Core

发表于 2019-06-01 更新于 2019-10-31 分类于 Linux ， QEMU-KVM 本文字数： 11k 阅读时长 ≈ 18 分钟

文中代码基于Linux 5.1 rc6版本

Overview

VFIO提供了两个字符设备文件作为提供给用户程序的入口点，分别是/dev/vfio/vfio和/dev/vfio/$GROUP，此外还在sysfs中添加了一些文件。

首先看/dev/vfio/vfio，它是一个misc device，在vfio模块的初始化函数vfio&lowbar;init中注册：

static struct miscdevice vfio_dev = {
    .minor = VFIO_MINOR,
    .name = "vfio",
    .fops = &vfio_fops,
    .nodename = "vfio/vfio",
    .mode = S_IRUGO | S_IWUGO,
};

static int __init vfio_init(void) {
    int ret;
    /* ... */
    ret = misc_register(&vfio_dev);
    /* ... */
}

每次打开/dev/vfio/vfio文件，都会创建一个对应的Container即struct vfio&lowbar;container：

struct vfio_container {
    struct kref                 kref;
    struct list_head            group_list;
    struct rw_semaphore         group_lock;
    struct vfio_iommu_driver    *iommu_driver;
    void                        *iommu_data;
    bool                        noiommu;
};

我们可以将VFIO Group加入到Container中，Container维护了一个VFIO Group（struct vfio&lowbar;group）的链表group&lowbar;list。Container的作用就是通过其iommu&lowbar;driver为Group提供IOMMU的服务：

struct vfio_iommu_driver {
    const struct vfio_iommu_driver_ops  *ops;
    struct list_head                    vfio_next;
};

noiommu用于表示该Container是否用于存放no-iommu的Group（一个Container不能同时存放no-iommu Group和普通Group）。no-iommu Group即背后没有IOMMU但仍然强行建立的VFIO Group，这个高级特性（CONFIG&lowbar;VFIO&lowbar;NOIOMMU）通常不建议开启，我们忽略相关的代码即可。

/dev/vfio/$GROUP文件显然对应着VFIO Group，它的由来要更复杂一些，我们看vfio_init的一段代码来理解：

/* /dev/vfio/$GROUP */
vfio.class = class_create(THIS_MODULE, "vfio");
if (IS_ERR(vfio.class)) {
    ret = PTR_ERR(vfio.class);
    goto err_class;
}

vfio.class->devnode = vfio_devnode;

ret = alloc_chrdev_region(&vfio.group_devt, 0, MINORMASK + 1, "vfio");
if (ret)
    goto err_alloc_chrdev;

cdev_init(&vfio.group_cdev, &vfio_group_fops);
ret = cdev_add(&vfio.group_cdev, vfio.group_devt, MINORMASK + 1);
if (ret)
    goto err_cdev_add;

其中vfio&lowbar;devnode函数的定义如下：

/**
 * Module/class support
 */
static char *vfio_devnode(struct device *dev, umode_t *mode)
{
    return kasprintf(GFP_KERNEL, "vfio/%s", dev_name(dev));
}

这里为VFIO Group字符设备动态分配了一整个Major（即包含该Major下的所有Minor）的设备号并注册了cdev，一旦创建一个带devt的Device，并挂在VFIO Class（/sys/class/vfio）下，就会创建一个/dev/vfio/$GROUP字符设备文件。

VFIO分为VFIO核心模块和VFIO驱动模块，VFIO Group是由VFIO驱动模块创建的，最常用的是vfio-pci驱动。VFIO驱动是以设备驱动的形式实现，它们会注册一个Driver，并在其probe函数中调用vfio_add_group_dev，并最终会调用device_create为VFIO Group创建一个Device（从而也创建了/dev/vfio/$GROUP设备文件）：

/* vfio_add_group_dev --> vfio_create_group */
dev = device_create(vfio.class, NULL,
            MKDEV(MAJOR(vfio.group_devt), minor),
            group, "%s%d", group->noiommu ? "noiommu-" : "",
            iommu_group_id(iommu_group));

至于上面说的sysfs文件，也是由VFIO驱动创建的，因为它本身就是一个（虚拟）设备驱动，自然可以创建sysfs目录与属性。

VFIO Group

以下均以vfio-pci为例进行分析，对于其他VFIO驱动也有参考价值

Creation

我们先从VFIO Group的创建开始，对于vfio-pci，这是在vfio&lowbar;pci&lowbar;probe中完成的：

static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
{
    struct vfio_pci_device *vdev;
    struct iommu_group *group;
    int ret;

    /* ... */

    ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
    if (ret) {
        vfio_iommu_group_put(group, &pdev->dev);
        kfree(vdev);
        return ret;
    }

    /* ... */

    return ret;
}

这里创建了一个vfio&lowbar;pci&lowbar;device对象vdev，并使用VFIO Core提供的vfio&lowbar;add&lowbar;group&lowbar;dev创建了一个VFIO Group。下面详细分析vfio&lowbar;add&lowbar;group&lowbar;dev创建的数据结构。

首先，VFIO Core有一个全局变量vfio：

static struct vfio {
    struct class            *class;
    struct list_head        iommu_drivers_list;
    struct mutex            iommu_drivers_lock;
    struct list_head        group_list;
    struct idr              group_idr;
    struct mutex            group_lock;
    struct cdev             group_cdev;
    dev_t                   group_devt;
    wait_queue_head_t       release_q;
} vfio;

其中group&lowbar;list是所有VFIO Group构成的链表，group&lowbar;idr是由VFIO Group的Minor号构成的一棵Radix Tree。

再来看VFIO Group，每个VFIO Group都是和一个IOMMU Group相对应的：

struct vfio_group {
    struct kref                     kref;
    int                             minor;
    atomic_t                        container_users;
    struct iommu_group              *iommu_group;
    struct vfio_container           *container;
    struct list_head                device_list;
    struct mutex                    device_lock;
    struct device                   *dev;
    struct notifier_block           nb;
    struct list_head                vfio_next;
    struct list_head                container_next;
    struct list_head                unbound_list;
    struct mutex                    unbound_lock;
    atomic_t                        opened;
    wait_queue_head_t               container_q;
    bool                            noiommu;
    struct kvm                      *kvm;
    struct blocking_notifier_head   notifier;
};

一个IOMMU Group代表一组设备，在硬件上无法区分它们的ID（例如它们都在PCIe-PCI Bridge后面），因此只能共用一张IOMMU页表。

VFIO Group的dev会指向/dev/vfio/$GROUP对应的Device，和vfio_add_group_dev传入的Device无关。由于VFIO Group和IOMMU Group是一一对应关系，一个Group下可以有多个VFIO Device，VFIO Group通过device_list链表引用这些VFIO Device。VFIO Device的定义如下：

struct vfio_device {
    struct kref                     kref;
    struct device                   *dev;
    const struct vfio_device_ops    *ops;
    struct vfio_group               *group;
    struct list_head                group_next;
    void                            *device_data;
};

我们向vfio_add_group_dev传入的pdev->dev被放入了vfio_device->dev，vfio_pci_ops被放入了vfio_device->ops，vdev则放入了vfio_device->device_data。

下面分析vfio_add_group_dev(dev, ops, device_data)函数，该函数的目的实际上是创建一个VFIO Device，并加入相应的VFIO Group：

第一步，通过dev（即VFIO Device背后的设备）获得IOMMU Group

1
2
3

iommu_group = iommu_group_get(dev);
if (!iommu_group)
    return -EINVAL;

第二步，在全局变量vfio的VFIO Group链表中寻找匹配的Group，若找不到则创建一个新的，并令其iommu_group指向上面获得的IOMMU Group。创建VFIO Group在vfio_create_group中完成，其中这段代码值得注意：

group->nb.notifier_call = vfio_iommu_group_notifier;

/*
 * blocking notifiers acquire a rwsem around registering and hold
 * it around callback.  Therefore, need to register outside of
 * vfio.group_lock to avoid A-B/B-A contention.  Our callback won't
 * do anything unless it can find the group in vfio.group_list, so
 * no harm in registering early.
 */
ret = iommu_group_register_notifier(iommu_group, &group->nb);
if (ret) {
    kfree(group);
    return ERR_PTR(ret);
}

这里向内核的IOMMU层注册了回调，当IOMMU Group上发生一些事件时，会通知VFIO层执行vfio_iommu_group_notifier。

最后一步，创建VFIO Device。我们首先调用vfio_group_get_device(group, dev)，如果发现VFIO Group下已有对应的VFIO Device则返回-EBUSY。然后调用vfio_group_create_device(group, dev, ops, device_data)：

static
struct vfio_device *vfio_group_create_device(struct vfio_group *group,
                                             struct device *dev,
                                             const struct vfio_device_ops *ops,
                                             void *device_data)
{
    struct vfio_device *device;

    device = kzalloc(sizeof(*device), GFP_KERNEL);
    if (!device)
        return ERR_PTR(-ENOMEM);

    kref_init(&device->kref);
    device->dev = dev;
    device->group = group;
    device->ops = ops;
    device->device_data = device_data;
    dev_set_drvdata(dev, device);

    /* No need to get group_lock, caller has group reference */
    vfio_group_get(group);

    mutex_lock(&group->device_lock);
    list_add(&device->group_next, &group->device_list);
    mutex_unlock(&group->device_lock);

    return device;
}

Group Level API

我们首先来看/dev/vfio/$GROUP提供的API，该文件只支持ioctl操作：

static const struct file_operations vfio_group_fops = {
    .owner              = THIS_MODULE,
    .unlocked_ioctl     = vfio_group_fops_unl_ioctl,
#ifdef CONFIG_COMPAT
    .compat_ioctl       = vfio_group_fops_compat_ioctl,
#endif
    .open               = vfio_group_fops_open,
    .release            = vfio_group_fops_release,
};

在open时，会利用Minor号从vfio.group&lowbar;idr中找到对应的VIFO Group，然后将文件的private&lowbar;data设置为该VFIO Group：

group = vfio_group_get_from_minor(iminor(inode));
if (!group)
    return -ENODEV;

filep->private_data = group;

VFIO Group只有4个ioctl，分别是：

VFIO&lowbar;GROUP&lowbar;GET&lowbar;STATUS, &status：获取一个struct vfio&lowbar;group&lowbar;status表示VFIO Group的状态
VFIO&lowbar;GROUP&lowbar;SET&lowbar;CONTAINER, fd：传入一个fd表示VFIO Container，将VFIO Group加入该Container
VFIO&lowbar;GROUP&lowbar;UNSET&lowbar;CONTAINER：将VFIO Group移出Container
VFIO&lowbar;GROUP&lowbar;GET&lowbar;DEVICE&lowbar;FD, str：传入一个字符串表示VFIO Group下的Device，获取该Device对应的fd

实际上，vfio&lowbar;group&lowbar;status只包含一个flag，为其定义了两个位VFIO&lowbar;GROUP&lowbar;FLAGS&lowbar;VIABLE和VFIO&lowbar;GROUP&lowbar;FLAGS&lowbar;CONTAINER&lowbar;SET，后者显然表示VFIO Group是否绑定到了某个Container，Viable的含义可参考vfio&lowbar;dev&lowbar;viable函数的注释：

/*
 * A vfio group is viable for use by userspace if all devices are in
 * one of the following states:
 *  - driver-less
 *  - bound to a vfio driver
 *  - bound to a whitelisted driver
 *  - a PCI interconnect device
 *
 * We use two methods to determine whether a device is bound to a vfio
 * driver.  The first is to test whether the device exists in the vfio
 * group.  The second is to test if the device exists on the group
 * unbound_list, indicating it's in the middle of transitioning from
 * a vfio driver to driver-less.
 */

VFIO&lowbar;GROUP&lowbar;SET&lowbar;CONTAINER调用了Container的IOMMU Driver的attach&lowbar;group方法，来将Group加入Container：

driver = container->iommu_driver;
if (driver) {
    ret = driver->ops->attach_group(container->iommu_data,
                    group->iommu_group);
    if (ret)
        goto unlock_out;
}

类似地，VFIO&lowbar;GROUP&lowbar;UNSET&lowbar;CONTAINER调用了IOMMU Driver的detach&lowbar;group方法：

driver = container->iommu_driver;
if (driver)
    driver->ops->detach_group(container->iommu_data,
                  group->iommu_group);

VFIO&lowbar;GROUP&lowbar;GET&lowbar;DEVICE&lowbar;FD首先调用了VFIO Device的open方法：

ret = device->ops->open(device->device_data);
if (ret) {
    vfio_device_put(device);
    return ret;
}

对于vfio-pci就是vfio&lowbar;pci&lowbar;open，该函数主要对传入的vfio&lowbar;pci&lowbar;device对象作了初始化，初始化的过程依据了vdev背后的pdev的Configuration Space。

随后，为VFIO Device创建了一个Anonymous Inode，即不存在于任何目录下的游离于文件系统之外的孤儿Inode，并返回了其fd：

/*
 * We can't use anon_inode_getfd() because we need to modify
 * the f_mode flags directly to allow more than just ioctls
 */
ret = get_unused_fd_flags(O_CLOEXEC);
if (ret < 0) {
    device->ops->release(device->device_data);
    vfio_device_put(device);
    return ret;
}

filep = anon_inode_getfile("[vfio-device]", &vfio_device_fops,
               device, O_RDWR);
if (IS_ERR(filep)) {
    put_unused_fd(ret);
    ret = PTR_ERR(filep);
    device->ops->release(device->device_data);
    vfio_device_put(device);
    return ret;
}

/*
 * TODO: add an anon_inode interface to do this.
 * Appears to be missing by lack of need rather than
 * explicitly prevented.  Now there's need.
 */
filep->f_mode |= (FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE);

atomic_inc(&group->container_users);

fd_install(ret, filep);

Device Level API

上一节中的vfio&lowbar;device&lowbar;fops实际上只是VFIO Device的ops的一个Wrapper，它将对VFIO Device fd的read、write、mmap和ioctl代理给device->ops中的回调。

对于不同的VFIO驱动，read、write、mmap的含义各有不同，不过总的来说是将VFIO设备文件分为若干个Region，例如PIO Region、MMIO Region、PCI Configuration Space等，每个Region位于VFIO设备文件的不同offset并分别可以读写和映射。另外，每个VFIO设备还可以有一个或多个IRQ Space，用于提供中断的模拟。下面看一下相关的ioctl：

VFIO&lowbar;DEVICE&lowbar;GET&lowbar;INFO, &info，获取一个struct vfio&lowbar;device&lowbar;info，表明VFIO Device的信息：

struct vfio_device_info {
    __u32   argsz;
    __u32   flags;
#define VFIO_DEVICE_FLAGS_RESET (1 << 0)    /* Device supports reset */
#define VFIO_DEVICE_FLAGS_PCI   (1 << 1)    /* vfio-pci device */
#define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2) /* vfio-platform device */
#define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)    /* vfio-amba device */
#define VFIO_DEVICE_FLAGS_CCW   (1 << 4)    /* vfio-ccw device */
#define VFIO_DEVICE_FLAGS_AP    (1 << 5)    /* vfio-ap device */
    __u32   num_regions;    /* Max region index + 1 */
    __u32   num_irqs;       /* Max IRQ index + 1 */
};

提供的信息包括VFIO Device由哪种驱动提供（vfio-mdev设备则模拟其中一种），有几个Region，有几个IRQ Space。

VFIO&lowbar;DEVICE&lowbar;GET&lowbar;REGION&lowbar;INFO, &info，用于进一步查询Region的信息，传入并返回一个struct vfio&lowbar;region&lowbar;info（用户只填写index）：

struct vfio_region_info {
    __u32   argsz;
    __u32   flags;
#define VFIO_REGION_INFO_FLAG_READ  (1 << 0) /* Region supports read */
#define VFIO_REGION_INFO_FLAG_WRITE (1 << 1) /* Region supports write */
#define VFIO_REGION_INFO_FLAG_MMAP  (1 << 2) /* Region supports mmap */
#define VFIO_REGION_INFO_FLAG_CAPS  (1 << 3) /* Info supports caps */
    __u32   index;      /* Region index */
    __u32   cap_offset; /* Offset within info struct of first cap */
    __u64   size;       /* Region size (bytes) */
    __u64   offset;     /* Region offset from start of device fd */
};

VFIO&lowbar;DEVICE&lowbar;GET&lowbar;IRQ&lowbar;INFO, &info，用于查询IRQ Space的信息，传入并返回一个struct vfio&lowbar;irq&lowbar;info（用户只填写index）：

struct vfio_irq_info {
    __u32   argsz;
    __u32   flags;
#define VFIO_IRQ_INFO_EVENTFD       (1 << 0)
#define VFIO_IRQ_INFO_MASKABLE      (1 << 1)
#define VFIO_IRQ_INFO_AUTOMASKED    (1 << 2)
#define VFIO_IRQ_INFO_NORESIZE      (1 << 3)
    __u32   index;      /* IRQ index */
    __u32   count;      /* Number of IRQs within this index */
};

count表示这个IRQ Space中的IRQ数量，例如某个IRQ Space代表MSI-X中断，那么它最多可以有2048个IRQ。EVENTFDflag表示IRQ Space支持eventfd方式报告中断，MASKABLEflag表示可以对其中的IRQ进行mask和unmask操作，AUTOMASKED表示当IRQ上触发一次中断后，IRQ会自动被mask。

VFIO&lowbar;DEVICE&lowbar;SET&lowbar;IRQS, &irq&lowbar;set，传入一个struct vfio&lowbar;irq&lowbar;set用于配置中断：

struct vfio_irq_set {
    __u32   argsz;
    __u32   flags;
#define VFIO_IRQ_SET_DATA_NONE      (1 << 0) /* Data not present */
#define VFIO_IRQ_SET_DATA_BOOL      (1 << 1) /* Data is bool (u8) */
#define VFIO_IRQ_SET_DATA_EVENTFD   (1 << 2) /* Data is eventfd (s32) */
#define VFIO_IRQ_SET_ACTION_MASK    (1 << 3) /* Mask interrupt */
#define VFIO_IRQ_SET_ACTION_UNMASK  (1 << 4) /* Unmask interrupt */
#define VFIO_IRQ_SET_ACTION_TRIGGER (1 << 5) /* Trigger interrupt */
    __u32   index;
    __u32   start;
    __u32   count;
    __u8    data[];
};

其中index表示选择第几个IRQ Space，start和count用于表示subindex的范围。关于flags中DATA和ACTION的组合，如下所示：

ACTION&lowbar;MASK和ACTION&lowbar;UNMASK分别表示屏蔽和启用选中的IRQ
- DATA&lowbar;NONE表示[start, start + count - 1]范围内的IRQ全部选中
- DATA&lowbar;BOOL表示data[]为一个bool数组，其成员依次代表start到start + count - 1是否选中
ACTION&lowbar;TRIGGER
- 首先需使用DATA&lowbar;EVENTFD，通过data[]传入一个eventfd数组，其成员注册为相应的IRQ的Trigger（-1代表相应的IRQ不设置Trigger），即当VFIO Device上产生一个中断时，内核通过注册的eventfd通知用户程序。
- 一旦注册过了eventfd，就可以用DATA&lowbar;NONE或DATA&lowbar;BOOL手动为选中的IRQ触发一个虚拟中断

VFIO&lowbar;DEVICE&lowbar;RESET，重置VFIO Device。

VFIO Container

Container Level API

VFIO Container和VFIO Group不同。VFIO Group和/dev/vfio/$GROUP设备文件绑定，每个设备文件唯一对应一个VFIO Group，且只能打开一次，试图第二次打开会返回-EBUSY。而VFIO Container只有一个入口点即/dev/vfio/vfio，每次打开该设备文件，都将获得一个新的VFIO Container实例。

VFIO Container本身具备的功能微乎其微，只有三个ioctl：

VFIO_GET_API_VERSION，返回VFIO_API_VERSION（目前版本号为0）
VFIO_CHECK_EXTENSION, ext，返回1表示支持该extension（ext），返回0表示不支持
VFIO_SET_IOMMU, type，设置IOMMU Driver为type类型，在调用该ioctl前必须至少挂载一个VFIO Group
- 本质上只有两种类型，即Type1 IOMMU和sPAPR IOMMU，前者代表x86、ARM等架构上的IOMMU，后者代表POWER架构上的IOMMU
- 我们只关心Type1 IOMMU，它又细分为VFIO_TYPE1_IOMMU、VFIO_TYPE1v2_IOMMU和VFIO_TYPE1_NESTING_IOMMU，一般来说用VFIO_TYPE1v2_IOMMU即可
- 所有的type都可以作为VFIO_CHECK_EXTENSION的参数，检查内核是否支持该类型，用户应该先检查是否支持该类型再设置IOMMU Driver

回顾VFIO Container的定义，除了IOMMU Driver以外，还有一个iommu_data：

struct vfio_container {
    struct kref                 kref;
    struct list_head            group_list;
    struct rw_semaphore         group_lock;
    struct vfio_iommu_driver    *iommu_driver;
    void                        *iommu_data;
    bool                        noiommu;
};

在VFIO_SET_IOMMU的实现vfio_ioctl_set_iommu中，通过调用IOMMU Driver的open方法获得了IOMMU Data：

data = driver->ops->open(arg);
if (IS_ERR(data)) {
    ret = PTR_ERR(data);
    module_put(driver->ops->owner);
    continue;
}
/* ... */
container->iommu_driver = driver;
container->iommu_data = data;

在Type1 IOMMU Driver中，返回的IOMMU Data是一个struct vfio_iommu（详下）。

这一步完成后，接着会对Container上已经挂载的VFIO Group调用IOMMU Driver的attach_group方法：

list_for_each_entry(group, &container->group_list, container_next) {
    ret = driver->ops->attach_group(data, group->iommu_group);
    if (ret)
        goto unwind;
}

IOMMU Driver (Type 1)

External Interface

VFIO Container上的其余操作都会代理给其IOMMU Driver执行，包括read、write、mmap和上述三个ioctl以外的ioctl：

/* vfio_fops_read */
driver = container->iommu_driver;
if (likely(driver && driver->ops->read))
    ret = driver->ops->read(container->iommu_data,
                buf, count, ppos);

/* vfio_fops_write */
driver = container->iommu_driver;
if (likely(driver && driver->ops->write))
    ret = driver->ops->write(container->iommu_data,
                 buf, count, ppos);

/* vfio_fops_mmap */
driver = container->iommu_driver;
if (likely(driver && driver->ops->mmap))
    ret = driver->ops->mmap(container->iommu_data, vma);

/* vfio_fops_unl_ioctl */
default:
    driver = container->iommu_driver;
    data = container->iommu_data;

    if (driver) /* passthrough all unrecognized ioctls */
        ret = driver->ops->ioctl(data, cmd, arg);

另外，VFIO_CHECK_EXTENSION实际上也是代理给IOMMU Driver执行的，当Container尚未指定Driver时，是遍历系统中的IOMMU Driver依次调用VFIO_CHECK_EXTENSION，至少有一个返回1则最终返回1，否则返回0，当Container指定了Driver时，则对该Driver调用VFIO_CHECK_EXTENSION。

对于我们关心的Type 1 IOMMU Driver，其提供的重要的ioctl实际上只有VFIO_IOMMU_MAP_DMA和VFIO_IOMMU_UNMAP_DMA：

VFIO_IOMMU_MAP_DMA，传入一个struct vfio_iommu_type1_dma_map：

struct vfio_iommu_type1_dma_map {
    __u32   argsz;
    __u32   flags;
#define VFIO_DMA_MAP_FLAG_READ (1 << 0)     /* readable from device */
#define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)    /* writable from device */
    __u64   vaddr;                          /* Process virtual address */
    __u64   iova;                           /* IO virtual address */
    __u64   size;                           /* Size of mapping (bytes) */
};

VFIO_IOMMU_UNMAP_DMA，传入一个struct vfio_iommu_type1_dma_unmap，成功unmap的内存的size会在size中返回（可能比传入的size小）：

struct vfio_iommu_type1_dma_unmap {
    __u32   argsz;
    __u32   flags;
    __u64   iova;               /* IO virtual address */
    __u64   size;               /* Size of mapping (bytes) */
};

这里设置的DMA Remapping是针对整个Container，即针对其中的所有Group的，下面我们将详细讨论这一点。

Internal Interface

IOMMU Driver实际上只是一个接口，用于提供若干回调，与具体的实现解耦：

struct vfio_iommu_driver {
    const struct vfio_iommu_driver_ops  *ops;
    struct list_head                    vfio_next;
};

/**
 * struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks
 */
struct vfio_iommu_driver_ops {
    char            *name;
    struct module   *owner;
    void            *(*open)(unsigned long arg);
    void            (*release)(void *iommu_data);
    ssize_t         (*read)(void *iommu_data, char __user *buf,
                            size_t count, loff_t *ppos);
    ssize_t         (*write)(void *iommu_data, const char __user *buf,
                             size_t count, loff_t *size);
    long            (*ioctl)(void *iommu_data, unsigned int cmd,
                             unsigned long arg);
    int             (*mmap)(void *iommu_data, struct vm_area_struct *vma);
    int	            (*attach_group)(void *iommu_data,
                                    struct iommu_group *group);
    void            (*detach_group)(void *iommu_data,
                                    struct iommu_group *group);
    int             (*pin_pages)(void *iommu_data, unsigned long *user_pfn,
                                 int npage, int prot,
                                 unsigned long *phys_pfn);
    int             (*unpin_pages)(void *iommu_data,
                                   unsigned long *user_pfn, int npage);
    int             (*register_notifier)(void *iommu_data,
                                         unsigned long *events,
                                         struct notifier_block *nb);
    int             (*unregister_notifier)(void *iommu_data,
                                           struct notifier_block *nb);
};

目前IOMMU Driver均未实现read、write、mmap回调，因此对VFIO Container实际上不能进行read、write或mmap操作，尽管不排除将来支持这些操作的可能。

在Type 1 IOMMU Driver中，实现了以下接口：

static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
    .name                   = "vfio-iommu-type1",
    .owner                  = THIS_MODULE,
    .open                   = vfio_iommu_type1_open,
    .release                = vfio_iommu_type1_release,
    .ioctl                  = vfio_iommu_type1_ioctl,
    .attach_group           = vfio_iommu_type1_attach_group,
    .detach_group           = vfio_iommu_type1_detach_group,
    .pin_pages              = vfio_iommu_type1_pin_pages,
    .unpin_pages            = vfio_iommu_type1_unpin_pages,
    .register_notifier      = vfio_iommu_type1_register_notifier,
    .unregister_notifier    = vfio_iommu_type1_unregister_notifier,
};

Data Structures

在vfio_iommu_type1_open中，创建了一个struct vfio_iommu，存放在Container的iommu_data成员中：

struct vfio_iommu {
    struct list_head    domain_list;
    struct vfio_domain  *external_domain; /* domain for external user */
    struct mutex        lock;
    struct rb_root      dma_list;
    struct blocking_notifier_head notifier;
    unsigned int        dma_avail;
    bool                v2;
    bool                nesting;
};

其中domain_list是struct vfio_domain构成的链表：

struct vfio_domain {
    struct iommu_domain *domain;
    struct list_head    next;
    struct list_head    group_list;
    int                 prot;       /* IOMMU_CACHE */
    bool                fgsp;       /* Fine-grained super pages */
};

其中group_list又是struct vfio_group构成的链表（此VFIO Group非彼VFIO Group，前者定义在drivers/vfio/vfio_iommu_type1.c，后者定义在drivers/vfio/vfio.c）：

struct vfio_group {
    struct iommu_group  *iommu_group;
    struct list_head    next;
};

这里，一个struct vfio_group和一个VFIO Group相对应，同时也对应于一个IOMMU Group。不同的IOMMU Group可以共享同一张IOMMU页表，我们说这些IOMMU Group属于同一个IOMMU Domain，在这里struct vfio_domain就对应着IOMMU Domain。最后，一个Container中可以容纳若干IOMMU Domain，即可以同时管理多个IOMMU页表。external_domain是由VFIO驱动管理的外部IOMMU Domain，可以暂时忽略，分析vfio-mdev时会详细解释。

这里忽略同一个IOMMU Group在不同进程中可以对应不同IOMMU页表的情况（例如VT-d以及SMMU都可以根据PASID选取不同页表），这种场景在Linux 5.1 rc6尚未支持。Patchwork上可以找到尚未upstream的patch。

dma_list则是由struct vfio_dma构成的一棵红黑树，其索引是[iova, iova + size]区间（IOMMU Driver保证这些区间不重叠）：

struct vfio_dma {
    struct rb_node      node;
    dma_addr_t          iova;       /* Device address */
    unsigned long       vaddr;      /* Process virtual addr */
    size_t              size;       /* Map size (bytes) */
    int                 prot;       /* IOMMU_READ/WRITE */
    bool                iommu_mapped;
    bool                lock_cap;   /* capable(CAP_IPC_LOCK) */
    struct task_struct  *task;
    struct rb_root      pfn_list;   /* Ex-user pinned pfn list */
};

每个vfio_dma都代表一小段内存映射，而这些映射是作用于Container下的所有IOMMU Domain、所有IOMMU Group的，也就是说Container下不同IOMMU Domain的页表内容是相同的。不过这仍是有意义的，因为可能加入Container的不同VFIO Group，分别被不同的IOMMU管辖，因此必须使用不同的IOMMU Domain。

Operations

以下均不考虑vfio-mdev驱动的VFIO Group对IOMMU Driver造成的影响，对于vfio-mdev会在专门的文章讨论

我们首先考察vfio_iommu_type1_attach_group(vfio_iommu, iommu_group)：

第一步，检查vfio_iommu下是否已经有IOMMU Group了，若已存在则立即返回-EINVAL。
第二步，从IOMMU Group可以得到其下面的Device（struct device），若它们所属的Bus不同则立即返回-EINVAL，否则记录下它们共同的Bus（记作bus）。
第三步，调用iommu_domain_alloc(bus)创建一个IOMMU Domain，然后调用iommu_attach_group(iommu_domain, iommu_group)将IOMMU Group加入该Domain。
第四步，遍历vfio_iommu的domain_list链表，查找可以容纳IOMMU Group的Domain，若找到则将IOMMU Group从上一步的Domain中去除，加入到这一步的Domain中，并直接返回：

/*
 * Try to match an existing compatible domain.  We don't want to
 * preclude an IOMMU driver supporting multiple bus_types and being
 * able to include different bus_types in the same IOMMU domain, so
 * we test whether the domains use the same iommu_ops rather than
 * testing if they're on the same bus_type.
 */
list_for_each_entry(d, &iommu->domain_list, next) {
    if (d->domain->ops == domain->domain->ops &&
        d->prot == domain->prot) {
        iommu_detach_group(domain->domain, iommu_group);
        if (!iommu_attach_group(d->domain, iommu_group)) {
            list_add(&group->next, &d->group_list);
            iommu_domain_free(domain->domain);
            kfree(domain);
            mutex_unlock(&iommu->lock);
            return 0;
        }

        ret = iommu_attach_group(domain->domain, iommu_group);
        if (ret)
            goto out_domain;
    }
}

否则，要在新建的IOMMU Domain上设置DMA Mapping，即调用vfio_iommu_replay(iommu, domain)重放所有DMA Mapping请求，最后将新Domain加入vfio_iommu的domain_list中。

我们再来考察vfio_iommu_type1_register_notifier和vfio_iommu_type1_unregister_notifier，它们的实现很简单：

static int vfio_iommu_type1_register_notifier(void *iommu_data,
                                              unsigned long *events,
                                              struct notifier_block *nb)
{
    struct vfio_iommu *iommu = iommu_data;

    /* clear known events */
    *events &= ~VFIO_IOMMU_NOTIFY_DMA_UNMAP;

    /* refuse to register if still events remaining */
    if (*events)
        return -EINVAL;

    return blocking_notifier_chain_register(&iommu->notifier, nb);
}

static int vfio_iommu_type1_unregister_notifier(void *iommu_data,
                                                struct notifier_block *nb)
{
    struct vfio_iommu *iommu = iommu_data;

    return blocking_notifier_chain_unregister(&iommu->notifier, nb);
}

那么iommu->notifier什么时候会被调用呢，答案是仅在用户调用VFIO_IOMMU_UNMAP_DMA时：

/* vfio_iommu_type1_ioctl --> vfio_dma_do_unmap */
blocking_notifier_call_chain(&iommu->notifier,
                VFIO_IOMMU_NOTIFY_DMA_UNMAP,
                &nb_unmap);

因此这里注册的notifier起的作用仅仅是在DMA Unmap的时候调用一个回调。

我们继续追溯vfio_iommu_type1_register_notifier的调用者，发现时vfio_register_notifier，该函数还可以用来注册Group Notifier（struct vfio_group (in "vfio.c")中的notifer）：

switch (type) {
case VFIO_IOMMU_NOTIFY:
    ret = vfio_register_iommu_notifier(group, events, nb);
    break;
case VFIO_GROUP_NOTIFY:
    ret = vfio_register_group_notifier(group, events, nb);
    break;
default:
    ret = -EINVAL;
}

无独有偶，Group Notifier实际上也只会在一个时刻被触发，即VFIO Group和KVM绑定时：

void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm)
{
    group->kvm = kvm;
    blocking_notifier_call_chain(&group->notifier,
                                 VFIO_GROUP_NOTIFY_SET_KVM, kvm);
}
EXPORT_SYMBOL_GPL(vfio_group_set_kvm);

接下来看vfio_iommu_type1_ioctl，实际上我们只关心其中VFIO_IOMMU_MAP_DMA和VFIO_IOMMU_UNMAP_DMA的实现，即vfio_dma_do_map和vfio_dma_do_unmap。

在vfio_dma_do_map中，首先是检查了DMA Mapping Request的IOVA是否和已有的vfio_dma重叠，若重叠则直接返回-EEXIST。随后，就是创建新的vfio_dma对象，加入vfio_iommu的红黑树，最后对其调用vfio_pin_map_dma建立DMA Remapping。

用户请求的IOVA Region和对应的HVA Region虽然都是连续的，但HVA对应的HPA不一定是连续的，可能要进一步分成若干HPA Region。

vfio_pin_map_dma由一个循环构成，每次先调用vfio_pin_pages_remote，pin住一段连续的物理内存，然后再调用vfio_iommu_map创建IOVA到HPA的DMA Remapping映射：

while (size) {
    /* Pin a contiguous chunk of memory */
    npage = vfio_pin_pages_remote(dma, vaddr + dma->size,
                      size >> PAGE_SHIFT, &pfn, limit);
    if (npage <= 0) {
        WARN_ON(!npage);
        ret = (int)npage;
        break;
    }

    /* Map it! */
    ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage,
                 dma->prot);
    if (ret) {
        vfio_unpin_pages_remote(dma, iova + dma->size, pfn,
                    npage, true);
        break;
    }

    size -= npage << PAGE_SHIFT;
    dma->size += npage << PAGE_SHIFT;
}

vfio_iommu_map的实现很简单，对Container下的所有IOMMU Domain依次调用iommu_map设置映射即可：

list_for_each_entry(d, &iommu->domain_list, next) {
    ret = iommu_map(d->domain, iova, (phys_addr_t)pfn << PAGE_SHIFT,
            npage << PAGE_SHIFT, prot | d->prot);
    if (ret)
        goto unwind;

    cond_resched();
}

vfio_pin_pages_remote的实现则要复杂一些：

总的来说，其逻辑是每次调用vaddr_get_pfn，就从一个vaddr（HVA）获得其对应的物理页的页框号（PFN），在一个for循环内不断获取PFN直到PFN不连续为止，将最后获得的不连续的PFN排除，剩下的就是一段连续的物理地址，可以交给vfio_iommu_map进行映射。

vaddr_get_pfn内部通过get_user_pages实现从HVA得到并pin住page（struct page），然后从page就可以获得PFN：

down_read(&mm->mmap_sem);
if (mm == current->mm) {
    ret = get_user_pages_longterm(vaddr, 1, flags, page, vmas);
} else {
    ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
                    vmas, NULL);
    /*
     * The lifetime of a vaddr_get_pfn() page pin is
     * userspace-controlled. In the fs-dax case this could
     * lead to indefinite stalls in filesystem operations.
     * Disallow attempts to pin fs-dax pages via this
     * interface.
     */
    if (ret > 0 && vma_is_fsdax(vmas[0])) {
        ret = -EOPNOTSUPP;
        put_page(page[0]);
    }
}
up_read(&mm->mmap_sem);

if (ret == 1) {
    *pfn = page_to_pfn(page[0]);
    return 0;
}

get_user_pages_*内部是通过try_get_page(page)将struct page的_refcount加一，来实现所谓的「pin住内存」的效果的。这样做的实际效果是：

该物理页仍可以被换出
该页不会被迁移，即虚拟地址和物理地址的对应关系被锁定

另一方面，mlock系统调用的「锁住内存」，其含义则是：

内存不会被换出
内存可以被迁移，即虚拟地址不变，物理地址改变

另一方面，vfio_pin_pages_remote还会统计pin住的页的总数，不过已经通过pin_pages回调pin住的（也就是重复被pin的）页不算在内：

if (!rsvd && !vfio_find_vpfn(dma, iova)) {
    if (!dma->lock_cap &&
        current->mm->locked_vm + lock_acct + 1 > limit) {
        put_pfn(pfn, dma->prot);
        pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
            __func__, limit << PAGE_SHIFT);
        ret = -ENOMEM;
        goto unpin_out;
    }
    lock_acct++;
}

pin住的总页数统计在lock_acct中，函数的结尾会调用vfio_lock_acct(dma, lock_acct, false)，为DMA Map的调用者的mm->locked_vm增加lock_acct（mm->locked_vm += lock_acct）。

被重复pin的情况只有在Container先挂载了vfio-mdev驱动的VFIO Group，并被调用了pin_pages方法pin住了部分页，然后再挂载普通VFIO Group时，才会发生。在挂载普通VFIO Group时，如前文所述，会对新创建的IOMMU Domain调用vfio_iommu_replay，它也会调用到vfio_pin_pages_remote，此时不会将已经被vfio-mdev pin住的页计入统计。

理论上mm->locked_vm是用来统计地址空间中有多少被mlock锁住的页的，此处并未调用mlock或为vma设置VM_LOCKEDflag，却增加了locked_vm的计数，究竟起到什么作用尚不清楚。

这段修改mm->locked_vm的代码在可追溯的最早版本，即Tom Lyon的初版PATCH就已经出现，并且当时也是使用的get_user_pages_fast来pin住内存，故其最初的用意已不可考。