Table of Contents | ||
---|---|---|
|
...
- Kubelet notices the pod being deleted.
- Kubelet invokes StopContainer CRI calls which is getting forwared to Virtlet based on the containing pod sandbox annotations.
- Virtlet stops the libvirt domain. libvirt sends a signal to qemu, which initiates the shutdown. If it doesn't quit in a reasonable time determined by pod's termination grace period, Virtlet will forcibly terminate the domain, thus killing the qemu process.
- After all the containers in the pod (the single container in case of Virtlet VM pod) are stopped, Kubelet invokes StopPodSandbox CRI call.
- Virtlet asks its tapmanager to remove pod from the network by means of CNI DEL command.
- after StopPodSandbox returnsAfter StopPodSandbox returns, the pod sandbox will be eventually GC'd by Kubelet by means of RemovePodSandbox CRI call.
- Upon RemovePodSandbox, Virtlet removes the pod metadata from its internal database.
...
To solve these problems, we should first have a clear knowledge of device plugin. A related concept for device plugin is Kubernetes extended-resources. In conclusion, By sending a patch node request to the Kubernetes apiserver, a custom resource type is added to the node, which is used for the quota statistics of the resource and the corresponding QoS configuration.
...
To send a patch node request conveniently, start a proxy, so that you can easily send requests to the Kubernetes API server, we first execute kube proxy command to start it temporarily, then add six intel.com/devices resource to a node (~1 in the commands will automatically transform into /):
|
---|
Now we extend 6 intel.com/devices resources for your node, then we can see
|
---|
Now we can use these resources in our pod by adding intel.com/devices: "1" to spec.containers.resources.requests/limits and the pod will be scheduled with statistics.
To clean up the extended resources, execute the following commands:
|
---|
Device plugin
Overview
Kubernetes provides to vendors a mechanism called device plugins to finish the following three tasks, device plugins are simple gRPC servers that may run in a container deployed through the pod mechanism or in bare metal mode.
service DevicePlugin { // returns a stream of []Device rpc ListAndWatch(Empty) returns (stream ListAndWatchResponse) {} rpc Allocate(AllocateRequest) returns (AllocateResponse) {} } |
---|
- advertise devices.
- monitor devices (currently perform health checks).
- hook into the runtime to execute device specific instructions (e.g: Clean GPU memory) and to take in order to make the device available in the container.
...
Drawio | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Why device plugin
- Very few devices are handled natively by Kubelet (cpu and memory)
- Need a sustainable solution for vendors to be able to advertise their resources to Kubelet and monitor them without writing custom Kubernetes code
- A consistent and portable solution to consume hardware devices across k8s clusters to use a particular device type (GPU, QAT, FPGA, etc.) in pods
- ...
How it works
In Kubernetes, Kubelet will offer a register gRPC server which allows device plugin register itself to Kubelet. When starting, the device plugin will make a (client) gRPC call to the Register function that Kubelet exposes. The device plugins sends a RegisterRequest to Kubelet to notify Kubelet of the following informations, and Kubelet answers to the RegisterRequest with a RegisterResponse containing any error Kubelet might have encountered (api version not supported, resource name already register), then the device plugin start its gRPC server if it did not receive an error.
- Its own unix socket name, which will receive the requests from Kubelet through the gRPC apis.
- The api version of device plugin itself
- The resource name they want to advertise. The resource name must follow a specified format (vendor-domain/vendor-device). such as intel.com/qat
After successful registration, Kubelet will call the ListAndWatch function from device plugin. A ListAndWatch function is for the Kubelet to Discover the devices and their properties as well as notify of any status change (devices become unhealthy). The list of devices is returned as an array of all devices description information (ID, health status) of the resource. Kubelet records this resource and its corresponding number of devices to node.status.capacity/allocable and updates it to apiserver. This function will always loop check, once the device is abnormal or unplugged from the machine, it will update and return the latest device list to Kubelet.
...
When testing the QAT sriov support condition with the officer virtlet image, together with QAT device plugin. we take thie simple straightforward method that add the resource name qat.intel.com/generic advertised by the QAT device plugin to fileds spec.containers.resource.limits and spec.containers.resource.requests with value "1". It works correctly in plain kubernetes pods. But in a virtlet vm pod, we encountered the conflict caused by the configuration transformed between virtual machine and pod by virtlet. The issues is that when allocating a QAT vf device to virtlet vm pod, Kubelet will add the extended device to kubeapi.PodSandboxConfig.Devices (k8s.io/kubernetes/pkg/kubelet/apis/cri/runtime/v1alpha2 - v1.14). Then virtlet will incorrectly transforms all these devices to its volume devices and considers them as block disk with disk drivers bound to them later.
for _, dev := range in.Config.Devices { |
---|
It causes the errors that too many disks, disks' reading issues, denied permission and so on after a vm pod starts. And regardless of this, I want assign QAT vf to virtlet pod by pci-passthrough. So I want add corresponding fileds into libvirt instance domain xml created by virtlet. After code analysis, virtlet is a cri implentment and in its createDomain(config *types.VMConfig) *libvirtxml.Domain (pkg/libvirttools/virtualization.go) I detect the xml file creation and find it is using the libvirtxml "github.com/libvirt/libvirt-go-xml" go module. So the whole work flow is clear now and I can fix it then.
domain := &libvirtxml.Domain{ ... |
---|
Fix
continue
Example
...