Skip to content

Eliminate FPGA admission webhook's mode #301

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rojkov opened this issue Feb 24, 2020 · 2 comments · Fixed by #358
Closed

Eliminate FPGA admission webhook's mode #301

rojkov opened this issue Feb 24, 2020 · 2 comments · Fixed by #358
Assignees

Comments

@rojkov
Copy link
Contributor

rojkov commented Feb 24, 2020

Problem:

It is possible to run the FPGA device plugin in two different modes on different nodes of the same cluster. Yet the admission webhook can be aligned to work with FPGA device plugins in either preprogrammed or orchestrated mode. The webhook needs to be redesigned to be agnostic about the device plugins' modes. Also when operating in preprogrammed mode it is impossible to differentiate nodes providing the same accelerated function with different hardware, e.g. a user's request to dispatch a task onto stratix10-dcp1.0-nlb0 may well be dispatched to a node running the nlb0 accelerated function on an Aria10.

Solution:

  1. Modify the FPGA plugin to expose both AF and interface IDs in resource names in "preprogrammed" mode (currently only AF ID is exposed).
  2. Modify AcceleratedFunction CRDs to contain info on the hardware the accelerated function is intended to run on (interface ID) and the required mode of the FPGA plugin. So that a cluster admin would have the possibility to configure the webhook to operate with plugins in both modes.
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: acceleratorfunctions.fpga.intel.com
spec:
  group: fpga.intel.com
  version: v1
  scope: Namespaced
  names:
    plural: acceleratorfunctions
    singular: acceleratorfunction
    kind: AcceleratorFunction
    shortNames:
    - af
  validation:
    openAPIV3Schema:
      properties:
        spec:
          properties:
            afuId:
              type: string
              pattern: '^[0-9a-f]{8,128}$'
            interfaceId:
              type: string
              pattern: '^[0-9a-f]{8,128}$'
            mode:
              type: string
              pattern: '^preprogrammed|orchestrated$'
  1. Modify the webhook not to accept -mode option and to translate requested resources using AcceleratedFunction CRDs only in the new format.

The format of resource names visible to a user is not changed. Basically the format can be anything, but it's expected to be in the form <hardware>-<firmware_release>-<accelerated_function>, e.g. arria10-dcp1.1-nlb0.

@rojkov
Copy link
Contributor Author

rojkov commented Apr 6, 2020

So, in this patch I made the FPGA plugin expose AFs as fpga.intel.com/<interface_id><afu_id> to make AFUs provided by different HW distinguishable.

The problem though is that such resource name is 64 bytes long (32 + 32). Whereas the max resource name length without namespace is 63:

• Failure [6.023 seconds]
FPGA Admission Webhook
/home/rojkov/work/intel-device-plugins-for-kubernetes/test/e2e/fpgaadmissionwebhook/fpgaadmissionwebhook.go:36
  mutates created pods to reference resolved AFs [It]
  /home/rojkov/work/intel-device-plugins-for-kubernetes/test/e2e/fpgaadmissionwebhook/fpgaadmissionwebhook.go:51

  pod Create API error
  Unexpected error:
      <*errors.StatusError | 0xc0004a4320>: {
          ErrStatus: {
              TypeMeta: {Kind: "", APIVersion: ""},
              ListMeta: {
                  SelfLink: "",
                  ResourceVersion: "",
                  Continue: "",
                  RemainingItemCount: nil,
              },
              Status: "Failure",
              Message: "Pod \"webhook-tester\" is invalid: [spec.containers[0].resources.limits[fpga.intel.com/bfac4d851ee856fe8c95865ce1bbaa2df7df405cbd7acf7222f144b0b93acd18]: Invalid value: \"fpga.intel.com/bfac4d851ee856fe8c95865ce1bbaa2df7df405cbd7acf7222f144b0b93acd18\": name part must be no more than 63 characters, spec.containers[0].resources.limits[fpga.intel.com/bfac4d851ee856fe8c95865ce1bbaa2df7df405cbd7acf7222f144b0b93acd18]: Invalid value: \"fpga.intel.com/bfac4d851ee856fe8c95865ce1bbaa2df7df405cbd7acf7222f144b0b93acd18\": doesn't follow extended resource name standard, spec.containers[0].resources.requests[fpga.intel.com/bfac4d851ee856fe8c95865ce1bbaa2df7df405cbd7acf7222f144b0b93acd18]: Invalid value: \"fpga.intel.com/bfac4d851ee856fe8c95865ce1bbaa2df7df405cbd7acf7222f144b0b93acd18\": name part must be no more than 63 characters, spec.containers[0].resources.requests[fpga.intel.com/bfac4d851ee856fe8c95865ce1bbaa2df7df405cbd7acf7222f144b0b93acd18]: Invalid value: \"fpga.intel.com/bfac4d851ee856fe8c95865ce1bbaa2df7df405cbd7acf7222f144b0b93acd18\": doesn't follow extended resource name standard]",
              Reason: "Invalid",
              Details: {
                  Name: "webhook-tester",
                  Group: "",
                  Kind: "Pod",
                  UID: "",
                  Causes: [
                      {
                          Type: "FieldValueInvalid",
                          Message: "Invalid value: \"fpga.intel.com/bfac4d851ee856fe8c95865ce1bbaa2df7df405cbd7acf7222f144b0b93acd18\": name part must be no more than 63 characters",
                          Field: "spec.containers[0].resources.limits[fpga.intel.com/bfac4d851ee856fe8c95865ce1bbaa2df7df405cbd7acf7222f144b0b93acd18]",
                      },
                      {
                          Type: "FieldValueInvalid",
                          Message: "Invalid value: \"fpga.intel.com/bfac4d851ee856fe8c95865ce1bbaa2df7df405cbd7acf7222f144b0b93acd18\": doesn't follow extended resource name standard",
                          Field: "spec.containers[0].resources.limits[fpga.intel.com/bfac4d851ee856fe8c95865ce1bbaa2df7df405cbd7acf7222f144b0b93acd18]",
                      },
                      {
                          Type: "FieldValueInvalid",
                          Message: "Invalid value: \"fpga.intel.com/bfac4d851ee856fe8c95865ce1bbaa2df7df405cbd7acf7222f144b0b93acd18\": name part must be no more than 63 characters",
                          Field: "spec.containers[0].resources.requests[fpga.intel.com/bfac4d851ee856fe8c95865ce1bbaa2df7df405cbd7acf7222f144b0b93acd18]",
                      },
                      {
                          Type: "FieldValueInvalid",
                          Message: "Invalid value: \"fpga.intel.com/bfac4d851ee856fe8c95865ce1bbaa2df7df405cbd7acf7222f144b0b93acd18\": doesn't follow extended resource name standard",
                          Field: "spec.containers[0].resources.requests[fpga.intel.com/bfac4d851ee856fe8c95865ce1bbaa2df7df405cbd7acf7222f144b0b93acd18]",
                      },
                  ],
                  RetryAfterSeconds: 0,
              },
              Code: 422,
          },
      }
      Pod "webhook-tester" is invalid: [spec.containers[0].resources.limits[fpga.intel.com/bfac4d851ee856fe8c95865ce1bbaa2df7df405cbd7acf7222f144b0b93acd18]: Invalid value: "fpga.intel.com/bfac4d851ee856fe8c95865ce1bbaa2df7df405cbd7acf7222f144b0b93acd18": name part must be no more than 63 characters, spec.containers[0].resources.limits[fpga.intel.com/bfac4d851ee856fe8c95865ce1bbaa2df7df405cbd7acf7222f144b0b93acd18]: Invalid value: "fpga.intel.com/bfac4d851ee856fe8c95865ce1bbaa2df7df405cbd7acf7222f144b0b93acd18": doesn't follow extended resource name standard, spec.containers[0].resources.requests[fpga.intel.com/bfac4d851ee856fe8c95865ce1bbaa2df7df405cbd7acf7222f144b0b93acd18]: Invalid value: "fpga.intel.com/bfac4d851ee856fe8c95865ce1bbaa2df7df405cbd7acf7222f144b0b93acd18": name part must be no more than 63 characters, spec.containers[0].resources.requests[fpga.intel.com/bfac4d851ee856fe8c95865ce1bbaa2df7df405cbd7acf7222f144b0b93acd18]: Invalid value: "fpga.intel.com/bfac4d851ee856fe8c95865ce1bbaa2df7df405cbd7acf7222f144b0b93acd18": doesn't follow extended resource name standard]
  occurred

  /home/rojkov/work/intel-device-plugins-for-kubernetes/test/e2e/fpgaadmissionwebhook/fpgaadmissionwebhook.go:95

/cc @kad @bart0sh Do you mind if I remove the last character of FPGA interface IDs when exposing AF resources?

@bart0sh
Copy link
Member

bart0sh commented Apr 6, 2020

I don't. At least for the POC version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants