P0 - Job Name, UID and Description #3935

yqwang-ms · 2019-11-29T07:52:56Z

If RestServer uses Job UID instead of Job Name as Job Key to serve query:
Pros:

Job Name can be very long (store in annotation as Job Description)
History job (of the same name) can be in job list and job detail page: List history jobs with job history API #3845

Cons:

RestServer may need to cache the mapping from uid to name, so that it can query APIServer efficently by naming, otherwise, it has to iterate all frameworks. (Or use UID as framework label)
Maybe many changes in backend

Proposal-1:

Job Name to submit idempotently,
Job UID to query uniquely,
Job Description to attach metadata arbitrarily.
UID generated by K8S

Add a new field in PAI Job Spec called description, which can be any string in any reasonable length (<10k), and RestServer stores it into k8s framework annotation.
If user specified job name (he wants idempotence), then RestServer uses this job name as k8s framework name to submit, but RestServer still uses k8s framework uid as this job key serve query (may still can use name to serve active job query).
If user did not specify job name (he does not care idempotence, like Aether), then RestServer uses empty name as k8s framework name to submit (k8s will auto generate it if metadata.generateName is set) , and RestServer uses k8s framework uid as this job key serve query (may still can use name to serve active job query).

Example:

  POST /jobs/  
      --> If Request's JOB_NAME is not empty
         --> Response includes its JOB_NAME and JOB_UID (Always the same if it is not GC)
      --> Else
         --> Response includes its JOB_NAME and JOB_UID (Always different)

  GET /jobs/{JOB_UID}
      --> Can query all jobs, both in history and active
      --> Response includes its JOB_NAME (Always the same)
      --> Useful for query both in history and active, such as webportal, etc

active jobs: jobs in k8s apiserver
history jobs: jobs only in elasticsearch

TBD:

  POST /activejobs/  
  GET /alljobs/{JOB_UID}
  GET /activejobs/{JOB_NAME}
      --> Only can query active jobs, i.e. are not GC to history
      --> Response includes its JOB_UID (Always the same if it is not GC)
      --> Useful for just check existence for stateless job submitter:
            If !(GET JOB_NAME)
               Prepare Externals: Cleanup previous intermediate data 
               POST JOB_NAME
            WATCH JOB_NAME

Proposal-2:

Job UID to submit idempotently and query uniquely,
Job Description to attach metadata arbitrarily.
UID generated by client

Assume RestServer client (WebPortal/SDK/RawHttpClient) always generates unique UID as current PAI's Job Name.
Or RestServer always also check the current PAI's Job Name conflict in history server

Pros:
In this way, we can merge the concepts JOB_NAME and JOB_UID in Proposal-1 to be only one concept: JOB_UID. Furthermore, RestServer does not need to change too much, such as store the mapping from JOB_UID to JOB_NAME. So, this Proposal is more simple and smooth.

Cons:

It is more vulnerable to conflict compared with centralized server generated UID, but need to measure, or it may bring history server into critical path.
To achieve idempotent, before client submit, it needs to persist the generated UID to avoid duplicated submission. So client must depends on a distributed storage to tolerate transient submit failures and retries.

Example:

  POST /jobs/
      --> Request includes client generated JOB_UID (as k8s framework name to submit)
      --> (TBD: RestServer also check the JOB_UID conflicts in history server)

  GET /jobs/{JOB_UID}
      --> Can query all jobs, both in history and active
      --> (TBD: May match multiple jobs, and need to choose one to return)

Proposal-3:

UID generated by RestServer.

Based on Proposal-1, but the UID is generated by RestServer instead of K8S, RestServer will use it as k8s framework name to submit if user does not specify job name.
#3935 (comment)

Proposal-4:

Based on Proposal-1, but

Job Name to submit idempotently and attach metadata arbitrarily,
Job UID to query uniquely.
UID generated by RestServer or K8S.

#3935 (comment)

Cons is summarized at #3935 (comment)

The text was updated successfully, but these errors were encountered:

debuggy · 2019-12-02T03:20:53Z

Supposing Job Name has duplication, how could rest server know which job exactly if only Job Name was given by user (or client)?

It is more straightforward if only one identifier is used to refer to job, which is UID. The jobname and description is only used for displaying and filtering. And the rest server does not need to cache the mapping from uid to name.

Example:

When submitting job, rest server get a UID (no matter generated by launcher or itself) as identifier and jobname generated by user.
When listing job, rest server accepts a list request to return the list of jobs (containing UID in response), allowing jobname (also pagination and others) as filter parameter.
When quering job, rest server only accepts UID as query ID for a specific job.

Explain:
Most time user only gets a specific job based on job list, in which scenario the UID is given. If a user wants to query a specific job, it has to remember the job UID first.

yqwang-ms · 2019-12-02T04:00:50Z

@debuggy See the example, when using name, rest server only check it against activejobs, i.e. by k8s APIServer.
And we can only query APIServer by name, so only use uid, you cannot query APIServer, you have to first convert the uid to name.

debuggy · 2019-12-02T05:23:49Z

@yqwang-ms It will increase complexity if differing active jobs from history jobs, and there is no benefit from this. It better to leave this difference in rest server and not expose to client.

In this design, how does a user get a history job by querying api server?

yqwang-ms · 2019-12-02T05:34:20Z

It will increase complexity if differing active jobs from history jobs, and there is no benefit from this. It better to leave this difference in rest server and not expose to client.

See GET /activejobs/{JOB_NAME} for the user's benefit. If we do not need, we can remove this API.
So we only have POST /jobs/ and GET /jobs/{JOB_UID}

In this design, how does a user get a history job by querying api server?

GET /alljobs/{JOB_UID}
And we can even support other advance query since we are using elasticsearch as backend.

debuggy · 2019-12-02T06:28:56Z

@yqwang-ms So currently my idea supports second endpoint GET /alljobs/{JOB_UID}.

As for the first one GET /activejobs/{JOB_NAME}, we should firstly let user know what the definition of active jobs, it may cost additional effort.

By the way, is there any significant difference between active jobs and history jobs?

yqwang-ms · 2019-12-02T06:39:42Z

active jobs: jobs in k8s apiserver
history jobs: jobs only in elasticsearch

sterow · 2019-12-03T06:56:51Z

I inclined to proposal-1 as it at least kept the job name unchanged.

I think we can just keep the original solution to use job name to archive idempotent job submission. However we introduce UIDs just for job history. Job name will not unique in job history, UID will be used in job history. So based on proposal-1, we will not change the old APIs, but add some enahancements:

If user don' specify job name, then we will use UID as job name automatically.
When query jobs, we can let users to query active jobs with job name, or UID with history jobs. We can prefix a special character such as ":" to specify a job UID, such as "GET /jobs/:xxxx-xxxx-xx"

With this minor modifications to proposal-1, we can keep the fully backward compatible while support history server job queries.

yqwang-ms · 2019-12-03T07:34:37Z

If user don' specify job name, then we will use UID as job name automatically.

So, this seems to be the UID is generated by RestServer instead of K8S.

Until now, we have 3 proposal:
Proposal-2: UID generated by client
Proposal-3: UID generated by RestServer
Proposal-1: UID generated by K8S

abuccts · 2019-12-04T11:59:45Z

Another solution:

Rest server uses job name as key to identify jobs, user need to specify a job name:
- no length restriction
- need to be unique among active jobs for idempotent, rest server will check duplicate job names
- job name can be used to get active job, as a label selector to query api server
K8s api server uses a UID as key to identify frameworks, the UID can be generated in rest server (by modifying encode function) or k8s (by using generatedname):
- UID is unique among all active and history jobs
- UID can be used to get active or history job

Pros:

allow long job name without changing user experience, transparent to user
won't break current jobs and apis
least modification (excluding history job part)

yqwang-ms · 2019-12-13T03:10:34Z

@abuccts to help check:
Use label as Name:

Label Perf
Some features that Name only have, such as DNS, etc. Do we really do not need them in future?
Need to ensure job submission to be "one by one", to avoid duplicated name
Label value may still have some name limitation, will it change yarn naming?

UID related:

Test K8S generated name
Query by UID or Name design

abuccts · 2019-12-16T06:31:06Z

Use label as Name:

Label Perf

setup: 6 nodes, 1000 frameworks
results (in test order):

query framework A:

label selector (first run)

Running 10s test @ /apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks?labelSelector=jobname=job0123
10 connections

┌─────────┬────────┬────────┬────────┬────────┬────────┬───────┬───────────┐
│ Stat    │ 2.5%   │ 50%    │ 97.5%  │ 99%    │ Avg    │ Stdev │ Max       │
├─────────┼────────┼────────┼────────┼────────┼────────┼───────┼───────────┤
│ Latency │ 776 ms │ 776 ms │ 776 ms │ 776 ms │ 776 ms │ 0 ms  │ 776.32 ms │
└─────────┴────────┴────────┴────────┴────────┴────────┴───────┴───────────┘
┌───────────┬─────┬──────┬─────┬─────────┬───────┬───────┬─────────┐
│ Stat      │ 1%  │ 2.5% │ 50% │ 97.5%   │ Avg   │ Stdev │ Min     │
├───────────┼─────┼──────┼─────┼─────────┼───────┼───────┼─────────┤
│ Req/Sec   │ 0   │ 0    │ 0   │ 1       │ 0.1   │ 0.31  │ 1       │
├───────────┼─────┼──────┼─────┼─────────┼───────┼───────┼─────────┤
│ Bytes/Sec │ 0 B │ 0 B  │ 0 B │ 3.04 kB │ 304 B │ 911 B │ 3.04 kB │
└───────────┴─────┴──────┴─────┴─────────┴───────┴───────┴─────────┘

Req/Bytes counts sampled once per second.

1 requests in 10.07s, 3.04 kB read
9 errors (9 timeouts)

label selector (second run)

Running 10s test @ /apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks?labelSelector=jobname=job0123
10 connections

┌─────────┬────────┬────────┬────────┬────────┬───────────┬──────────┬───────────┐
│ Stat    │ 2.5%   │ 50%    │ 97.5%  │ 99%    │ Avg       │ Stdev    │ Max       │
├─────────┼────────┼────────┼────────┼────────┼───────────┼──────────┼───────────┤
│ Latency │ 471 ms │ 502 ms │ 868 ms │ 876 ms │ 528.51 ms │ 88.09 ms │ 879.21 ms │
└─────────┴────────┴────────┴────────┴────────┴───────────┴──────────┴───────────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ Stat      │ 1%      │ 2.5%    │ 50%     │ 97.5%   │ Avg     │ Stdev   │ Min     │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Req/Sec   │ 10      │ 10      │ 20      │ 23      │ 18.3    │ 3.83    │ 10      │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Bytes/Sec │ 30.4 kB │ 30.4 kB │ 60.8 kB │ 69.9 kB │ 55.6 kB │ 11.6 kB │ 30.4 kB │
└───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

Req/Bytes counts sampled once per second.

183 requests in 10.08s, 556 kB read

framework name

Running 10s test @ /apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks/labelperf0123
10 connections

┌─────────┬────────┬────────┬─────────┬─────────┬───────────┬───────────┬────────────┐
│ Stat    │ 2.5%   │ 50%    │ 97.5%   │ 99%     │ Avg       │ Stdev     │ Max        │
├─────────┼────────┼────────┼─────────┼─────────┼───────────┼───────────┼────────────┤
│ Latency │ 274 ms │ 286 ms │ 2907 ms │ 2908 ms │ 431.68 ms │ 531.53 ms │ 2908.37 ms │
└─────────┴────────┴────────┴─────────┴─────────┴───────────┴───────────┴────────────┘
┌───────────┬─────┬──────┬─────────┬────────┬─────────┬─────────┬─────────┐
│ Stat      │ 1%  │ 2.5% │ 50%     │ 97.5%  │ Avg     │ Stdev   │ Min     │
├───────────┼─────┼──────┼─────────┼────────┼─────────┼─────────┼─────────┤
│ Req/Sec   │ 0   │ 0    │ 30      │ 40     │ 23      │ 11.88   │ 10      │
├───────────┼─────┼──────┼─────────┼────────┼─────────┼─────────┼─────────┤
│ Bytes/Sec │ 0 B │ 0 B  │ 84.3 kB │ 112 kB │ 64.6 kB │ 33.4 kB │ 28.1 kB │
└───────────┴─────┴──────┴─────────┴────────┴─────────┴─────────┴─────────┘

Req/Bytes counts sampled once per second.

230 requests in 10.08s, 646 kB read

query framework B:

label selector (first run)

Running 10s test @ /apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks?labelSelector=jobname=job0876
10 connections

┌─────────┬────────┬────────┬────────┬────────┬───────────┬──────────┬───────────┐
│ Stat    │ 2.5%   │ 50%    │ 97.5%  │ 99%    │ Avg       │ Stdev    │ Max       │
├─────────┼────────┼────────┼────────┼────────┼───────────┼──────────┼───────────┤
│ Latency │ 463 ms │ 536 ms │ 871 ms │ 874 ms │ 552.87 ms │ 85.42 ms │ 880.82 ms │
└─────────┴────────┴────────┴────────┴────────┴───────────┴──────────┴───────────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ Stat      │ 1%      │ 2.5%    │ 50%     │ 97.5%   │ Avg     │ Stdev   │ Min     │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Req/Sec   │ 10      │ 10      │ 19      │ 20      │ 17.5    │ 3.7     │ 10      │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Bytes/Sec │ 29.2 kB │ 29.2 kB │ 55.5 kB │ 58.4 kB │ 51.1 kB │ 10.8 kB │ 29.2 kB │
└───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

Req/Bytes counts sampled once per second.

175 requests in 10.08s, 511 kB read

label selector (second run)

Running 10s test @ /apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks?labelSelector=jobname=job0876
10 connections

┌─────────┬────────┬────────┬────────┬────────┬───────────┬──────────┬──────────┐
│ Stat    │ 2.5%   │ 50%    │ 97.5%  │ 99%    │ Avg       │ Stdev    │ Max      │
├─────────┼────────┼────────┼────────┼────────┼───────────┼──────────┼──────────┤
│ Latency │ 482 ms │ 541 ms │ 929 ms │ 933 ms │ 563.28 ms │ 96.69 ms │ 939.2 ms │
└─────────┴────────┴────────┴────────┴────────┴───────────┴──────────┴──────────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ Stat      │ 1%      │ 2.5%    │ 50%     │ 97.5%   │ Avg     │ Stdev   │ Min     │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Req/Sec   │ 10      │ 10      │ 20      │ 20      │ 17      │ 4.59    │ 10      │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Bytes/Sec │ 29.2 kB │ 29.2 kB │ 58.4 kB │ 58.4 kB │ 49.6 kB │ 13.4 kB │ 29.2 kB │
└───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

Req/Bytes counts sampled once per second.

170 requests in 10.08s, 496 kB read

framework name

Running 10s test @ /apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks/labelperf0876
10 connections

┌─────────┬────────┬────────┬────────┬────────┬───────────┬───────────┬───────────┐
│ Stat    │ 2.5%   │ 50%    │ 97.5%  │ 99%    │ Avg       │ Stdev     │ Max       │
├─────────┼────────┼────────┼────────┼────────┼───────────┼───────────┼───────────┤
│ Latency │ 274 ms │ 275 ms │ 905 ms │ 906 ms │ 317.54 ms │ 126.89 ms │ 908.04 ms │
└─────────┴────────┴────────┴────────┴────────┴───────────┴───────────┴───────────┘
┌───────────┬─────┬──────┬─────────┬────────┬─────────┬─────────┬─────────┐
│ Stat      │ 1%  │ 2.5% │ 50%     │ 97.5%  │ Avg     │ Stdev   │ Min     │
├───────────┼─────┼──────┼─────────┼────────┼─────────┼─────────┼─────────┤
│ Req/Sec   │ 0   │ 0    │ 30      │ 40     │ 24      │ 14.32   │ 10      │
├───────────┼─────┼──────┼─────────┼────────┼─────────┼─────────┼─────────┤
│ Bytes/Sec │ 0 B │ 0 B  │ 80.7 kB │ 108 kB │ 64.6 kB │ 38.5 kB │ 26.9 kB │
└───────────┴─────┴──────┴─────────┴────────┴─────────┴─────────┴─────────┘

Req/Bytes counts sampled once per second.

240 requests in 10.12s, 646 kB read

query all frameworks:

framework list

Running 10s test @ /apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks
10 connections

┌─────────┬─────────┬─────────┬─────────┬─────────┬────────────┬────────────┬────────────┐
│ Stat    │ 2.5%    │ 50%     │ 97.5%   │ 99%     │ Avg        │ Stdev      │ Max        │
├─────────┼─────────┼─────────┼─────────┼─────────┼────────────┼────────────┼────────────┤
│ Latency │ 1317 ms │ 1918 ms │ 4647 ms │ 4647 ms │ 2596.08 ms │ 1211.68 ms │ 4647.47 ms │
└─────────┴─────────┴─────────┴─────────┴─────────┴────────────┴────────────┴────────────┘
┌───────────┬─────┬──────┬─────┬─────────┬─────────┬─────────┬─────────┐
│ Stat      │ 1%  │ 2.5% │ 50% │ 97.5%   │ Avg     │ Stdev   │ Min     │
├───────────┼─────┼──────┼─────┼─────────┼─────────┼─────────┼─────────┤
│ Req/Sec   │ 0   │ 0    │ 0   │ 6       │ 2.7     │ 2.73    │ 5       │
├───────────┼─────┼──────┼─────┼─────────┼─────────┼─────────┼─────────┤
│ Bytes/Sec │ 0 B │ 0 B  │ 0 B │ 27.8 MB │ 12.5 MB │ 12.6 MB │ 23.1 MB │
└───────────┴─────┴──────┴─────┴─────────┴─────────┴─────────┴─────────┘

Req/Bytes counts sampled once per second.

27 requests in 10.08s, 125 MB read

there exists cache in apiserver, but using label selector may be expensive in first time.

Some features that Name only have, such as DNS, etc. Do we really do not need them in future?

there's no such need yet, but we can also use uid as pod's hostname in DNS if needed?

Need to ensure job submission to be "one by one", to avoid duplicated name

it's ok to add locks in rest server, but it may slow down submission
another option is caching previous submitted job names to check

one shortcoming is, multiple rest servers for one cluster cannot avoid duplicate job names without persistent storages

Label value may still have some name limitation, will it change yarn naming?

for previous YARN version, name uses ^[a-zA-Z0-9_-]+$ format
for k8s label, the value syntax is ^(([a-zA-Z0-9][a-zA-Z0-9_-.]*)?[a-zA-Z0-9])?$, and no longer than 63 chars

UID related:

Test K8S generated name

metadata.generateName works but it only generates 5-char random string, reference
the uid is also needed in priority class and secret before creating framework

better to generate uid in rest server

Query by UID or Name design

query by name: /api/v2/jobs/{name}
query by uid: /api/v2/jobs?uid={uid}

yqwang-ms · 2019-12-16T10:28:12Z

Thanks @abuccts seems 1000 frameworks is too small to differentiate iteration vs lookup. Could you please also test 30k frameworks (assume 1k framework one day)? And you can also check the code to double confirm the label time complexity.

fanyangCS · 2019-12-16T11:37:44Z

@abuccts , please do some stress test like the performance with 300k active jobs.

yqwang-ms · 2019-12-17T01:52:38Z

Thanks @abuccts

another option is caching previous submitted job names to check

Cache maintenance is not easy and will make restserver stateful, you need to make sure it is consistent with ApiServer all over the time. Let's not try this in first stage.

one shortcoming is, multiple rest servers for one cluster cannot avoid duplicate job names without persistent storages

Even with persistent storage, these multiple rest servers need to sync with each other or handle its own naming partition/space only. So we may lost the rest server scalibility.
BTW, do we already support multiple rest servers now?

Label value may still have some name limitation, will it change yarn naming?

So, there is still a breaking naming change, such as -_ cannot be head or tail.

the uid is also needed in priority class and secret before creating framework

The uid needed in priority class and secret is the K8S generated framework UID, you cannot use restserver generated UID as it.

metadata.generateName works but it only generates 5-char random string,

So, seems during a post K8S request, even if K8S generated a conflicted random string, it will not try to generate another one, but just return 404.
So, if restserver generate UID, it would better also check if the UID at least already in active jobs.

abuccts · 2019-12-18T03:36:05Z

seems 1000 frameworks is too small to differentiate iteration vs lookup. Could you please also test 30k frameworks (assume 1k framework one day)?

for 30k frameworks, apiserver will timeout (exceed 1m, return 504) in several hours after creating frameworks, here're the results after 15 hours:

query framework A:

label selector

Running 60s test @ /apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks?labelSelector=jobname=02345
10 connections

┌─────────┬─────────┬──────────┬──────────┬──────────┬─────────────┬────────────┬─────────────┐
│ Stat    │ 2.5%    │ 50%      │ 97.5%    │ 99%      │ Avg         │ Stdev      │ Max         │
├─────────┼─────────┼──────────┼──────────┼──────────┼─────────────┼────────────┼─────────────┤
│ Latency │ 4069 ms │ 17033 ms │ 17328 ms │ 17328 ms │ 14656.92 ms │ 4671.51 ms │ 17328.74 ms │
└─────────┴─────────┴──────────┴──────────┴──────────┴─────────────┴────────────┴─────────────┘
┌───────────┬─────┬──────┬─────┬───────┬────────┬───────┬───────┐
│ Stat      │ 1%  │ 2.5% │ 50% │ 97.5% │ Avg    │ Stdev │ Min   │
├───────────┼─────┼──────┼─────┼───────┼────────┼───────┼───────┤
│ Req/Sec   │ 0   │ 0    │ 0   │ 2     │ 0.2    │ 1.07  │ 1     │
├───────────┼─────┼──────┼─────┼───────┼────────┼───────┼───────┤
│ Bytes/Sec │ 0 B │ 0 B  │ 0 B │ 678 B │ 67.8 B │ 360 B │ 339 B │
└───────────┴─────┴──────┴─────┴───────┴────────┴───────┴───────┘

Req/Bytes counts sampled once per second.

12 requests in 60.2s, 4.07 kB read

Running 60s test @ /apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks?labelSelector=jobname=02345
10 connections

┌─────────┬─────────┬─────────┬─────────┬─────────┬────────────┬───────────┬────────────┐
│ Stat    │ 2.5%    │ 50%     │ 97.5%   │ 99%     │ Avg        │ Stdev     │ Max        │
├─────────┼─────────┼─────────┼─────────┼─────────┼────────────┼───────────┼────────────┤
│ Latency │ 4377 ms │ 5355 ms │ 8089 ms │ 8220 ms │ 5417.34 ms │ 904.99 ms │ 8227.77 ms │
└─────────┴─────────┴─────────┴─────────┴─────────┴────────────┴───────────┴────────────┘
┌───────────┬─────┬──────┬───────┬─────────┬───────┬───────┬───────┐
│ Stat      │ 1%  │ 2.5% │ 50%   │ 97.5%   │ Avg   │ Stdev │ Min   │
├───────────┼─────┼──────┼───────┼─────────┼───────┼───────┼───────┤
│ Req/Sec   │ 0   │ 0    │ 1     │ 7       │ 1.79  │ 2.01  │ 1     │
├───────────┼─────┼──────┼───────┼─────────┼───────┼───────┼───────┤
│ Bytes/Sec │ 0 B │ 0 B  │ 339 B │ 2.37 kB │ 605 B │ 681 B │ 339 B │
└───────────┴─────┴──────┴───────┴─────────┴───────┴───────┴───────┘

Req/Bytes counts sampled once per second.

107 requests in 60.15s, 36.3 kB read

framework name

Running 60s test @ /apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks/labelperf02345
10 connections

┌─────────┬────────┬────────┬────────┬────────┬───────────┬──────────┬────────────┐
│ Stat    │ 2.5%   │ 50%    │ 97.5%  │ 99%    │ Avg       │ Stdev    │ Max        │
├─────────┼────────┼────────┼────────┼────────┼───────────┼──────────┼────────────┤
│ Latency │ 276 ms │ 328 ms │ 582 ms │ 671 ms │ 336.34 ms │ 82.95 ms │ 1064.08 ms │
└─────────┴────────┴────────┴────────┴────────┴───────────┴──────────┴────────────┘
┌───────────┬───────┬───────┬───────┬────────┬─────────┬─────────┬───────┐
│ Stat      │ 1%    │ 2.5%  │ 50%   │ 97.5%  │ Avg     │ Stdev   │ Min   │
├───────────┼───────┼───────┼───────┼────────┼─────────┼─────────┼───────┤
│ Req/Sec   │ 10    │ 20    │ 30    │ 40     │ 29.62   │ 5.25    │ 10    │
├───────────┼───────┼───────┼───────┼────────┼─────────┼─────────┼───────┤
│ Bytes/Sec │ 27 kB │ 54 kB │ 81 kB │ 108 kB │ 79.9 kB │ 14.2 kB │ 27 kB │
└───────────┴───────┴───────┴───────┴────────┴─────────┴─────────┴───────┘

Req/Bytes counts sampled once per second.

2k requests in 60.14s, 4.79 MB read

query framework B:

label selector

Running 60s test @ /apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks?labelSelector=jobname=23450
10 connections

┌─────────┬─────────┬─────────┬──────────┬──────────┬────────────┬────────────┬─────────────┐
│ Stat    │ 2.5%    │ 50%     │ 97.5%    │ 99%      │ Avg        │ Stdev      │ Max         │
├─────────┼─────────┼─────────┼──────────┼──────────┼────────────┼────────────┼─────────────┤
│ Latency │ 4237 ms │ 5196 ms │ 10348 ms │ 10373 ms │ 5858.05 ms │ 1773.43 ms │ 10373.58 ms │
└─────────┴─────────┴─────────┴──────────┴──────────┴────────────┴────────────┴─────────────┘
┌───────────┬─────┬──────┬───────┬─────────┬───────┬───────┬───────┐
│ Stat      │ 1%  │ 2.5% │ 50%   │ 97.5%   │ Avg   │ Stdev │ Min   │
├───────────┼─────┼──────┼───────┼─────────┼───────┼───────┼───────┤
│ Req/Sec   │ 0   │ 0    │ 1     │ 9       │ 1.64  │ 2.24  │ 1     │
├───────────┼─────┼──────┼───────┼─────────┼───────┼───────┼───────┤
│ Bytes/Sec │ 0 B │ 0 B  │ 339 B │ 3.05 kB │ 554 B │ 758 B │ 339 B │
└───────────┴─────┴──────┴───────┴─────────┴───────┴───────┴───────┘

Req/Bytes counts sampled once per second.

98 requests in 60.17s, 33.2 kB read

Running 60s test @ /apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks?labelSelector=jobname=23450
10 connections

┌─────────┬─────────┬─────────┬─────────┬─────────┬───────────┬───────────┬────────────┐
│ Stat    │ 2.5%    │ 50%     │ 97.5%   │ 99%     │ Avg       │ Stdev     │ Max        │
├─────────┼─────────┼─────────┼─────────┼─────────┼───────────┼───────────┼────────────┤
│ Latency │ 4384 ms │ 4847 ms │ 6951 ms │ 6953 ms │ 5108.6 ms │ 682.66 ms │ 6961.91 ms │
└─────────┴─────────┴─────────┴─────────┴─────────┴───────────┴───────────┴────────────┘
┌───────────┬─────┬──────┬───────┬─────────┬───────┬───────┬───────┐
│ Stat      │ 1%  │ 2.5% │ 50%   │ 97.5%   │ Avg   │ Stdev │ Min   │
├───────────┼─────┼──────┼───────┼─────────┼───────┼───────┼───────┤
│ Req/Sec   │ 0   │ 0    │ 1     │ 7       │ 1.9   │ 2.02  │ 1     │
├───────────┼─────┼──────┼───────┼─────────┼───────┼───────┼───────┤
│ Bytes/Sec │ 0 B │ 0 B  │ 339 B │ 2.37 kB │ 644 B │ 683 B │ 339 B │
└───────────┴─────┴──────┴───────┴─────────┴───────┴───────┴───────┘

Req/Bytes counts sampled once per second.

114 requests in 60.16s, 38.6 kB read

framework name

Running 60s test @ /apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks/labelperf23450
10 connections

┌─────────┬────────┬────────┬────────┬─────────┬───────────┬───────────┬────────────┐
│ Stat    │ 2.5%   │ 50%    │ 97.5%  │ 99%     │ Avg       │ Stdev     │ Max        │
├─────────┼────────┼────────┼────────┼─────────┼───────────┼───────────┼────────────┤
│ Latency │ 276 ms │ 334 ms │ 912 ms │ 2259 ms │ 377.05 ms │ 355.23 ms │ 4451.22 ms │
└─────────┴────────┴────────┴────────┴─────────┴───────────┴───────────┴────────────┘
┌───────────┬─────┬──────┬───────┬────────┬─────────┬─────────┬───────┐
│ Stat      │ 1%  │ 2.5% │ 50%   │ 97.5%  │ Avg     │ Stdev   │ Min   │
├───────────┼─────┼──────┼───────┼────────┼─────────┼─────────┼───────┤
│ Req/Sec   │ 0   │ 0    │ 30    │ 39     │ 26.37   │ 9.98    │ 10    │
├───────────┼─────┼──────┼───────┼────────┼─────────┼─────────┼───────┤
│ Bytes/Sec │ 0 B │ 0 B  │ 81 kB │ 105 kB │ 71.1 kB │ 26.9 kB │ 27 kB │
└───────────┴─────┴──────┴───────┴────────┴─────────┴─────────┴───────┘

Req/Bytes counts sampled once per second.

2k requests in 60.13s, 4.27 MB read

query all frameworks

framework list

Running 60s test @ /apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks
10 connections

┌─────────┬──────────┬──────────┬──────────┬──────────┬────────────┬─────────┬─────────────┐
│ Stat    │ 2.5%     │ 50%      │ 97.5%    │ 99%      │ Avg        │ Stdev   │ Max         │
├─────────┼──────────┼──────────┼──────────┼──────────┼────────────┼─────────┼─────────────┤
│ Latency │ 58745 ms │ 58748 ms │ 58753 ms │ 58753 ms │ 58748.6 ms │ 2.73 ms │ 58753.33 ms │
└─────────┴──────────┴──────────┴──────────┴──────────┴────────────┴─────────┴─────────────┘
┌───────────┬─────┬──────┬─────┬───────┬─────────┬─────────┬────────┐
│ Stat      │ 1%  │ 2.5% │ 50% │ 97.5% │ Avg     │ Stdev   │ Min    │
├───────────┼─────┼──────┼─────┼───────┼─────────┼─────────┼────────┤
│ Req/Sec   │ 0   │ 0    │ 0   │ 0     │ 0.09    │ 0.65    │ 5      │
├───────────┼─────┼──────┼─────┼───────┼─────────┼─────────┼────────┤
│ Bytes/Sec │ 0 B │ 0 B  │ 0 B │ 0 B   │ 6.62 MB │ 50.8 MB │ 397 MB │
└───────────┴─────┴──────┴─────┴───────┴─────────┴─────────┴────────┘

Req/Bytes counts sampled once per second.

5 requests in 60.25s, 397 MB read
5 errors (5 timeouts)

Running 60s test @ /apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks
10 connections

┌─────────┬──────┬──────┬───────┬──────┬──────┬───────┬──────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%  │ Avg  │ Stdev │ Max  │
├─────────┼──────┼──────┼───────┼──────┼──────┼───────┼──────┤
│ Latency │ 0 ms │ 0 ms │ 0 ms  │ 0 ms │ 0 ms │ 0 ms  │ 0 ms │
└─────────┴──────┴──────┴───────┴──────┴──────┴───────┴──────┘
┌───────────┬─────┬──────┬─────┬───────┬─────┬───────┬─────┐
│ Stat      │ 1%  │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │
├───────────┼─────┼──────┼─────┼───────┼─────┼───────┼─────┤
│ Req/Sec   │ 0   │ 0    │ 0   │ 0     │ 0   │ 0     │ 0   │
├───────────┼─────┼──────┼─────┼───────┼─────┼───────┼─────┤
│ Bytes/Sec │ 0 B │ 0 B  │ 0 B │ 0 B   │ 0 B │ 0 B   │ 0 B │
└───────────┴─────┴──────┴─────┴───────┴─────┴───────┴─────┘

Req/Bytes counts sampled once per second.

0 requests in 60.39s, 0 B read
10 errors (10 timeouts)

please do some stress test like the performance with 300k active jobs.

it will take a lot of time to create 300k frameworks, and etcd will timeout frequently during creating:

etcdserver: request timed out

Internal error occurred: resource quota evaluates timeout

yqwang-ms · 2019-12-19T02:17:44Z

So, for Proposal-4, it seems have below cons:
1. Label cannot achieve idempotently in some corner cases, such as:
a. User submit job-x, then RestServer get job-x, not found, so directly posts framework to ApiServer
b. RestServer crashes or receives failure response before ApiServer receives the request (the request is in flight)
c. User submit job-x, then RestServer get job-x, not found, so directly posts framework to ApiServer
d. ApiServer processes the request in step a.
e. ApiServer processes the request in step c.
f. Until now, although user only submit one job: job-x, but backend will have 2 frameworks.

2. Label is 20x slower than Name in large scale
30k Jobs
107 requests in 60.15s, 36.3 kB read
2k requests in 60.14s, 4.79 MB read

3. Label still have 63 length limitation

For API, we need to take it seriously, to guarantee idempotent, instead of besteffort to achieve it.
So, I still prefer to decouple idempotent and arbitrarily to different features, i.e. Proposal-1.

@fanyangCS @abuccts
Pls check

yqwang-ms · 2019-12-20T09:51:22Z

Offlined discussed, here is the agreement:

Still insist Proposal-4, but some adjustments:

1. Label cannot achieve idempotently in some corner cases, such as:

We will try to use MD5(UserName + JobName) as FrameworkName to achieve nearly idempotent.

2. Label is 20x slower than Name in large scale

FrameworkName can be got by MD5, no need to use label to search anymore.

3. Label still have 63 length limitation

We will try break (UserName + JobName) to multiple labels to break this limitation.

The UID is FrameworkUID which is generated by K8S.

TBD:
Can we use annotation instead of label to store the (UserName + JobName)?
Given Mutual Conversion:
UserName + JobName -> MD5 -> FrameworkName (We already can calculate the FrameworkName to look up, so no need label to lookup FrameworkObject)
FrameworkName -> FrameworkObject -> UserName + JobName

Planning:
P0: Relax Job Name limitation. (No need to introduce UID in this step)
P1: Support History job (of the same name) can be in job list and job detail page: #3845 (need to introduce UID in this step)

abuccts · 2019-12-23T08:15:54Z

TBD:
Can we use annotation instead of label to store the (UserName + JobName)?
Given Mutual Conversion:
UserName + JobName -> MD5 -> FrameworkName (We already can calculate the FrameworkName to look up, so no need label to lookup FrameworkObject)
FrameworkName -> FrameworkObject -> UserName + JobName

offline discussed and decided to use label:

job name is used in api path and acceptable to have 63 length limit
labels can be used as filters while annotation cannot, we can use labels to keep compatible with old jobs
database may be used in the future for partial search and large scale query

fanyangCS · 2019-12-23T08:18:28Z

regarding to the length limit of job name, I think we can keep the limit to 63 for now. if there is further requirement, can we extend to more than 63 by using multiple labels?

Update job name encoding method, use md5 hash instead. Query job by k8s label selector. Resolve job name related issues in #3935.

abuccts · 2019-12-25T11:58:02Z

regarding to the length limit of job name, I think we can keep the limit to 63 for now. if there is further requirement, can we extend to more than 63 by using multiple labels?

yes, it's possible to use multiple labels separating a long job name, or we could also change to annotations if we don't need to use labels to query legacy jobs (base32 encoded) in the future.

* Update job name encoding method Update job name encoding method, use md5 hash instead. Query job by k8s label selector. Resolve job name related issues in #3935. * Drop legacy jobs compatibility Drop legacy jobs compatibility.

fanyangCS · 2020-01-03T05:56:03Z

we will remove length constraint. lift the length limitation of job name #4101
we do not support advanced job filter in the backend. currently, the job search is done in the frontend, in the future, we will use a database (installed a trigger on k8s API server) to keep the updated job info and support advanced query/tag.
uuid of a job is generated by k8s (we reuse the uuid of a k8s object created by framework controller)

fanyangCS · 2020-02-27T06:50:21Z

done.

yqwang-ms changed the title ~~Consider to use Job UID instead of Job Name as Job Key~~ Job Name, UID and Description Nov 29, 2019

yqwang-ms self-assigned this Nov 29, 2019

yqwang-ms assigned sunqinzheng Dec 2, 2019

scarlett2018 mentioned this issue Dec 12, 2019

Pure K8S Beta Release Plan - v0.17 #3872

Closed

54 tasks

yqwang-ms assigned abuccts Dec 12, 2019

scarlett2018 added the high priority label Dec 17, 2019

scarlett2018 added this to the Pure K8S Beta Release milestone Dec 17, 2019

scarlett2018 changed the title ~~Job Name, UID and Description~~ P0 - Job Name, UID and Description Dec 17, 2019

abuccts added a commit that referenced this issue Dec 25, 2019

Update job name encoding method

f4bcef9

Update job name encoding method, use md5 hash instead. Query job by k8s label selector. Resolve job name related issues in #3935.

abuccts mentioned this issue Dec 25, 2019

[Rest Server] Update job name encoding method #4069

Merged

scarlett2018 mentioned this issue Feb 10, 2020

Feb end game plan #4177

Closed

fanyangCS closed this as completed Feb 27, 2020

hzy46 mentioned this issue Apr 27, 2020

add release note #4452

Merged

hzy46 mentioned this issue Jun 8, 2020

Use Database as History Server #4610

Closed

yqwang-ms mentioned this issue Jun 28, 2020

New RestServer Architecture: RestServer -> DB -> ApiServer #4651

Open

32 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P0 - Job Name, UID and Description #3935

P0 - Job Name, UID and Description #3935

yqwang-ms commented Nov 29, 2019 •

edited

Loading

debuggy commented Dec 2, 2019 •

edited

Loading

yqwang-ms commented Dec 2, 2019

debuggy commented Dec 2, 2019

yqwang-ms commented Dec 2, 2019 •

edited

Loading

debuggy commented Dec 2, 2019 •

edited

Loading

yqwang-ms commented Dec 2, 2019

sterow commented Dec 3, 2019

yqwang-ms commented Dec 3, 2019

abuccts commented Dec 4, 2019 •

edited

Loading

yqwang-ms commented Dec 13, 2019 •

edited

Loading

abuccts commented Dec 16, 2019 •

edited

Loading

yqwang-ms commented Dec 16, 2019

fanyangCS commented Dec 16, 2019

yqwang-ms commented Dec 17, 2019 •

edited

Loading

abuccts commented Dec 18, 2019 •

edited

Loading

yqwang-ms commented Dec 19, 2019 •

edited

Loading

yqwang-ms commented Dec 20, 2019 •

edited

Loading

abuccts commented Dec 23, 2019

fanyangCS commented Dec 23, 2019

abuccts commented Dec 25, 2019

fanyangCS commented Jan 3, 2020 •

edited

Loading

fanyangCS commented Feb 27, 2020

P0 - Job Name, UID and Description #3935

P0 - Job Name, UID and Description #3935

Comments

yqwang-ms commented Nov 29, 2019 • edited Loading

Proposal-1:

Proposal-2:

Proposal-3:

Proposal-4:

debuggy commented Dec 2, 2019 • edited Loading

yqwang-ms commented Dec 2, 2019

debuggy commented Dec 2, 2019

yqwang-ms commented Dec 2, 2019 • edited Loading

debuggy commented Dec 2, 2019 • edited Loading

yqwang-ms commented Dec 2, 2019

sterow commented Dec 3, 2019

yqwang-ms commented Dec 3, 2019

abuccts commented Dec 4, 2019 • edited Loading

yqwang-ms commented Dec 13, 2019 • edited Loading

abuccts commented Dec 16, 2019 • edited Loading

yqwang-ms commented Dec 16, 2019

fanyangCS commented Dec 16, 2019

yqwang-ms commented Dec 17, 2019 • edited Loading

abuccts commented Dec 18, 2019 • edited Loading

yqwang-ms commented Dec 19, 2019 • edited Loading

yqwang-ms commented Dec 20, 2019 • edited Loading

abuccts commented Dec 23, 2019

fanyangCS commented Dec 23, 2019

abuccts commented Dec 25, 2019

fanyangCS commented Jan 3, 2020 • edited Loading

fanyangCS commented Feb 27, 2020

yqwang-ms commented Nov 29, 2019 •

edited

Loading

debuggy commented Dec 2, 2019 •

edited

Loading

yqwang-ms commented Dec 2, 2019 •

edited

Loading

debuggy commented Dec 2, 2019 •

edited

Loading

abuccts commented Dec 4, 2019 •

edited

Loading

yqwang-ms commented Dec 13, 2019 •

edited

Loading

abuccts commented Dec 16, 2019 •

edited

Loading

yqwang-ms commented Dec 17, 2019 •

edited

Loading

abuccts commented Dec 18, 2019 •

edited

Loading

yqwang-ms commented Dec 19, 2019 •

edited

Loading

yqwang-ms commented Dec 20, 2019 •

edited

Loading

fanyangCS commented Jan 3, 2020 •

edited

Loading