Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Use Postgresql as an alternative storage for job history #4164

Merged
merged 31 commits into from
Mar 19, 2020

Conversation

hzy46
Copy link
Contributor

@hzy46 hzy46 commented Jan 23, 2020

#2850

How to use

The following service config should be set to enable internal storage, postgresql, and set rest-server to read job history from database.

cluster:
  common:
    job-history: "true"
    job-attempt-storage-engine: "db"

internal-storage:
  root-path: /paiInternal
  enable: true

postgresql:
  enable: true

What will happen if the database crashes

In such case, the health check of job attempt api will not pass. User won't be able to click it.

image

In backend side, fluentd will try to send chunk to postgresql forever. Once the postgresql db is restarted, the job attempt history will be recovered.

Test

  • Check if the job history works normally
    • Submit a special job, which will fail after 60 seconds: sleep 60s; echo failed at $(date); exit 233, and set a retry numbers for it.
    • Expect: all the job retry history will be recorded normally.
  • Check the behavior when the database is down
    • Delete the database service by ./paictl.py service stop -m postgresql
    • Submit a job with retries, now the user will not be able to see the retry history
    • Start the database service by ./paictl.py service start -m postgresql
    • Expect: all the job retry history will be recorded normally after a few minutes

@hzy46 hzy46 force-pushed the zhiyuhe/postgresql_job_history branch from 30ef3a8 to 6bf6184 Compare March 11, 2020 10:29
@hzy46 hzy46 force-pushed the zhiyuhe/postgresql_job_history branch from 8323ae1 to 7451a1c Compare March 11, 2020 15:37
<format>
@type json
</format>
</filter>
Copy link
Member

@yqwang-ms yqwang-ms Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remote it? #Closed

Copy link
Contributor Author

@hzy46 hzy46 Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this log makes my debugging harder since fc-objectsnapshots is very huge, which makes other userful logs difficult to see. If it is necessary, I can revert this change. #Closed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so let's just comment it out,


In reply to: 391375514 [](ancestors = 391375514)

Copy link
Contributor Author

@hzy46 hzy46 Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so let's just comment it out,

In reply to: 391375514 [](ancestors = 391375514)

OK, done. #Closed

@@ -143,4 +135,26 @@ data:
overflow_action block
</buffer>
</store>
{% if (cluster_cfg['internal-storage']['enable']) and (cluster_cfg['postgresql']['enable']) %}
<store>
Copy link
Member

@yqwang-ms yqwang-ms Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also and IF for elasticsearch? #Closed

Copy link
Contributor Author

@hzy46 hzy46 Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done #Closed

&& sudo gem install rake bundler pg

# Build fluent-plugin-pgjson from scratch
# Original fluent-plugin-pgjson is from https://github.com/fluent-plugins-nursery/fluent-plugin-pgjson
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment why we need build our own

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@hzy46 hzy46 requested a review from qfyin March 12, 2020 06:21
@@ -143,4 +144,27 @@ data:
overflow_action block
</buffer>
</store>
{% endif %}
{% if (cluster_cfg['internal-storage']['enable']) and (cluster_cfg['postgresql']['enable']) %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems you can use if else...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better use something like:
job-attempt-storage-engine: "elasticsearch" # set it to "db" to use database backend


In reply to: 391421409 [](ancestors = 391421409)

Copy link
Contributor Author

@hzy46 hzy46 Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use cluster_cfg['cluster']['common']['job-attempt-storage-engine'] instead

Copy link
Member

@yqwang-ms yqwang-ms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

@debuggy debuggy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approve with minor comment

src/rest-server/src/utils/postgresUtil.js Outdated Show resolved Hide resolved
@hzy46 hzy46 merged commit 2cb5a3c into master Mar 19, 2020
@hzy46 hzy46 mentioned this pull request Apr 27, 2020
@hzy46 hzy46 deleted the zhiyuhe/postgresql_job_history branch September 3, 2020 07:37
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants