Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] doris fe not ready after reboot fe/be #290

Open
2 of 3 tasks
ming12713 opened this issue Nov 12, 2024 · 5 comments
Open
2 of 3 tasks

[Bug] doris fe not ready after reboot fe/be #290

ming12713 opened this issue Nov 12, 2024 · 5 comments
Labels
good first issue Good for newcomers question Further information is requested

Comments

@ming12713
Copy link

Search before asking

  • I had searched in the issues and found no similar issues.

Version

2.11

What's Wrong?

2024-11-12 08:05:14,341 INFO (stateListener|83) [DatabaseTransactionMgr.replayUpsertTransactionState():2158] replay a COMMITTED transaction TransactionState. transaction id: 3917247, label: vtc_source_nome__KC_ods_vtc_source_nome__KC_2__KC_loshu_ods_vtc_source_nome__KC_0__KC_495084__KC_1730487306367, db id: 11154, table id list: 508351, callback id: -1, coordinator: BE: 10.42.1.19, transaction status: COMMITTED, error replicas num: 0, replica ids: , prepare time: 1730487306375, commit time: 1730487307994, finish time: -1, reason: 
2024-11-12 08:05:14,341 INFO (stateListener|83) [DatabaseTransactionMgr.replayUpsertTransactionState():2158] replay a COMMITTED transaction TransactionState. transaction id: 3917245, label: vtc_source_nome__KC_ods_vtc_source_nome__KC_1__KC_loshu_ods_vtc_source_nome__KC_0__KC_494027__KC_1730487305225, db id: 11154, table id list: 508351, callback id: -1, coordinator: BE: 10.42.1.19, transaction status: COMMITTED, error replicas num: 0, replica ids: , prepare time: 1730487305236, commit time: 1730487308002, finish time: -1, reason: 
2024-11-12 08:05:14,341 INFO (stateListener|83) [OlapTable.updateVisibleVersionAndTime():2591] updateVisibleVersionAndTime, tableName: ods_vtc_source_nome, visibleVersion, 344672, visibleVersionTime: 1730487308007
2024-11-12 08:05:14,341 INFO (stateListener|83) [DatabaseTransactionMgr.replayUpsertTransactionState():2158] replay a VISIBLE transaction TransactionState. transaction id: 3917247, label: vtc_source_nome__KC_ods_vtc_source_nome__KC_2__KC_loshu_ods_vtc_source_nome__KC_0__KC_495084__KC_1730487306367, db id: 11154, table id list: 508351, callback id: -1, coordinator: BE: 10.42.1.19, transaction status: VISIBLE, error replicas num: 0, replica ids: , prepare time: 1730487306375, commit time: 1730487307994, finish time: 1730487308007, reason: 
2024-11-12 08:05:14,342 INFO (stateListener|83) [OlapTable.updateVisibleVersionAndTime():2591] updateVisibleVersionAndTime, tableName: ods_vtc_source_nome, visibleVersion, 344673, visibleVersionTime: 1730487308018
2024-11-12 08:05:14,342 INFO (stateListener|83) [DatabaseTransactionMgr.replayUpsertTransactionState():2158] replay a VISIBLE transaction TransactionState. transaction id: 3917245, label: vtc_source_nome__KC_ods_vtc_source_nome__KC_1__KC_loshu_ods_vtc_source_nome__KC_0__KC_494027__KC_1730487305225, db id: 11154, table id list: 508351, callback id: -1, coordinator: BE: 10.42.1.19, transaction status: VISIBLE, error replicas num: 0, replica ids: , prepare time: 1730487305236, commit time: 1730487308002, finish time: 1730487308018, reason: 
2024-11-12 08:05:14,342 INFO (stateListener|83) [DatabaseTransactionMgr.replayUpsertTransactionState():2158] replay a COMMITTED transaction TransactionState. transaction id: 3917244, label: nome_raw_data__KC_ods_vtc_nome_raw_data__KC_2__KC_loshu_ods_vtc_nome_raw_data__KC_0__KC_1611187__KC_1730487305168, db id: 11154, table id list: 74966, callback id: -1, coordinator: BE: 10.42.1.19, transaction status: COMMITTED, error replicas num: 0, replica ids: , prepare time: 1730487305177, commit time: 1730487308558, finish time: -1, reason: 
2024-11-12 08:05:14,342 INFO (stateListener|83) [DatabaseTransactionMgr.replayUpsertTransactionState():2158] replay a COMMITTED transaction TransactionState. transaction id: 3917246, label: nome_raw_data__KC_ods_vtc_nome_raw_data__KC_1__KC_loshu_ods_vtc_nome_raw_data__KC_0__KC_1612525__KC_1730487305321, db id: 11154, table id list: 74966, callback id: -1, coordinator: BE: 10.42.1.19, transaction status: COMMITTED, error replicas num: 0, replica ids: , prepare time: 1730487305400, commit time: 1730487308568, finish time: -1, reason: 
/opt/apache-doris/fe/bin/start_fe.sh: line 265:   162 Killed                  ${LIMIT:+${LIMIT}} "${JAVA}" ${final_java_opt:+${final_java_opt}} -XX:-OmitStackTraceInFastThrow -XX:OnOutOfMemoryError="kill -9 %p" ${coverage_opt:+${coverage_opt}} org.apache.doris.DorisFE ${HELPER:+${HELPER}} ${OPT_VERSION:+${OPT_VERSION}} "${METADATA_FAILURE_RECOVERY}" "$@" < /dev/null

Doris Installation via Operator, 1 BE Node and 1 FE Node, After restarting both the Doris FE and BE nodes, the FE node fails to start normally and reports the error mentioned above. The BE IP 10.42.1.19 mentioned in the error is the previous BE pod IP, not the SVC IP. The FE configuration for service discovery is set to use SVC (Service) method, but now the BE is 10.42.1.6.

image

pod network cidr 10.42.1.x/16
image
svc network cidr 10.43.48.x
image

What You Expected?

fix issue

How to Reproduce?

No response

Anything Else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@ming12713
Copy link
Author

In my Case, Kafka Writes to Doris via Connector Sink Mode, When Doris is restarted, the connector continues to write data. The logs parse the coordinator BE IP. Is it possible that the connector is using the StreamLoad method to write data? This data is synchronized to the FE meta with bdb, but it has not yet been synchronized to the BE. If the BE is restarted at this moment, the FE may negotiate a BE coordinator IP that it cannot connect to, causing cluster issues. Is my understanding correct?

@intelligentfu8
Copy link
Contributor

what's the doriscluster spec, please share the yaml. In k8s, if the IP is not static when restarted, please set enable_fqdn_mode = true to use fqdn communicate.
The connector sink mode uses streamload method to insert data.

@ming12713
Copy link
Author

what's the doriscluster spec, please share the yaml. In k8s, if the IP is not static when restarted, please set enable_fqdn_mode = true to use fqdn communicate. The connector sink mode uses streamload method to insert data.
@intelligentfu8
I observed the StreamLoad mechanism,The FE selects a BE (Backend) as the Coordinator node in a round-robin manner, which is responsible for scheduling the import job, and then returns an HTTP redirect to the client. The redirect uses the BE pod IP instead of svc , the reason might be related to this.
https://doris.apache.org/docs/data-operate/import/import-way/stream-load-manual/

doriscluster.yaml

apiVersion: v1
items:
- apiVersion: doris.selectdb.com/v1
  kind: DorisCluster
  metadata:
    labels:
      app.kubernetes.io/instance: doriscluster
      app.kubernetes.io/name: doriscluster
      app.kubernetes.io/part-of: doris-operator
    name: doriscluster
    namespace: doris
    resourceVersion: "18187746"
    uid: 9b4d358b-ac8c-491c-8701-6a7ce61f4bdb
  spec:
    beSpec:
      annotations:
        selectdb/dorisclsuter.component: be
      envVars:
      - name: HOME
        value: /opt/selectdb
      image: selectdb/doris.be-ubuntu:2.1.1
      limits:
        cpu: 24
        memory: 64Gi
      nodeSelector:
        kubernetes.io/hostname: loshu-kube-ds01
      persistentVolumes:
      - mountPath: /opt/apache-doris/be/storage
        name: doris-be
      replicas: 1
      requests:
        cpu: 2
        memory: 8Gi
      service:
        servicePorts:
        - nodePort: 32422
          targetPort: 9060
        - nodePort: 30652
          targetPort: 8040
        - nodePort: 30891
          targetPort: 9050
        - nodePort: 31420
          targetPort: 8060
        type: NodePort
      systemInitialization:
        command:
        - /sbin/sysctl
        - -w
        - vm.max_map_count=2000000
    feSpec:
      annotations:
        selectdb/dorisclsuter.component: fe
      configMapInfo:
        configMapName: fe-configmap
        resolveKey: fe.conf
      envVars:
      - name: HOME
        value: /opt/selectdb
      image: selectdb/doris.fe-ubuntu:2.1.1
      limits:
        cpu: 8
        memory: 32Gi
      nodeSelector:
        kubernetes.io/hostname: loshu-kube-ds
      persistentVolumes:
      - mountPath: /opt/apache-doris/fe/doris-meta
        name: doris-fe
      replicas: 1
      requests:
        cpu: 2
        memory: 4Gi
      service:
        servicePorts:
        - nodePort: 30148
          targetPort: 8030
        - nodePort: 30252
          targetPort: 9020
        - nodePort: 31341
          targetPort: 9030
        type: NodePort
      systemInitialization:
        command:
        - /sbin/sysctl
        - -w
        - vm.max_map_count=2000000

@intelligentfu8
Copy link
Contributor

yeah, you are right. but, the selectdb community is improving the streaming load ability. they have fixed the issue on arrow flight pr. and the flink, spark is coming pr. Please reference the issue for more description.

@intelligentfu8 intelligentfu8 added good first issue Good for newcomers question Further information is requested labels Nov 18, 2024
@ming12713
Copy link
Author

yeah, you are right. but, the selectdb community is improving the streaming load ability. they have fixed the issue on arrow flight pr. and the flink, spark is coming pr. Please reference the issue for more description.

nice ,thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants