Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgres sql failure on 1.9.0 and 1.10.0 #993

Closed
AyWa opened this issue Aug 26, 2022 · 18 comments
Closed

Postgres sql failure on 1.9.0 and 1.10.0 #993

AyWa opened this issue Aug 26, 2022 · 18 comments
Assignees
Labels

Comments

@AyWa
Copy link

AyWa commented Aug 26, 2022

Describe the Bug

We are running flipt in k8s in multiples environment:

  • locally using rancher desktop k8s.
  • in gcp

we are in 1.8.3 and everything is working well, but when we try to update to 1.9.0 or 1.10.0 (even on a new db), we got error.

getting db driver for: postgres: dial tcp: lookup postgres.database.svc.cluster.local: device or resource busy

Version Info

1.8.2 -> working
1.8.3 -> working
1.9.0 -> error
1.10.0 -> error

our posgres sql is postgres:12

To Reproduce

we just start the docker image with the env variable FLIPT_DB_URL to postgres://yy@postgres.database.svc.cluster.local:5432/flipt?sslmode=disable

I went to the list of change, but I could not see any change related to postgres. Did we miss something or we need to change something ?

log


Version: 1.9.0
Commit: 9938aba8e9c67aa4789bf0eec384d4113dd567d1
Build Date: 2022-08-26T07:05:51Z
Go Version: go1.17.11

A newer version of Flipt exists at https://github.com/flipt-io/flipt/releases/tag/v1.10.0, 
please consider updating to the latest version.
time="2022-08-26T07:05:53Z" level=info msg="shutting down..."
time="2022-08-26T07:05:53Z" level=error msg="getting db driver for: postgres: dial tcp: lookup postgres.database.svc.cluster.local: device or resource busy"
@markphelps
Copy link
Collaborator

Thanks for the detailed bug report @AyWa !

Taking a 👀 . Im actually out of town this weekend but will try to provide an update in the next couple days.

@AyWa
Copy link
Author

AyWa commented Aug 26, 2022

Thanks for the detailed bug report @AyWa !

Taking a 👀 . Im actually out of town this weekend but will try to provide an update in the next couple days.

Thank you, it is not really urgent, I will also try to reproduce outside our environment (maybe just docker compose or simple k8s)

@markphelps
Copy link
Collaborator

Cool, in the mean time I bumped the postgres example (using docker-compose) and the integration tests to use postgres 12 instead of the old postgres 10 to see if that exposes anything

@markphelps
Copy link
Collaborator

FYI i've traced it to the migration check that automatically runs on Flipt startup:

https://github.com/flipt-io/flipt/blob/main/storage/sql/migrator.go#L42

I wonder if something changed in the underlying 'github.com/golang-migrate/migrate/database/postgres' library between versions that is causing this error. Going to keep digging in this coming week

@AyWa
Copy link
Author

AyWa commented Aug 28, 2022

This bug seems to be little tricky. In docker compose everything works like:

# Use postgres/example user/password credentials
version: '3.1'
services:
  db:
    image: postgres:12
    restart: always
    ports:
      - 5432:5432
    environment:
      POSTGRES_PASSWORD: example

  flipt:
    image: flipt/flipt:v1.10.0
    restart: always
    ports:
      - 8080:8080
      - 9000:9000
    environment:
      FLIPT_DB_URL: "postgres://postgres:example@db:5432/flipt?sslmode=disable"

However, on k8s, I can reproduce.

apiVersion: v1
kind: Namespace
metadata:
  annotations:
    environment: local
  name: db
---
apiVersion: v1
kind: Namespace
metadata:
  annotations:
    environment: local
  name: enablement
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: db
  name: db
  namespace: db
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: db
  template:
    metadata:
      labels:
        app: db
        app.kubernetes.io/name: db
    spec:
      containers:
      - env:
        - name: POSTGRES_PASSWORD
          value: "example"
        image: postgres:12
        imagePullPolicy: IfNotPresent
        name: db
        ports:
        - containerPort: 5432
          name: db
          protocol: TCP
        securityContext: {}
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    environment: local
  labels:
  name: db
  namespace: db
spec:
  ports:
  - name: db
    port: 5432
    protocol: TCP
    targetPort: db
  selector:
    app.kubernetes.io/name: db
---
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    environment: local
  labels:
    app.kubernetes.io/instance: flipt
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: flipt
    app.kubernetes.io/version: v1.10.0
    helm.sh/chart: flipt-0.5.0
  name: flipt
  namespace: enablement
---
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    environment: local
  labels:
    app: flipt
  name: flipt
  namespace: enablement
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: flipt
  template:
    metadata:
      labels:
        app: flipt
        app.kubernetes.io/name: flipt
    spec:
      containers:
      - env:
        - name: FLIPT_DB_URL
          value: "postgres://postgres:example@db.db.svc.cluster.local:5432/flipt?sslmode=disable"
        - name: FLIPT_META_TELEMETRY_ENABLED
          value: "false"
        # image: flipt/flipt:v1.8.3 # is working
        image: flipt/flipt:v1.9.0 # is not
        imagePullPolicy: IfNotPresent
        name: flipt
        resources:
          limits:
            cpu: 256m
            memory: 128Mi
          requests:
            cpu: 128m
            memory: 32Mi
        securityContext: {}
      securityContext: {}

I think this working example is correct, because if I set image flipt/flipt:v1.8.3 it is working but not flipt/flipt:v1.9.0.
Does the docker build is same as before ? the only difference I can see is the dns in k8s that is a bit different than the docker compose. 🤔

For now I tried in my local k8s (rancher-desktop) I might try on a clean cloud k8s but our stg is on gcp and we had the same issue

@AyWa
Copy link
Author

AyWa commented Aug 29, 2022

So I tried something else today: If I deploy the db and flipt in the same namespace and so the db connection string can be change to: postgres://postgres:example@db:5432/flipt?sslmode=disable then it is working !

So I think there is an issue in the docker image about the DNS resolve like db.db.svc.cluster.local

@markphelps
Copy link
Collaborator

That looks promising thank you for digging in! I will be able to take a deeper dive tomorrow am. I think the problem may lie in the migrator library I linked earlier.

@markphelps
Copy link
Collaborator

Likely related to #963 as well

@GeorgeMac
Copy link
Member

GeorgeMac commented Aug 30, 2022

👋 This is likely to do with how flipt is now being built.

see: segmentio/kafka-go#285 and golang/go#35067

Meanwhile, try dropping the .cluster.local and it works. i.e. db.db.svc:5432 as the host.

Not sure why the resolver is changing for flipt. I thought it was a change to static compilation. But I think I was wrong there.

@markphelps
Copy link
Collaborator

Thanks @GeorgeMac !! I think this may have broken when I switched how Flipt is built for release.

Pre #927 I was building locally, using musl to cross-compile to linux (statically linked). In #927 I changed over to using github actions to build/release (see: https://github.com/flipt-io/flipt/pull/927/files#diff-42e26dc67aed8aa3edb2472b4403288c1699fb6dc47419b9a475f0f224fe4689).

I wonder if I need to set the netdns flags now that Im not using musl-cross?

@markphelps
Copy link
Collaborator

I created #1001 to address.

I'm actually having a hard time reproducing/verifying the fix however because of the move to building everything in CI. I need to build on a linux machine (to reproduce how GH Actions is building) and the only linux machine I have available locally is a Ubuntu VM running in UTM, however I installed the ARM Ubuntu version (I'm running on an m1 mac) so goreleaser cant build it for x64 when running in the VM.

Also, goreleaser doesnt allow me to push a snapshot build to Dockerhub, only release builds.. so I cant very easily create a test docker image to deploy to k8s/kind to test. 😠

Couple ideas:

  1. Hack around goreleaser by pushing the image manually after the build step in the snapshot action
  2. Hack around goreleaser by saving the image to GH artifacts in the snapshot action then downloading locally
  3. Create a x64 emulated ubuntu VM via UTM and try to build via goreleaser there, testing everything inside the VM via kind
  4. Have a lovely contributor clone my fix branch on an x64 linux machine or VM, build via goreleaser release --snapshot, test the image in k8s or kind

What do you think @GeorgeMac @AyWa ?

@AyWa
Copy link
Author

AyWa commented Sep 1, 2022

I am actually running on a m1 mac too, but I could try to build the image on a vm tomorrow and see if it is working

@GeorgeMac
Copy link
Member

My suggestion would probably be in the 1/2 space.
I would get Actions to do the building for you and publish it somewhere.

The other option might be to play with an alternative base image to replicate it.
i.e. instead of depending on the host as the base for the build, copy the source into e.g. an ubuntu base image, build from there and copy from that base into the alpine target.

@GeorgeMac
Copy link
Member

First things first. Here is an adjusted version of build/Dockerfile which creates a representative build process.
I took this image and recreated the problem with it in kind:

# https://goreleaser.com/docker/

FROM golang:1.17.13-buster AS build

RUN apt clean && apt update && apt install -y gcc-x86-64-linux-gnu

RUN curl https://raw.githubusercontent.com/creationix/nvm/master/install.sh | bash

RUN \. "$HOME/.nvm/nvm.sh" && nvm install --lts

RUN \. "$HOME/.nvm/nvm.sh" && nvm use --lts

RUN go install github.com/goreleaser/goreleaser@v1.6.2

RUN go install github.com/go-task/task/v3/cmd/task@latest

WORKDIR /flipt

ADD . /flipt

RUN \. "$HOME/.nvm/nvm.sh" && task prep -f

ENV ANALYTICS_KEY=foo
ENV CC=x86_64-linux-gnu-gcc
ENV GOARCH=amd64

RUN goreleaser --rm-dist --snapshot build 

FROM alpine:3.16.2

LABEL maintainer="dev@flipt.io"
LABEL org.opencontainers.image.name="flipt"
LABEL org.opencontainers.image.source="https://github.com/flipt-io/flipt"

RUN apk add --no-cache postgresql-client \
    openssl \
    ca-certificates

RUN mkdir -p /etc/flipt && \
    mkdir -p /var/opt/flipt

COPY --from=build /flipt/dist/flipt_linux_amd64/flipt /
COPY config/migrations/ /etc/flipt/config/migrations/
COPY config/*.yml /etc/flipt/config/

RUN addgroup flipt && \
    adduser -S -D -g '' -G flipt -s /bin/sh flipt && \
    chown -R flipt:flipt /etc/flipt /var/opt/flipt

EXPOSE 8080
EXPOSE 9000

USER flipt

CMD ["./flipt"]

Then I did the following to validate the error can be reproduced and is fixed by netgo tag:

  1. Replace build/Dockerfile with the contents above.
  2. Run docker buildx build -f ./build/Dockerfile .
  3. Take the resulting image sha and tag it with something appropriate (see buildx output)>
    e.g. docker tag sha256:abcdef flip/flipt:amd64-no-musl
  4. I took the yaml shared by @AyWa and applied it to a fresh kind cluster.
kind create cluster --name flipt
kubectx kind-flipt
kubectl apply -f cluster.yaml
  1. Load the image we built earlier into kind kind load --name flipt docker-image flipt/flipt:amd64-no-musl.
  2. Update cluster.yaml flipt container image directive to be flipt/flipt:amd64-no-musl.
  3. kubectl -n enablement logs <pod-name> you should see the error.
  4. Update .goreleaser as you have done in the PR above.
  5. Repeated step (2) through (5).
  6. Now that the image has been updated kubectl delete pod <flipt-pod> this will cause it to start with the replace tag version.

The issue disappears for me 👍

➜  docker tag sha256:b3bcff949de5665f62d92f9270c01c3923e924e375ca48712f0dcbd641afb96c flipt/flipt:amd64-no-musl                                                                               kind-flipt:(enablement)
➜  kind load --name flipt docker-image flipt/flipt:amd64-no-musl                                                                                                                              
Image: "flipt/flipt:amd64-no-musl" with ID "sha256:b3bcff949de5665f62d92f9270c01c3923e924e375ca48712f0dcbd641afb96c" not yet present on node "flipt-control-plane", loading...
➜  k get pods                                                                                                                                                                                 
NAME                     READY   STATUS             RESTARTS         AGE
flipt-7b7cbf97fb-8b66p   0/1     CrashLoopBackOff   19 (2m21s ago)   75m
➜  k delete pod flipt-7b7cbf97fb-8b66p                                                                                                                                                        
pod "flipt-7b7cbf97fb-8b66p" deleted
➜  k get pods                                                                                                                                                                                 
NAME                     READY   STATUS    RESTARTS   AGE
flipt-7b7cbf97fb-8vp9p   1/1     Running   0          3s
➜  k get pods                                                                                                                                                                                 
NAME                     READY   STATUS    RESTARTS   AGE
flipt-7b7cbf97fb-8vp9p   1/1     Running   0          4s
➜  k logs flipt-7b7cbf97fb-8vp9p                                                                                                                                                              

 _____ _ _       _
|  ___| (_)_ __ | |_
| |_  | | | '_ \| __|
|  _| | | | |_) | |_
|_|   |_|_| .__/ \__|
          |_|

Version: 7509911-snapshot
Commit: 7509911fc6413ce1678eabe7c28c1166607613c9
Build Date: 2022-09-01T12:07:36Z
Go Version: go1.17.13


API: http://0.0.0.0:8080/api/v1
UI: http://0.0.0.0:8080

@markphelps
Copy link
Collaborator

Thanks @GeorgeMac for validating the fix! Great idea of building in the container.

@AyWa I went ahead and backported the fix for 1.9 and created v1.9.1 as well as forward fixed for 1.10 (v1.10.1).

Would you mind giving either v1.9.1 or v1.10.1 a try and see if they now work for you?

@AyWa
Copy link
Author

AyWa commented Sep 1, 2022

Thanks @GeorgeMac for validating the fix! Great idea of building in the container.

@AyWa I went ahead and backported the fix for 1.9 and created v1.9.1 as well as forward fixed for 1.10 (v1.10.1).

Would you mind giving either v1.9.1 or v1.10.1 a try and see if they now work for you?

Sure I will try to test locally and in our stg cluster tomorrow. Thx for the quick fix !!

@markphelps
Copy link
Collaborator

Can confirm the fix in 1.9.1 when updating the yaml provided by @AyWa to use v1.9.1 of the image:

workspace/flipt - [main●] » kubectl -n enablement logs flipt-6996554dd9-zh7bf

 _____ _ _       _
|  ___| (_)_ __ | |_
| |_  | | | '_ \| __|
|  _| | | | |_) | |_
|_|   |_|_| .__/ \__|
          |_|

Version: 1.9.1
Commit: cab0f2d16398d6793a7238a67fb62917a1c9db55
Build Date: 2022-09-01T18:00:14Z
Go Version: go1.17.13

A newer version of Flipt exists at https://github.com/flipt-io/flipt/releases/tag/v1.10.1,
please consider updating to the latest version.

@AyWa
Copy link
Author

AyWa commented Sep 2, 2022

It is perfectly working !!! thank you

@AyWa AyWa closed this as completed Sep 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants