Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding advance failover detection for Aurora managed DBs #2160

Open
xocasdashdash opened this issue Nov 2, 2024 · 5 comments
Open

Adding advance failover detection for Aurora managed DBs #2160

xocasdashdash opened this issue Nov 2, 2024 · 5 comments

Comments

@xocasdashdash
Copy link

Is your feature request related to a problem? Please describe.

When a db failover happens in Aurora clusters for some time until DNS gets reconfigured any connections are broken.

Potentially i'd like to be able to configure a monitor that detects this situation and:

  • invalidates current connections
  • provides DNS resolution that's faster than DNS ttl

Describe the solution you'd like
Looking at this https://github.com/aws/aws-advanced-jdbc-wrapper i think what i'd like to be able to have is some topology aware driver/connection pool.

Describe alternatives you've considered
I think the target_session_attrs covers some of the functionality that i'm thinking about, specifically giving it a list of hosts and requiring that the connection is in "write" mode.

This does not cover a failover scenario, because at that point we need to re-resolve the DNS (see here )

I think that this can be mostly doable within the context of pgx by doing the following:

  • Have a custom function that returns the IP of the writer
  • When a failover is detected, reset the connection pool (based on this)

I'm mostly writing here to see if anybody can think of another way of going at this problem, the aws jdbc driver acts as a load balancer from the point of view from the app:

For each client session, the driver maintains a logical connection (app to driver) and a physical connection (driver to database). If a physical connection fails due to a failover, the driver notices the failure and initiates its own failover handling protocol while keeping the logical connection alive

I don't think that pgx should be responsible for that but maybe there's a "simpler" approach where i can wrap pgx and do something pretty similar, but i'm not fully sure what this wrapper looks like. is it implementing the sql interfaces on top of pgx?

@jackc
Copy link
Owner

jackc commented Nov 2, 2024

I don't have experience with Aurora, but just wanted to make sure you are aware of pgtype.Config.DialFunc and pgtype.Config.LookupFunc. My guess is they could play a significant part in solution you end up with.

@xocasdashdash
Copy link
Author

yeah, I was checking those up too.

Is there any interface you think i could target to wrap?

I think that if i create my own driver it should be possible to do this dynamic connection transparently. As i understand it i should not be opening a connection until the first query is executed. Would that mean that i have to create my own Connection too?

Mostly asking to see if you can give me some pointers in regards to how to write my own db driver, feels like there's a lot of nuance to it to make this easily integrate with projects like https://github.com/jmoiron/sqlx or https://github.com/XSAM/otelsql

@felix-roehrich
Copy link
Contributor

pgxpool.Pool supports BeforeConnect and BeforeAcquire shouldn't this be enough to implement your use case?

@jackc
Copy link
Owner

jackc commented Nov 6, 2024

@xocasdashdash

Ah, you're looking to use database/sql. That might be a little trickier as pgx doesn't manage the connection pool, database/sql does.

If you're looking to write your own database/sql compatible driver then you would run into the same issue - the pool is out of your control. But if you want to try then I would look at stdlib/sql.go. pgx is written in a layers so the actual database/sql driver is less than 900 lines including comments. So the database/sql part is pretty accessible.

@xocasdashdash
Copy link
Author

@jackc I think i have a working approach

Wrap the pgxstdlib.Driver with something like this:

type Driver struct {
	pgxstdlib.Driver
	configMutex sync.Mutex
	sequence    int

	configs map[string]*auroradriver.Config
}

Then have my "auroradriver" create a connection monitor when a called to Open is done:

// Open opens a new connection to the database.
func (a *Driver) Open(name string) (driver.Conn, error) {
	c, ok := a.configs[name]
	c.StartMonitor()
	if !ok {
		return nil, fmt.Errorf(
			"connection not found: %s %w",
			name,
			ErrConnectionNotFound,
		)
	}
	d := c.Connector.Driver()
	return d.Open(name)
}

That monitor makes use of the AfterConnect and LookupFunc hooks:

// AfterConnect is a callback that is called after a connection is established.
func (m *Monitor) AfterConnect(_ context.Context, conn *pgconn.PgConn) error {
	m.configLock.Lock()
	defer m.configLock.Unlock()
	m.conns = append(m.conns, conn)
	return nil
}

// LookupFunc is a callback that is called to lookup the host.
func (m *Monitor) LookupFunc(_ context.Context, _ string) ([]string, error) {
	if m.monitorConfig.Preference == ReadOnly {
		return m.hostList, nil
	}
	return []string{m.writerHost}, nil
}

and then have something like this:

	watcher := m.hp.WatchTopologyChange()
        for {
		select {
		// Refresh the hosts when the watcher or the ticker fires.
		case <-t.C:
			if err := m.RefreshHosts(ctx); err != nil {
				m.logger.ErrorContext(ctx, "failed to refresh hosts", slog.AnyValue(err))
			}
		case <-watcher:
			m.logger.InfoContext(ctx, "topology change detected")
			if err := m.RefreshHosts(ctx); err != nil {
				m.logger.ErrorContext(ctx, "failed to refresh hosts", slog.AnyValue(err))
			}
			m.configLock.Lock()
			for _, conn := range m.conns {
				err := conn.Close(ctx)
				if err != nil {
					m.logger.ErrorContext(ctx, "failed to close connection", slog.AnyValue(err))
				}
			}
			m.configLock.Unlock()

		case <-m.controlCh:
			m.configLock.Lock()
			for _, conn := range m.conns {
				err := conn.Close(ctx)
				if err != nil {
					m.logger.ErrorContext(ctx, "failed to close connection", slog.AnyValue(err))
				}
			}
			m.conns = nil
			m.configLock.Unlock()
			return
		}
	}

that watches for changes in topology and have that close db connections when needed. At this point i'm assuming (haven't fully tested it) that closing the connections means that when a new connection is requested it'll provision it and it'll only kill the "live" connections

I think the approach is similar enough to the Advanced JDBC driver that it should work, i don't fully control the connections like the jdbc driver does but it might be enough for what i want

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants