Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Prostgres Vector Store] Add PGVector Driver option + Fix null character issue w/ TypeORM Driver #3367

Conversation

JJK801
Copy link
Contributor

@JJK801 JJK801 commented Oct 16, 2024

Fix #3362

Hi Flowise,

Here is a PR that improves PG Vector Store by:

  • Adding a driver config (which can be typeorm by default, or pgvector)
  • Adding an option to configure the contentColumnName (pgvector only), so that some can migrate easily from other tools
  • Fixing the 0x00 character issue (PG crash on insert) with typeorm driver

let me know if you need some more details or changes.

Comment on lines 116 to 133
const embeddingString = `[${query.join(',')}]`
let _filter = '{}'
let notExists = ''
if (filter && typeof filter === 'object') {
if (filter.$notexists) {
notExists = `OR NOT (metadata ? '${filter.$notexists}')`
delete filter.$notexists
}
_filter = JSON.stringify(filter)
}

const queryString = `
SELECT *, embedding <=> $1 as "_distance"
FROM ${tableName}
WHERE metadata @> $2
${notExists}
ORDER BY "_distance" ASC
LIMIT $3;`
Copy link
Contributor Author

@JJK801 JJK801 Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm doubtful about this logic... (i pasted it as is from legacy code)

The WHERE clause sounds like: Where metadata includes filter OR metadata keys not includes chatflowIdKey
which seems to make the first part optional when second part is truthy, and this is clearly not the expected behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PGVector driver, i removed the chatflow part from filters and added a AND (<compareExistingValue> OR <keyNotExist>)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original intention is to filter these conditions:

  • metadata contains the JSON structure (filter)
    OR
  • metadata does not contains the key flowise_chatId

Say the following scenario:
1.) Owner of the chatflow upserted data to the postgres vector store through API, ex: [{pageContent: 'abc', metadata: {}}]
2.) User A interacts with the chatflow and upload some files. This will be upserted on the fly, resulting in [{pageContent: 'this is userA', metadata: { flowise_chatId: userA-session }}]
3.) User B interacts with the chatflow and upload some files. This will be upserted on the fly, resulting in [{pageContent: 'this is userB', metadata: { flowise_chatId: userB-session }}]

Now, when user A is asking questions, we want to only get 2 sources:

[
  {pageContent: 'abc', metadata: {}}
  {pageContent: 'this is userA', metadata: { flowise_chatId: userA-session }}
]

We do not want to get source with metadata of flowise_chatId: userB-session

Hence this results in the following filters:
metadata contains that specific chatId OR metadata that doesn't have the flowise_chatId key

Copy link
Contributor Author

@JJK801 JJK801 Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tanks for your response @HenryHengZJ 🙏

We're ok about the expected behavior 👍

Now imagine a 4th scenario:
For whatever reason, i add a metadata filter on the vector store node, ex : [{pageContent: 'this is userA', metadata: { flowise_chatId: userA-session, foo: 'bar' }}]

With the actual or clause, the or clause will be:

WHERE metadata @> '{ flowise_chatId: \'userA-session\', foo: \'bar\' }'
OR NOT (metadata ? 'flowise_chatId')

This will return:

  • Chatflow uploaded documents filtered with foo=bar
  • Any documents without flowise_chatId, not filtered with foo=bar

If we want the right behavior, the where clause should be:

WHERE metadata @> '{ foo: \'bar\' }'
AND (metadata @> '{ flowise_chatId: \'userA-session\' }' OR NOT (metadata ? 'flowise_chatId'))

This will return:

  • Chatflow uploaded documents filtered with foo=bar
  • Any documents filtered with foo=bar

Copy link
Contributor Author

@JJK801 JJK801 Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as an alternative, we can also do:

WHERE metadata @> '{ foo: \'bar\' }'
OR metadata @> '{ flowise_chatId: \'userA-session\' }'

This will return:

  • Any document with foo=bar
  • Chatflow uploaded documents, not filtered with foo=bar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think its 3 clauses:

  • Any documents with metadata contains foo: bar
  • or any documents with metadata contains foo: bar, AND flowise_chatId: <chatid>
  • or any documents with metadata contains foo: bar, AND without flowise_chatId

Copy link
Contributor Author

@JJK801 JJK801 Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we don't care about foo: bar if there is a flowise_chatId set, because if it's been uploaded via the chatflow, it's always part of the chatflow context (and filters will not be added to the document's metadata).

Basically, the expectations are:

  • Return any document uploaded via the chatflow
  • Return filtered documents from the vector store
  • Don't return the documents uploaded via other chatflows

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part:

AND NOT (metadata ? 'flowise_chatId')

is just present to ensure that we don't return documents from other chatflows if there is no filter set:

WHERE (metadata @> '{}' AND NOT (metadata ? 'flowise_chatId'))
OR metadata @> '{ flowise_chatId: \'userA-session\' }'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so this will results if either of these two conditions is true:

  • The metadata contains foo': 'bar and does not have flowise_chatId.
  • The metadata contains flowise_chatId': 'userA-session, regardless of whether foo': 'bar is present or not.

Rewind back the scenario, and see if this fits:
1.) Owner upserted documents without metadata: pageContent: 'this is doc 1', metadata: {}
2.) Owner upserted documents WITH metadata: pageContent: 'this is doc 2', metadata: { source: doc2 }
Owner publish the chatflow to public
3.) User A uploaded a file from the chat window, results in pageContent: 'this is user A', metadata: { flowise_chatId: userA }
4.) User B uploaded a file from the chat window, results in pageContent: 'this is user B', metadata: { flowise_chatId: userB }

Now 4 docs are sitting in vector store:
1.) pageContent: 'this is doc 1', metadata: {}
2.) pageContent: 'this is doc 2', metadata: { source: doc2 }
3.) pageContent: 'this is user A', metadata: { flowise_chatId: userA }
4.) pageContent: 'this is user B', metadata: { flowise_chatId: userB }

Scenario 1: Owner doesnt set any filter on metadata. User A when chatting, should only get documents (1) (2) and (3)

Scenario 2: Owner pre set filter on metadata. User A when chatting, should only get documents (2) and (3)

Looks like the new clause you have there satisfy these 2 scenarios

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0a39892

return this._postgresConnectionOptions
}

async getArgs(): Promise<PGVectorStoreArgs> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for pgvector, can we allow the option for choosing distance strategy? #2198

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably, i don't know about it but it's a good learning opportunity ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HenryHengZJ done in 967f0bb

⚠️ not tested ⚠️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i did it for both PGVector an TypeORM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! will test it out

@JJK801
Copy link
Contributor Author

JJK801 commented Oct 26, 2024

@HenryHengZJ Last commit (4e1f4c4) allows to set default Postgres credentials via env variables (for both vector store and record manager) and makes credentials optional if user and password are set.

@HenryHengZJ
Copy link
Contributor

@HenryHengZJ Last commit (4e1f4c4) allows to set default Postgres credentials via env variables (for both vector store and record manager) and makes credentials optional if user and password are set.

thanks for that! I understand you prefer to read the value from an env file, but given that we have too many components, doing this will introduce too much env variables, and we wanted to limit env variables to only settings of the flowise apps, not for each and every components.

having said that, you can achieve the same using Variables: https://docs.flowiseai.com/using-flowise/variables#using-variables

For example, if I wanted to read Pinecone Index value, I can create a variable as such:
image
You can use a static value, or runtime value which can directly read from env file.

Then use it in the component:
image

@JJK801
Copy link
Contributor Author

JJK801 commented Oct 27, 2024

thanks for that! I understand you prefer to read the value from an env file, but given that we have too many components, doing this will introduce too much env variables, and we wanted to limit env variables to only settings of the flowise apps, not for each and every components.
This is a specific case because Langchain's tools for Postgres do not use env vars (that's why i documented it in Flowise) but almost all others do.

The reason why i'm doing this, is because of the allow/deny all strategy in flowise, i can't allow users to manage the chatflows without providing em' the ability to delete/modify credentials (which can break my whole production application in a mistake).

Also, i wan't to be able to deploy my stack all in once without having to manually setup credentials on each stage and put identifiers in client apps, or it will be an headache to deploy. For this part, having the capacity to setup credentials as default for each category would do the job, but it it doesn't solve the deletion issue.

}

protected async adaptInstance(instance: PGVectorStore, metadataFilters?: any): Promise<PGVectorStore> {
const { [FLOWISE_CHATID]: chatId, ...pgMetadataFilter } = metadataFilters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this error when metadatFilters is undefined

const { [FLOWISE_CHATID]: chatId, ...pgMetadataFilter } = metadataFilters || {}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a7de414

I'm working on some issues of my project ATM, but i'll re-test all this stuff asap

instance.client = await instance.pool.connect()
}

await instance.client.connect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have this error:

Error inserting: Client has already been connected. You cannot reuse a client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm sorry, i implemented that because i got Too many connections error while using for a long period, but didn't test it since. I'll fix that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 3527383

@HenryHengZJ
Copy link
Contributor

thanks for that! I understand you prefer to read the value from an env file, but given that we have too many components, doing this will introduce too much env variables, and we wanted to limit env variables to only settings of the flowise apps, not for each and every components.
This is a specific case because Langchain's tools for Postgres do not use env vars (that's why i documented it in Flowise) but almost all others do.

The reason why i'm doing this, is because of the allow/deny all strategy in flowise, i can't allow users to manage the chatflows without providing em' the ability to delete/modify credentials (which can break my whole production application in a mistake).

Also, i wan't to be able to deploy my stack all in once without having to manually setup credentials on each stage and put identifiers in client apps, or it will be an headache to deploy. For this part, having the capacity to setup credentials as default for each category would do the job, but it it doesn't solve the deletion issue.

got it, in that case I'd suggest to add a README to the packages/components/vectorstore/postgres and same for record manager, for all the available env variables, and once these env variables are provided, it will override whatever you specified on the UI. I would remove the env variables from CONTRIBUTING.md and .env.example as those are meant for app settings. One effect of using env variables is that you will not have the flexbility to use different credentials if you have multiple flows

@JJK801
Copy link
Contributor Author

JJK801 commented Oct 28, 2024

got it, in that case I'd suggest to add a README to the packages/components/vectorstore/postgres and same for record manager, for all the available env variables, and once these env variables are provided, it will override whatever you specified on the UI. I would remove the env variables from CONTRIBUTING.md and .env.example as those are meant for app settings. One effect of using env variables is that you will not have the flexbility to use different credentials if you have multiple flows

LGTM 👍

The expected behavior is that env vars act as default values and makes credentials optional, but it can be override by credentials if set.

@JJK801
Copy link
Contributor Author

JJK801 commented Oct 28, 2024

got it, in that case I'd suggest to add a README to the packages/components/vectorstore/postgres and same for record manager, for all the available env variables

Done in 675641b

Copy link
Contributor

@HenryHengZJ HenryHengZJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for late reply, thank you so much @JJK801 ! all looks good

@JJK801
Copy link
Contributor Author

JJK801 commented Nov 1, 2024

my pleasure

@HenryHengZJ HenryHengZJ merged commit 15d59a9 into FlowiseAI:main Nov 1, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants