New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

serialize ContextItems in SerializedChatMessages #5910

Open

sqs wants to merge 6 commits into main from sqs/serialized-chat-ctx

Member

sqs commented Oct 15, 2024 •

edited

Loading

Previously, a SerializedChatMessage contained ContextItems, which had URI instances. This meant that the JSON serialization of SerializedChatMessage was (1) not stable, since URI.toJSON and vscode.Uri.toJSON differ and are not standardized, and (2) not readable without special re-hydration. This seems to not have caused major problems, but it absolutely could have, such as if we tried to call .fsPath (a dynamic getter) on a URI in a ChatMessage that we deserialized.

This also paves the way to make it so that dehydration/hydration are not needed across the postMessage boundary.

Test plan

CI & e2e

sqs added 2 commits

October 15, 2024 10:55


          centralize deserializeChatMessage

e3addfd


          store SerializedContextItem in SerializedChatMessage

2d5bf53

sqs requested review from dominiccooney and a team

October 15, 2024 18:02

github-actions bot commented Oct 15, 2024

‼️ Hey @sourcegraph/cody-security, please review this PR carefully as it introduces the usage of an unsafe_ function or abuses PromptString.

jamesmcnamara added 4 commits

October 15, 2024 12:25


          added a formatter blacklist entry for SerializedContextItem content

26c0a2f


          added an acceptlist for subtyping in code generator

a4c1087


          regenerated

b41e769


          added a better diagnostic message to opt-in for subtyping

d2d38d7

dominiccooney requested changes

View reviewed changes

Contributor

dominiccooney left a comment

Could we instead allow subtypes more generally, and not add an explicit list like this?

agent/src/cli/command-bench/strategy-chat.ts

@@ @@ -65,7 +65,7 @@ export async function evaluateChatStrategy( @@
                               const concisenessScore = await llm.judge(concisenessPrompt({ response: llmResponse }))
                               const contextItems = query?.contextFiles?.map(i => ({
                                   source: i.source,
-                                  file: i.uri.path + `:${i.range?.start.line}-${i.range?.end.line}`,
+                                  file: i.uri + `:${i.range?.start.line}-${i.range?.end.line}`,

Contributor

dominiccooney Oct 21, 2024

What consumes this? Because if it looks for that : separator for line ranges, it is going to encounter a : in file:/// now.

agent/src/cli/scip-codegen/BaseCodegen.ts

-                  public isStringType(type: scip.Type): boolean {
+                  public isStringType(
+                      type: scip.Type,
+                      options: TypecheckerOptions = { allowSubtyping: false }

Contributor

dominiccooney Oct 21, 2024

It is unclear to me what this means.

Consider the case immediately below:

if (type.has_constant_type) {
  return type.constant_type.constant.has_string_constant
}

In that sense, "some literal" : string has type "some literal" and is a subtype of string. So if allowSubtyping is false then we are returning an inaccurate result there.

Member Author

sqs Oct 21, 2024

@jamesmcnamara wrote this. @jamesmcnamara, can you pls reply?

Contributor

jamesmcnamara Oct 21, 2024 •

edited

Loading

This option allows us consider the a union type of T | undefined | null (or some subset) as assignable to T. Perhaps allowNullableSupertyping would be better?

agent/src/cli/scip-codegen/BaseCodegen.ts

+                          const sym = type.type_ref.symbol
+                          if (
+                              options.allowSubtyping &&
+                              [typescriptKeyword('undefined'), typescriptKeyword('null')].includes(sym)

Contributor

dominiccooney Oct 21, 2024

Question about the TypeScript AST here: Is a type like T? desugared to a representation like T | undefined by this point?

Member Author

sqs Oct 21, 2024

@jamesmcnamara wrote this. @jamesmcnamara, can you pls reply?

Contributor

jamesmcnamara Oct 21, 2024

Yes, can confirm.

agent/src/cli/scip-codegen/BaseCodegen.ts

+                              options.allowSubtyping &&
+                              [typescriptKeyword('undefined'), typescriptKeyword('null')].includes(sym)
+                          ) {
+                              return true

Contributor

dominiccooney Oct 21, 2024

If allow sub typing is true, this doesn't make sense to me, because string | undefined is a supertype of string... all values of string inhabit string | undefined...

Would it make sense to recast the function not as isStringType but something about a bound on a type involving string?

Member Author

sqs Oct 21, 2024

@jamesmcnamara wrote this. @jamesmcnamara, can you pls reply?

agent/src/cli/scip-codegen/Codegen.ts

+                                      })
+                                  ) {
+                                      diagnostic.message = dedent`${diagnostic.message} Note that these types are
+                                      sub-type compatible, so if you want to allow subtyping for this class, add the symbol to

Contributor

dominiccooney Oct 21, 2024

If this were safe, why wouldn't we just do it everywhere all the time?

If this is not safe, then it seems dubious to encourage people to enable it without any provisos.

Member Author

sqs Oct 21, 2024

@jamesmcnamara wrote this. @jamesmcnamara, can you pls reply?

dominiccooney reviewed

View reviewed changes

agent/src/cli/scip-codegen/BaseCodegen.ts

                           )
                       }
                       return false
                   }
-                  public compatibleSignatures(a: scip.SymbolInformation, b: scip.SymbolInformation): boolean {
-                      if (this.isStringTypeInfo(a) && this.isStringTypeInfo(b)) {
+                  public compatibleSignatures(

Contributor

dominiccooney Oct 21, 2024

I am uncertain about this approach.

IF I understand it, you're saying that

Item = {
    title?: string
    description?: string
} & (HasTitle | HasDescription)

is safe because HasTitle has a string field to discriminate on, and HasDescription has a string field to discriminate on? But IIRC the downstream doesn't work that way... it discriminates on a specific field.

And in practice the escape hatch works for SerializedContextItem because there's no discrimination happening... the consumer just gets that big union type. But would things blow up if we had a method with a type like SerializedContextItemTypeA | SerializedContextItemTypeB?

Member Author

sqs Oct 21, 2024

@jamesmcnamara wrote this. @jamesmcnamara, can you pls reply?

Contributor

jamesmcnamara Oct 21, 2024

There shouldn't be any method descriptions in here because this is used to generate code from the agent protocol so it should only contain immediate data types.

A type alias like type AorB = SerializedContextItemTypeA | SerializedContextItemTypeB should generate a discriminated union without any trouble.

The only issue we're trying to resolve here is the combination of a union type and an intersection type which is treated at the top level as an intersection type (as described in this comment). The current behavior is to check that every shared field in the sub-discriminated unions are identical and we throw an exception if they differ (even on nullability).

(Note that there is a bug in the existing implementation in that the output intersection type has all fields from all discriminated union members included, so:

type Intersect = {
  name: string
} & ({title: string} | {description: string})

would produce:

type Intersect = {
  name: string
  title: string
  description: string
}

instead of

type Intersect = {
  name: string
  title: string | undefined
  description: string | undefined
}

and that is not changed in this implementation)

The only case we're trying to handle here is where the the types of a field on the intersection type and the sub-union types differ, but they differ only in their nullability (as with the above Item type). 99% of the time, I think this was a mistake by the programmer, or this type should not be included in the agent protocol, hence it not being handled by default.

However in the 1% of cases where it is intentional, as with SerializedContextItem which takes advantage of the type-refinedness on the TypeScript side and is reasonably included in the protocol for chat migrations and exports, we have 4 options:

Define a protocol specific variant of SerializedContextItem (like we do for ProtocolAuthStatus) and convert every method coming out of the agent that uses that type to the simpler protocol variant.
Simplify the type on the TypeScript side to be consistent with the current code generation (would require inserting null handling in various parts of the app which would be misleading as they were never reachable)
Generate the type as an union type, inlining all properties of the intersection type onto each variant of the union.
Generate the type as an intersection type with each field using it's highest upper-bound supertype from the unions (or null).

1 is the most practical and probably the best practice but introduces an annoying maintenance burden of maintaining two types that are semantically identical but duplicated. 3 feels to me the most correct way but would require a careful, non-trivial implementation. 4 is what the current implementation does only for white-listed types (and indeed we see there is exactly one instance of this in the entire protocol so it's not likely a valuable case to invest in.)

@olafurpg's opinion was to use option 1 and in general to avoid allowing complex types to leak into the protocol definition. I implemented 4 because it was quick and allowed us to avoid the conversion types. But with this much pushback and the bug noted above, perhaps 1 is really the way we should go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet