Skip to content

Commit d3a1591

Browse files
feat(api): api update
1 parent 6e1f882 commit d3a1591

File tree

8 files changed

+427
-361
lines changed

8 files changed

+427
-361
lines changed

.stats.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
configured_endpoints: 54
2-
openapi_spec_hash: d01153406f196329a5d5d1efd8f3e50e
2+
openapi_spec_hash: 924a89b5f031d9215a5a701f834b132f
33
config_hash: 930284cfa37f835d949c8a1b124f4807

src/codex/resources/projects/projects.py

Lines changed: 92 additions & 80 deletions
Large diffs are not rendered by default.

src/codex/resources/tlm.py

Lines changed: 184 additions & 160 deletions
Large diffs are not rendered by default.

src/codex/types/project_validate_params.py

Lines changed: 48 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -123,60 +123,66 @@ class ProjectValidateParams(TypedDict, total=False):
123123
`model`, and `max_tokens` is set to 512. You can set custom values for these
124124
arguments regardless of the quality preset specified.
125125
126-
Args: model ({"gpt-4.1", "gpt-4.1-mini", "gpt-4.1-nano", "o4-mini", "o3",
127-
"gpt-4.5-preview", "gpt-4o-mini", "gpt-4o", "o3-mini", "o1", "o1-mini", "gpt-4",
128-
"gpt-3.5-turbo-16k", "claude-opus-4-0", "claude-sonnet-4-0",
129-
"claude-3.7-sonnet", "claude-3.5-sonnet-v2", "claude-3.5-sonnet",
130-
"claude-3.5-haiku", "claude-3-haiku", "nova-micro", "nova-lite", "nova-pro"},
131-
default = "gpt-4.1-mini"): Underlying base LLM to use (better models yield
132-
better results, faster models yield faster results). - Models still in beta:
133-
"o3", "o1", "o4-mini", "o3-mini", "o1-mini", "gpt-4.5-preview",
134-
"claude-opus-4-0", "claude-sonnet-4-0", "claude-3.7-sonnet",
135-
"claude-3.5-haiku". - Recommended models for accuracy: "gpt-4.1", "o4-mini",
136-
"o3", "claude-opus-4-0", "claude-sonnet-4-0". - Recommended models for low
137-
latency/costs: "gpt-4.1-nano", "nova-micro".
138-
139-
max_tokens (int, default = 512): the maximum number of tokens that can be generated in the TLM response (and in internal trustworthiness scoring).
140-
Higher values here may produce better (more reliable) TLM responses and trustworthiness scores, but at higher runtimes/costs.
141-
If you experience token/rate limit errors while using TLM, try lowering this number.
126+
Args: model ({"gpt-5", "gpt-5-mini", "gpt-5-nano", "gpt-4.1", "gpt-4.1-mini",
127+
"gpt-4.1-nano", "o4-mini", "o3", "gpt-4.5-preview", "gpt-4o-mini", "gpt-4o",
128+
"o3-mini", "o1", "o1-mini", "gpt-4", "gpt-3.5-turbo-16k", "claude-opus-4-0",
129+
"claude-sonnet-4-0", "claude-3.7-sonnet", "claude-3.5-sonnet-v2",
130+
"claude-3.5-sonnet", "claude-3.5-haiku", "claude-3-haiku", "nova-micro",
131+
"nova-lite", "nova-pro"}, default = "gpt-4.1-mini"): Underlying base LLM to use
132+
(better models yield better results, faster models yield faster results). -
133+
Models still in beta: "gpt-5", "gpt-5-mini", "gpt-5-nano", "o3", "o1",
134+
"o4-mini", "o3-mini", "o1-mini", "gpt-4.5-preview", "claude-opus-4-0",
135+
"claude-sonnet-4-0", "claude-3.7-sonnet", "claude-3.5-haiku". - Recommended
136+
models for accuracy: "gpt-5", "gpt-4.1", "o4-mini", "o3", "claude-opus-4-0",
137+
"claude-sonnet-4-0". - Recommended models for low latency/costs: "gpt-4.1-nano",
138+
"nova-micro".
139+
140+
log (list[str], default = []): optionally specify additional logs or metadata that TLM should return.
141+
For instance, include "explanation" here to get explanations of why a response is scored with low trustworthiness.
142+
143+
custom_eval_criteria (list[dict[str, Any]], default = []): optionally specify custom evalution criteria beyond the built-in trustworthiness scoring.
144+
The expected input format is a list of dictionaries, where each dictionary has the following keys:
145+
- name: Name of the evaluation criteria.
146+
- criteria: Instructions specifying the evaluation criteria.
147+
148+
max_tokens (int, default = 512): the maximum number of tokens that can be generated in the response from `TLM.prompt()` as well as during internal trustworthiness scoring.
149+
If you experience token/rate-limit errors, try lowering this number.
142150
For OpenAI models, this parameter must be between 64 and 4096. For Claude models, this parameter must be between 64 and 512.
143151
144-
num_candidate_responses (int, default = 1): how many alternative candidate responses are internally generated in `TLM.prompt()`.
145-
`TLM.prompt()` scores the trustworthiness of each candidate response, and then returns the most trustworthy one.
146-
This parameter must be between 1 and 20. It has no effect on `TLM.score()`.
147-
Higher values here can produce more accurate responses from `TLM.prompt()`, but at higher runtimes/costs.
148-
When it is 1, `TLM.prompt()` simply returns a standard LLM response and does not attempt to auto-improve it.
152+
reasoning_effort ({"none", "low", "medium", "high"}, default = "high"): how much internal LLM calls are allowed to reason (number of thinking tokens)
153+
when generating alternative possible responses and reflecting on responses during trustworthiness scoring.
154+
Reduce this value to reduce runtimes. Higher values may improve trust scoring.
155+
156+
num_self_reflections (int, default = 3): the number of different evaluations to perform where the LLM reflects on the response, a factor affecting trust scoring.
157+
The maximum number currently supported is 3. Lower values can reduce runtimes.
158+
Reflection helps quantify aleatoric uncertainty associated with challenging prompts and catches responses that are noticeably incorrect/bad upon further analysis.
159+
This parameter has no effect when `disable_trustworthiness` is True.
149160
150-
num_consistency_samples (int, default = 8): the amount of internal sampling to measure LLM response consistency, a factor affecting trustworthiness scoring.
151-
Must be between 0 and 20. Higher values produce more reliable TLM trustworthiness scores, but at higher runtimes/costs.
161+
num_consistency_samples (int, default = 8): the amount of internal sampling to measure LLM response consistency, a factor affecting trust scoring.
162+
Must be between 0 and 20. Lower values can reduce runtimes.
152163
Measuring consistency helps quantify the epistemic uncertainty associated with
153164
strange prompts or prompts that are too vague/open-ended to receive a clearly defined 'good' response.
154165
TLM measures consistency via the degree of contradiction between sampled responses that the model considers plausible.
155-
156-
num_self_reflections(int, default = 3): the number of self-reflections to perform where the LLM is asked to reflect on the given response and directly evaluate correctness/confidence.
157-
The maximum number of self-reflections currently supported is 3. Lower values will reduce runtimes/costs, but potentially also the reliability of trustworthiness scores.
158-
Reflection helps quantify aleatoric uncertainty associated with challenging prompts and catches responses that are noticeably incorrect/bad upon further analysis.
166+
This parameter has no effect when `disable_trustworthiness` is True.
159167
160168
similarity_measure ({"semantic", "string", "embedding", "embedding_large", "code", "discrepancy"}, default = "discrepancy"): how the
161169
trustworthiness scoring's consistency algorithm measures similarity between alternative responses considered plausible by the model.
162170
Supported similarity measures include - "semantic" (based on natural language inference),
163171
"embedding" (based on vector embedding similarity), "embedding_large" (based on a larger embedding model),
164172
"code" (based on model-based analysis designed to compare code), "discrepancy" (based on model-based analysis of possible discrepancies),
165-
and "string" (based on character/word overlap). Set this to "string" for minimal runtimes/costs.
166-
167-
reasoning_effort ({"none", "low", "medium", "high"}, default = "high"): how much internal LLM calls are allowed to reason (number of thinking tokens)
168-
when generating alternative possible responses and reflecting on responses during trustworthiness scoring.
169-
Higher reasoning efforts may yield more reliable TLM trustworthiness scores. Reduce this value to reduce runtimes/costs.
170-
171-
log (list[str], default = []): optionally specify additional logs or metadata that TLM should return.
172-
For instance, include "explanation" here to get explanations of why a response is scored with low trustworthiness.
173+
and "string" (based on character/word overlap). Set this to "string" for minimal runtimes.
174+
This parameter has no effect when `num_consistency_samples = 0`.
173175
174-
custom_eval_criteria (list[dict[str, Any]], default = []): optionally specify custom evalution criteria beyond the built-in trustworthiness scoring.
175-
The expected input format is a list of dictionaries, where each dictionary has the following keys:
176-
- name: Name of the evaluation criteria.
177-
- criteria: Instructions specifying the evaluation criteria.
176+
num_candidate_responses (int, default = 1): how many alternative candidate responses are internally generated in `TLM.prompt()`.
177+
`TLM.prompt()` scores the trustworthiness of each candidate response, and then returns the most trustworthy one.
178+
You can auto-improve responses by increasing this parameter, but at higher runtimes/costs.
179+
This parameter must be between 1 and 20. It has no effect on `TLM.score()`.
180+
When this parameter is 1, `TLM.prompt()` simply returns a standard LLM response and does not attempt to auto-improve it.
181+
This parameter has no effect when `disable_trustworthiness` is True.
178182
179-
use_self_reflection (bool, default = `True`): deprecated. Use `num_self_reflections` instead.
183+
disable_trustworthiness (bool, default = False): if True, trustworthiness scoring is disabled and TLM will not compute trust scores for responses.
184+
This is useful when you only want to use custom evaluation criteria or when you want to minimize computational overhead and only need the base LLM response.
185+
The following parameters will be ignored when `disable_trustworthiness` is True: `num_consistency_samples`, `num_self_reflections`, `num_candidate_responses`, `reasoning_effort`, `similarity_measure`.
180186
"""
181187

182188
prompt: Optional[str]
@@ -647,6 +653,8 @@ class MessageChatCompletionDeveloperMessageParam(TypedDict, total=False):
647653
class Options(TypedDict, total=False):
648654
custom_eval_criteria: Iterable[object]
649655

656+
disable_trustworthiness: bool
657+
650658
log: List[str]
651659

652660
max_tokens: int

src/codex/types/tlm_prompt_params.py

Lines changed: 48 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -45,60 +45,66 @@ class TlmPromptParams(TypedDict, total=False):
4545
`model`, and `max_tokens` is set to 512. You can set custom values for these
4646
arguments regardless of the quality preset specified.
4747
48-
Args: model ({"gpt-4.1", "gpt-4.1-mini", "gpt-4.1-nano", "o4-mini", "o3",
49-
"gpt-4.5-preview", "gpt-4o-mini", "gpt-4o", "o3-mini", "o1", "o1-mini", "gpt-4",
50-
"gpt-3.5-turbo-16k", "claude-opus-4-0", "claude-sonnet-4-0",
51-
"claude-3.7-sonnet", "claude-3.5-sonnet-v2", "claude-3.5-sonnet",
52-
"claude-3.5-haiku", "claude-3-haiku", "nova-micro", "nova-lite", "nova-pro"},
53-
default = "gpt-4.1-mini"): Underlying base LLM to use (better models yield
54-
better results, faster models yield faster results). - Models still in beta:
55-
"o3", "o1", "o4-mini", "o3-mini", "o1-mini", "gpt-4.5-preview",
56-
"claude-opus-4-0", "claude-sonnet-4-0", "claude-3.7-sonnet",
57-
"claude-3.5-haiku". - Recommended models for accuracy: "gpt-4.1", "o4-mini",
58-
"o3", "claude-opus-4-0", "claude-sonnet-4-0". - Recommended models for low
59-
latency/costs: "gpt-4.1-nano", "nova-micro".
60-
61-
max_tokens (int, default = 512): the maximum number of tokens that can be generated in the TLM response (and in internal trustworthiness scoring).
62-
Higher values here may produce better (more reliable) TLM responses and trustworthiness scores, but at higher runtimes/costs.
63-
If you experience token/rate limit errors while using TLM, try lowering this number.
48+
Args: model ({"gpt-5", "gpt-5-mini", "gpt-5-nano", "gpt-4.1", "gpt-4.1-mini",
49+
"gpt-4.1-nano", "o4-mini", "o3", "gpt-4.5-preview", "gpt-4o-mini", "gpt-4o",
50+
"o3-mini", "o1", "o1-mini", "gpt-4", "gpt-3.5-turbo-16k", "claude-opus-4-0",
51+
"claude-sonnet-4-0", "claude-3.7-sonnet", "claude-3.5-sonnet-v2",
52+
"claude-3.5-sonnet", "claude-3.5-haiku", "claude-3-haiku", "nova-micro",
53+
"nova-lite", "nova-pro"}, default = "gpt-4.1-mini"): Underlying base LLM to use
54+
(better models yield better results, faster models yield faster results). -
55+
Models still in beta: "gpt-5", "gpt-5-mini", "gpt-5-nano", "o3", "o1",
56+
"o4-mini", "o3-mini", "o1-mini", "gpt-4.5-preview", "claude-opus-4-0",
57+
"claude-sonnet-4-0", "claude-3.7-sonnet", "claude-3.5-haiku". - Recommended
58+
models for accuracy: "gpt-5", "gpt-4.1", "o4-mini", "o3", "claude-opus-4-0",
59+
"claude-sonnet-4-0". - Recommended models for low latency/costs: "gpt-4.1-nano",
60+
"nova-micro".
61+
62+
log (list[str], default = []): optionally specify additional logs or metadata that TLM should return.
63+
For instance, include "explanation" here to get explanations of why a response is scored with low trustworthiness.
64+
65+
custom_eval_criteria (list[dict[str, Any]], default = []): optionally specify custom evalution criteria beyond the built-in trustworthiness scoring.
66+
The expected input format is a list of dictionaries, where each dictionary has the following keys:
67+
- name: Name of the evaluation criteria.
68+
- criteria: Instructions specifying the evaluation criteria.
69+
70+
max_tokens (int, default = 512): the maximum number of tokens that can be generated in the response from `TLM.prompt()` as well as during internal trustworthiness scoring.
71+
If you experience token/rate-limit errors, try lowering this number.
6472
For OpenAI models, this parameter must be between 64 and 4096. For Claude models, this parameter must be between 64 and 512.
6573
66-
num_candidate_responses (int, default = 1): how many alternative candidate responses are internally generated in `TLM.prompt()`.
67-
`TLM.prompt()` scores the trustworthiness of each candidate response, and then returns the most trustworthy one.
68-
This parameter must be between 1 and 20. It has no effect on `TLM.score()`.
69-
Higher values here can produce more accurate responses from `TLM.prompt()`, but at higher runtimes/costs.
70-
When it is 1, `TLM.prompt()` simply returns a standard LLM response and does not attempt to auto-improve it.
74+
reasoning_effort ({"none", "low", "medium", "high"}, default = "high"): how much internal LLM calls are allowed to reason (number of thinking tokens)
75+
when generating alternative possible responses and reflecting on responses during trustworthiness scoring.
76+
Reduce this value to reduce runtimes. Higher values may improve trust scoring.
77+
78+
num_self_reflections (int, default = 3): the number of different evaluations to perform where the LLM reflects on the response, a factor affecting trust scoring.
79+
The maximum number currently supported is 3. Lower values can reduce runtimes.
80+
Reflection helps quantify aleatoric uncertainty associated with challenging prompts and catches responses that are noticeably incorrect/bad upon further analysis.
81+
This parameter has no effect when `disable_trustworthiness` is True.
7182
72-
num_consistency_samples (int, default = 8): the amount of internal sampling to measure LLM response consistency, a factor affecting trustworthiness scoring.
73-
Must be between 0 and 20. Higher values produce more reliable TLM trustworthiness scores, but at higher runtimes/costs.
83+
num_consistency_samples (int, default = 8): the amount of internal sampling to measure LLM response consistency, a factor affecting trust scoring.
84+
Must be between 0 and 20. Lower values can reduce runtimes.
7485
Measuring consistency helps quantify the epistemic uncertainty associated with
7586
strange prompts or prompts that are too vague/open-ended to receive a clearly defined 'good' response.
7687
TLM measures consistency via the degree of contradiction between sampled responses that the model considers plausible.
77-
78-
num_self_reflections(int, default = 3): the number of self-reflections to perform where the LLM is asked to reflect on the given response and directly evaluate correctness/confidence.
79-
The maximum number of self-reflections currently supported is 3. Lower values will reduce runtimes/costs, but potentially also the reliability of trustworthiness scores.
80-
Reflection helps quantify aleatoric uncertainty associated with challenging prompts and catches responses that are noticeably incorrect/bad upon further analysis.
88+
This parameter has no effect when `disable_trustworthiness` is True.
8189
8290
similarity_measure ({"semantic", "string", "embedding", "embedding_large", "code", "discrepancy"}, default = "discrepancy"): how the
8391
trustworthiness scoring's consistency algorithm measures similarity between alternative responses considered plausible by the model.
8492
Supported similarity measures include - "semantic" (based on natural language inference),
8593
"embedding" (based on vector embedding similarity), "embedding_large" (based on a larger embedding model),
8694
"code" (based on model-based analysis designed to compare code), "discrepancy" (based on model-based analysis of possible discrepancies),
87-
and "string" (based on character/word overlap). Set this to "string" for minimal runtimes/costs.
88-
89-
reasoning_effort ({"none", "low", "medium", "high"}, default = "high"): how much internal LLM calls are allowed to reason (number of thinking tokens)
90-
when generating alternative possible responses and reflecting on responses during trustworthiness scoring.
91-
Higher reasoning efforts may yield more reliable TLM trustworthiness scores. Reduce this value to reduce runtimes/costs.
92-
93-
log (list[str], default = []): optionally specify additional logs or metadata that TLM should return.
94-
For instance, include "explanation" here to get explanations of why a response is scored with low trustworthiness.
95+
and "string" (based on character/word overlap). Set this to "string" for minimal runtimes.
96+
This parameter has no effect when `num_consistency_samples = 0`.
9597
96-
custom_eval_criteria (list[dict[str, Any]], default = []): optionally specify custom evalution criteria beyond the built-in trustworthiness scoring.
97-
The expected input format is a list of dictionaries, where each dictionary has the following keys:
98-
- name: Name of the evaluation criteria.
99-
- criteria: Instructions specifying the evaluation criteria.
98+
num_candidate_responses (int, default = 1): how many alternative candidate responses are internally generated in `TLM.prompt()`.
99+
`TLM.prompt()` scores the trustworthiness of each candidate response, and then returns the most trustworthy one.
100+
You can auto-improve responses by increasing this parameter, but at higher runtimes/costs.
101+
This parameter must be between 1 and 20. It has no effect on `TLM.score()`.
102+
When this parameter is 1, `TLM.prompt()` simply returns a standard LLM response and does not attempt to auto-improve it.
103+
This parameter has no effect when `disable_trustworthiness` is True.
100104
101-
use_self_reflection (bool, default = `True`): deprecated. Use `num_self_reflections` instead.
105+
disable_trustworthiness (bool, default = False): if True, trustworthiness scoring is disabled and TLM will not compute trust scores for responses.
106+
This is useful when you only want to use custom evaluation criteria or when you want to minimize computational overhead and only need the base LLM response.
107+
The following parameters will be ignored when `disable_trustworthiness` is True: `num_consistency_samples`, `num_self_reflections`, `num_candidate_responses`, `reasoning_effort`, `similarity_measure`.
102108
"""
103109

104110
quality_preset: Literal["best", "high", "medium", "low", "base"]
@@ -110,6 +116,8 @@ class TlmPromptParams(TypedDict, total=False):
110116
class Options(TypedDict, total=False):
111117
custom_eval_criteria: Iterable[object]
112118

119+
disable_trustworthiness: bool
120+
113121
log: List[str]
114122

115123
max_tokens: int

0 commit comments

Comments
 (0)