Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: exception events are inconsistently placed in the callback trace map #9070

Closed
axiomofjoy opened this issue Nov 22, 2023 · 10 comments · Fixed by #9112
Closed

[Bug]: exception events are inconsistently placed in the callback trace map #9070

axiomofjoy opened this issue Nov 22, 2023 · 10 comments · Fixed by #9112
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@axiomofjoy
Copy link
Contributor

axiomofjoy commented Nov 22, 2023

Bug Description

LlamaIndex's callback system passes exception events that are distinct from the events from which the error arose. Callback implementers must infer from which callback event the exception arose from the trace_map passed to end_trace. However, the position of the exception event in the trace map relative to the event from which the exception arose is not consistent. Most of the time, the exception event is a sibling of the event from which it arose, but when the exception arises from a node post-processor like the Cohere re-ranker, it is a child. This makes it difficult for callback implementations to consistently attribute the exception to the correct event.

Ideally, exception information would always be attributable to an easily identifiable event in the trace map. There are several options:

  • Exception events are siblings of the event causing the error (how it currently works most of the time)
  • Exception events are children of the event causing the error (how it works for node post-processors)
  • Exceptions are not distinct callback events. Instead, exception information is passed to the on_event_end hook of the event in which the exception occurred (this might be the easiest to handle from the perspective of the callback implementation)

Version

latest main (a31d3c8)

Steps to Reproduce

This script shows how exception events can be siblings of the events causing the exception. It mocks a error from the OpenAI chat completions API and produces this output:

└── query
    ├── retrieve
    │   └── embedding
    └── synthesize
        ├── templating
        ├── llm
        ├── exception
        ├── exception
        └── exception

This script shows how exception events can be children of the event causing the exception. It mocks a CohereAPIError on the reranker client and produces the following output:

└── query
    ├── retrieve
    │   └── embedding
    └── reranking
        ├── exception
        ├── exception
        └── exception

Relevant Logs/Tracbacks

No response

@axiomofjoy axiomofjoy added bug Something isn't working triage Issue needs to be triaged/prioritized labels Nov 22, 2023
Copy link

dosubot bot commented Nov 22, 2023

🤖

Hello @axiomofjoy, it's good to see you again! I hope you're having a great day. Let's take a look at this together.

Thank you for bringing this issue to our attention. The inconsistency in the placement of exception events in the callback trace map is indeed a problem that needs to be addressed.

To make the placement of exception events consistent, regardless of whether the exception arises from a regular event or a node post-processor like the Cohere re-ranker, we can modify the event context manager in the CallbackManager class to include the event_id of the current event when an exception is raised.

Currently, when an exception is raised, a new event of type CBEventType.EXCEPTION is started, but no event_id is provided. This means a new event_id is generated in the on_event_start method, which could lead to inconsistencies in the trace map.

To fix this, we can modify the event context manager to pass the event_id of the current event to the on_event_start method when an exception is raised. Here's how you can do it:

@contextmanager
def event(
    self,
    event_type: CBEventType,
    payload: Optional[Dict[str, Any]] = None,
    event_id: Optional[str] = None,
) -> Generator["EventContext", None, None]:
    """Context manager for lanching and shutdown of events.

    Handles sending on_evnt_start and on_event_end to handlers for specified event.

    Usage:
        with callback_manager.event(CBEventType.QUERY, payload={key, val}) as event:
            ...
            event.on_end(payload={key, val})  # optional
    """
    # create event context wrapper
    event = EventContext(self, event_type, event_id=event_id)
    event.on_start(payload=payload)

    try:
        yield event
    except Exception as e:
        self.on_event_start(
            CBEventType.EXCEPTION, payload={EventPayload.EXCEPTION: e}, event_id=event_id
        )
        raise
    finally:
        # ensure event is ended
        if not event.finished:
            event.on_end()

This way, when an exception is raised, the CBEventType.EXCEPTION event will be linked to the current event in the trace map, ensuring consistent placement of exception events.

As for your question about how the trace_map is generated in the LlamaIndex codebase, the trace_map is not directly generated in the LlamaIndex codebase. However, the as_trace method from the callback_manager in the service_context is used in several places in the code. This method is likely responsible for generating or updating a trace_map or similar structure.

I hope this helps! If you have any further questions or need more clarification, please let me know.

Sources

@axiomofjoy
Copy link
Contributor Author

Related to this ticket: Arize-ai/phoenix#1618

@logan-markewich
Copy link
Collaborator

I like the idea of attaching to the on-end of the actual event 🤔 It would also solve the issue of duplicate error events

@logan-markewich
Copy link
Collaborator

Ok been exploring a few fixes -- I think the root issue here is that at some point we specified that LLM events (and a few others) are LEAF_EVNTS -- meaning they can't have children, hence, the odd sibling vs. child issue.

@logan-markewich
Copy link
Collaborator

Ok, decided to attach the error to the event.

With a slight change to the example scripts to check the payload of events, the traces look like this

"sibling" script check

└── query
    ├── retrieve
    │   └── embedding
    └── synthesize (error)
        ├── templating
        └── llm

"child" script check

└── query
    ├── retrieve
    │   └── embedding
    └── reranking (error)

@logan-markewich
Copy link
Collaborator

If an error happens outside of an event, but still within a trace, it will get logged as an actual exception event as before (But this should be extremely rare I think)

@axiomofjoy
Copy link
Contributor Author

Awesome, thanks for the quick fix @logan-markewich!

@axiomofjoy
Copy link
Contributor Author

Ok, decided to attach the error to the event.

With a slight change to the example scripts to check the payload of events, the traces look like this

"sibling" script check

└── query
    ├── retrieve
    │   └── embedding
    └── synthesize (error)
        ├── templating
        └── llm

"child" script check

└── query
    ├── retrieve
    │   └── embedding
    └── reranking (error)

@logan-markewich Thanks for the quick fix on this! Exception events arising from the Cohere re-ranker are now attached to the re-ranker spans! See the image below:

Screenshot 2023-11-27 at 10 05 52 PM

For a rate-limit error, I think the exception information should be associated with the LLM callback event rather than the synthesis event. Is it possible to change to this behavior?

In the example script, the output would look like:

└── query
    ├── retrieve
    │   └── embedding
    └── synthesize
        ├── templating
        └── llm (exception)

@logan-markewich
Copy link
Collaborator

Oof. It's not easy to fix/change that I think 😅

@axiomofjoy
Copy link
Contributor Author

Oof. It's not easy to fix/change that I think 😅

Got it, no worries @logan-markewich. It's a small issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants