Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PY] fix: Optimizing the to_string Function #2107

Merged
merged 3 commits into from
Oct 14, 2024

Conversation

jamiesun
Copy link
Contributor

@jamiesun jamiesun commented Oct 11, 2024

The to_string function has been optimized to add logic for handling string types and null values.

Linked issues

closes: #2065

Details

If non-English characters are used in the conversation, the entire conversation history is saved as an escaped character, causing confusion for the AI model

Change details

Describe your changes, with screenshots and code snippets as appropriate

code snippets:

def to_string(tokenizer: Tokenizer, value: Any, as_json: bool = False) -> str:
    """
    Converts a value to a string representation.
    Dates are converted to ISO strings and Objects are converted to JSON or YAML,
    whichever is shorter.

    Args:
        tokenizer (Tokenizer): The tokenizer object used for encoding.
        value (Any): The value to be converted.
        as_json (bool, optional): Flag indicating whether to return the value as JSON string.
          Defaults to False.

    Returns:
        str: The string representation of the value.
    """
    if value is None:
        return ""
    
    if hasattr(value, "isoformat") and callable(value.isoformat):
        # Used when the value is a datetime object
        return value.isoformat()
    value = todict(value)

    if as_json:
        return json.dumps(value, default=lambda o: o.__dict__, ensure_ascii=False)

    # Return shorter version of object
    yaml_str = yaml.dump(value, allow_unicode=True)
    json_str = json.dumps(value, default=lambda o: o.__dict__, ensure_ascii=False)
    if len(tokenizer.encode(yaml_str)) < len(tokenizer.encode(json_str)):
        return yaml_str

    return json_str

screenshots:

image

Attestation Checklist

  • My code follows the style guidelines of this project

  • I have checked for/fixed spelling, linting, and other errors

  • I have commented my code for clarity

  • I have made corresponding changes to the documentation (updating the doc strings in the code is sufficient)

  • My changes generate no new warnings

  • I have added tests that validates my changes, and provides sufficient test coverage. I have tested with:

    • Local testing
    • E2E testing in Teams
  • New and existing unit tests pass locally with my changes

The to_string function has been optimized to add logic for handling string types and null values.
@jamiesun
Copy link
Contributor Author

@microsoft-github-policy-service agree company="Terateams"

@corinagum corinagum changed the title refactor: Optimizing the to_string Function [JS] fix: Optimizing the to_string Function Oct 11, 2024
@corinagum corinagum changed the title [JS] fix: Optimizing the to_string Function [PY] fix: Optimizing the to_string Function Oct 11, 2024
@corinagum
Copy link
Collaborator

@jamiesun thank you for the contribution! We super appreciate it. Do you happen to know if this same issue occurs in JS / C# as well, or just Python? If they do, I'll create some new tickets to make sure we are tracking for all three scenarios.

lilyydu
lilyydu previously approved these changes Oct 11, 2024
Copy link
Contributor

@lilyydu lilyydu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks for adding this!

@lilyydu
Copy link
Contributor

lilyydu commented Oct 11, 2024

@jamiesun It looks like there's a linting issue- can you rerun and fix accordingly? Thanks!

@lilyydu lilyydu self-requested a review October 11, 2024 21:46
@jamiesun
Copy link
Contributor Author

@jamiesun It looks like there's a linting issue- can you rerun and fix accordingly? Thanks!

Thanks for the reminder, I've reworked the commit code

@lilyydu lilyydu merged commit d4bf9cc into microsoft:main Oct 14, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Chinese is stored as a unicode escape character and sent to AI, resulting in confusing replies
3 participants