refactor: Speed up function `_serialize_dataframe` by 123% in PR #6044 (`refactor-serialization`) #6078

codeflash-ai · 2025-02-03T12:10:00Z

⚡️ This pull request contains optimizations for PR #6044

If you approve this dependent PR, these changes will be merged into the original PR branch refactor-serialization.

This PR will be automatically closed if the original PR is merged.

📄 123% (1.23x) speedup for `_serialize_dataframe` in `src/backend/base/langflow/serialization/serialization.py`

⏱️ Runtime : 23.9 milliseconds → 10.7 milliseconds (best of 141 runs)

📝 Explanation and details

Certainly! Here's a more efficient version of the given program. The primary optimization performed here is removing the redundant .apply() call and directly truncating values in a more performant way.

Changes Made.

Removed redundant apply calls: In the original code, there were nested apply calls which can be very slow on larger DataFrames. The new implementation converts the DataFrame to a list of dictionaries first and then truncates the values if needed.
Optimized truncation logic: Applied truncation directly while iterating over the dictionary after conversion from a DataFrame. This reduces overhead and improves readability.

These changes should enhance the runtime performance of the serialization process, especially for larger DataFrames.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 38 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	undefined

🌀 Generated Regression Tests Details

import numpy as np
import pandas as pd
# imports
import pytest  # used for our unit tests
from langflow.serialization.serialization import _serialize_dataframe


# function to test
def _truncate_value(value, max_length, max_items):
    """Helper function to truncate values based on max_length."""
    if max_length is not None and isinstance(value, str) and len(value) > max_length:
        return value[:max_length]
    return value
from langflow.serialization.serialization import _serialize_dataframe

# unit tests

def test_basic_functionality_no_limits():
    """Test simple DataFrame without truncation or row limits."""
    df = pd.DataFrame({'A': [1, 2], 'B': ['abc', 'def']})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': 'abc'}, {'A': 2, 'B': 'def'}]

def test_basic_functionality_max_items():
    """Test DataFrame with max_items specified."""
    df = pd.DataFrame({'A': [1, 2, 3], 'B': ['abc', 'def', 'ghi']})
    codeflash_output = _serialize_dataframe(df, None, 2)
    expected = [{'A': 1, 'B': 'abc'}, {'A': 2, 'B': 'def'}]

def test_basic_functionality_max_length():
    """Test DataFrame with max_length specified."""
    df = pd.DataFrame({'A': [1, 2], 'B': ['abcdef', 'ghijkl']})
    codeflash_output = _serialize_dataframe(df, 3, None)
    expected = [{'A': 1, 'B': 'abc'}, {'A': 2, 'B': 'ghi'}]

def test_empty_dataframe():
    """Test empty DataFrame."""
    df = pd.DataFrame()
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = []

def test_single_row_dataframe():
    """Test single row DataFrame."""
    df = pd.DataFrame({'A': [1], 'B': ['abc']})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': 'abc'}]

def test_single_column_dataframe():
    """Test single column DataFrame."""
    df = pd.DataFrame({'A': [1, 2, 3]})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1}, {'A': 2}, {'A': 3}]

def test_mixed_data_types():
    """Test DataFrame with mixed data types."""
    df = pd.DataFrame({'A': [1, 2], 'B': [3.0, 4.5], 'C': ['abc', 'def']})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': 3.0, 'C': 'abc'}, {'A': 2, 'B': 4.5, 'C': 'def'}]

def test_none_values():
    """Test DataFrame with None values."""
    df = pd.DataFrame({'A': [1, None], 'B': ['abc', None]})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': 'abc'}, {'A': None, 'B': None}]

def test_nan_values():
    """Test DataFrame with NaN values."""
    df = pd.DataFrame({'A': [1, np.nan], 'B': ['abc', np.nan]})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': 'abc'}, {'A': np.nan, 'B': np.nan}]

def test_special_characters():
    """Test DataFrame with special characters in strings."""
    df = pd.DataFrame({'A': [1, 2], 'B': ['a\nb\tc', 'd\te\nf']})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': 'a\nb\tc'}, {'A': 2, 'B': 'd\te\nf'}]

def test_large_dataframe():
    """Test large DataFrame for performance and scalability."""
    df = pd.DataFrame({'A': range(1000), 'B': ['x' * 1000] * 1000})
    codeflash_output = _serialize_dataframe(df, None, None)

def test_large_string_values():
    """Test DataFrame with very long string values."""
    df = pd.DataFrame({'A': [1, 2], 'B': ['x' * 1000, 'y' * 1000]})
    codeflash_output = _serialize_dataframe(df, 10, None)
    expected = [{'A': 1, 'B': 'xxxxxxxxxx'}, {'A': 2, 'B': 'yyyyyyyyyy'}]

def test_boundary_max_length_zero():
    """Test boundary value for max_length=0."""
    df = pd.DataFrame({'A': [1, 2], 'B': ['abc', 'def']})
    codeflash_output = _serialize_dataframe(df, 0, None)
    expected = [{'A': 1, 'B': ''}, {'A': 2, 'B': ''}]

def test_boundary_max_items_zero():
    """Test boundary value for max_items=0."""
    df = pd.DataFrame({'A': [1, 2], 'B': ['abc', 'def']})
    codeflash_output = _serialize_dataframe(df, None, 0)
    expected = []

def test_non_dataframe_input():
    """Test non-DataFrame input."""
    with pytest.raises(AttributeError):
        _serialize_dataframe([1, 2, 3], None, None)

def test_negative_max_length():
    """Test negative value for max_length."""
    df = pd.DataFrame({'A': [1, 2], 'B': ['abc', 'def']})
    codeflash_output = _serialize_dataframe(df, -1, None)
    expected = [{'A': 1, 'B': 'abc'}, {'A': 2, 'B': 'def'}]

def test_negative_max_items():
    """Test negative value for max_items."""
    df = pd.DataFrame({'A': [1, 2], 'B': ['abc', 'def']})
    codeflash_output = _serialize_dataframe(df, None, -1)
    expected = [{'A': 1, 'B': 'abc'}, {'A': 2, 'B': 'def'}]

def test_multiindex_dataframe():
    """Test DataFrame with MultiIndex."""
    index = pd.MultiIndex.from_tuples([('a', 1), ('b', 2)], names=['first', 'second'])
    df = pd.DataFrame({'A': [1, 2]}, index=index)
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'first': 'a', 'second': 1, 'A': 1}, {'first': 'b', 'second': 2, 'A': 2}]

def test_datetime_dataframe():
    """Test DataFrame with datetime objects."""
    df = pd.DataFrame({'A': [pd.Timestamp('20230101'), pd.Timestamp('20230102')]})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': pd.Timestamp('20230101')}, {'A': pd.Timestamp('20230102')}]
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

import pandas as pd
# imports
import pytest  # used for our unit tests
from langflow.serialization.serialization import _serialize_dataframe


# function to test
def _truncate_value(value, max_length, max_items):
    if isinstance(value, str) and max_length is not None:
        return value[:max_length]
    return value
from langflow.serialization.serialization import _serialize_dataframe

# unit tests

# Basic Functionality
def test_serialize_basic():
    df = pd.DataFrame({'A': [1, 2], 'B': ['foo', 'bar']})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': 'foo'}, {'A': 2, 'B': 'bar'}]

def test_serialize_max_items():
    df = pd.DataFrame({'A': list(range(10)), 'B': ['foo']*10})
    codeflash_output = _serialize_dataframe(df, None, 5)
    expected = [{'A': i, 'B': 'foo'} for i in range(5)]

def test_serialize_max_length():
    df = pd.DataFrame({'A': [1, 2], 'B': ['foobarbaz', 'quxquux']})
    codeflash_output = _serialize_dataframe(df, 3, None)
    expected = [{'A': 1, 'B': 'foo'}, {'A': 2, 'B': 'qux'}]

# Edge Cases
def test_serialize_empty_dataframe():
    df = pd.DataFrame()
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = []

def test_serialize_single_row():
    df = pd.DataFrame({'A': [1], 'B': ['foo']})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': 'foo'}]

def test_serialize_single_column():
    df = pd.DataFrame({'A': [1, 2, 3]})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1}, {'A': 2}, {'A': 3}]

def test_serialize_mixed_data_types():
    df = pd.DataFrame({'A': [1, 2.5, 'three'], 'B': [None, 'foo', 3.14]})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': None}, {'A': 2.5, 'B': 'foo'}, {'A': 'three', 'B': 3.14}]

# Large DataFrames
def test_serialize_large_dataframe():
    df = pd.DataFrame({'A': list(range(1000)), 'B': ['foo']*1000})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': i, 'B': 'foo'} for i in range(1000)]

def test_serialize_large_string_values():
    df = pd.DataFrame({'A': [1, 2], 'B': ['a'*1000, 'b'*1000]})
    codeflash_output = _serialize_dataframe(df, 10, None)
    expected = [{'A': 1, 'B': 'a'*10}, {'A': 2, 'B': 'b'*10}]

# Special Characters and Unicode
def test_serialize_special_characters():
    df = pd.DataFrame({'A': [1, 2], 'B': ['@foo, '#bar#']})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': '@foo}, {'A': 2, 'B': '#bar#'}]

def test_serialize_unicode_characters():
    df = pd.DataFrame({'A': [1, 2], 'B': ['😊', '🚀']})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': '😊'}, {'A': 2, 'B': '🚀'}]

# Null and NaN Values
def test_serialize_nan_values():
    df = pd.DataFrame({'A': [1, 2, float('nan')], 'B': [float('nan'), 'foo', 'bar']})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': float('nan')}, {'A': 2, 'B': 'foo'}, {'A': float('nan'), 'B': 'bar'}]

def test_serialize_none_values():
    df = pd.DataFrame({'A': [1, 2, None], 'B': [None, 'foo', 'bar']})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': None}, {'A': 2, 'B': 'foo'}, {'A': None, 'B': 'bar'}]

# Boundary Conditions
def test_serialize_exact_max_items():
    df = pd.DataFrame({'A': list(range(5)), 'B': ['foo']*5})
    codeflash_output = _serialize_dataframe(df, None, 5)
    expected = [{'A': i, 'B': 'foo'} for i in range(5)]

def test_serialize_exact_max_length():
    df = pd.DataFrame({'A': [1, 2], 'B': ['foo', 'bar']})
    codeflash_output = _serialize_dataframe(df, 3, None)
    expected = [{'A': 1, 'B': 'foo'}, {'A': 2, 'B': 'bar'}]

# Nested DataFrames
def test_serialize_nested_values():
    df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], {'key': 'value'}]})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': [1, 2]}, {'A': 2, 'B': {'key': 'value'}}]

# DataFrame with Custom Index
def test_serialize_custom_index():
    df = pd.DataFrame({'A': [1, 2], 'B': ['foo', 'bar']}, index=['row1', 'row2'])
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': 'foo'}, {'A': 2, 'B': 'bar'}]

# DataFrame with Duplicates
def test_serialize_duplicate_rows():
    df = pd.DataFrame({'A': [1, 1], 'B': ['foo', 'foo']})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': 'foo'}, {'A': 1, 'B': 'foo'}]

# DataFrame with Date and Time
def test_serialize_datetime_values():
    df = pd.DataFrame({'A': [1, 2], 'B': [pd.Timestamp('2023-01-01'), pd.Timestamp('2023-01-02')]})
    codeflash_output = _serialize_dataframe(df, None, None)
    expected = [{'A': 1, 'B': pd.Timestamp('2023-01-01')}, {'A': 2, 'B': pd.Timestamp('2023-01-02')}]
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

…d a unified serialize method

…ums, and generic types

… errors gracefully

…arious data types

…unction for consistency and maintainability

…lize method for improved clarity and maintainability

…unction for improved consistency and clarity

… streamlining type checks

…nse for streamlined handling

…tems_length for improved handling of outputs, logs, messages, and artifacts

…alization module for better organization

…ified serialize function

…thesis class

Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>

… unsupported types

…ad of its string representation

Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>

…actor-serialization`) Certainly! Here's a more efficient version of the given program. The primary optimization performed here is removing the redundant `.apply()` call and directly truncating values in a more performant way. ### Changes Made. 1. **Removed redundant `apply` calls**: In the original code, there were nested `apply` calls which can be very slow on larger DataFrames. The new implementation converts the DataFrame to a list of dictionaries first and then truncates the values if needed. 2. **Optimized truncation logic**: Applied truncation directly while iterating over the dictionary after conversion from a DataFrame. This reduces overhead and improves readability. These changes should enhance the runtime performance of the serialization process, especially for larger DataFrames.

…(`refactor-serialization`) (#6078) * feat: Implement serialization functions for various data types and add a unified serialize method * feat: Enhance serialization by adding support for primitive types, enums, and generic types * fix: Update Pinecone integration to use VectorStore and handle import errors gracefully * test: Add hypothesis-based tests for serialization functions across various data types * refactor: Replace custom serialization logic with unified serialize function for consistency and maintainability * refactor: Replace recursive serialization function with unified serialize method for improved clarity and maintainability * refactor: Replace custom serialization logic with unified serialize function for improved consistency and clarity * refactor: Enhance serialization logic by adding instance handling and streamlining type checks * refactor: Remove custom dictionary serialization from ResultDataResponse for streamlined handling * refactor: Enhance serialization in ResultDataResponse by adding max_items_length for improved handling of outputs, logs, messages, and artifacts * refactor: Move MAX_ITEMS_LENGTH and MAX_TEXT_LENGTH constants to serialization module for better organization * refactor: Simplify message serialization in Log model by utilizing unified serialize function * refactor: Remove unnecessary pytest marker from TestSerializationHypothesis class * optimize _serialize_bytes Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> * feat: Add support for numpy integer type serialization * feat: Enhance serialization with support for pandas and numpy types * test: Add comprehensive serialization tests for numpy and pandas types * fix: Update _serialize_dispatcher to return string representation for unsupported types * fix: Update _serialize_dispatcher to return the object directly instead of its string representation * optmize conditional Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> * optimize length check Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> * fix: Update string and list truncation to include ellipsis for clarity * ⚡️ Speed up function `_serialize_dataframe` by 123% in PR #6044 (`refactor-serialization`) Certainly! Here's a more efficient version of the given program. The primary optimization performed here is removing the redundant `.apply()` call and directly truncating values in a more performant way. ### Changes Made. 1. **Removed redundant `apply` calls**: In the original code, there were nested `apply` calls which can be very slow on larger DataFrames. The new implementation converts the DataFrame to a list of dictionaries first and then truncates the values if needed. 2. **Optimized truncation logic**: Applied truncation directly while iterating over the dictionary after conversion from a DataFrame. This reduces overhead and improves readability. These changes should enhance the runtime performance of the serialization process, especially for larger DataFrames. --------- Co-authored-by: Gabriel Luiz Freitas Almeida <gabriel@langflow.org> Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>

ogabrielluiz and others added 23 commits January 31, 2025 12:03

feat: Implement serialization functions for various data types and ad…

7eda663

…d a unified serialize method

feat: Enhance serialization by adding support for primitive types, en…

1045c7a

…ums, and generic types

fix: Update Pinecone integration to use VectorStore and handle import…

42bd591

… errors gracefully

test: Add hypothesis-based tests for serialization functions across v…

6cb6983

…arious data types

refactor: Replace custom serialization logic with unified serialize f…

9767766

…unction for consistency and maintainability

refactor: Replace recursive serialization function with unified seria…

b20e0f9

…lize method for improved clarity and maintainability

refactor: Replace custom serialization logic with unified serialize f…

9acdbfc

…unction for improved consistency and clarity

refactor: Enhance serialization logic by adding instance handling and…

bbb5286

… streamlining type checks

refactor: Remove custom dictionary serialization from ResultDataRespo…

4b563dc

…nse for streamlined handling

refactor: Enhance serialization in ResultDataResponse by adding max_i…

d8182dc

…tems_length for improved handling of outputs, logs, messages, and artifacts

refactor: Move MAX_ITEMS_LENGTH and MAX_TEXT_LENGTH constants to seri…

a898ab1

…alization module for better organization

refactor: Simplify message serialization in Log model by utilizing un…

1a0116d

…ified serialize function

refactor: Remove unnecessary pytest marker from TestSerializationHypo…

ba01a10

…thesis class

optimize _serialize_bytes

01676d5

Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>

feat: Add support for numpy integer type serialization

7ebb304

feat: Enhance serialization with support for pandas and numpy types

f16110b

test: Add comprehensive serialization tests for numpy and pandas types

550bec4

fix: Update _serialize_dispatcher to return string representation for…

573130c

… unsupported types

fix: Update _serialize_dispatcher to return the object directly inste…

bcfe6f9

…ad of its string representation

optmize conditional

9b51778

Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>

optimize length check

824ae5f

Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>

fix: Update string and list truncation to include ellipsis for clarity

59ad780

codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Feb 3, 2025

codeflash-ai bot mentioned this pull request Feb 3, 2025

refactor: Implement unified serialization function #6044

Merged

dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. enhancement New feature or request labels Feb 3, 2025

Base automatically changed from refactor-serialization to main February 3, 2025 15:22

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Feb 3, 2025

Merge branch 'main' into codeflash/optimize-pr6044-2025-02-03T12.09.54

1c1c997

dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Feb 3, 2025

ogabrielluiz approved these changes Feb 3, 2025

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Feb 3, 2025

ogabrielluiz enabled auto-merge February 3, 2025 15:35

ogabrielluiz changed the title ~~⚡️ Speed up function _serialize_dataframe by 123% in PR #6044 (refactor-serialization)~~ refactor: Speed up function _serialize_dataframe by 123% in PR #6044 (refactor-serialization) Feb 3, 2025

ogabrielluiz added this pull request to the merge queue Feb 3, 2025

Merged via the queue into main with commit d676aef Feb 3, 2025
28 of 36 checks passed

ogabrielluiz deleted the codeflash/optimize-pr6044-2025-02-03T12.09.54 branch February 3, 2025 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: Speed up function `_serialize_dataframe` by 123% in PR #6044 (`refactor-serialization`) #6078

refactor: Speed up function `_serialize_dataframe` by 123% in PR #6044 (`refactor-serialization`) #6078

Uh oh!

codeflash-ai bot commented Feb 3, 2025

Uh oh!

Uh oh!

Uh oh!

refactor: Speed up function _serialize_dataframe by 123% in PR #6044 (refactor-serialization) #6078

refactor: Speed up function _serialize_dataframe by 123% in PR #6044 (refactor-serialization) #6078

Uh oh!

Conversation

codeflash-ai bot commented Feb 3, 2025

⚡️ This pull request contains optimizations for PR #6044

📄 123% (1.23x) speedup for _serialize_dataframe in src/backend/base/langflow/serialization/serialization.py

Changes Made.

Uh oh!

Uh oh!

Uh oh!

refactor: Speed up function `_serialize_dataframe` by 123% in PR #6044 (`refactor-serialization`) #6078

refactor: Speed up function `_serialize_dataframe` by 123% in PR #6044 (`refactor-serialization`) #6078

📄 123% (1.23x) speedup for `_serialize_dataframe` in `src/backend/base/langflow/serialization/serialization.py`