Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dependency Updates, Documentation Enhancements, New Features, and Code Refactoring #250

Open
wants to merge 73 commits into
base: version3.0
Choose a base branch
from

Conversation

mxsrm
Copy link
Contributor

@mxsrm mxsrm commented Dec 22, 2024

So, I have the first big truck of changes for you. I mainly did formatting (those are separate commits) and clippy changes, improving alot of things efficiency-wise. I also fixed a MRSV of minimum supported Rust version, dictated to us by our dependencies anyway.

I compressed the helper/crypt.rs part and pulled away all the static values we currently do not allow to be changed from user side anyway. Makes the code alot more readable.


Chore

  • update zip dependency to version 2.2.2

  • add categories to Cargo.toml

  • allow multiple versions of thiserror
    This fixes the following clippy warning:

    multiple versions for dependency thiserror: 1.0.69, 2.0.8
    for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#multiple_crate_versions

    thiserror and thiserror-impl are transitive dependencies:

    cargo tree --invert --package thiserror-impl@1.0.69
    
    thiserror-impl v1.0.69 (proc-macro)
    └── thiserror v1.0.69
        ├── html_parser v0.7.0
        │   └── umya-spreadsheet v2.2.0 (/home/mxsrm/code/umya-spreadsheet)
        └── rav1e v0.7.1
            └── ravif v0.11.11
                └── image v0.25.5
                    └── umya-spreadsheet v2.2.0 (/home/mxsrm/code/umya-spreadsheet)
    
    cargo tree --invert --package thiserror-impl@2.0.8
    
    thiserror-impl v2.0.8 (proc-macro)
    └── thiserror v2.0.8
        ├── pest v2.7.15
        │   ├── html_parser v0.7.0
        │   │   └── umya-spreadsheet v2.2.0 (/home/mxsrm/code/umya-spreadsheet)
        │   ├── pest_derive v2.7.15 (proc-macro)
        │   │   └── html_parser v0.7.0 (*)
        │   ├── pest_generator v2.7.15
        │   │   └── pest_derive v2.7.15 (proc-macro) (*)
        │   └── pest_meta v2.7.15
        │       └── pest_generator v2.7.15 (*)
        └── zip v2.2.2
            └── umya-spreadsheet v2.2.0 (/home/mxsrm/code/umya-spreadsheet)
  • remove unused js feature from Cargo.toml
    This commit removes the unused js feature from the [features]
    section in Cargo.toml. The js feature is no longer utilized in
    the project, and its removal helps to maintain a cleaner configuration.
    Previously, this feature led to the inclusion of the js feature from
    the getrandom dependency, which was removed in a prior commit.

    • Deleted the js feature entry from Cargo.toml.
  • set minimum supported Rust version to 1.79.0
    This commit sets the minimum supported Rust version in Cargo.toml
    to 1.79.0, as determined by the cargo-msrv tool. This version is
    dictated by the bitstream-io dependency, which is a transitive
    dependency of the image crate. Setting the minimum version ensures
    compatibility with the required features and functionality.

    • Added rust-version = "1.79.0" to Cargo.toml.
  • move lint configurations to lib.rs
    This commit removes the lint configurations from Cargo.toml and
    transfers them to lib.rs using attribute macros. This change
    centralizes lint settings within the codebase, making it easier to
    manage and understand the linting rules applied to the project.

    • Removed lint settings from Cargo.toml.
    • Added lint attributes in lib.rs for better backwards compatibility.
  • sort dependencies in Cargo.toml
    This commit sorts the dependencies in Cargo.toml for improved
    readability and organization. Sorting helps maintain a consistent
    format and makes it easier to locate specific dependencies.

Documentation

  • enhance documentation for encryption-related functions
    • Add detailed parameter descriptions and return values for all public functions
    • Improve documentation formatting and clarity for encryption methods
    • Add specific documentation for AES-256-CBC implementation details
    • Include clippy allow attributes for possible truncation warnings
    • Document HMAC and IV generation processes
    • Add comprehensive documentation for password-to-key conversion logic

New Features

  • enhance Excel timestamp conversion with additional calendar systems
    • add num-traits dependency for numeric casts
    • introduce DEFAULT_TIMEZONE constant for consistency
    • implement excel_to_date_time_object for converting Excel timestamps to NaiveDateTime
    • add convert_date_windows_1900 and convert_date_mac_1904 for specific calendar systems
    • enhance convert_date_crate to support both Windows 1900 and Mac 1904 systems
    • include detailed documentation for new functions with examples and panic behavior

Refactor

  • update is_address function for regex initialization

    • Refactored the is_address function to use OnceLock for lazy initialization of the regex pattern, improving performance by compiling the regex only once.
    • Enhanced documentation for is_address to clarify its parameters, return values, and potential panics.
    • Removed the previous implementation of is_address to streamline the code.
    • Added examples in the documentation to demonstrate usage.
  • remove unnecessary qualification

  • clippy lints

  • simplify make_buffer function and improve worksheet processing

    • Reorganized the make_buffer function for better readability and maintainability.
    • Replaced manual iteration with try_for_each for processing worksheets.
    • Consolidated worksheet processing logic to handle both deserialized and raw data more cleanly.
    • Improved comments for clarity on each processing step.
    • Streamlined the addition of various worksheet components (charts, drawings, comments, etc.).
    • Ensured proper handling of printer settings and other worksheet relationships.
  • remove leading underscores from method names and allow dead_code at crate level

    • Rename private methods across multiple modules by removing leading underscores to adhere to Rust naming conventions.
    • Additionally, add #![allow(dead_code)] at the crate level to silence dead code warnings.
  • centralize constants into a dedicated module
    Extract multiple static variables into the constants module, eliminating the need to pass them as parameters in function calls. This refactor simplifies function signatures, centralizes configuration values, and enhances maintainability. Future updates may revert this approach to accommodate support for additional algorithms as the crate evolves.

  • abstract random byte generation with macro and enhance RNG security

    • Introduced the generate_random_bytes! macro to eliminate repetitive code for generating random bytes.
    • Replaced rand::thread_rng with rand::rngs::OsRng to utilize a cryptographically secure random number generator.
    • Updated cryptographic key constants to use fixed-size arrays instead of slices for improved type safety.
    • Enhanced error handling in random byte generation by adding expect messages.
    • Removed unused imports and commented-out code to clean up the codebase.
    • Improved consistency in hash function implementations and key management.

    This refactoring improves code maintainability, readability, and security by centralizing random byte generation and ensuring the use of secure RNG sources.

  • replace is_none_or with match statements
    This commit refactors the code to replace the use of the is_none_or
    method with match statements. This change improves compatibility with
    a lower minimum supported Rust version, since is_none_or is only
    available in stable since Rust 1.82.0.

    • Updated is_support method in TwoCellAnchor to use match.
    • Refactored _has_vertical and has_horizontal methods in MergeCells
      to use match for start and end row/column checks.
  • replace hex decoding/encoding with hex-literal
    This commit simplifies the test code by replacing custom hex decoding
    and encoding functions with the hex-literal crate. This change
    improves readability and reduces boilerplate code in the tests.

    • Removed decode_hex and encode_hex helper functions.
    • Updated test cases to use the hex! macro for byte literals.
  • replace OsRng with thread_rng for random byte generation
    This commit refactors the random byte generation in the gen_random_bytes function by replacing the use of OsRng with thread_rng. This change simplifies the code and improves performance by using the thread-local random number generator.

    • Updated the gen_random_bytes function to utilize thread_rng() for filling the byte vector.
    • Removed the OsRng import as it is no longer needed.
  • replace ahash with std::hash::DefaultHasher
    This commit removes the ahash dependency and replaces its usage with the standard library's DefaultHasher. This change simplifies the dependency tree and reduces external dependencies while maintaining the same hashing functionality.

    • Removed the ahash crate from Cargo.toml.
    • Updated the SharedStringItem struct to use std::hash::DefaultHasher instead of AHasher from ahash.
  • rename cell_collection to cells, row_dimensions to rows, and column_dimensions to columns

    • Renamed cell_collection to cells for consistency and clarity.
    • Renamed row_dimensions to rows for improved readability.
    • Renamed column_dimensions to columns for better understanding.
    • Updated all method calls and references accordingly.
  • replace ThinVec with rust-lang Vec

  • replace lazy_static with built-in std::sync::OnceLock

  • remove lazy_static, add comments

  • fix typos, clippy lints

  • reimplement and cleanup helper/crypt.rs

  • replace getrandom with rand for random byte generation

    • Replaced getrandom crate with rand to simplify and modernize random number generation.
    • Updated gen_random_* functions to use rand::thread_rng().fill() for generating random bytes of specified lengths.
    • Adjusted encryption functions to use gen_random_bytes for generating random values.

Style

  • fix many clippy lints
  • fix formatting and improve check_sheet_name logic
    • Corrected the documentation comment for get_sheet_by_name to fix the syntax error.
    • Refactored the check_sheet_name method to use a more concise if-else structure for clarity.
    • Improved readability of the code while maintaining the same functionality.
  • add modern rustfmt.toml and apply it
  • enable more lints

Commit Statistics

  • 33 commits contributed to the release.
  • 11 days passed between releases.
  • 31 commits were understood as conventional.
  • 1 unique issue was worked on: #249

Commit Details

view details
  • #249
  • Uncategorized
    • Fix many clippy lints (e60fa40)
    • Fix formatting and improve check_sheet_name logic (6b3c774)
    • Update zip dependency to version 2.2.2 (ea1d8a2)
    • Update is_address function for regex initialization (4eab5d7)
    • Remove unnecessary qualification (505d766)
    • Add categories to Cargo.toml (37d1139)
    • Allow multiple versions of thiserror (de9f38f)
    • Clippy lints (8d41bc1)
    • Enhance documentation for encryption-related functions (f8d9659)
    • Simplify make_buffer function and improve worksheet processing (72faf4f)
    • Remove leading underscores from method names and allow dead_code at crate level (4dcea0e)
    • Centralize constants into a dedicated module (26e7021)
    • Add modern rustfmt.toml and apply it (98fcf7f)
    • Abstract random byte generation with macro and enhance RNG security (b1916f7)
    • Enhance Excel timestamp conversion with additional calendar systems (92a1af9)
    • Remove unused js feature from Cargo.toml (995f40d)
    • Set minimum supported Rust version to 1.79.0 (f76be7a)
    • Replace is_none_or with match statements (9511418)
    • Move lint configurations to lib.rs (6b2c88e)
    • Sort dependencies in Cargo.toml (517cf8c)
    • Replace hex decoding/encoding with hex-literal (739446e)
    • Replace OsRng with thread_rng for random byte generation (b65b035)
    • Replace ahash with std::hash::DefaultHasher (12926d6)
    • Rename cell_collection to cells, row_dimensions to rows, and column_dimensions to columns (cb80120)
    • Replace ThinVec with rust-lang Vec (714bf07)
    • Replace lazy_static with built-in std::sync::OnceLock (aaaec5f)
    • Remove lazy_static, add comments (f3de769)
    • Fix typos, clippy lints (71f0572)
    • Reimplement and cleanup helper/crypt.rs (30e3e9a)
    • Replace getrandom with rand for random byte generation (a317cce)
    • Enable more lints (48b864b)
    • Feature Request - Set Worksheet Gridlines #248 (5e20000)

mxsrm added 30 commits December 21, 2024 17:36
…ration

- Replaced `getrandom` crate with `rand` to simplify and modernize random number generation.
- Updated `gen_random_*` functions to use `rand::thread_rng().fill()` for generating random bytes of specified lengths.
- Adjusted encryption functions to use `gen_random_bytes` for generating random values.
…to rows, and column_dimensions to columns

- Renamed `cell_collection` to `cells` for consistency and clarity.
- Renamed `row_dimensions` to `rows` for improved readability.
- Renamed `column_dimensions` to `columns` for better understanding.
- Updated all method calls and references accordingly.
This commit removes the `ahash` dependency and replaces its usage with the standard library's `DefaultHasher`. This change simplifies the dependency tree and reduces external dependencies while maintaining the same hashing functionality.

- Removed the `ahash` crate from `Cargo.toml`.
- Updated the `SharedStringItem` struct to use `std::hash::DefaultHasher` instead of `AHasher` from `ahash`.
…tion

This commit refactors the random byte generation in the `gen_random_bytes` function by replacing the use of `OsRng` with `thread_rng`. This change simplifies the code and improves performance by using the thread-local random number generator.

- Updated the `gen_random_bytes` function to utilize `thread_rng()` for filling the byte vector.
- Removed the `OsRng` import as it is no longer needed.
This commit simplifies the test code by replacing custom hex decoding
and encoding functions with the `hex-literal` crate. This change
improves readability and reduces boilerplate code in the tests.

- Removed `decode_hex` and `encode_hex` helper functions.
- Updated test cases to use the `hex!` macro for byte literals.
This commit sorts the dependencies in `Cargo.toml` for improved
readability and organization. Sorting helps maintain a consistent
format and makes it easier to locate specific dependencies.
This commit removes the lint configurations from `Cargo.toml` and
transfers them to `lib.rs` using attribute macros. This change
centralizes lint settings within the codebase, making it easier to
manage and understand the linting rules applied to the project.

- Removed lint settings from `Cargo.toml`.
- Added lint attributes in `lib.rs` for better backwards compatibility.
This commit refactors the code to replace the use of the `is_none_or`
method with match statements. This change improves compatibility with
a lower minimum supported Rust version, since `is_none_or` is only
available in stable since Rust 1.82.0.

- Updated `is_support` method in `TwoCellAnchor` to use match.
- Refactored `_has_vertical` and `has_horizontal` methods in `MergeCells`
  to use match for start and end row/column checks.
This commit sets the minimum supported Rust version in `Cargo.toml`
to 1.79.0, as determined by the `cargo-msrv` tool. This version is
dictated by the `bitstream-io` dependency, which is a transitive
dependency of the `image` crate. Setting the minimum version ensures
compatibility with the required features and functionality.

- Added `rust-version = "1.79.0"` to `Cargo.toml`.
This commit removes the unused `js` feature from the `[features]`
section in `Cargo.toml`. The `js` feature is no longer utilized in
the project, and its removal helps to maintain a cleaner configuration.
Previously, this feature led to the inclusion of the `js` feature from
the `getrandom` dependency, which was removed in a prior commit.

- Deleted the `js` feature entry from `Cargo.toml`.
… calendar systems

* add `num-traits` dependency for numeric casts
* introduce `DEFAULT_TIMEZONE` constant for consistency
* implement `excel_to_date_time_object` for converting Excel timestamps to `NaiveDateTime`
* add `convert_date_windows_1900` and `convert_date_mac_1904` for specific calendar systems
* enhance `convert_date_crate` to support both Windows 1900 and Mac 1904 systems
* include detailed documentation for new functions with examples and panic behavior
…d enhance RNG security

- Introduced the `generate_random_bytes!` macro to eliminate repetitive code for generating random bytes.
- Replaced `rand::thread_rng` with `rand::rngs::OsRng` to utilize a cryptographically secure random number generator.
- Updated cryptographic key constants to use fixed-size arrays instead of slices for improved type safety.
- Enhanced error handling in random byte generation by adding `expect` messages.
- Removed unused imports and commented-out code to clean up the codebase.
- Improved consistency in hash function implementations and key management.

This refactoring improves code maintainability, readability, and security by centralizing random byte generation and ensuring the use of secure RNG sources.
Extract multiple static variables into the `constants` module, eliminating the need to pass them as parameters in function calls. This refactor simplifies function signatures, centralizes configuration values, and enhances maintainability. Future updates may revert this approach to accommodate support for additional algorithms as the crate evolves.
…_code at crate level

- Rename private methods across multiple modules by removing leading underscores to adhere to Rust naming conventions.
- Additionally, add `#![allow(dead_code)]` at the crate level to silence dead code warnings.
…sheet processing

- Reorganized the `make_buffer` function for better readability and maintainability.
- Replaced manual iteration with `try_for_each` for processing worksheets.
- Consolidated worksheet processing logic to handle both deserialized and raw data more cleanly.
- Improved comments for clarity on each processing step.
- Streamlined the addition of various worksheet components (charts, drawings, comments, etc.).
- Ensured proper handling of printer settings and other worksheet relationships.
…tions

* Add detailed parameter descriptions and return values for all public functions
* Improve documentation formatting and clarity for encryption methods
* Add specific documentation for AES-256-CBC implementation details
* Include clippy allow attributes for possible truncation warnings
* Document HMAC and IV generation processes
* Add comprehensive documentation for password-to-key conversion logic
This fixes the following `clippy` warning:
> multiple versions for dependency `thiserror`: 1.0.69, 2.0.8
> for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#multiple_crate_versions

`thiserror` and `thiserror-impl` are transitive dependencies:

```sh
cargo tree --invert --package thiserror-impl@1.0.69

thiserror-impl v1.0.69 (proc-macro)
└── thiserror v1.0.69
    ├── html_parser v0.7.0
    │   └── umya-spreadsheet v2.2.0 (/home/mxsrm/code/umya-spreadsheet)
    └── rav1e v0.7.1
        └── ravif v0.11.11
            └── image v0.25.5
                └── umya-spreadsheet v2.2.0 (/home/mxsrm/code/umya-spreadsheet)

cargo tree --invert --package thiserror-impl@2.0.8

thiserror-impl v2.0.8 (proc-macro)
└── thiserror v2.0.8
    ├── pest v2.7.15
    │   ├── html_parser v0.7.0
    │   │   └── umya-spreadsheet v2.2.0 (/home/mxsrm/code/umya-spreadsheet)
    │   ├── pest_derive v2.7.15 (proc-macro)
    │   │   └── html_parser v0.7.0 (*)
    │   ├── pest_generator v2.7.15
    │   │   └── pest_derive v2.7.15 (proc-macro) (*)
    │   └── pest_meta v2.7.15
    │       └── pest_generator v2.7.15 (*)
    └── zip v2.2.2
        └── umya-spreadsheet v2.2.0 (/home/mxsrm/code/umya-spreadsheet)
```
…lization

- Refactored the `is_address` function to use `OnceLock` for lazy initialization of the regex pattern, improving performance by compiling the regex only once.
- Enhanced documentation for `is_address` to clarify its parameters, return values, and potential panics.
- Removed the previous implementation of `is_address` to streamline the code.
- Added examples in the documentation to demonstrate usage.
- Corrected the documentation comment for `get_sheet_by_name` to fix the syntax error.
- Refactored the `check_sheet_name` method to use a more concise if-else structure for clarity.
- Improved readability of the code while maintaining the same functionality.
@schungx
Copy link
Contributor

schungx commented Dec 26, 2024

I'm happy to reimplement it if @MathNya thinks it useful. Or implement smallvec instead.

SmallVec solves a different problem: avoiding allocations. It does not save memory. A SmallVec can be larger than a Vec depending on your config.

You'd need to bemchmark to check if using SmallVec is worth it as it stores small number of items inline to avoid allocations, at the expense of larger memory footprint.

- Added `assert_sha256` macro for asserting SHA-256 hash equivalence.
- Added `print_sha256_hex` mmacro for easy printing of SHA-256 hashes.
- Move test helper macros to `helper/utils`.
- Updated the `crypt` module tests to use the new `assert_sha256` macro.
- Adjusted linting configurations in `lib.rs` to allow unused macros.
- Replaces the use of `once_cell::sync::OnceCell` with `std::sync::OnceLock` in `compile_regex` macro to improve compatibility and adhere to Rust's standard library practices.
- Removes now unused dependency on `once_cell`.

This change ensures that the codebase uses the latest and most idiomatic way to handle lazy initialization of static variables, enhancing both performance and safety.
@schungx
Copy link
Contributor

schungx commented Dec 27, 2024

A bit more details on the two different crates.

SmallVec I suggest we use on transitory purposes, for example preparing a list for processing inside a function. The list is destroyed after use and never goes outside of the function. Using SmallVec this way keeps everything on the stack for direct access and cache locality, and avoids all allocations if the number of items is often small. I have benchmarked a lot of such usages and it usually yields 10-15% speed improvements alone.

ThinVec should be used where every byte counts. If using a Vec does not affect a type's size (e.g. when another variant is larger) or the Vec is usually empty (where you can use Option<Box<Vec>>), then there is no need. I tend to use it in cases where having a Vec automatically makes a LOT of types larger, for example in a fundamental type which is embedded everywhere inside a lot of other data structures.

My actual suggestion is to look into the API itself and remove any mutable direct access to Vec's. For example, all those direct get_xxx_mut(&mut self) -> &mut Vec<T> should be replaced by some indirect write API. In that case, if there is no more need for the outside program to know what type is inside, you can then change everything to Box<[T]> which is two words and saves 8 bytes from a Vec since its length is immutable.

Update the library documentation to be more informative and
user-friendly.

The changes include:

- Improved descriptions and example code snippets for common
  use cases like reading/writing files, adding sheets,
  changing cell values, styling and inserting/removing rows/
  columns.
- Improved wording to enhance clarity and correct minor
  grammatical issues.
- Added a more detailed description of the lazy reader.
- Improved code example explanations.
- Added doc comment links to relevant modules/structs
  (e.g. [Style](crate::structs::style)).
This commit introduces a new macro `pub_mod_use!` to streamline module declarations and exports in `structs.rs`. The macro replaces individual module declarations with a single macro call, reducing code duplication and improving readability:

- Modules are now declared and exported using the `pub_mod_use!` macro.
- Each module is made private within its own scope and then re-exported with the desired visibility.
- The change reduces the number of lines significantly, from 494 to 170, enhancing maintainability.

This refactoring will make it easier to manage the visibility of modules and their contents in future updates.
@mxsrm
Copy link
Contributor Author

mxsrm commented Dec 28, 2024

If you want to substantially change the API you would end in a near-total re-write of the whole crate. If you do that, you could implement many things more efficiently, e.g. you could abstract the whole Unit types into a single generic one using Cow<'_, str>, e.g.:

use std::{
    borrow::Cow,
    fmt::Display,
    str::FromStr,
};

#[derive(Clone, Default, Debug, PartialEq, Eq, PartialOrd, Ord)]
pub struct Value<T> {
    value: Option<T>,
}

impl<T> Value<T> {
    pub(crate) fn get_value(&self) -> Option<&T> {
        self.value.as_ref()
    }

    pub(crate) fn set_value(&mut self, value: T) -> &mut Self {
        self.value = Some(value);
        self
    }

    pub(crate) fn remove_value(&mut self) -> &mut Self {
        self.value = None;
        self
    }

    pub(crate) fn has_value(&self) -> bool {
        self.value.is_some()
    }
}

impl<T: Display + Default + Clone + FromStr> Value<T> {
    pub(crate) fn set_value_string<S: Into<String>>(&mut self, value: S) -> &mut Self {
        match value.into().parse::<T>() {
            Ok(parsed) => self.set_value(parsed),
            Err(_) => self.remove_value(), // Or handle the error differently, e.g., log it
        }
    }

    pub(crate) fn get_value_or_default(&self) -> T {
        self.value.clone().unwrap_or_default()
    }
}

macro_rules! create_and_export_ValueType {
    ($t:ty, $i:ident) => {
        pub type $i = Value<$t>;

        impl $i {
            pub(crate) fn get_value_string(&self) -> Cow<'a, str> {
                self.value
                    .as_ref()
                    .map_or_else(|| Cow::Borrowed(""), |v| Cow::Owned(v.to_string()))
            }
        }
    };
}

create_and_export_ValueType!(bool, BoolValue);
create_and_export_ValueType!(u8, ByteValue);
create_and_export_ValueType!(i8, SignedByteValue);
create_and_export_ValueType!(f64, DoubleValue);
create_and_export_ValueType!(i16, I16Value);
create_and_export_ValueType!(i32, I32Value);
create_and_export_ValueType!(i64, I64Value);
create_and_export_ValueType!(u16, U16Value);
create_and_export_ValueType!(u32, U32Value);
create_and_export_ValueType!(u64, U64Value);

pub type StringValue<'a> = Value<Cow<'a, str>>;
impl<'a> StringValue<'a> {
    pub(crate) fn get_value_string(&self) -> Cow<'a, str> {
        self.value.clone().unwrap()
    }
}

Or you could ditch the whole Value<T> idea and work using Traits that you implement directly onto Option<T>. I started on that adventure once or twice, but you just need to change too much to do it incrementally. Many functions want &str from the get_value_string method, others expect String etc.

This commit introduces the `AttrCollection` type alias, which is a vector of `AttrPair` structs.
`AttrPair` holds a key-value pair for XML attributes, where the value is a `Cow<str>`.
This allows for more efficient handling of string attributes, avoiding unnecessary allocations.

The commit also updates all structs to use `AttrCollection` instead of `Vec<(&str, &str)>` for XML attributes.
This change improves code readability and maintainability.

Additionally, the `AttrPair` struct implements `From` traits for various types, making it easier to create attribute pairs.
… update writer

This commit introduces a `From` implementation for `AttrPair` to convert it into a tuple of `(&str, Cow<str>)`. This allows for seamless conversion of `AttrPair` instances into a format that `quick_xml` can handle.

Additionally, the commit updates the `write_start_tag` function in `src/writer/driver.rs` to use the new `From` implementation. This change simplifies the code and makes it more efficient by avoiding manual tuple creation.
@schungx
Copy link
Contributor

schungx commented Dec 29, 2024

Agree. Not recommended to change the current API too much as it'll be too much rewrites for existing code.

- Changed the `argb` field in the `Color` struct from `Option<String>` to `Option<ARGB8>`.
- Added helper methods to convert between hex strings and ARGB8.
- Updated methods to handle the new ARGB8 type, including `set_argb`, `get_argb`, and `set_attributes`.
- Modified constants to use ARGB8 for predefined colors.
- Enhanced the `get_argb_with_theme` method to return ARGB8 values correctly.
- Added tests for hex conversion and color setting functionality.
- Modified the `Border` struct to use `Option<Box<Color>>` for the `color` field.
- Updated methods to handle the optional color, including `get_color`, `set_color`, and `set_attributes`.
- Adjusted the `get_hash_code` method to handle the optional color.
- Updated the `write_to` method to handle the optional color.
- Improved consistency in color handling within the `Border` struct.

This change drastically reduce the amount of heap allocations when cloning Worksheets.
@mxsrm
Copy link
Contributor Author

mxsrm commented Dec 29, 2024

In terms of memory efficiency (@schungx) I did some profiling with valgrind and the memory consumption upon cloning a Worksheet is huge because of the way Color, Border and Borders are represented. This is v2.2.1:

master_border
master_borders
master_color

This is the memory layout after re-implementing the structs, using the rgb crate as basis to represent colors:

pr_border
pr_borders
pr_color

I personally use umya to parse a plan from work, iterate over it, and color cells that contain specific names (to create personalized versions for many people at once).

Here massif data from Valgrind:

Master (v2.2.1)

image

This PR @ 89016c7

image

This PR @ 3c26246

Screenshot_20241229_183419


We could probably do alot more optimization, but this change was minimal-invasive and did not require changing alot of internal APIs.


@schungx
If you look at the first and last screenshot you see that removing thin_vec and using Vec did not matter much in Terms of Heap allocations. 122,6MB+23,9MB and some change @ thinvec versus 122.6MB and some change @ std_vec.

@mxsrm
Copy link
Contributor Author

mxsrm commented Dec 29, 2024

It is interesting to see what happens if you remove all the #[inline] instructions and let the compiler decide for himself (otherwise same code as this PR @ 3c26246).

No #[inline], lto = false

image

No #[inline], lto = true

image

So letting the compiler decide for itself and using lto if possible in the final application is better than force-inlining that many functions for speed and memory-allocations. I will push a commit to this PR to remove all #[inline] instructions.

The compiler knows best which functions to inline. If the end user wants
optimization, they should enable `lto = true` in their Cargo.toml.

See MathNya#250 (comment)
Comment on lines +482 to +483
// To pass an integration test where the hex string is "#333".
// https://github.com/MathNya/umya-spreadsheet/pull/113
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this should be valid behaviour? Is #333 really a valid color? I added padding so the previous integration test added in PR #113 still passes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the code.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mxsrm
Hmmm, this seems to be a mistake for #333333.

@MathNya
Copy link
Owner

MathNya commented Dec 31, 2024

@schungx
@mxsrm
Thank you for your support through the end of the year.
I am not very knowledgeable about performance enhancement and have left it to you.
I would be grateful if you could continue to support us if you would like.
I look forward to working with you again next year.

@c-git
Copy link

c-git commented Dec 31, 2024

I don't know if it's valid there but in CSS it's allowed. #333 is short for #333333

As per: https://www.w3schools.com/css/css_colors_hex.asp

3 Digit HEX Value

Sometimes you will see a 3-digit hex code in the CSS source.
The 3-digit hex code is a shorthand for some 6-digit hex codes.
The 3-digit hex code has the following form:
#rgb
Where r, g, and b represent the red, green, and blue components with values between 0 and f.

@schungx
Copy link
Contributor

schungx commented Dec 31, 2024

It is interesting to see what happens if you remove all the #[inline] instructions and let the compiler decide for himself

This is wrong. The compiler does NOT decide. It only decides within the same crate. Without #[inline], the compiler will NEVER inline across crate boundaries. You have to turn on LTO for those to inline.

I will push a commit to this PR to remove all #[inline] instructions.

I would strongly not recommend that. My own benchmarks have #[inline] substantially speeding up the linking phase when LTO is turned on, otherwise LLVM needs to inline tons of simple accessor functions where they would have been inlined by Rust much earlier on.

If LTO is turned off, any function without #[inline] will never be inlined, and there are a LOT of simple accessor functions. It makes things much slower. Try timing a tight loop iterating over thousands of cells.

Therefore, the standard for libraries is to be liberal with #[inline]. Look at the standard library -- it is practically littered with #[inline].

#[inline] is never for memory, it is intended for speed. Without #[inline], my release build links in 4 hours with LTO, while with #[inline] it is less than 1 hour. Without LTO, the program runs much more slowly. It is a no-win situation.

Check out this: https://matklad.github.io/2021/07/09/inline-in-rust.html

Fourth, when building libraries, proactively add #[inline] to small non-generic functions. Pay special attention to impls: Deref, AsRef and the like often benefit from inlining. A library can’t anticipate all usages upfront, it makes sense to not prematurely pessimize future users. Note that #[inline] is not transitive: if a trivial public function calls a trivial private function, you need to #[inline] both. See this benchmark for details.

Support three-digit hex color codes with and without leading hash.
This change addresses issues where short hex codes were not properly
expanded, ensuring compatibility with CSS standards and improving
user experience in color specification.

- Expand three-digit hex codes by repeating each digit.
- Add integration tests for both `#RGB` and `RGB` formats.
- Update existing tests to reflect new behavior.

See MathNya#250 (comment)
@mxsrm
Copy link
Contributor Author

mxsrm commented Dec 31, 2024

I don't know if it's valid there but in CSS it's allowed. #333 is short for #333333

As per: https://www.w3schools.com/css/css_colors_hex.asp

3 Digit HEX Value

Sometimes you will see a 3-digit hex code in the CSS source.
The 3-digit hex code is a shorthand for some 6-digit hex codes.
The 3-digit hex code has the following form:
#rgb
Where r, g, and b represent the red, green, and blue components with values between 0 and f.

I implented it in that way. See a496cbe.

@mxsrm
Copy link
Contributor Author

mxsrm commented Dec 31, 2024

@schungx @mxsrm Thank you for your support through the end of the year. I am not very knowledgeable about performance enhancement and have left it to you.

Not my area of expertise either. In daily life I am an interventional neuroradiologist that enjoys coding in his free time. @schungx is more knowledgable in that regard I think.

- Changed the file paths in `integration_test.rs` to reflect
  the actual names of the test result files for better clarity
  and consistency in testing procedures.
- Updated paths from 'bbb_new_sheet_value.xlsx' to
  'three_digit_hex_color_with_hash.xlsx' and
  'three_digit_hex_color_without_hash.xlsx' for the
  respective integration tests.
- Changed the return type of `get_sheet_collection_mut` from
  `&mut Vec<Worksheet>` to `&mut [Worksheet]` to improve type safety
  and flexibility in how the sheets can be manipulated.
- This adjustment ensures that users of the API cannot directly resize
  the vector, which aligns with the intended use of this method for
  accessing existing sheets only.
@schungx
Copy link
Contributor

schungx commented Jan 1, 2025

Not my area of expertise either. In daily life I am an interventional neuroradiologist that enjoys coding in his free time. @schungx is more knowledgable in that regard I think.

Just different experiences that's all.

I happen to be maintaining an LOB application running on limitrd hardware, so memory and CPU count a lot. For most applications I'd agree there may not be a big impact, unless we're dealing with massive spreadsheets with tens of thousands of rows, like some users who complained in issues.

In my own system, the standard handler times out after 30 seconds of processing and it'll be hell trying to get around it. I can get 2000 rows within that time limit on an LTO release build. So it keeps choking on spreadsheets with 2500 rows.

After the memory shrink it runs much faster probably due to better cache locality and it shows. I can now process 5000 rows without exceeding the limit, that makes it work just fine for my case.

So yeah, you have to be there to know something is amiss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants