Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor py #12

Open
wants to merge 64 commits into
base: dev
Choose a base branch
from
Open

Refactor py #12

wants to merge 64 commits into from

Conversation

ypriverol
Copy link
Member

@ypriverol ypriverol commented Sep 25, 2024

PR Type

enhancement, tests


Description

  • Implemented new classes and methods for feature selection, including univariate, multivariate, and machine learning approaches.
  • Added comprehensive tests for the new feature selection methods and data handling classes.
  • Updated utility functions and constants to support new feature selection functionalities.
  • Introduced example scripts for data conversion and pipeline execution.
  • Updated package configuration and dependencies to reflect the new project scope.

Changes walkthrough 📝

Relevant files
Enhancement
9 files
ml.py
Add ML feature selection and cross-validation classes       

fslite/fs/ml.py

  • Introduced FSMLMethod class for ML feature selection.
  • Added MLCVModel class for ML model creation with cross-validation.
  • Implemented feature selection and evaluation methods.
  • +399/-0 
    fdataframe.py
    Implement FSDataFrame class for feature selection               

    fslite/fs/fdataframe.py

  • Introduced FSDataFrame class for feature selection.
  • Implemented methods for handling sparse and dense matrices.
  • Added feature scaling and selection methods.
  • +311/-0 
    multivariate.py
    Add multivariate feature selection methods                             

    fslite/fs/multivariate.py

  • Added FSMultivariate class for multivariate feature selection.
  • Implemented correlation and variance-based selection methods.
  • +219/-0 
    univariate.py
    Implement univariate feature selection methods                     

    fslite/fs/univariate.py

  • Introduced FSUnivariate class for univariate feature selection.
  • Implemented various univariate selection methods.
  • +217/-0 
    constants.py
    Define constants and utilities for feature selection         

    fslite/fs/constants.py

  • Defined constants for feature selection methods.
  • Added utility functions for method validation.
  • +156/-0 
    loom2parquetchunks.py
    Add script for loom to parquet conversion                               

    examples/loom2parquetchunks.py

  • Added script to convert loom files to parquet chunks.
  • Implemented metadata handling and chunk processing.
  • +118/-0 
    utils.py
    Update utilities for feature selection                                     

    fslite/fs/utils.py

  • Updated utility functions for feature selection.
  • Replaced pyspark imputer with sklearn's SimpleImputer.
  • +33/-16 
    methods.py
    Define abstract class for feature selection methods           

    fslite/fs/methods.py

  • Introduced abstract class FSMethod for feature selection.
  • Defined error classes for invalid methods and data.
  • +78/-0   
    loom2parquetmerge.py
    Add script for merging parquet files                                         

    examples/loom2parquetmerge.py

  • Added script to merge parquet files incrementally.
  • Implemented batch processing to manage memory usage.
  • +62/-0   
    Tests
    4 files
    test_univariate_methods.py
    Add tests for univariate feature selection                             

    fslite/tests/test_univariate_methods.py

  • Added tests for univariate feature selection methods.
  • Included tests for various selection techniques.
  • +146/-0 
    test_fsdataframe.py
    Add tests for FSDataFrame class                                                   

    fslite/tests/test_fsdataframe.py

  • Added tests for FSDataFrame initialization and scaling.
  • Included memory usage tests for large datasets.
  • +121/-0 
    test_multivariate_methods.py
    Add tests for multivariate feature selection                         

    fslite/tests/test_multivariate_methods.py

  • Added tests for multivariate feature selection methods.
  • Included tests for correlation and variance methods.
  • +122/-0 
    generate_big_tests.py
    Add script to generate large test datasets                             

    fslite/tests/generate_big_tests.py

  • Added script to generate large test datasets.
  • Implemented chunk processing for memory efficiency.
  • +56/-0   
    Documentation
    1 files
    fs_pipeline_example.py
    Add example pipeline for feature selection                             

    fslite/pipeline/fs_pipeline_example.py

  • Added example pipeline for feature selection.
  • Demonstrated univariate, multivariate, and ML methods.
  • +69/-0   
    Configuration changes
    1 files
    setup.py
    Update package setup and dependencies                                       

    setup.py

  • Updated package name and dependencies.
  • Changed project references from fsspark to fslite.
  • +13/-10 

    💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

    ypriverol and others added 21 commits September 23, 2024 09:44
    Co-authored-by: codiumai-pr-agent-pro[bot] <151058649+codiumai-pr-agent-pro[bot]@users.noreply.github.com>
    Co-authored-by: codiumai-pr-agent-pro[bot] <151058649+codiumai-pr-agent-pro[bot]@users.noreply.github.com>
    Co-authored-by: codiumai-pr-agent-pro[bot] <151058649+codiumai-pr-agent-pro[bot]@users.noreply.github.com>
    Co-authored-by: codiumai-pr-agent-pro[bot] <151058649+codiumai-pr-agent-pro[bot]@users.noreply.github.com>
    Co-authored-by: codiumai-pr-agent-pro[bot] <151058649+codiumai-pr-agent-pro[bot]@users.noreply.github.com>
    Co-authored-by: codiumai-pr-agent-pro[bot] <151058649+codiumai-pr-agent-pro[bot]@users.noreply.github.com>
    Co-authored-by: codiumai-pr-agent-pro[bot] <151058649+codiumai-pr-agent-pro[bot]@users.noreply.github.com>
    Co-authored-by: codiumai-pr-agent-pro[bot] <151058649+codiumai-pr-agent-pro[bot]@users.noreply.github.com>
    Co-authored-by: codiumai-pr-agent-pro[bot] <151058649+codiumai-pr-agent-pro[bot]@users.noreply.github.com>
    Co-authored-by: codiumai-pr-agent-pro[bot] <151058649+codiumai-pr-agent-pro[bot]@users.noreply.github.com>
    Copy link

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    PR Reviewer Guide 🔍

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Key issues to review

    Performance Concern
    The FSDataFrame class uses a memory threshold to decide between sparse and dense matrix storage. This approach may not be optimal for all datasets and could lead to performance issues with very large datasets.

    Code Smell
    The MLCVModel class has a complex initialization process with many parameters. Consider refactoring to use a builder pattern or configuration object to simplify the interface.

    Potential Bug
    The memory usage test in test_memory_fsdataframe uses memory_usage which may not accurately measure the memory usage of the FSDataFrame class, especially for large datasets.

    Copy link

    codiumai-pr-agent-pro bot commented Sep 25, 2024

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    PR Code Suggestions ✨

    CategorySuggestion                                                                                                                                    Score
    Enhancement
    Use parameterized tests to reduce code duplication in univariate method tests

    Consider using parameterized tests to reduce code duplication across the different
    test functions for univariate methods.

    fslite/tests/test_univariate_methods.py [8-58]

    -def test_univariate_filter_corr():
    +import pytest
    +
    +@pytest.mark.parametrize("fs_method, selection_mode, selection_threshold, expected_features", [
    +    ("u_corr", None, 0.3, 211),
    +    ("anova", "percentile", 0.8, 4),
    +    # Add more test cases for other univariate methods
    +])
    +def test_univariate_filter(fs_method, selection_mode, selection_threshold, expected_features):
         """
    -    Test univariate_filter method with 'u_corr' method.
    -    :return: None
    +    Test univariate_filter method with different parameters.
         """
    -
    -    # import tsv as pandas DataFrame
         df = pd.read_csv(get_tnbc_data_path(), sep="\t")
    -
    -    # create FSDataFrame instance
         fs_df = FSDataFrame(df=df, sample_col="Sample", label_col="label")
     
    -    # create FSUnivariate instance
    -    fs_univariate = FSUnivariate(fs_method="u_corr", selection_threshold=0.3)
    -
    -    fsdf_filtered = fs_univariate.select_features(fs_df)
    -
    -    assert fs_df.count_features() == 500
    -    assert fsdf_filtered.count_features() == 211
    -
    -    # Export the filtered DataFrame as Pandas DataFrame
    -    df_filtered = fsdf_filtered.to_pandas()
    -    df_filtered.to_csv("filtered_tnbc_data.csv", index=False)
    -
    -
    -# test the univariate_filter method with 'anova' method
    -def test_univariate_filter_anova():
    -    """
    -    Test univariate_filter method with 'anova' method.
    -    :return: None
    -    """
    -
    -    # import tsv as pandas DataFrame
    -    df = pd.read_csv(get_tnbc_data_path(), sep="\t")
    -
    -    # create FSDataFrame instance
    -    fs_df = FSDataFrame(df=df, sample_col="Sample", label_col="label")
    -
    -    # create FSUnivariate instance
         fs_univariate = FSUnivariate(
    -        fs_method="anova", selection_mode="percentile", selection_threshold=0.8
    +        fs_method=fs_method,
    +        selection_mode=selection_mode,
    +        selection_threshold=selection_threshold
         )
     
         fsdf_filtered = fs_univariate.select_features(fs_df)
     
         assert fs_df.count_features() == 500
    -    assert fsdf_filtered.count_features() == 4
    +    assert fsdf_filtered.count_features() == expected_features
     
    -    # Export the filtered DataFrame as Pandas DataFrame
         df_filtered = fsdf_filtered.to_pandas()
    -    df_filtered.to_csv("filtered_tnbc_data.csv", index=False)
    +    df_filtered.to_csv(f"filtered_tnbc_data_{fs_method}.csv", index=False)
     
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: Parameterized tests significantly reduce code duplication and improve test maintainability, making this a valuable enhancement.

    8
    Add a more informative docstring to the package's __init__.py file

    Consider adding a more descriptive comment or docstring to provide information about
    the package's purpose and functionality.

    fslite/init.py [1-2]

    -# eam
    -# 18.07.22
    +"""
    +fslite: A Python package for memory-efficient, high-performance feature selection on big and small datasets.
     
    +Author: eam
    +Date: 18.07.22
    +"""
    +
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: Adding a descriptive docstring improves code readability and provides context about the package's purpose, which is beneficial for users and maintainers.

    8
    Use a more specific exception type for invalid scoring methods

    Consider using a more specific exception type instead of ValueError for invalid
    scoring methods. This can help in better error handling and debugging.

    fslite/fs/ml.py [99-102]

     if score_func not in score_func_mapping:
    -    raise ValueError(
    +    raise InvalidMethodError(
             f"Invalid score_func '{score_func}'. Valid options are: {list(score_func_mapping.keys())}"
         )
     
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: Using a more specific exception type like InvalidMethodError can improve error handling and debugging, making it easier to identify the source of the error.

    7
    Use an Enum class for feature selection methods to improve type safety and maintainability

    Consider using an Enum class for the feature selection methods instead of a nested
    dictionary. This would provide better type safety and make the code more
    maintainable.

    fslite/fs/constants.py [7-31]

    +from enum import Enum, auto
    +
    +class FSMethod(Enum):
    +    ANOVA = auto()
    +    U_CORR = auto()
    +    F_REGRESSION = auto()
    +    MUTUAL_INFO_REGRESSION = auto()
    +    MUTUAL_INFO_CLASSIFICATION = auto()
    +
     FS_METHODS = {
         "univariate": {
             "title": "Univariate Feature Selection",
             "description": "Univariate feature selection refers to the process of selecting the most relevant features for "
             "a machine learning model by evaluating each feature individually with respect to the target "
             "variable using univariate statistical tests. It simplifies the feature selection process by "
             "treating each feature independently and assessing its contribution to the predictive "
             "performance of the model.",
             "methods": [
                 {
    -                "name": "anova",
    +                "name": FSMethod.ANOVA,
                     "description": "Univariate ANOVA feature selection (f-classification)",
                 },
    -            {"name": "u_corr", "description": "Univariate Pearson's correlation"},
    -            {"name": "f_regression", "description": "Univariate f-regression"},
    +            {"name": FSMethod.U_CORR, "description": "Univariate Pearson's correlation"},
    +            {"name": FSMethod.F_REGRESSION, "description": "Univariate f-regression"},
                 {
    -                "name": "mutual_info_regression",
    +                "name": FSMethod.MUTUAL_INFO_REGRESSION,
                     "description": "Univariate mutual information regression",
                 },
                 {
    -                "name": "mutual_info_classification",
    +                "name": FSMethod.MUTUAL_INFO_CLASSIFICATION,
                     "description": "Univariate mutual information classification",
                 },
             ],
         },
     
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: Using an Enum class can improve type safety and maintainability, but the current dictionary structure is functional and the improvement is not critical.

    7
    Add import statements and function definitions to the io.py module

    Consider adding import statements or function definitions to make the io.py module
    more useful and functional.

    fslite/utils/io.py [1]

    +from typing import Union
    +import pandas as pd
    +import pyarrow as pa
    +from pyspark.sql import DataFrame, SparkSession
     
    +def import_table(file_path: str, sep: str = '\t', n_partitions: int = 5) -> DataFrame:
    +    # Implementation here
    +    pass
     
    +def import_table_as_psdf(file_path: str, sep: str = '\t', n_partitions: int = 5) -> pd.DataFrame:
    +    # Implementation here
    +    pass
    +
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: The suggestion adds useful functionality to the module, making it more practical and complete, although the specific implementations are not provided.

    7
    Use a custom exception class for invalid data errors instead of ValueError

    Consider using a more specific exception type for InvalidDataError instead of
    ValueError. This would make it easier to catch and handle specific errors related to
    invalid data.

    fslite/fs/methods.py [72-78]

    -class InvalidDataError(ValueError):
    +class InvalidDataError(Exception):
         """
    -    Error raised when an invalid feature selection method is used.
    +    Error raised when invalid data is provided for feature selection.
         """
     
         def __init__(self, message):
             super().__init__(f"Invalid data frame: {message}")
     
    • Apply this suggestion
    Suggestion importance[1-10]: 6

    Why: Using a custom exception class improves error handling specificity, but the current use of ValueError is not incorrect.

    6
    Add specific information about the datasets used in the experiments

    Consider adding more specific information about the datasets used in the
    experiments, such as their names, sizes, and characteristics.

    docs/EXPERIMENTS.md [1-4]

     ## Experiments and Benchmarks
     
     This document contains the experiments and benchmarks that were conducted to evaluate the performance of fslite. 
     The experiments were conducted on the following datasets:
     
    +1. Dataset A: [Brief description, size, characteristics]
    +2. Dataset B: [Brief description, size, characteristics]
    +3. Dataset C: [Brief description, size, characteristics]
    +
    +Each dataset was chosen to represent different scenarios and challenges in feature selection tasks.
    +
    • Apply this suggestion
    Suggestion importance[1-10]: 6

    Why: Providing detailed information about the datasets enhances the documentation's usefulness and helps users understand the context and applicability of the experiments.

    6
    Best practice
    Use numpy's isclose() function for more robust floating-point comparison when checking sparsity

    Consider using numpy's built-in np.isclose() function instead of a direct comparison
    when checking for sparsity. This can help avoid potential floating-point precision
    issues.

    fslite/fs/fdataframe.py [118-143]

    -if sparsity > sparse_threshold:
    +if np.isclose(sparsity, sparse_threshold, atol=1e-8) or sparsity > sparse_threshold:
         if dense_matrix_size < memory_threshold * available_memory:
             # Use dense matrix if enough memory is available
             logging.info(
                 f"Data is sparse (sparsity={sparsity:.2f}) but enough memory available. "
                 f"Using a dense matrix."
             )
             self.__matrix = numerical_df.to_numpy(dtype=np.float32)
             self.__is_sparse = False
         else:
             # Use sparse matrix due to memory constraints
             logging.info(
                 f"Data is sparse (sparsity={sparsity:.2f}), memory insufficient for dense matrix. "
                 f"Using a sparse matrix representation."
             )
             self.__matrix = sparse.csr_matrix(
                 numerical_df.to_numpy(dtype=np.float32)
             )
             self.__is_sparse = True
     else:
         # Use dense matrix since it's not sparse
         logging.info(
             f"Data is not sparse (sparsity={sparsity:.2f}), using a dense matrix."
         )
         self.__matrix = numerical_df.to_numpy(dtype=np.float32)
         self.__is_sparse = False
     
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: This suggestion addresses potential floating-point precision issues, which can be crucial for ensuring correct behavior in numerical computations.

    8
    Use a constant for the testdata directory path to improve maintainability

    Consider using a constant for the 'testdata' directory path to avoid repetition and
    make it easier to update if the directory structure changes.

    fslite/utils/datasets.py [7-22]

    +TESTDATA_DIR = Path(__file__).parent.parent / "testdata"
    +
     def get_tnbc_data_path() -> str:
         """
         Return path to example dataset (TNBC) with 44 samples and 500 features.
    -
         """
    -    tnbc_path = Path(__file__).parent.parent / "testdata/TNBC.tsv.gz"
    -    return tnbc_path.__str__()
    -
    +    return str(TESTDATA_DIR / "TNBC.tsv.gz")
     
     def get_tnbc_data_missing_values_path() -> str:
         """
         Return path to example dataset (TNBC) with missing values.
    +    """
    +    return str(TESTDATA_DIR / "TNBC_missing.tsv")
     
    -    """
    -    tnbc_path = Path(__file__).parent.parent / "testdata/TNBC_missing.tsv"
    -    return tnbc_path.__str__()
    -
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: Defining a constant for the directory path enhances maintainability and reduces the risk of errors if the path changes, but the current implementation is not problematic.

    7
    Update code examples to use keyword arguments for clarity

    Consider updating the code examples to use f-strings for better readability and
    consistency with modern Python practices.

    docs/README.data.md [42-44]

    -sdf = import_table('data.tsv.bgz',
    +sdf = import_table(file_path='data.tsv.bgz',
                        sep='\t',
                        n_partitions=5)
     
    • Apply this suggestion
    Suggestion importance[1-10]: 5

    Why: Using keyword arguments improves code readability and clarity, but the change is minor and does not significantly impact functionality.

    5
    Maintainability
    Use a dictionary mapping for univariate methods to improve maintainability and extensibility

    Consider using a dictionary mapping for univariate methods instead of multiple
    if-elif statements. This can make the code more maintainable and easier to extend
    with new methods in the future.

    fslite/fs/univariate.py [145-162]

    -if univariate_method == "anova":
    -    selected_features = self.univariate_feature_selector(
    -        df, score_func="f_classif", **kwargs
    -    )
    -elif univariate_method == "f_regression":
    -    selected_features = self.univariate_feature_selector(
    -        df, score_func="f_regression", **kwargs
    -    )
    -elif univariate_method == "u_corr":
    -    selected_features = univariate_correlation_selector(df, **kwargs)
    -elif univariate_method == "mutual_info_classification":
    -    selected_features = self.univariate_feature_selector(
    -        df, score_func="mutual_info_classif", **kwargs
    -    )
    -elif univariate_method == "mutual_info_regression":
    -    selected_features = self.univariate_feature_selector(
    -        df, score_func="mutual_info_regression", **kwargs
    -    )
    +method_mapping = {
    +    "anova": lambda: self.univariate_feature_selector(df, score_func="f_classif", **kwargs),
    +    "f_regression": lambda: self.univariate_feature_selector(df, score_func="f_regression", **kwargs),
    +    "u_corr": lambda: univariate_correlation_selector(df, **kwargs),
    +    "mutual_info_classification": lambda: self.univariate_feature_selector(df, score_func="mutual_info_classif", **kwargs),
    +    "mutual_info_regression": lambda: self.univariate_feature_selector(df, score_func="mutual_info_regression", **kwargs),
    +}
     
    +selected_features = method_mapping.get(univariate_method, lambda: [])()
    +
    • Apply this suggestion
    Suggestion importance[1-10]: 6

    Why: This change enhances code maintainability and readability by reducing the complexity of method selection, making it easier to add new methods in the future.

    6
    Performance
    Consider using a more efficient correlation computation method for large datasets

    Consider using a more efficient method to compute the correlation matrix for large
    datasets. The current implementation might not scale well for very large feature
    sets.

    fslite/fs/multivariate.py [131-139]

     # Compute correlation matrix
     if corr_method == "pearson":
         corr_matrix = np.corrcoef(f_matrix, rowvar=False)
     elif corr_method == "spearman":
         corr_matrix, _ = spearmanr(f_matrix)
     else:
         raise ValueError(
             f"Unsupported correlation method '{corr_method}'. Use 'pearson' or 'spearman'."
         )
     
    +# For large datasets, consider using a more efficient method
    +# For example, you could use pandas' corr() method with a custom function
    +# that computes correlation for chunks of the data at a time
    +
    • Apply this suggestion
    Suggestion importance[1-10]: 5

    Why: While the suggestion is valid for improving performance with large datasets, it lacks a concrete implementation, making it less immediately actionable.

    5

    💡 Need additional feedback ? start a PR chat

    Copy link

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    CI Failure Feedback 🧐

    Action: build-linux

    Failed stage: Test with pytest [❌]

    Relevant error logs:
    1:  ##[group]Operating System
    2:  Ubuntu
    ...
    
    324:  #
    325:  #     $ conda activate base
    326:  #
    327:  # To deactivate an active environment, use
    328:  #
    329:  #     $ conda deactivate
    330:  ##[group]Run conda install flake8
    331:  �[36;1mconda install flake8�[0m
    332:  �[36;1m# stop the build if there are Python syntax errors or undefined names�[0m
    333:  �[36;1mflake8 . --count --select=E9,F63,F7,F82 --show-source --statistics�[0m
    334:  �[36;1m# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide�[0m
    ...
    
    428:  Executing transaction: ...working... done
    429:  ============================= test session starts ==============================
    430:  platform linux -- Python 3.10.0, pytest-7.4.4, pluggy-1.5.0
    431:  rootdir: /home/runner/work/fslite/fslite
    432:  collected 13 items
    433:  fslite/tests/test_fsdataframe.py ...                                     [ 23%]
    434:  fslite/tests/test_multivariate_methods.py ....                           [ 53%]
    435:  fslite/tests/test_univariate_methods.py .F....                           [100%]
    436:  =================================== FAILURES ===================================
    ...
    
    452:  def get_handle(
    453:  path_or_buf: FilePath | BaseBuffer,
    454:  mode: str,
    455:  *,
    456:  encoding: str | None = None,
    457:  compression: CompressionOptions | None = None,
    458:  memory_map: bool = False,
    459:  is_text: bool = True,
    460:  errors: str | None = None,
    ...
    
    478:  supported for compression modes 'gzip', 'bz2', 'zstd' and 'zip'.
    479:  .. versionchanged:: 1.4.0 Zstandard support.
    480:  memory_map : bool, default False
    481:  See parsers._parser_params for more information. Only used by read_csv.
    482:  is_text : bool, default True
    483:  Whether the type of the content passed to the file/buffer is string or
    484:  bytes. This is not the same as `"b" not in mode`. If a string content is
    485:  passed to a binary file/buffer, a wrapper is inserted.
    486:  errors : str, default 'strict'
    487:  Specifies how encoding and decoding errors are to be handled.
    488:  See the errors argument for :func:`open` for a full list
    489:  of options.
    490:  storage_options: StorageOptions = None
    491:  Passed to _get_filepath_or_buffer
    492:  Returns the dataclass IOHandles
    493:  """
    494:  # Windows does not default to utf-8. Set to utf-8 for a consistent behavior
    495:  encoding = encoding or "utf-8"
    496:  errors = errors or "strict"
    497:  # read_csv does not know whether the buffer is opened in binary/text mode
    498:  if _is_binary_mode(path_or_buf, mode) and "b" not in mode:
    499:  mode += "b"
    500:  # validate encoding and errors
    501:  codecs.lookup(encoding)
    502:  if isinstance(errors, str):
    503:  codecs.lookup_error(errors)
    ...
    
    526:  ioargs.mode = ioargs.mode.replace("t", "")
    527:  elif compression == "zstd" and "b" not in ioargs.mode:
    528:  # python-zstandard defaults to text mode, but we always expect
    529:  # compression libraries to use binary mode.
    530:  ioargs.mode += "b"
    531:  # GZ Compression
    532:  if compression == "gzip":
    533:  if isinstance(handle, str):
    534:  # error: Incompatible types in assignment (expression has type
    ...
    
    552:  # "Union[str, BaseBuffer]", "str", "Dict[str, Any]"
    553:  handle = get_bz2_file()(  # type: ignore[call-overload]
    554:  handle,
    555:  mode=ioargs.mode,
    556:  **compression_args,
    557:  )
    558:  # ZIP Compression
    559:  elif compression == "zip":
    560:  # error: Argument 1 to "_BytesZipFile" has incompatible type
    ...
    
    564:  handle, ioargs.mode, **compression_args  # type: ignore[arg-type]
    565:  )
    566:  if handle.buffer.mode == "r":
    567:  handles.append(handle)
    568:  zip_names = handle.buffer.namelist()
    569:  if len(zip_names) == 1:
    570:  handle = handle.buffer.open(zip_names.pop())
    571:  elif not zip_names:
    572:  raise ValueError(f"Zero files found in ZIP file {path_or_buf}")
    573:  else:
    574:  raise ValueError(
    ...
    
    576:  f"Only one file per ZIP: {zip_names}"
    577:  )
    578:  # TAR Encoding
    579:  elif compression == "tar":
    580:  compression_args.setdefault("mode", ioargs.mode)
    581:  if isinstance(handle, str):
    582:  handle = _BytesTarFile(name=handle, **compression_args)
    583:  else:
    584:  # error: Argument "fileobj" to "_BytesTarFile" has incompatible
    ...
    
    591:  if "r" in handle.buffer.mode:
    592:  handles.append(handle)
    593:  files = handle.buffer.getnames()
    594:  if len(files) == 1:
    595:  file = handle.buffer.extractfile(files[0])
    596:  assert file is not None
    597:  handle = file
    598:  elif not files:
    599:  raise ValueError(f"Zero files found in TAR archive {path_or_buf}")
    600:  else:
    601:  raise ValueError(
    602:  "Multiple files found in TAR archive. "
    603:  f"Only one file per TAR archive: {files}"
    604:  )
    605:  # XZ Compression
    606:  elif compression == "xz":
    607:  # error: Argument 1 to "LZMAFile" has incompatible type "Union[str,
    ...
    
    620:  handle = zstd.open(
    621:  handle,
    622:  mode=ioargs.mode,
    623:  **open_args,
    624:  )
    625:  # Unrecognized Compression
    626:  else:
    627:  msg = f"Unrecognized compression type: {compression}"
    628:  raise ValueError(msg)
    ...
    
    632:  # Check whether the filename is to be opened in binary mode.
    633:  # Binary mode does not support 'encoding' and 'newline'.
    634:  if ioargs.encoding and "b" not in ioargs.mode:
    635:  # Encoding
    636:  handle = open(
    637:  handle,
    638:  ioargs.mode,
    639:  encoding=ioargs.encoding,
    640:  errors=errors,
    641:  newline="",
    642:  )
    643:  else:
    644:  # Binary mode
    645:  >               handle = open(handle, ioargs.mode)
    646:  E               FileNotFoundError: [Errno 2] No such file or directory: '../../examples/GSE156793.parquet'
    647:  /usr/share/miniconda/lib/python3.10/site-packages/pandas/io/common.py:882: FileNotFoundError
    648:  =========================== short test summary info ============================
    649:  FAILED fslite/tests/test_univariate_methods.py::test_univariate_filter_big_corr - FileNotFoundError: [Errno 2] No such file or directory: '../../examples/GSE156793.parquet'
    650:  ======================== 1 failed, 12 passed in 15.33s =========================
    651:  ##[error]Process completed with exit code 1.
    

    ✨ CI feedback usage guide:

    The CI feedback tool (/checks) automatically triggers when a PR has a failed check.
    The tool analyzes the failed checks and provides several feedbacks:

    • Failed stage
    • Failed test name
    • Failure summary
    • Relevant error logs

    In addition to being automatically triggered, the tool can also be invoked manually by commenting on a PR:

    /checks "https://github.com/{repo_name}/actions/runs/{run_number}/job/{job_number}"
    

    where {repo_name} is the name of the repository, {run_number} is the run number of the failed check, and {job_number} is the job number of the failed check.

    Configuration options

    • enable_auto_checks_feedback - if set to true, the tool will automatically provide feedback when a check is failed. Default is true.
    • excluded_checks_list - a list of checks to exclude from the feedback, for example: ["check1", "check2"]. Default is an empty list.
    • enable_help_text - if set to true, the tool will provide a help message with the feedback. Default is true.
    • persistent_comment - if set to true, the tool will overwrite a previous checks comment with the new feedback. Default is true.
    • final_update_message - if persistent_comment is true and updating a previous checks message, the tool will also create a new message: "Persistent checks updated to latest commit". Default is true.

    See more information about the checks tool in the docs.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    2 participants