bigbio · ypriverol · Sep 19, 2024 · Sep 19, 2024 · Sep 19, 2024 · Sep 19, 2024
diff --git a/README.md b/README.md
@@ -1,41 +1,43 @@
-[![Python application](https://github.com/enriquea/fsspark/actions/workflows/python-app.yml/badge.svg?branch=main)](https://github.com/enriquea/fsspark/actions/workflows/python-app.yml)
-[![Python Package using Conda](https://github.com/enriquea/fsspark/actions/workflows/python-package-conda.yml/badge.svg?branch=main)](https://github.com/enriquea/fsspark/actions/workflows/python-package-conda.yml)
+[![Python application](https://github.com/bigbio/fslite/actions/workflows/python-app.yml/badge.svg?branch=main)](https://github.com/enriquea/fslite/actions/workflows/python-app.yml)
+[![Python Package using Conda](https://github.com/bigbio/fslite/actions/workflows/python-package-conda.yml/badge.svg?branch=main)](https://github.com/bigbio/fslite/actions/workflows/python-package-conda.yml)
 
-# fsspark
+# fslite
 
 ---
 
-## Feature selection in Spark
+### Memory-Efficient, High-Performance Feature Selection Library for Big and Small Datasets
 
 ### Description
 
-`fsspark` is a python module to perform feature selection and machine learning based on spark.
-Pipelines written using `fsspark` can be divided roughly in four major stages: 1) data pre-processing, 2) univariate 
+`fslite` is a python module to perform feature selection and machine learning using pre-built FS pipelines. 
+Pipelines written using `fslite` can be divided roughly in four major stages: 1) data pre-processing, 2) univariate 
 filters, 3) multivariate filters and 4) machine learning wrapped with cross-validation (**Figure 1**).
 
+`fslite` is based on our previous work [feseR](https://github.com/enriquea/feseR); previously implemented in R and caret package; publication can be found [here](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0189875).
+
 ![Feature Selection flowchart](images/fs_workflow.png)
-**Figure 1**. Feature selection workflow example implemented in fsspark.
+**Figure 1**. Feature selection workflow example implemented in fslite.
 
 ### Documentation
 
 The package documentation describes the [data structures](docs/README.data.md) and 
-[features selection methods](docs/README.methods.md) implemented in `fsspark`.
+[features selection methods](docs/README.methods.md) implemented in `fslite`.
 
 ### Installation
 
 - pip
 ```bash
-git clone https://github.com/enriquea/fsspark.git
-cd fsspark
+git clone https://github.com/bigbio/fslite.git
+cd fslite
 pip install . -r requirements.txt
 ```
 
 - conda
 ```bash
-git clone https://github.com/enriquea/fsspark.git
-cd fsspark
+git clone https://github.com/bigbio/fslite.git
+cd fslite
 conda env create -f environment.yml
-conda activate fsspark-venv
+conda activate fslite-venv
 pip install . -r requirements.txt
 ```
 

diff --git a/docs/EXPERIMENTS.md b/docs/EXPERIMENTS.md
@@ -0,0 +1,4 @@
+## Experiments and Benchmarks
+
+This document contains the experiments and benchmarks that were conducted to evaluate the performance of fslite. 
+The experiments were conducted on the following datasets:
diff --git a/docs/README.data.md b/docs/README.data.md
@@ -1,9 +1,9 @@
-## fsspark - data structures
+## fslite - data structures
 
 --- 
 
-`fsspark` is a Python package that provides a set of tools for feature selection in Spark. 
-Here we describe the main data structures used in `fsspark` and how to use them.
+`fslite` is a Python package that provides a set of tools for feature selection in Spark. 
+Here we describe the main data structures used in `fslite` and how to use them.
 
 ### Input data
 
@@ -32,30 +32,32 @@ The following is an example of a TSV file with a binary response variable:
 
 ### Import functions
 
-`fsspark` provides two main functions to import data from a TSV file.
+`fslite` provides two main functions to import data from a TSV file.
 
 - `import_table` - Import data from a TSV file into a Spark Data Frame (sdf).
 
 ```python
-from fsspark.utils.io import import_table
-sdf = import_table('data.tsv.bgz', 
-                    sep='\t', 
-                    n_partitions=5)
+from fslite.utils.io import import_table
+
+sdf = import_table('data.tsv.bgz',
+                   sep='\t',
+                   n_partitions=5)
 ```
 
 - `import_table_as_psdf` - Import data from a TSV file into a Spark Data Frame (sdf) and 
 convert it into a Pandas on Spark Data Frame (psdf).
 
 ```python
-from fsspark.utils.io import import_table_as_psdf
-psdf = import_table_as_psdf('data.tsv.bgz', 
-                            sep='\t', 
+from fslite.utils.io import import_table_as_psdf
+
+psdf = import_table_as_psdf('data.tsv.bgz',
+                            sep='\t',
                             n_partitions=5)
 ```
 
 ### The Feature Selection Spark Data Frame (FSDataFrame)
 
-The `FSDataFrame` (**Figure 1**) is a core functionality of `fsspark`. It is a wrapper around a Spark Data Frame (sdf) 
+The `FSDataFrame` (**Figure 1**) is a core functionality of `fslite`. It is a wrapper around a Spark Data Frame (sdf) 
 that provides a set of methods to facilitate feature selection tasks. The `FSDataFrame` is initialized 
 with a Spark Data Frame (sdf) or a Pandas on Spark Data Frame (psdf) and two mandatory arguments: 
 `sample_col` and `label_col`. The `sample_col` argument is the name of the column in the sdf that 
@@ -73,9 +75,9 @@ contains the response variable.
 #### How to create a Feature Selection Spark Data Frame (FSDF)
 
 ```python
-from fsspark.config.context import init_spark, stop_spark_session
-from fsspark.fs.core import FSDataFrame
-from fsspark.utils.io import import_table_as_psdf
+from fslite.config.context import init_spark, stop_spark_session
+from fslite.fs.core import FSDataFrame
+from fslite.utils.io import import_table_as_psdf
 
 # Init spark
 init_spark()

diff --git a/docs/README.methods.md b/docs/README.methods.md
@@ -1,10 +1,10 @@
 
-# fsspark - features selection methods 
+# fslite - features selection methods 
 
 ---
 
-`fsspark `includes a set of methods to perform feature selection and machine learning based on spark.
-A typical workflow written using `fsspark` can be divided roughly in four major stages:
+`fslite `includes a set of methods to perform feature selection and machine learning based on spark.
+A typical workflow written using `fslite` can be divided roughly in four major stages:
 
 1) data pre-processing. 
 2) univariate filters. 
@@ -53,4 +53,4 @@ A typical workflow written using `fsspark` can be divided roughly in four major
 
 ### 5. Feature selection pipeline example
 
-[FS pipeline example](../fsspark/pipeline/fs_pipeline_example.py)
+[FS pipeline example](../fslite/pipeline/fs_pipeline_example.py)
diff --git a/environment.yml b/environment.yml
@@ -1,14 +1,19 @@
-name: fsspark-venv
+name: fslite-venv
 channels:
   - defaults
   - conda-forge
 dependencies:
   - python==3.10
   - pip
   - pip:
-      - setuptools~=65.5.0
-      - pyspark~=3.3.0
-      - networkx~=2.8.7
-      - numpy~=1.23.4
-      - pandas~=1.5.1
-      - pyarrow~=8.0.0
+      - setuptools
+      - networkx
+      - numpy
+      - pyarrow
+      - pandas
+      - scipy
+      - scikit-learn
+      - psutil
+      - pytest
+      - matplotlib
+      - memory-profiler
diff --git a/examples/loom2parquetchunks.py b/examples/loom2parquetchunks.py
@@ -0,0 +1,136 @@
+# Import and convert to parquet a single-cell dataset: GSE156793 (loom format)
+# GEO URL:
+# https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE156793&format=file&file=GSE156793%5FS3%5Fgene%5Fcount%2Eloom%2Egz
+
+# import libraries
+import pandas as pd
+import loompy
+import pyarrow.parquet as pq
+import pyarrow as pa
+
+# define the path to the loom file
+loom_file = "GSE156793_S3_gene_count.loom"
+
+# connect to the loom file
+ds = loompy.connect(loom_file)
+
+# get shape of the data
+ds.shape
+
+# retrieve the row attributes
+ds.ra.keys()
+
+# get gene ids
+gene_ids = ds.ra["gene_id"]
+gene_ids[0:10]
+
+# get the column attributes
+ds.ca.keys()
+
+# get sample metadata
+sample_id = ds.ca["sample"]
+cell_cluster = ds.ca["Main_cluster_name"]
+assay = ds.ca["Assay"]
+development_day = ds.ca["Development_day"]
+
+# make a dataframe with the sample metadata, define the columns types
+sample_df = pd.DataFrame({
+        "sample_id": sample_id,
+        "cell_cluster": cell_cluster,
+        "assay": assay,
+        "development_day": development_day,
+    }
+)
+
+# print the first 5 rows
+sample_df.head()
+
+# Make 'cell_cluster' a categorical variable encoded as an integer
+sample_df["cell_cluster"] = sample_df["cell_cluster"].astype("category")
+sample_df["cell_cluster_id"] = sample_df["cell_cluster"].cat.codes
+
+# Make 'assay' a categorical variable encoded as an integer
+sample_df["assay"] = sample_df["assay"].astype("category")
+sample_df["assay_id"] = sample_df["assay"].cat.codes
+
+# Make 'sample_id' the index
+sample_df = sample_df.set_index("sample_id")
+
+# Show the first 5 rows
+sample_df.head()
+
+# Save the sample metadata to parquet
+(
+    sample_df.reset_index().to_parquet(
+        "sample_metadata.parquet", index=False, engine="auto", compression="gzip"
+    )
+)
+
+
+# transpose dataset and convert to parquet.
+# process the data per chunks.
+chunk_size = 10000
+writer = None
+count = 0
+number_chunks = 10 # number of chunks to process
+
+for ix, selection, view in ds.scan(axis=1, batch_size=chunk_size):
+    # retrieve the chunk
+    matrix_chunk = view[:, :]
+
+    # transpose the data
+    matrix_chunk_t = matrix_chunk.T
+
+    # convert to pandas dataframe
+    df_chunk = pd.DataFrame(
+        matrix_chunk_t, index=sample_id[selection.tolist()], columns=gene_ids
+    )
+
+    # merge chunk with sample metadata
+    df_chunk = pd.merge(
+        left=sample_df[["cell_cluster_id", "development_day", "assay_id"]],
+        right=df_chunk,
+        how="inner",
+        left_index=True,
+        right_index=True,
+        sort=False,
+        copy=True,
+        indicator=False,
+        validate="one_to_one",
+    )
+
+    # reset the index
+    df_chunk = df_chunk.reset_index()
+
+    # rename the index column
+    df_chunk = df_chunk.rename(columns={"index": "sample_id"})
+
+    if writer is None:
+        # define the schema
+        schema = pa.schema(
+            [
+                pa.field("sample_id", pa.string()),
+                pa.field("cell_cluster_id", pa.int8()),
+                pa.field("development_day", pa.int64()),
+                pa.field("assay_id", pa.int8()),
+            ]
+            + [pa.field(gene_id, pa.float32()) for gene_id in gene_ids]
+        )
+
+        print(len(list(df_chunk.columns)))
+        print(len(schema))
+
+        # create the parquet writer
+        writer = pq.ParquetWriter("GSE156793.parquet", schema, compression="snappy")
+
+    writer.write_table(pa.Table.from_pandas(df_chunk, preserve_index=False))
+
+    print(f"Chunk {ix} saved")
+
+    count += 1
+    if count >= number_chunks:
+        break
+
+if writer is not None:
+    writer.close()
+    print(f"Concatenated parquet file written to GSE156793.parquet")
diff --git a/fsspark/__init__.py → fslite/__init__.py b/fsspark/__init__.py → fslite/__init__.py
diff --git a/fsspark/config/__init__.py → fslite/fs/__init__.py b/fsspark/config/__init__.py → fslite/fs/__init__.py