Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Hdf5 io #4

Open
wants to merge 51 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
14183b8
hdf5 io funcs
alexey0308 Jul 30, 2021
0718a52
add h5py to prepare
alexey0308 Jul 30, 2021
b8e56e0
move cycle within with open()
alexey0308 Jul 30, 2021
456c3c7
correct level
alexey0308 Aug 2, 2021
a3f791e
add iteration over chr and test
alexey0308 Aug 2, 2021
a1cb3ed
use w to substitute hdf5 file if exists
alexey0308 Aug 2, 2021
d4aec09
upd version in test to silence it
alexey0308 Aug 2, 2021
3fd6f4f
add read_chromosome and use it to load chr matrix
alexey0308 Aug 2, 2021
08d97cf
add reading from hdf5 in smooth
alexey0308 Aug 2, 2021
275ba8a
rename to scan
alexey0308 Aug 2, 2021
30689e4
fix imports
alexey0308 Aug 5, 2021
bf9905b
hdf5 io funcs
alexey0308 Jul 30, 2021
c460728
add h5py to prepare
alexey0308 Jul 30, 2021
b17c190
move cycle within with open()
alexey0308 Jul 30, 2021
e3af8fa
correct level
alexey0308 Aug 2, 2021
76edae0
add iteration over chr and test
alexey0308 Aug 2, 2021
17198ac
use w to substitute hdf5 file if exists
alexey0308 Aug 2, 2021
e09a67f
upd version in test to silence it
alexey0308 Aug 2, 2021
c55cdbf
add read_chromosome and use it to load chr matrix
alexey0308 Aug 2, 2021
7b69007
add reading from hdf5 in smooth
alexey0308 Aug 2, 2021
153566c
rename to scan
alexey0308 Aug 2, 2021
c651780
rm class annotation
alexey0308 Aug 18, 2021
d6b12cf
corrected import of packages
LeonieKuechenhoff Aug 18, 2021
ce862f9
Merge branch 'hdf5' of https://github.com/LKremer/scbs into hdf5
LeonieKuechenhoff Aug 18, 2021
5aa75ea
ADD streamed writing of hd5f from coo matrix
alexey0308 Aug 18, 2021
7665c55
upd doc
alexey0308 Aug 19, 2021
25d024f
ADD calculate number of entries in the chromosome file
alexey0308 Aug 19, 2021
3393219
ADD ExitStack to avoid not closed files in exceptions
alexey0308 Aug 19, 2021
4899b92
Merge branch 'hdf5' into hdf5-streaming
alexey0308 Aug 19, 2021
4fdf1ca
ADD calculate number of entries in the chromosome file
alexey0308 Aug 19, 2021
1452972
ADD ExitStack to avoid not closed files in exceptions
alexey0308 Aug 19, 2021
b8be552
ADD test write from stream -> read hdf5
alexey0308 Aug 19, 2021
4c819c6
UPD version
alexey0308 Aug 19, 2021
b8ccd10
extract transformation of coo to csr
alexey0308 Aug 20, 2021
c049733
extract format class to file
alexey0308 Aug 20, 2021
5fb028d
return file names, but not closed descriptors
alexey0308 Aug 20, 2021
ecdb7e5
move return level
alexey0308 Aug 20, 2021
5259874
fix level
alexey0308 Aug 20, 2021
780562b
add description of chromosome class
alexey0308 Aug 20, 2021
6efe0c8
use to strategies to save
alexey0308 Aug 20, 2021
bf07c06
use compression and types
alexey0308 Aug 23, 2021
e40d074
add test with zeros in the end
alexey0308 Aug 23, 2021
fceee1d
add streamed write to the prepare main func
alexey0308 Aug 23, 2021
6328c27
ADD sparse facade object
alexey0308 Aug 25, 2021
1377fa8
ADD doc
alexey0308 Aug 25, 2021
76c211b
Merge branch 'hdf5-streaming' into hdf5
alexey0308 Aug 25, 2021
1b228ae
fix imports
alexey0308 Aug 25, 2021
cbe0562
import only if use
alexey0308 Aug 25, 2021
2f0af9b
add missing import
alexey0308 Aug 26, 2021
52658fa
sort imports
alexey0308 Aug 26, 2021
2dfddf4
add streaming flag
alexey0308 Sep 27, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
add h5py to prepare
alexey0308 committed Jul 30, 2021

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.
commit 0718a52aded364a3f4b05efb7e99074eb111d39a
9 changes: 5 additions & 4 deletions scbs/io.py
Original file line number Diff line number Diff line change
@@ -6,23 +6,24 @@ def write_sparse_hdf5(h5object: h5py.AttributeManager, matrix):
h5object["indptr"] = matrix.indptr
h5object["data"] = matrix.data
h5object["indices"] = matrix.indices
h5object["format"] = matrix.format
h5object.attrs["format"] = matrix.format
h5object["shape"] = matrix.shape


def read_sparse_hdf5(h5object: h5py.AttributeManager):
"""Read the matrix from the provided h5py File or Group."""
try:
matformat = h5object["format"]
constructor = {"csc": sparse.csc_matrix, "csr": sparse.csr_matrix}[
matformat
h5object.attrs["format"]
]
except KeyError:
raise KeyError(
"The matrix format (csc, csr or coo) must be specified in the 'format' attribute."
)
try:
return constructor(
h5object["data"], h5object["indices"], h5object["indptr"]
(h5object["data"], h5object["indices"], h5object["indptr"]),
shape=h5object["shape"],
)
except KeyError:
raise Exception(
10 changes: 7 additions & 3 deletions scbs/prepare.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
import numpy as np
import h5py
import gzip
import os
import scipy.sparse as sp_sparse
from .utils import echo, secho
import sys
import pandas as pd
from scbs.io import write_sparse_hdf5


def prepare(input_files, data_dir, input_format):
@@ -35,13 +37,15 @@ def prepare(input_files, data_dir, input_format):
echo(f"Populating {chrom_size} x {n_cells} matrix for chromosome {chrom}...")
# populate with values from temporary COO file
coo_path = os.path.join(data_dir, f"{chrom}.coo")
mat_path = os.path.join(data_dir, f"{chrom}.npz")
mat = _load_csr_from_coo(coo_path, chrom_size, n_cells)
n_obs_cell += mat.getnnz(axis=0)
n_meth_cell += np.ravel(np.sum(mat > 0, axis=0))

echo(f"Writing to {mat_path} ...")
sp_sparse.save_npz(mat_path, mat)
echo(f"Writing {chrom} ...")
with h5py.File(os.path.join(data_dir, "methyl.hdf5"), "a") as hfile:
h5object = hfile.create_group(chrom)
write_sparse_hdf5(h5object, mat.tocsc())

os.remove(coo_path) # delete temporary .coo file

colname_path = _write_column_names(data_dir, cell_names)