Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Biology pipeline testing #27

Merged
merged 135 commits into from
Sep 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
135 commits
Select commit Hold shift + click to select a range
be581de
First commit for testing
Aremaki Feb 8, 2023
74521de
Hot Fix
Aremaki Feb 8, 2023
7d2e2f4
Hot fix
Aremaki Feb 8, 2023
ca5a44e
Hot fix
Aremaki Feb 8, 2023
ea0218d
Hot fix
Aremaki Feb 8, 2023
d9c8c29
Hot fix
Aremaki Feb 8, 2023
232bec3
hot fix
Aremaki Feb 8, 2023
01d51c9
Hot Fix
Aremaki Feb 8, 2023
36e2c30
Hot Fix
Aremaki Feb 8, 2023
84a95c4
remove root termino
Aremaki Feb 8, 2023
88e1f54
Hot fix
Aremaki Feb 8, 2023
6ef2c33
Full refacto
Aremaki Mar 21, 2023
cdec8bd
Add registry
Aremaki Mar 23, 2023
9cbe726
Parquet connector to convert pandas via pyarrow
Aremaki Mar 23, 2023
9370dc3
Hot fix
Aremaki Mar 23, 2023
8553f1a
Hot fix
Aremaki Mar 24, 2023
8a6c30b
Hot fix
Aremaki Mar 24, 2023
5e498bb
Hot fix
Aremaki Mar 24, 2023
bec8c7b
Hot fix
Aremaki Mar 27, 2023
8ed4ded
Add Specialties_sets in Probe index
Aremaki Mar 31, 2023
8c1a3fd
Get child care site in the filter care site
Aremaki Mar 31, 2023
29b534f
Hot fix
Aremaki Apr 13, 2023
9e54694
Hot fix
Aremaki Apr 13, 2023
fa2b68c
Add n_result to viz
Aremaki Apr 17, 2023
8f407b7
Remove json_dir
Aremaki Apr 17, 2023
6e29db1
Add manual x_axis and y axis
Aremaki Apr 17, 2023
0683697
Hot fix
Aremaki Apr 17, 2023
6c828b2
First cleaning
Aremaki May 3, 2023
2c2ff5d
Ready for testing on real data
Aremaki May 10, 2023
8cda3be
Fix test
Aremaki May 10, 2023
5389c6a
Fix the model
Aremaki May 12, 2023
c50e7a3
Fix impute missing event
Aremaki May 16, 2023
9ecd3fd
Separate concepts set and concept codes
Aremaki May 30, 2023
1ecd8da
Hot fix
Aremaki Jun 2, 2023
16112e7
Fix impute missing column for note probe
Aremaki Jun 5, 2023
ff1e4a1
Hot fix
Aremaki Jun 5, 2023
6dcc92e
Update impute missing date
Aremaki Jun 7, 2023
bb9d4a1
Improve missing dates algo
Aremaki Jun 9, 2023
a2672b8
Hot fix: sort date before loss optimization
Aremaki Jun 9, 2023
40d26dd
adding age and pmsi type filter
svittoz Jun 26, 2023
de11d41
merging master
svittoz Jul 6, 2023
8872c0d
solving conflicts
svittoz Jul 6, 2023
2918806
solving conflicts
svittoz Jul 6, 2023
7a912b7
solving conflicts
svittoz Jul 6, 2023
8f389fa
solving conflicts
svittoz Jul 6, 2023
3cd4a1c
black correction
svittoz Jul 6, 2023
325e80c
black correction
svittoz Jul 6, 2023
1cbe6b1
tests correction
svittoz Jul 6, 2023
e346a96
precommit
svittoz Jul 6, 2023
6f86f84
precommit
svittoz Jul 6, 2023
eed7da8
small fix
svittoz Jul 11, 2023
c4a91f7
precommit
svittoz Jul 25, 2023
51c2c1a
persons test
svittoz Jul 25, 2023
61092ac
persons test
svittoz Jul 25, 2023
6199299
test
svittoz Jul 25, 2023
ab85549
test
svittoz Jul 25, 2023
ba205c5
test synthetic
svittoz Jul 26, 2023
4191cf6
fix test
svittoz Jul 26, 2023
34b6dfe
fix tests
svittoz Jul 26, 2023
248ddee
correcting test
svittoz Jul 28, 2023
9414a63
fixing error
svittoz Aug 2, 2023
e2474e8
person cleaning
svittoz Aug 2, 2023
ef6c179
changing naming
svittoz Aug 2, 2023
75994fe
adding tqdm
svittoz Aug 2, 2023
4554bd5
Merge branch 'biology_age_pmsi' of https://github.com/aphp/edsteva in…
svittoz Aug 2, 2023
1008455
remove change
svittoz Aug 2, 2023
e0ff48e
Merge branch 'main' into biology_age_pmsi
svittoz Aug 2, 2023
a66a9e0
persons to person
svittoz Aug 2, 2023
ddac179
Merge branch 'biology_age_pmsi' of https://github.com/aphp/edsteva in…
svittoz Aug 2, 2023
60fa6dc
measurement place
svittoz Aug 2, 2023
176d416
coverage test
svittoz Aug 2, 2023
3849615
coverage test
svittoz Aug 2, 2023
0e4d61e
remove itertools
svittoz Aug 2, 2023
c3a0308
Merge branch 'biology_age_pmsi' of https://github.com/aphp/edsteva in…
svittoz Aug 2, 2023
817aebc
precommit
svittoz Aug 2, 2023
197b20c
typo
svittoz Aug 2, 2023
c3756f2
correcting test
svittoz Aug 2, 2023
9ed31f1
fix test
svittoz Aug 2, 2023
d92c817
fix test
svittoz Aug 2, 2023
79dc102
fix test
svittoz Aug 2, 2023
d34304c
fix test
svittoz Aug 2, 2023
70b875e
fix test
svittoz Aug 2, 2023
d950d18
fix test
svittoz Aug 2, 2023
523eb79
Adam code review
svittoz Aug 3, 2023
1fa79b6
Adam review
svittoz Aug 3, 2023
e04b6ea
precommit
svittoz Aug 3, 2023
bf172e4
adding all modifications
svittoz Sep 6, 2023
84e3e79
mering main
svittoz Sep 6, 2023
80cc453
merge pb fixed
svittoz Sep 6, 2023
ebb6445
adding all modifications
svittoz Sep 6, 2023
a7e1f2b
correcting tests
svittoz Sep 6, 2023
6ab6fc9
remove .lock
svittoz Sep 6, 2023
95e2648
adding pyspark
svittoz Sep 7, 2023
0d3fd59
remove probe
svittoz Sep 7, 2023
fe5c687
fix coverage
svittoz Sep 7, 2023
4ec8ee1
add filtering on notes and measurements
svittoz Sep 7, 2023
79fc8d3
test coverage
svittoz Sep 7, 2023
399cbd4
coverage
svittoz Sep 7, 2023
1037c25
coverage
svittoz Sep 7, 2023
35c9850
kwargs
svittoz Sep 7, 2023
3cfd1e9
Apply suggestions from code review
svittoz Sep 8, 2023
44bce5b
fix typing
svittoz Sep 8, 2023
a2b85ea
cost / person not computed if no need
svittoz Sep 8, 2023
5bec33a
poetry.lock
svittoz Sep 8, 2023
e517f42
fix
svittoz Sep 8, 2023
43facdc
add age_ranges
svittoz Sep 8, 2023
7bc57df
Merge branch 'adding_ghm_cim' of https://github.com/aphp/edsteva into…
svittoz Sep 8, 2023
838ac24
adding age ranges
svittoz Sep 8, 2023
1fcb4d3
precommit
svittoz Sep 8, 2023
186884e
remove .lock
svittoz Sep 8, 2023
45217a3
.lock
svittoz Sep 8, 2023
8cd412c
.lock
svittoz Sep 8, 2023
19c5cf7
cost not None
svittoz Sep 8, 2023
317ebd5
cost not none
svittoz Sep 11, 2023
aa7db80
last changes
svittoz Sep 11, 2023
5c502af
change drop_duplicates position
svittoz Sep 11, 2023
67a1eee
duplicated
svittoz Sep 11, 2023
4e5e3dc
last fix
svittoz Sep 11, 2023
bbd5f19
coverage
svittoz Sep 11, 2023
d904c96
coverage
svittoz Sep 11, 2023
a219540
coverage
svittoz Sep 11, 2023
a280fac
last fix
svittoz Sep 11, 2023
7d83549
.lock
svittoz Sep 11, 2023
053fed7
contributing
svittoz Sep 11, 2023
8df1594
.lock
svittoz Sep 11, 2023
a6e4dec
dependencies
svittoz Sep 11, 2023
e0ec75c
adding a test on biology pipeline
svittoz Sep 13, 2023
8980a78
merging main
svittoz Sep 13, 2023
8706df8
clean code
svittoz Sep 13, 2023
0f2ece0
adding docstring
svittoz Sep 13, 2023
6f1b563
adding docstring
svittoz Sep 13, 2023
df6496e
pre-commit
svittoz Sep 13, 2023
ab7269d
small fix
svittoz Sep 13, 2023
677da3e
doc
svittoz Sep 14, 2023
fb1ae99
pre-commit
svittoz Sep 14, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
194 changes: 93 additions & 101 deletions edsteva/io/synthetic/biology.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,150 +2,142 @@
import pandas as pd
from loguru import logger

from edsteva.io.synthetic.utils import (
generate_events_after_t0,
generate_events_after_t1,
generate_events_around_t0,
generate_events_around_t1,
generate_events_before_t0,
)


def generate_bio(
generator: np.random.Generator,
t_start: int,
t_end: int,
n_events: int,
increase_time: int,
increase_ratio: float,
visit_care_site,
t0_visit,
date_col: str,
bio_date_col: str,
id_visit_col,
unit: str,
concept_code: str,
mode: str,
):
if mode == "step":
return _generate_bio_step(
generator=generator,
t_start=t_start,
t_end=t_end,
n_events=n_events,
increase_time=increase_time,
increase_ratio=increase_ratio,
visit_care_site=visit_care_site,
t0_visit=t0_visit,
date_col=date_col,
bio_date_col=bio_date_col,
id_visit_col=id_visit_col,
unit=unit,
concept_code=concept_code,
)
if mode == "rect":
return _generate_bio_rect(
generator=generator,
t_start=t_start,
t_end=t_end,
n_events=n_events,
increase_time=increase_time,
increase_ratio=increase_ratio,
visit_care_site=visit_care_site,
t0_visit=t0_visit,
date_col=date_col,
bio_date_col=bio_date_col,
id_visit_col=id_visit_col,
unit=unit,
concept_code=concept_code,
)


def _generate_bio_step(
generator: np.random.Generator,
t_start: int,
t_end: int,
n_events: int,
increase_time: int,
increase_ratio: float,
visit_care_site,
t0_visit,
date_col: str,
bio_date_col: str,
id_visit_col,
unit: str,
concept_code: str,
):
t0 = generator.integers(t_start + increase_time, t_end - increase_time)
params = dict(
generator=generator,
t_start=t_start,
t_end=t_end,
n_events=n_events,
t0=t0,
increase_ratio=increase_ratio,
increase_time=increase_time,
t_end = visit_care_site[date_col].max()
t0 = generator.integers(t0_visit, t_end)
c_before = generator.uniform(0, 0.01)
c_after = generator.uniform(0.8, 1)

measurement_before_t0_visit = (
visit_care_site[visit_care_site[date_col] <= t0_visit][[id_visit_col, date_col]]
.sample(frac=c_before)
.rename(columns={date_col: bio_date_col})
)
df = pd.concat(
[
generate_events_before_t0(**params),
generate_events_after_t0(**params),
generate_events_around_t0(**params),
]
).to_frame()
df.columns = [bio_date_col]
df["unit_source_value"] = unit
df["measurement_source_concept_id"] = concept_code
df["t_0_min"] = t0 - increase_time / 2
df["t_0_max"] = t0 + increase_time / 2
logger.debug("Generate measurement deploying as step function")
# Stratify visit between t0_visit and t0 to
# ensure that these elements are represented
# in the final measurements dataset.

measurement_before_t0 = (
visit_care_site[
(visit_care_site[date_col] <= t0) & (visit_care_site[date_col] > t0_visit)
][[id_visit_col, date_col]]
.sample(frac=c_before)
.rename(columns={date_col: bio_date_col})
)

measurement_after_t0 = (
visit_care_site[visit_care_site[date_col] > t0][[id_visit_col, date_col]]
.sample(frac=c_after)
.rename(columns={date_col: bio_date_col})
)

measurement = pd.concat(
[measurement_before_t0_visit, measurement_before_t0, measurement_after_t0]
)

measurement[bio_date_col] = pd.to_datetime(measurement[bio_date_col], unit="s")
measurement["unit_source_value"] = unit
measurement["measurement_source_concept_id"] = concept_code
measurement["t_0"] = t0

logger.debug("Generate synthetic measurement deploying as step function")

return df
return measurement


def _generate_bio_rect(
generator: np.random.Generator,
t_start: int,
t_end: int,
n_events: int,
increase_time: int,
increase_ratio: float,
visit_care_site,
t0_visit,
date_col: str,
bio_date_col: str,
id_visit_col,
unit: str,
concept_code: str,
):
t0 = generator.integers(
t_start + increase_time, (t_end + t_start) / 2 - increase_time
)
t1 = generator.integers(
(t_end + t_start) / 2 + increase_time, t_end - increase_time
t1_visit = visit_care_site["t_1_min"].max()
t0 = generator.integers(t0_visit, t0_visit + (t1_visit - t0_visit) / 3)
t1 = generator.integers(t0_visit + 2 * (t1_visit - t0_visit) / 3, t1_visit)
c_out = generator.uniform(0, 0.1)
c_in = generator.uniform(0.8, 1)

measurement_before_t0 = (
visit_care_site[visit_care_site[date_col] <= t0][[id_visit_col, date_col]]
.sample(frac=c_out)
.rename(columns={date_col: bio_date_col})
)
t0_params = dict(
generator=generator,
t_start=t_start,
t_end=t1 - increase_time / 2,
n_events=n_events,
t0=t0,
increase_ratio=increase_ratio,
increase_time=increase_time,
measurement_between_t0_t1 = (
visit_care_site[
(visit_care_site[date_col] > t0) & (visit_care_site[date_col] <= t1)
][[id_visit_col, date_col]]
.sample(frac=c_in)
.rename(columns={date_col: bio_date_col})
)
before_t0 = generate_events_before_t0(**t0_params)
around_t0 = generate_events_around_t0(**t0_params)
# Raise n_visit to enforce a rectangle shape
between_t0_t1 = generate_events_after_t0(**t0_params)
t1_params = dict(
generator=generator,
t_start=t_start,
t_end=t_end,
n_events=n_events,
t1=t1,
increase_time=increase_time,
increase_ratio=increase_ratio,

measurement_after_t1 = (
visit_care_site[(visit_care_site[date_col] > t1)][[id_visit_col, date_col]]
.sample(frac=c_out)
.rename(columns={date_col: bio_date_col})
)
around_t1 = generate_events_around_t1(**t1_params)
after_t1 = generate_events_after_t1(**t1_params)

df = pd.concat(
measurement = pd.concat(
[
before_t0,
around_t0,
between_t0_t1,
around_t1,
after_t1,
measurement_before_t0,
measurement_between_t0_t1,
measurement_after_t1,
]
).to_frame()

df.columns = [bio_date_col]
df["unit_source_value"] = unit
df["measurement_source_concept_id"] = concept_code
df["t_0_min"] = t0 - increase_time / 2
df["t_0_max"] = t0 + increase_time / 2
df["t_1_min"] = t1 - increase_time / 2
df["t_1_max"] = t1 + increase_time / 2
logger.debug("Generate measurement deploying as rectangle function")

return df
)

measurement[bio_date_col] = pd.to_datetime(measurement[bio_date_col], unit="s")
measurement["unit_source_value"] = unit
measurement["measurement_source_concept_id"] = concept_code
measurement["t_0"] = t0
measurement["t_1"] = t1
logger.debug("Generate synthetic measurement deploying as rectangle function")

return measurement
57 changes: 22 additions & 35 deletions edsteva/io/synthetic/synthetic.py
Original file line number Diff line number Diff line change
Expand Up @@ -616,8 +616,8 @@ def _generate_measurement(
mean_measurement: int = 1000,
units: List[str] = ["g", "g/l", "mol", "s"],
):
t_min = self.t_min.timestamp()
t_max = self.t_max.timestamp()
self.t_min.timestamp()
self.t_max.timestamp()
measurements = []
visit_occurrence = visit_occurrence.sample(frac=0.9)
for concept_name in src_concept_name:
Expand All @@ -626,50 +626,37 @@ def _generate_measurement(
mean_value = (1 + units.index(unit)) * 2
std_value = 1
for care_site_id in hospital_ids:
t_start = t_min + self.generator.integers(0, (t_max - t_min) / 20)
t_end = t_max - self.generator.integers(0, (t_max - t_min) / 20)
valid_measurements = int(
self.generator.normal(mean_measurement, mean_measurement / 5)
)
missing_value = int(self.generator.uniform(1, valid_measurements / 10))
n_measurements = valid_measurements + missing_value
increase_time = self.generator.integers(
(t_end - t_start) / 100, (t_end - t_start) / 10
visit_care_site = visit_occurrence[
visit_occurrence.care_site_id == care_site_id
].reset_index(drop=True)
visit_care_site[self.date_col] = (
visit_care_site[self.date_col].view("int64") // 10**9
)
increase_ratio = self.generator.uniform(150, 200)
concept_code = concept_name.split("_")[1]
unit = concept_name.split("_")[-1]
mean_value = (1 + units.index(unit)) * 2
std_value = 1

t0_visit = visit_care_site["t_0_max"].max()
params = dict(
generator=self.generator,
t_start=t_start,
t_end=t_end,
n_events=n_measurements,
increase_ratio=increase_ratio,
increase_time=increase_time,
visit_care_site=visit_care_site,
date_col=self.date_col,
bio_date_col=self.bio_date_col,
id_visit_col=self.id_visit_col,
unit=unit,
concept_code=concept_code,
t0_visit=t0_visit,
mode=self.mode,
)

measurement = generate_bio(**params)
visit_care_site = visit_occurrence[
visit_occurrence.care_site_id == care_site_id
]
measurement[self.id_visit_col] = (
visit_care_site[self.id_visit_col]
.sample(
n=measurement.shape[0],
replace=True,
)
.reset_index(drop=True)

measurement["value_as_number"] = self.generator.normal(
mean_value, std_value, measurement.shape[0]
)
measurement["value_as_number"] = [None] * missing_value + list(
self.generator.normal(
mean_value, std_value, measurement.shape[0] - missing_value
)

valid_measurements = (
self.generator.uniform(0, 1, measurement.shape[0]) > 0.01
)
measurement.loc[~valid_measurements, "value_as_number"] = None

measurements.append(measurement)

measurements = pd.concat(measurements).reset_index(drop=True)
Expand Down
Loading