Skip to content

Commit

Permalink
Merge pull request #50 from aphp/fix-cache-issue
Browse files Browse the repository at this point in the history
Fix latency due to koalas cache
  • Loading branch information
svittoz authored Dec 1, 2023
2 parents cb73941 + 176fb15 commit 053a943
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 4 deletions.
4 changes: 4 additions & 0 deletions changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# Changelog

## Unreleased
### Fixed
- Caching in spark instead of koalas to improve speed

## v0.1.6 (2023-09-27)
### Added
- Module ``event_sequences`` to visualize individual sequences of events.
Expand Down
5 changes: 1 addition & 4 deletions eds_scikit/io/hive.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@
from pyspark.sql import SparkSession
from pyspark.sql.types import LongType, StructField, StructType

from ..utils.framework import bd
from . import settings
from .base import BaseData
from .data_quality import clean_dates
Expand Down Expand Up @@ -227,12 +226,10 @@ def _read_table(
if "person_id" in df.columns and person_ids is not None:
df = df.join(person_ids, on="person_id", how="inner")

df = df.to_koalas()
df = df.cache().to_koalas()

df = clean_dates(df)

bd.cache(df)

return df

def persist_tables_to_folder(
Expand Down

0 comments on commit 053a943

Please sign in to comment.