Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dropnull to FirstK #556

Merged
merged 6 commits into from
Sep 27, 2024
Merged

Add dropnull to FirstK #556

merged 6 commits into from
Sep 27, 2024

Conversation

satrana42
Copy link
Contributor

@satrana42 satrana42 commented Sep 9, 2024

Add dropnull to FirstK aggregation to filter out None values from Optional input.

Copy link

ellipsis-dev bot commented Sep 9, 2024

Your free trial has expired. To keep using Ellipsis, sign up at https://app.ellipsis.dev for $20/seat/month or reach us at help@ellipsis.dev

limit=10,
dedup=False,
window=Continuous("1d"),
drop_nulls=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please make this a single word dropnull so that it reads similar to existing dropnull operator?

@satrana42
Copy link
Contributor Author

This is WIP. Unit tests pass, but integration test fails right now.

Copy link
Contributor

@aditya-nambiar aditya-nambiar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you land the first pr, rebase and then send this for review ?

Comment on lines 7 to 19
Aggregation to computes a rolling list of the earliest values for each group
within a window.

#### Parameters
<Expandable title="of" type="str">
Name of the field in the input dataset over which the aggregation should be
computed.
</Expandable>

<Expandable title="window" type="Window">
The continuous window within which aggregation needs to be computed. Possible
values are `"forever"` or any [time duration](/api-reference/data-types/duration).
</Expandable>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will all go once you rebase right?

@satrana42 satrana42 changed the title Add drop_nulls to FirstK Add dropnull to FirstK Sep 21, 2024
Copy link
Contributor

@aditya-nambiar aditya-nambiar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add to changelog and bump pyproject.toml

Comment on lines 4519 to 4566
@mock
def test_firstk_null(client):
@dataset
@source(webhook.endpoint("A"), disorder="14d", cdc="append")
class A:
id: int
value: Optional[int]
ts: datetime

@dataset(index=True)
class B:
id: int = field(key=True)
value: List[Optional[int]]
ts: datetime

@pipeline
@inputs(A)
def pipeline(cls, event: Dataset):
return event.groupby("id").aggregate(
value=FirstK(
of="value",
window=Continuous("forever"),
dedup=False,
limit=2,
dropnull=False
)
)

client.commit(datasets=[A, B], message="test")

now = datetime.now(timezone.utc)
yesterday = now - timedelta(days=1)
day_before_yesterday = now - timedelta(days=2)
df = pd.DataFrame(
{
"id": [1, 1, 2, 2, 3, 3],
"value": [None, 1, None, 2, None, 3],
"ts": [yesterday, now, now, day_before_yesterday, now, yesterday],
}
)
df["value"] = df["value"].astype("Int64")
client.log("fennel_webhook", "A", df)
client.sleep(30)

results, _ = client.lookup(
B,
keys=pd.DataFrame({"id": [1, 2, 3]}),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could parameterize this test and combine the two into 1.
you can explore using - @pytest.mark.parametrize(
only the assert would change according to the param i believe ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to make parameterize work. First I ran into pytest-dev/pytest#6810. Then I ran into:

$ pytest fennel/client_tests/test_dataset.py -k test_firstk_dropnull
========================================================= test session starts =========================================================
platform darwin -- Python 3.11.5, pytest-7.1.3, pluggy-1.5.0
rootdir: /Users/satrana42/Codes/client, configfile: pyproject.toml
plugins: rerunfailures-13.0, anyio-4.4.0, timeout-2.3.1, mock-3.14.0
collected 54 items / 52 deselected / 2 selected

fennel/client_tests/test_dataset.py EE                                                                                          [100%]

=============================================================== ERRORS ================================================================
________________________________ ERROR at setup of test_firstk_dropnull[False-Optional-values_firstk0] ________________________________
file /Users/satrana42/Codes/client/fennel/client_tests/test_dataset.py, line 4518
  @pytest.mark.integration
  @pytest.mark.parametrize(
      "dropnull,value_type,values_firstk",
      [
          (False, Optional[int], [[pd.NA, 1], [2, pd.NA], [3, pd.NA]]),
          (True, int, [[1], [2], [3]])
      ]
  )
  @mock
  def test_firstk_dropnull(client, dropnull, value_type, values_firstk):
E       fixture 'client' not found
>       available fixtures: anyio_backend, anyio_backend_name, anyio_backend_options, cache, capfd, capfdbinary, caplog, capsys, capsysbinary, class_mocker, doctest_namespace, mocker, module_mocker, monkeypatch, package_mocker, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, session_mocker, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory
>       use 'pytest --fixtures [testpath]' for help on them.

/Users/satrana42/Codes/client/fennel/client_tests/test_dataset.py:4518
___________________________________ ERROR at setup of test_firstk_dropnull[True-int-values_firstk1] ___________________________________
file /Users/satrana42/Codes/client/fennel/client_tests/test_dataset.py, line 4518
  @pytest.mark.integration
  @pytest.mark.parametrize(
      "dropnull,value_type,values_firstk",
      [
          (False, Optional[int], [[pd.NA, 1], [2, pd.NA], [3, pd.NA]]),
          (True, int, [[1], [2], [3]])
      ]
  )
  @mock
  def test_firstk_dropnull(client, dropnull, value_type, values_firstk):
E       fixture 'client' not found
>       available fixtures: anyio_backend, anyio_backend_name, anyio_backend_options, cache, capfd, capfdbinary, caplog, capsys, capsysbinary, class_mocker, doctest_namespace, mocker, module_mocker, monkeypatch, package_mocker, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, session_mocker, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory
>       use 'pytest --fixtures [testpath]' for help on them.

/Users/satrana42/Codes/client/fennel/client_tests/test_dataset.py:4518
======================================================= short test summary info =======================================================
ERROR fennel/client_tests/test_dataset.py::test_firstk_dropnull[False-Optional-values_firstk0]
ERROR fennel/client_tests/test_dataset.py::test_firstk_dropnull[True-int-values_firstk1]
================================================== 52 deselected, 2 errors in 9.22s ===================================================

Do you have an example where it works?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed on call: Don't use pytest's parametrize but pass the parameters in a loop or function with the deduplicated code of the two tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran into further issues with loop based parametrization. Reuse of client is the likely root cause because separate tests passed. I am skipping parametrization of these two tests.

@satrana42 satrana42 merged commit e7d9900 into main Sep 27, 2024
8 checks passed
@satrana42 satrana42 deleted the sat/drop_nulls_firstk branch September 27, 2024 16:39
satrana42 added a commit that referenced this pull request Oct 18, 2024
In this change, we update or add documentation for operator
changes done in #553, #556, #535, #560 and #578.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants