Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SCHEMATIC-183] Update tests - Use magic mock and add parentId #1554

Merged
merged 13 commits into from
Dec 3, 2024
5 changes: 4 additions & 1 deletion schematic/store/synapse.py
Original file line number Diff line number Diff line change
Expand Up @@ -705,7 +705,10 @@ def getFilesInStorageDataset(
ValueError: Dataset ID not found.
"""
file_list = []
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Work with Bryan to see the difference in speeds between dev branch and prod branch within signoz.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1552

Is necessary to filter out difference between branches in gh runs. It's possible to get the data now, just a bit more difficult to filter it out. As of now the average duration of this function plotted over time:

image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to pull develop into this feature branch and we can compare the develop branch to this feature branch performance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BryanFauble. Done

Copy link
Collaborator

@BryanFauble BryanFauble Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasyu888 These are the results over the past 5 days. We can see your branch has a much better average execution time for this function. Although, similar to bwmac/SCHEMATIC-163/error-message-update for some reason.

image

We can selected on a few fields to filter for what we want, perform an average of the duration for the function, then group by a few fields to get these results.

Some other tests:
image

image


# HACK: must requery the fileview to get new files, since SynapseStorage will query the last state
# of the fileview which may not contain any new folders in the fileview.
# This is a workaround to fileviews not always containing the latest information
self.query_fileview(force_requery=True)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Post in the team channel to discuss this hack and there's a message in #synapse channel for us to chime in on

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is something that we can do here with this API:

https://rest-docs.synapse.org/rest/POST/entity/id/table/query/async/start.html

Specifically:
"The last updated on date of the table (lastUpdatedOn) = 0x80"

If this get's updated as data is indexed into the table we could fetch the lastUpdatedOn field before we query the table to know if we need to re-query the table or not.

Copy link
Member Author

@thomasyu888 thomasyu888 Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing to note: if we look at the failing test that was the motivation behind this hack, it's because it falls outside of the overall schematic flow.

So for example: when people upload a bunch of files, they usually first run manifest generate. Now manifest generate will have a "this dataset doesn't exist" situation but then afterwards, the fileview should always consist of the data - theoretically. There have been incidents when it doesn't but that's very little in the grand scheme of things.

So we could remove the hack and trigger the fileview indexing within the test but that's probably best as a team decision (but that probably complicates things a bit as it requires re-querying within the synapse storage context that is being passed along per testing function)

Copy link
Member Author

@thomasyu888 thomasyu888 Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GiaJordan After a team discussion, we decided that it would be better to modify the test to re-query the fileview since this is such an edgecase. It does throw an error already, and upon multiple runs of this code outside of the test, it would work. This only doesn't work so smoothly in the test because resources are created and destroyed.

FAILED tests/integration/test_metadata_model.py::TestMetadataModel::test_submit_filebased_manifest_file_and_entities_valid_manifest_submitted - LookupError: Dataset syn64313762 could not be found in fileview syn23643253.
FAILED tests/integration/test_metadata_model.py::TestMetadataModel::test_submit_filebased_manifest_file_and_entities_mock_filename - LookupError: Dataset syn64313765 could not be found in fileview syn23643253.

I commented out the HACK for now and the CI will run and I updated the HACK into the test: b8360cb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasyu888 thanks for the information and the update!

# Get path to dataset folder by using childern to avoid cases where the dataset is the scope of the view
child_path = self.storageFileviewTable.loc[
self.storageFileviewTable["parentId"] == datasetId, "path"
Expand Down
72 changes: 33 additions & 39 deletions tests/test_store.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
import uuid
from contextlib import nullcontext as does_not_raise
from typing import Any, Callable, Generator
from unittest.mock import AsyncMock, patch
from unittest.mock import AsyncMock, MagicMock, patch

import pandas as pd
import pytest
Expand Down Expand Up @@ -464,49 +464,45 @@ def test_getDatasetProject(self, dataset_id, synapse_store):
(
True,
[
("syn126", "schematic - main/parent_folder/test_file"),
("syn126", "syn_mock", "schematic - main/parent_folder/test_file"),
(
"syn125",
"syn_mock",
"schematic - main/parent_folder/test_folder/test_file_2",
),
],
),
(False, [("syn126", "test_file"), ("syn125", "test_file_2")]),
(
False,
[
("syn126", "syn_mock", "test_file"),
("syn125", "syn_mock", "test_file_2"),
],
),
],
)
def test_getFilesInStorageDataset(self, synapse_store, full_path, expected):
mock_table_dataFrame_initial = pd.DataFrame(
{
"id": ["syn_mock"],
"path": ["schematic - main/parent_folder"],
}
)

mock_table_dataFrame_return = pd.DataFrame(
mock_table_dataframe_return = pd.DataFrame(
{
"id": ["syn126", "syn125"],
"parentId": ["syn_mock", "syn_mock"],
"path": [
"schematic - main/parent_folder/test_file",
"schematic - main/parent_folder/test_folder/test_file_2",
],
}
)
mock_table_return = build_table(
"Mock Table", "syn123", mock_table_dataFrame_return
)

with patch.object(synapse_store, "syn") as mocked_synapse_client:
with patch.object(
synapse_store, "storageFileviewTable"
) as mocked_fileview_table:
mocked_fileview_table.storageFileviewTable.return_value = (
mock_table_dataFrame_initial
)
mocked_synapse_client.tableQuery.return_value = mock_table_return
file_list = synapse_store.getFilesInStorageDataset(
datasetId="syn_mock", fileNames=None, fullpath=full_path
)
assert file_list == expected
with patch.object(
synapse_store, "storageFileviewTable", mock_table_dataframe_return
), patch.object(synapse_store, "query_fileview") as mocked_query:
# query_fileview is the function called to get the fileview
mocked_query.return_value = mock_table_dataframe_return

file_list = synapse_store.getFilesInStorageDataset(
datasetId="syn_mock", fileNames=None, fullpath=full_path
)
assert file_list == expected

@pytest.mark.parametrize(
"full_path",
Expand All @@ -516,27 +512,25 @@ def test_getFilesInStorageDataset(self, synapse_store, full_path, expected):
],
)
def test_get_files_in_storage_dataset_exception(self, synapse_store, full_path):
mock_table_dataFrame_initial = pd.DataFrame(
mock_table_dataframe_return = pd.DataFrame(
{
"id": ["child_syn_mock"],
"path": ["schematic - main/parent_folder/child_entity"],
"parentId": ["wrong_syn_mock"],
}
)
with patch.object(
synapse_store, "storageFileviewTable", mock_table_dataframe_return
), patch.object(synapse_store, "query_fileview") as mocked_query:
# query_fileview is the function called to get the fileview
mocked_query.return_value = mock_table_dataframe_return

with patch.object(synapse_store, "syn") as mocked_synapse_client:
with patch.object(
synapse_store, "storageFileviewTable"
) as mocked_fileview_table:
mocked_fileview_table.storageFileviewTable.return_value = (
mock_table_dataFrame_initial
with pytest.raises(
LookupError, match="Dataset syn_mock could not be found"
):
synapse_store.getFilesInStorageDataset(
datasetId="syn_mock", fileNames=None, fullpath=full_path
)
with pytest.raises(
LookupError, match="Dataset syn_mock could not be found"
):
file_list = synapse_store.getFilesInStorageDataset(
datasetId="syn_mock", fileNames=None, fullpath=full_path
)

@pytest.mark.parametrize("downloadFile", [True, False])
def test_getDatasetManifest(self, synapse_store, downloadFile):
Expand Down
Loading