Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

Updating to Mephisto 1.0 #4426

Merged
merged 28 commits into from
Mar 30, 2022
Merged

Updating to Mephisto 1.0 #4426

merged 28 commits into from
Mar 30, 2022

Conversation

JackUrb
Copy link
Contributor

@JackUrb JackUrb commented Mar 16, 2022

Patch description
First set of steps that get crowdsourcing tests running (no longer breaking the newer mephisto conventions), but still not passing. More work is to be done there, leaving this open as a starting point for others to make comments and take over.

(@EricMichaelSmith : all crowdsourcing tests are passing now as of March 30th)

Testing steps

pytest tests/crowdsourcing

max_num_tries = 6
mock_worker_registration_name = f"MOCK_WORKER_{idx:d}"
mock_worker_name = f"{mock_worker_registration_name}_sandbox"
max_num_tries = 3
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside, retries should no longer be necessary with assert_sandbox_worker_created and await_channel_requests which run the async loop until pending things are processed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just removed this retry loop without breaking the tests

@EricMichaelSmith
Copy link
Contributor

@JackUrb Most of the crowdsourcing tests don't seem to be running currently due to ImportErrors - disabling the try/except blocks around them to see what's going on

Copy link
Contributor

@EricMichaelSmith EricMichaelSmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR and for doing these refactors - yeah, it looks like it's useful to have a few pieces of boilerplate code abstracted away, and not needing to do retries would be useful. Am trying to get all 43 crowdsourcing tests to run to get a better sense of what needs to be done with this

parlai/crowdsourcing/utils/tests.py Outdated Show resolved Hide resolved
tests/crowdsourcing/tasks/test_chat_demo.py Outdated Show resolved Hide resolved
parlai/crowdsourcing/utils/tests.py Outdated Show resolved Hide resolved
parlai/crowdsourcing/utils/tests.py Outdated Show resolved Hide resolved
@EricMichaelSmith
Copy link
Contributor

Okay, I got all 43 crowdsourcing checks to run so that we can debug them. @JackUrb 27 of them, the Fast-Acute ones, are currently failing with a "No live runs present" error due to no LiveTaskRuns being found when calling Operator.get_running_task_runs() within ParlAI's AbstractCrowdsourcingTest._get_live_run(). Do you know what might cause no task runs to be found?

@JackUrb
Copy link
Contributor Author

JackUrb commented Mar 17, 2022

Hadn't noticed that the blueprint was broken under-the-hood, leading to a blueprint launch error (and thus no running tasks), can make a quick change for this. (running locally identified the issue in logs)

@EricMichaelSmith
Copy link
Contributor

EricMichaelSmith commented Mar 17, 2022

Hadn't noticed that the blueprint was broken under-the-hood, leading to a blueprint launch error (and thus no running tasks), can make a quick change for this. (running locally identified the issue in logs)

Great, thanks! Hmm, now I'm seeing an AgentTimeoutError in the CI check logs when cleaning up the test unit: https://app.circleci.com/pipelines/github/facebookresearch/ParlAI/11125/workflows/8f87deb8-70aa-47d3-86a7-dd9b474a900b/jobs/91214?invite=true#step-111-5068 The .wait() method of the agent here seems to be preventing the Fast-Acute checks from completing after erroring out

@JackUrb
Copy link
Contributor Author

JackUrb commented Mar 17, 2022

That would imply to me that the unit was still running when it was shutdown, and thus the shutdown waited for the timeout. You may need to examine this locally to see what was coming through the agent and what wasn't, as I'm unclear why this happened from the given info.

@EricMichaelSmith
Copy link
Contributor

That would imply to me that the unit was still running when it was shutdown, and thus the shutdown waited for the timeout. You may need to examine this locally to see what was coming through the agent and what wasn't, as I'm unclear why this happened from the given info.

Hmm, on my devfair, it looks like this issue with the unit being left hanging came from a ValueError due to the test unit not being registered in the data browser correctly:

self = <parlai.crowdsourcing.tasks.acute_eval.analysis.AcuteAnalyzer object at 0x7fc491f6e670>

    def _extract_to_dataframe(self) -> pd.DataFrame:
        """
        Extract the data from the run to a pandas dataframe.
        """
        units = self.mephisto_data_browser.get_units_for_task_name(self.run_id)
        responses: List[Dict[str, Any]] = []
        for unit in units:
            unit_details = self._parse_unit(unit)
            if unit_details is None:
                continue
            for idx in range(len(unit_details['data'])):
                response = self._extract_response_by_index(unit_details, idx)
                if response is not None:
                    responses.append(response)

        if len(responses) == 0:
>           raise ValueError('No valid results found!')
E           ValueError: No valid results found!

parlai/crowdsourcing/tasks/acute_eval/analysis.py:251: ValueError

So I suppose the question now is (1) whether the unit got saved in the data browser correctly, and if so, (2) why it's not being loaded back in

@JackUrb
Copy link
Contributor Author

JackUrb commented Mar 17, 2022

So I suppose the question now is (1) whether the unit got saved in the data browser correctly, and if so, (2) why it's not being loaded back in

My bet is still not have been marked as completed, which would happen in another thread (and I imagine if this script is launched before the unit is completed, you won't get result data). I expect this to more likely be (1) than (2). You'd likely want to dig into the TaskRunner to be sure that your TaskRunner's run_unit function completes.

Actually this is likely it, we've changed the semantics for live acts to be different from task submission. See the new StaticTaskRunner:

    def run_unit(self, unit: "Unit", agent: "Agent") -> None:
        """
        Static runners will get the task data, send it to the user, then
        wait for the agent to act (the data to be completed)
        """
        agent.await_submit(self.assignment_duration_in_seconds)

.circleci/config.yml Outdated Show resolved Hide resolved
@EricMichaelSmith EricMichaelSmith marked this pull request as ready for review March 30, 2022 13:43
Copy link
Contributor

@EricMichaelSmith EricMichaelSmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All crowdsourcing tests seem to be passing now. No remaining issues that I can see

@JackUrb JackUrb merged commit c946fb3 into main Mar 30, 2022
@JackUrb JackUrb deleted the mephisto-1.0 branch March 30, 2022 14:06
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants