Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scatter table refactoring #1319

Merged

Conversation

nkanazawa1989
Copy link
Collaborator

Summary

This PR modifies ScatterTable which is introduced in #1253.

This change resolves some code issues in #1315 and #1243.

Details and comments

In the original design ScatterTable is tied to the fit models, and the columns contains model_name (str) and model_id (int). Also the fit module only allows to have three categorical data; "processed", "formatted", "fitted". However, #1243 breaks this assumption, namely, the StarkRamseyXYAmpScanAnalysis fitter defines two fit models which are not directly mapped to the results data. The data fed into the models is synthesized by consuming the input results data. The fitter needs to manage four categorical data; "raw", "ramsey" (raw results), "phase" (synthesized data for fit), and "fitted".

This PR relaxes the tight coupling of data to the fit model. In above example, "raw" and "ramsey" category data can fill new fields name (formally model_name) and class_id (model_id) without indicating a particular fit model. Usually, raw category data is just classified according to the data_subfit_map definition, and the map doesn't need to match with the fit models. The connection to fit models is only introduced in a particular category defined by new option value fit_category. This option defaults to "formatted", but StarkRamseyXYAmpScanAnalysis fitter would set "phase" instead. Thus fit model assignment is effectively delayed until the formatter function.

Also the original scatter table is designed to store all circuit metadata which causes some problem in data formatting, especially when it tries to average the data over the same x value in the group. Non-numeric data is averaged by builtin set operation, but this assumes the metadata value is hashable object, which is not generally true. This PR also drops all metadata from the scatter table. Note that important metadata fields for the curve analysis are one used for model classification (classifier fields), and other fields just decorate the table with unnecessary memory footprint requirements. The classifier fields and name (class_id) are sort of duplicated information. This implies the name and class_id fields are enough for end-users to reuse the table data for further analysis once after it's saved as an artifact.

- Remove context of fit model from data name and index; model_name -> name, model_id -> class_id
- Remove extra metadata from the scatter table
@wshanks wshanks added this to the Release 0.6 milestone Nov 15, 2023
data_subfif_map becomes a single source of the data name. This means data is classified without notion of the fit model. Index mapping is made in the formatter function through the data-model name identity.
@nkanazawa1989
Copy link
Collaborator Author

nkanazawa1989 commented Nov 16, 2023

Sorry @wshanks if you already started review. I added new commit 72a5e6a. The context of the commit is the following; In the first commit in the PR, _run_data_processing() still classifies the data based on the model names, which introduces hidden coupling to the fit model although "raw" category data should be unaware of it. In the second commit, I made the change so that the function classifies the data based on the data_subfit_map, and then the data index to model index mapping is done in the _format_data() function.

I understand that the point of the discussion is the (implicit) index matching between self._models and data_subfit_map key. In 72a5e6a the index matching is explicitly considered by the data-model name consistency.

nkanazawa1989 added a commit to nkanazawa1989/qiskit-experiments that referenced this pull request Nov 16, 2023
Copy link
Collaborator

@wshanks wshanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my in-line comments, I had one question that might require action and one typo fix. Besides that, I think all the changes here are good. My one hesitation with merging this PR as is though is documentation. For one thing, ScatterTable is not in the built documentation at all. From a comment in the ScatterTable code, I wonder if we could add the pandas .inv url to docs/conf.py? Beyond that, the column names are not documented. In particular, I think the curve analysis documentation (maybe curve-analysis-workflow?) should describe category, name, and class_id. The doc-strings for _run_data_processing and _format_data could also be clearer about what they do. It is hard to understand how the workflow manipulates the data in order to know how to make a customization like #1243 does with the current documentation.

source[idx]["shots"] = datum.get("shots", -1)

# Assign entry name and class id
# Enumerate starts at 1 so that unclassified data becomes class_id = 0.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should unclassified data be allowed at this point? Previously, an exception was raised for unclassified data. Maybe the comment should give an example of why data could remain unclassified.

It is not a big deal but one side effect of allowing unclassified data as class 0 is that the formatted data classes are offset by 1 from the raw data classes in the default case, which might be confusing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added release note about this behavior change and replaced class id with null value in 4aa8505. The scatter table can keep every user data from the experiment (at least in the "raw" category, as this doesn't need to be a part of fitting). So I believe this behavior is more convenient for users, especially when they rerun analysis from scatter table stored in the artifact.

qiskit_experiments/curve_analysis/curve_analysis.py Outdated Show resolved Hide resolved
qiskit_experiments/curve_analysis/curve_analysis.py Outdated Show resolved Hide resolved
averaged["xval"] = xv
averaged["yval"] = avg_yval
averaged["yerr"] = avg_yerr
averaged["model_id"] = mid
averaged["name"] = g_dict["name"][0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at this part for a while trying to understand. It seemed like there was a split here with the name preserving the class name but the id changing from the class to the model, until I realized that the model id only gets set if the class name matches the model name. I am not sure much can help with that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you suggest removing name field if nothing matches? This could be a bit confusing as you say, but this requires Stark analysis to overwrite the entire method just to keep the data name for visualization of formatted data. Another solution would be re-adding separate model_name and model_id field, but for almost all analyses this information is duplicated and may confuse the users.

@nkanazawa1989
Copy link
Collaborator Author

Thanks Will. I updated tutorial in 61ebcd9. In #1253 @coruscating also tried to find an approach for class doc rendering, and we also consulted with Qiskit core members, but no one cannot find valid solution. Likely subclassing a class in the external package is not good direction in terms of documentation (we should to contribute to the package instead).

@coruscating
Copy link
Collaborator

@wshanks The docs issue was that by default, all attributes inherited from pandas were rendered and trying to parse the pandas docstrings in our build fails with a lot of warnings. We already render externally inherited attributes like in https://qiskit.org/ecosystem/experiments/dev/stubs/qiskit_experiments.framework.ExperimentEncoder.html, but ideally we shouldn't. "inherited-members": None is already set in the autodoc options, which should work in theory but doesn't.

@itoko
Copy link
Contributor

itoko commented Nov 27, 2023

Sorry for my late comment. I think it's worth considering separation of the implementation (Pandas Dataframe) and the interface (ScatterTable) not only for fixing the issue on the doc generation but also for making the future implementation change easier (since we have no good control on the 3rd party software even though I know Pandas is a very stable and reliable library). Also, for example (data: ScatterTable), I as a developer would like to write data.raw_xval instead of data[data.category == "raw"].xval.to_numpy(). For Pandas users, we may have data.to_dataframe() function. I think forwarding is more suitable than inheritance for the implementation of ScatterTable. I'm expecting the overhead for the wrapping would be acceptable (sufficiently small). What do you think? @nkanazawa1989 @wshanks @coruscating

Copy link
Collaborator

@coruscating coruscating left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with merging this now and @nkanazawa1989 following up with another PR to alter the ScatterTable interface as suggested by @itoko.

Copy link
Collaborator

@wshanks wshanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. I read through the latest changes and I think everything looks good code-wise. I had final suggestions for the wording.

I like the documentation additions.

docs/tutorials/images/curve_analysis_structure.png Outdated Show resolved Hide resolved
docs/tutorials/curve_analysis.rst Outdated Show resolved Hide resolved
docs/tutorials/curve_analysis.rst Outdated Show resolved Hide resolved
docs/tutorials/curve_analysis.rst Outdated Show resolved Hide resolved
docs/tutorials/curve_analysis.rst Outdated Show resolved Hide resolved
docs/tutorials/curve_analysis.rst Outdated Show resolved Hide resolved
docs/tutorials/curve_analysis.rst Outdated Show resolved Hide resolved
docs/tutorials/curve_analysis.rst Outdated Show resolved Hide resolved
@wshanks
Copy link
Collaborator

wshanks commented Dec 13, 2023

Should we make a new issue for @itoko's point? Or will you just start with a PR @nkanazawa1989? One thing similar to @itoko's point that I wanted to raise is that I turned one all warnings while trying qiskit 1.0 with qiskit-experiments and I noticed that our tests generate a lot of PendingDeprecationWarnings for the old CurveData methods like x, y, and y_err. Maybe adding convenience methods like data.raw_xval would allow us to update the analysis classes to stop generating deprecation warnings internally.

nkanazawa1989 and others added 2 commits December 22, 2023 10:18
Co-authored-by: Will Shanks <willshanks@us.ibm.com>
@nkanazawa1989
Copy link
Collaborator Author

I plan to start with PR, since we already have some consensus and no further discussion is needed. Regarding the warning, I think we just need to remove these pending warnings. With @itoko's suggestion of composition (forwarding), we need to wrap the data frame with Qiskit Experiments classes instead of inheriting from pandas object. This means .x doesn't automatically refers to the x column of the data frame. So we want to keep the method as a convenient accessor.

@nkanazawa1989 nkanazawa1989 added this pull request to the merge queue Dec 22, 2023
Merged via the queue into qiskit-community:main with commit 5bb1fb4 Dec 22, 2023
11 checks passed
@nkanazawa1989 nkanazawa1989 deleted the update-scatter-table branch December 22, 2023 04:04
nkanazawa1989 added a commit to nkanazawa1989/qiskit-experiments that referenced this pull request Jan 17, 2024
### Summary

This PR modifies `ScatterTable` which is introduced in qiskit-community#1253.

This change resolves some code issues in qiskit-community#1315 and qiskit-community#1243.

### Details and comments

In the original design `ScatterTable` is tied to the fit models, and the
columns contains `model_name` (str) and `model_id` (int). Also the fit
module only allows to have three categorical data; "processed",
"formatted", "fitted". However, qiskit-community#1243 breaks this assumption, namely,
the `StarkRamseyXYAmpScanAnalysis` fitter defines two fit models which
are not directly mapped to the results data. The data fed into the
models is synthesized by consuming the input results data. The fitter
needs to manage four categorical data; "raw", "ramsey" (raw results),
"phase" (synthesized data for fit), and "fitted".

This PR relaxes the tight coupling of data to the fit model. In above
example, "raw" and "ramsey" category data can fill new fields `name`
(formally model_name) and `class_id` (model_id) without indicating a
particular fit model. Usually, raw category data is just classified
according to the `data_subfit_map` definition, and the map doesn't need
to match with the fit models. The connection to fit models is only
introduced in a particular category defined by new option value
`fit_category`. This option defaults to "formatted", but
`StarkRamseyXYAmpScanAnalysis` fitter would set "phase" instead. Thus
fit model assignment is effectively delayed until the formatter
function.

Also the original scatter table is designed to store all circuit
metadata which causes some problem in data formatting, especially when
it tries to average the data over the same x value in the group.
Non-numeric data is averaged by builtin set operation, but this assumes
the metadata value is hashable object, which is not generally true. This
PR also drops all metadata from the scatter table. Note that important
metadata fields for the curve analysis are one used for model
classification (classifier fields), and other fields just decorate the
table with unnecessary memory footprint requirements. The classifier
fields and `name` (`class_id`) are sort of duplicated information. This
implies the `name` and `class_id` fields are enough for end-users to
reuse the table data for further analysis once after it's saved as an
artifact.

---------

Co-authored-by: Will Shanks <willshanks@us.ibm.com>
github-merge-queue bot pushed a commit that referenced this pull request Feb 6, 2024
### Summary

This PR updates the implementation of `ScatterTable` and
`AnalysisResultTable` based on the
[comment](#1319 (comment))
from @itoko .

### Details and comments

Current pattern heavily uses inheritance; `Table(DataFrame, MixIn)`, but
this causes several problems. Qiskit Experiments class directly depends
on the third party library, resulting in Sphinx directive mismatch and
poor robustness of the API. Instead of using inheritance, these classes
are refactored with composition and delegation, namely
```python
class Table:
    def __init__(self):
        self._data = DataFrame(...)
```
this pattern is also common in other software libraries using dataframe.
Since this PR removes unreleased public classes, this should be merged
before the release.

Although this updates many files, these are just delegation of data
handling logic to the class itself, which simplifies the implantation of
classes that operate the container objects. Also new pattern allows more
strict dtype management with dataframe.

---------

Co-authored-by: Will Shanks <wshaos@posteo.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants