-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scatter table refactoring #1319
Scatter table refactoring #1319
Conversation
- Remove context of fit model from data name and index; model_name -> name, model_id -> class_id - Remove extra metadata from the scatter table
7ecee38
to
8db9df9
Compare
data_subfif_map becomes a single source of the data name. This means data is classified without notion of the fit model. Index mapping is made in the formatter function through the data-model name identity.
Sorry @wshanks if you already started review. I added new commit 72a5e6a. The context of the commit is the following; In the first commit in the PR, I understand that the point of the discussion is the (implicit) index matching between |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my in-line comments, I had one question that might require action and one typo fix. Besides that, I think all the changes here are good. My one hesitation with merging this PR as is though is documentation. For one thing, ScatterTable
is not in the built documentation at all. From a comment in the ScatterTable
code, I wonder if we could add the pandas .inv url to docs/conf.py? Beyond that, the column names are not documented. In particular, I think the curve analysis documentation (maybe curve-analysis-workflow
?) should describe category
, name
, and class_id
. The doc-strings for _run_data_processing
and _format_data
could also be clearer about what they do. It is hard to understand how the workflow manipulates the data in order to know how to make a customization like #1243 does with the current documentation.
source[idx]["shots"] = datum.get("shots", -1) | ||
|
||
# Assign entry name and class id | ||
# Enumerate starts at 1 so that unclassified data becomes class_id = 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should unclassified data be allowed at this point? Previously, an exception was raised for unclassified data. Maybe the comment should give an example of why data could remain unclassified.
It is not a big deal but one side effect of allowing unclassified data as class 0 is that the formatted data classes are offset by 1 from the raw data classes in the default case, which might be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added release note about this behavior change and replaced class id with null value in 4aa8505. The scatter table can keep every user data from the experiment (at least in the "raw" category, as this doesn't need to be a part of fitting). So I believe this behavior is more convenient for users, especially when they rerun analysis from scatter table stored in the artifact.
averaged["xval"] = xv | ||
averaged["yval"] = avg_yval | ||
averaged["yerr"] = avg_yerr | ||
averaged["model_id"] = mid | ||
averaged["name"] = g_dict["name"][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at this part for a while trying to understand. It seemed like there was a split here with the name preserving the class name but the id changing from the class to the model, until I realized that the model id only gets set if the class name matches the model name. I am not sure much can help with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you suggest removing name field if nothing matches? This could be a bit confusing as you say, but this requires Stark analysis to overwrite the entire method just to keep the data name for visualization of formatted data. Another solution would be re-adding separate model_name
and model_id
field, but for almost all analyses this information is duplicated and may confuse the users.
releasenotes/notes/add-dataframe-curve-data-a8905c450748b281.yaml
Outdated
Show resolved
Hide resolved
…gned data with null value.
9dca9fc
to
1f10cc2
Compare
Thanks Will. I updated tutorial in 61ebcd9. In #1253 @coruscating also tried to find an approach for class doc rendering, and we also consulted with Qiskit core members, but no one cannot find valid solution. Likely subclassing a class in the external package is not good direction in terms of documentation (we should to contribute to the package instead). |
@wshanks The docs issue was that by default, all attributes inherited from pandas were rendered and trying to parse the pandas docstrings in our build fails with a lot of warnings. We already render externally inherited attributes like in https://qiskit.org/ecosystem/experiments/dev/stubs/qiskit_experiments.framework.ExperimentEncoder.html, but ideally we shouldn't. |
Sorry for my late comment. I think it's worth considering separation of the implementation (Pandas Dataframe) and the interface (ScatterTable) not only for fixing the issue on the doc generation but also for making the future implementation change easier (since we have no good control on the 3rd party software even though I know Pandas is a very stable and reliable library). Also, for example ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with merging this now and @nkanazawa1989 following up with another PR to alter the ScatterTable interface as suggested by @itoko.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. I read through the latest changes and I think everything looks good code-wise. I had final suggestions for the wording.
I like the documentation additions.
releasenotes/notes/add-dataframe-curve-data-a8905c450748b281.yaml
Outdated
Show resolved
Hide resolved
Should we make a new issue for @itoko's point? Or will you just start with a PR @nkanazawa1989? One thing similar to @itoko's point that I wanted to raise is that I turned one all warnings while trying qiskit 1.0 with qiskit-experiments and I noticed that our tests generate a lot of |
Co-authored-by: Will Shanks <willshanks@us.ibm.com>
I plan to start with PR, since we already have some consensus and no further discussion is needed. Regarding the warning, I think we just need to remove these pending warnings. With @itoko's suggestion of composition (forwarding), we need to wrap the data frame with Qiskit Experiments classes instead of inheriting from pandas object. This means |
### Summary This PR modifies `ScatterTable` which is introduced in qiskit-community#1253. This change resolves some code issues in qiskit-community#1315 and qiskit-community#1243. ### Details and comments In the original design `ScatterTable` is tied to the fit models, and the columns contains `model_name` (str) and `model_id` (int). Also the fit module only allows to have three categorical data; "processed", "formatted", "fitted". However, qiskit-community#1243 breaks this assumption, namely, the `StarkRamseyXYAmpScanAnalysis` fitter defines two fit models which are not directly mapped to the results data. The data fed into the models is synthesized by consuming the input results data. The fitter needs to manage four categorical data; "raw", "ramsey" (raw results), "phase" (synthesized data for fit), and "fitted". This PR relaxes the tight coupling of data to the fit model. In above example, "raw" and "ramsey" category data can fill new fields `name` (formally model_name) and `class_id` (model_id) without indicating a particular fit model. Usually, raw category data is just classified according to the `data_subfit_map` definition, and the map doesn't need to match with the fit models. The connection to fit models is only introduced in a particular category defined by new option value `fit_category`. This option defaults to "formatted", but `StarkRamseyXYAmpScanAnalysis` fitter would set "phase" instead. Thus fit model assignment is effectively delayed until the formatter function. Also the original scatter table is designed to store all circuit metadata which causes some problem in data formatting, especially when it tries to average the data over the same x value in the group. Non-numeric data is averaged by builtin set operation, but this assumes the metadata value is hashable object, which is not generally true. This PR also drops all metadata from the scatter table. Note that important metadata fields for the curve analysis are one used for model classification (classifier fields), and other fields just decorate the table with unnecessary memory footprint requirements. The classifier fields and `name` (`class_id`) are sort of duplicated information. This implies the `name` and `class_id` fields are enough for end-users to reuse the table data for further analysis once after it's saved as an artifact. --------- Co-authored-by: Will Shanks <willshanks@us.ibm.com>
### Summary This PR updates the implementation of `ScatterTable` and `AnalysisResultTable` based on the [comment](#1319 (comment)) from @itoko . ### Details and comments Current pattern heavily uses inheritance; `Table(DataFrame, MixIn)`, but this causes several problems. Qiskit Experiments class directly depends on the third party library, resulting in Sphinx directive mismatch and poor robustness of the API. Instead of using inheritance, these classes are refactored with composition and delegation, namely ```python class Table: def __init__(self): self._data = DataFrame(...) ``` this pattern is also common in other software libraries using dataframe. Since this PR removes unreleased public classes, this should be merged before the release. Although this updates many files, these are just delegation of data handling logic to the class itself, which simplifies the implantation of classes that operate the container objects. Also new pattern allows more strict dtype management with dataframe. --------- Co-authored-by: Will Shanks <wshaos@posteo.net>
Summary
This PR modifies
ScatterTable
which is introduced in #1253.This change resolves some code issues in #1315 and #1243.
Details and comments
In the original design
ScatterTable
is tied to the fit models, and the columns containsmodel_name
(str) andmodel_id
(int). Also the fit module only allows to have three categorical data; "processed", "formatted", "fitted". However, #1243 breaks this assumption, namely, theStarkRamseyXYAmpScanAnalysis
fitter defines two fit models which are not directly mapped to the results data. The data fed into the models is synthesized by consuming the input results data. The fitter needs to manage four categorical data; "raw", "ramsey" (raw results), "phase" (synthesized data for fit), and "fitted".This PR relaxes the tight coupling of data to the fit model. In above example, "raw" and "ramsey" category data can fill new fields
name
(formally model_name) andclass_id
(model_id) without indicating a particular fit model. Usually, raw category data is just classified according to thedata_subfit_map
definition, and the map doesn't need to match with the fit models. The connection to fit models is only introduced in a particular category defined by new option valuefit_category
. This option defaults to "formatted", butStarkRamseyXYAmpScanAnalysis
fitter would set "phase" instead. Thus fit model assignment is effectively delayed until the formatter function.Also the original scatter table is designed to store all circuit metadata which causes some problem in data formatting, especially when it tries to average the data over the same x value in the group. Non-numeric data is averaged by builtin set operation, but this assumes the metadata value is hashable object, which is not generally true. This PR also drops all metadata from the scatter table. Note that important metadata fields for the curve analysis are one used for model classification (classifier fields), and other fields just decorate the table with unnecessary memory footprint requirements. The classifier fields and
name
(class_id
) are sort of duplicated information. This implies thename
andclass_id
fields are enough for end-users to reuse the table data for further analysis once after it's saved as an artifact.