Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce dataframe to ExperimentData (step1) #1133

Merged

Conversation

nkanazawa1989
Copy link
Collaborator

@nkanazawa1989 nkanazawa1989 commented Apr 10, 2023

Summary

Executing approved design proposal in https://github.com/Qiskit/rfcs/blob/master/0007-experiment-dataframe.md. Introduction of the Artifact will be done in the followup.

This PR introduces dataframe to ExperimentData and replaces the conventional AnalysisResult with it. With this PR, experimentalist can visualize all analysis results with a single html table by dataframe=True.

image

Once can control the columns to show in the html table with columns option.

Details and comments

Experiment analysis is a local operation on client side, and data generated there must be agnostic to the payload format of remote service. In the current implementation, AnalysisResult appears in the BaseAnalysis and the data generated by the analysis is implicitly typecasted to this class. Note that AnalysisResult not only defines canonical data format but also implements the API for the IBM Experiment service. This tight coupling to the IBM Experiment service may limit data expressibility and an analysis class author always need to be aware of the data structure of the remote service.

In this PR, internal data structure is replaced with the flat dataframe, and one should be able to add arbitrary key-value pair to the analysis result without concerning about data model. Added key creates a new column in the table, which must be internally formatted and absorbed by extra field of the payload at a later time when these results are saved. This is implemented by qiskit_experiments.framework.experiment_data._series_to_service_result. Bypassing the generation of AnalysisResult object also drastically improves the performance of heavily parallelized analysis that generates huge amount of data.

raise AttributeError(f"'ThreadSafeDataFrame' object has no attribute '{item}'")

def __json_encode__(self) -> Dict[str, Any]:
return {
Copy link
Collaborator Author

@nkanazawa1989 nkanazawa1989 Apr 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire dataframe is JSON serializable with the experiment encoder/decoder. So you can call the save API for the batch experiment results in one piece. This drastically improves the data saving performance once experiment service updates its API.

Copy link
Collaborator

@coruscating coruscating left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nice work, here's my first round of comments. I have some general thoughts as well:

  • I think it would be helpful to add a label for the short analysis ID index column, otherwise it may look like a random string to a new user.
  • Should the display style for CurveFitResults be updated? Right now the newlines are showing as \n literals in the dataframe. If we keep the string representation the same, the default dataframe display style can be updated to render the newspaces (but then we should truncate the output or the cell will be very long).
  • This may be outside the scope of this PR, but when I run a BatchExperiment of ParalleExperiments of T1s, the results are difficult to distinguish from each other because the experiment IDs are all the same. What do you think about including an extra field that has the experiment ID of the innermost child experiment?
    image


if results:
for result in results:
supplementary = result.extra
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, if the user sets one of these fields to a different value in the extra parameters, for example by adding experiment=my_experiment, this will override the default value. Should we implement a check to disallow this since these should be reserved fields?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional and composite analysis requires this logic to update entry experiment name. Since we should return AnalysisResultData from the child experiment data, it updates experiment value through extra field. Alternatively, we can completely remove child data generation in the composite analysis, and generate them on the fly when user access ExperimentData.child_data for backward compatibility. This drastically simplifies the code stack, but should be done in another PR because I don't want to add multiple logical changes in a single PR.

qiskit_experiments/framework/experiment_data.py Outdated Show resolved Hide resolved
qiskit_experiments/framework/experiment_data.py Outdated Show resolved Hide resolved
qiskit_experiments/framework/experiment_data.py Outdated Show resolved Hide resolved
package_name="qiskit-experiments",
pending=True,
predicate=lambda dataframe: not dataframe,
)
def analysis_results(
self,
index: Optional[Union[int, slice, str]] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning on keeping index in the future? The user can retrieve specific rows from the dataframe themselves, though the syntax to retrieve from the dataframe by name would be more complicated than using index, so maybe it's good to keep from the user's perspective.

Copy link
Collaborator Author

@nkanazawa1989 nkanazawa1989 May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that what I discussed with Chris. We concluded to keep index, just as a syntactic sugar. Probably we can deprecate it if users are already familiar with the handling of dataframe, e.g.

df = exp_data.analysis_results()
df[df.name=="T1"]

I don't think this is complicated and we can remove hundreds lines of code by removing index. Let's set user feedback session after 0.6.

qiskit_experiments/framework/experiment_data.py Outdated Show resolved Hide resolved
@nkanazawa1989 nkanazawa1989 force-pushed the feature/dataframe-pr1 branch from c073705 to a6a0eab Compare May 12, 2023 06:25
Copy link
Contributor

@itoko itoko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @nkanazawa1989 This is the first step of great upgrades in experiment result data handling. The design doc is written well and it helps me a lot to understand the background of this PR.

  • Why don't we add a new method like add_analysis_result (without s) and deprecate add_analysis_results instead of upgrading add_analysis_results? (The new API seems to support only addition of single analysis result and have completely different arguments.)

  • In my understanding, result_id is a unique key of the table. Do we really need to have index as a yet-another key? (Note: I'm talking about index in pd.DataFrame not index argument of analysis_results method.)

This PR changes the data structure of analysis results from list of dictionaries (JSON) to a flat table (pandas DataFrame). Table data would be easier to manipulate in most cases. It must increase the (re)usability of analysis result data very much. Instead, information on the structure of experiments will be dropped from AnalysisResult expecting it is not usually used during analysis. Also the information is still reconstructable from other data stored in an ExperimentData (is it correct? @nkanazawa1989 ) if necessary.

I also add several minor comments inline.

qiskit_experiments/framework/experiment_data.py Outdated Show resolved Hide resolved
Comment on lines 350 to 378
# Convert Dataframe Series back into AnalysisResultData
# This is due to limitation that _run_analysis must return List[AnalysisResultData],
# and some composite analysis such as TphiAnalysis overrides this method to
# return extra quantity computed from sub analysis results.
# This produces unnecessary data conversion.
# The _run_analysis mechanism seems just complicating the entire logic.
# Since it's impossible to deprecate the usage of this protected method,
# we should implement new CompositeAnalysis class with much more efficient
# internal logic. Note that the child data structure is no longer necessary
# because dataframe offers more efficient data filtering mechanisms.
analysis_table = sub_expdata.analysis_results(verbosity=3, dataframe=True)
for _, series in analysis_table.iterrows():
data_dict = series.to_dict()
primary_info = {
"name": data_dict.pop("name"),
"value": data_dict.pop("value"),
"quality": data_dict.pop("quality"),
"device_components": data_dict.pop("components"),
}
analysis_results.append(result)
chisq = data_dict.pop("chisq", np.nan)
if chisq:
primary_info["chisq"] = chisq
data_dict["experiment"] = sub_expdata.experiment_type
if "experiment_id" in data_dict:
# Use experiment ID of parent experiment data.
# Sub experiment data is merged and discarded.
del data_dict["experiment_id"]
analysis_result = AnalysisResultData(**primary_info, extra=data_dict)
analysis_results.append(analysis_result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm expecting this is a temporary conversion and will be removed after the breaking API change of the return type of _run_analysis planned in one of the following PRs. (I know the change is not easy...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are removed with overhaul in 6ebd975.

@nkanazawa1989 nkanazawa1989 added this to the Release 0.6 milestone Jun 13, 2023
@nkanazawa1989 nkanazawa1989 added the Changelog: New Feature Include in the "Added" section of the changelog label Jun 13, 2023
@nkanazawa1989 nkanazawa1989 force-pushed the feature/dataframe-pr1 branch from 6e30147 to 6ebd975 Compare June 21, 2023 19:16
@nkanazawa1989 nkanazawa1989 marked this pull request as ready for review July 3, 2023 04:56
@nkanazawa1989
Copy link
Collaborator Author

Thanks @coruscating @itoko for your review. I updated the PR in response to your comments. Please read following replies to your review comments.

@coruscating

I think it would be helpful to add a label for the short analysis ID index column, otherwise it may look like a random string to a new user.

The short UUID corresponds to a pandas dataframe index, and index doesn't have a column name. We could update the stylesheet to show some name, but I feel this is overengineering. By default index is incremental number, but I like current format by analogy of the github commit hash.

Should the display style for CurveFitResults be updated? Right now the newlines are showing as \n literals in the dataframe. If we keep the string representation the same, the default dataframe display style can be updated to render the newspaces (but then we should truncate the output or the cell will be very long).

Yes, this object is very noisy in the table, but will be moved to artifact in the followup.

This may be outside the scope of this PR, but when I run a BatchExperiment of ParalleExperiments of T1s, the results are difficult to distinguish from each other because the experiment IDs are all the same. What do you think about including an extra field that has the experiment ID of the innermost child experiment?

Good point. If I understand your request correctly, user doesn't need to distinguish the difference of entry itself. However, if an experiment generates multiple results, there is no convenient label to group the results from the same experiment instance together. Since this is sort of an edge-case, I'll probably go with current implementation, and update in followup if there is any user feedback.

@itoko

Why don't we add a new method like add_analysis_result (without s) and deprecate add_analysis_results instead of upgrading add_analysis_results? (The new API seems to support only addition of single analysis result and have completely different arguments.)

I was thinking of the same thing, but I couldn't come up with reasonable method name. I think analysis_results vs analysis_result is very confusing and I can imagine a user always needs to check code or docstring to choose proper method. I think current implementation with kwargs is much more intuitive.

In my understanding, result_id is a unique key of the table. Do we really need to have index as a yet-another key? (Note: I'm talking about index in pd.DataFrame not index argument of analysis_results method.)

Yes, result ID is unique, but full hexadecimal UUID is too long. I was thinking to use result_id as a dataframe row index, but this creates very wide table which may hurt user experience (i.e. our browser is not friendly for horizontal scroll).

information on the structure of experiments will be dropped from AnalysisResult expecting it is not usually used during analysis. Also the information is still reconstructable from other data stored in an ExperimentData

Yes, this is what I intended to. Because this becomes a massive logic change in composite analysis in addition to the introduction of dataframe, I decided to do this cleanup in followup PR.

@nkanazawa1989 nkanazawa1989 force-pushed the feature/dataframe-pr1 branch 2 times, most recently from f5c2bab to ba431c8 Compare July 3, 2023 17:45
Copy link
Collaborator

@coruscating coruscating left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nice update @nkanazawa1989. There seems to be a bug right now where analysis results are not saved to ResultsDB. I ran a simple T1 experiment:

exp=T1(physical_qubits=[0], delays=np.arange(1e-6, 2e-4, 3e-5))
exp.run(backend).block_for_results()
exp.save()

and the resulting analysis result could not be serialized:

{'data': (AnalysisResultData(result_id='bb822f9486344a03bd7e4d00a145766f', experiment_id='b4c968e8-ab44-4e48-823e-b31180a5f87a', result_type='@Parameters_T1Analysis', result_data={'_value': <qiskit_experiments.curve_analysis.curve_data.CurveFitResult object at 0x298618280>, '_chisq': nan, '_extra': {'unit': None}, '_source': {'class': 'qiskit_experiments.framework.analysis_result.AnalysisResult', 'data_version': 1, 'qiskit_version': '0.42.1'}, 'value': '(CurveFitResult)'}, device_components=[<Qubit(Q0)>], quality='bad', verified=False, tags=[], backend_name='ibm_nazca', creation_datetime=Timestamp('2023-07-17 19:49:08.242003+0000', tz='UTC'), updated_datetime=None, chisq=nan),)

This yields the error Unexpected token N in JSON at position 1792, but the save actually fails silently, as in the experiment saves without the analysis results and the user is not notified. I wonder, after the bug is fixed, if we should also change the default behavior so the user is aware of cases where the experiment did not fully save.

@property
def end_datetime(self) -> datetime:
"""Return the end datetime of this experiment data.

The end datetime is the time the latest job data was
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

end_datetime now seems almost redundant with running_time. running_time is the timestamp of the successful job completion from the job side, while end_datetime sets the timestamp to be when the successful job is returned, so the only difference seems to be networking and processing latency. Perhaps we should change the behavior of end_datetime to be when the experiment actually finishes all its processing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm this is indeed misleading... I was thinking that running_time is the time that the first circuit is triggered on the backend. Do you know where this is publicly defined?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I misunderstood, my bad. It looks like currently running_time is the time the latest successful job started to run, and end_datetime is when the latest successful job is returned, is that right? Maybe running_time could be changed to the time the first job of the experiment started to run? But I think we can also keep it as-is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Done in f6e7294

This may return wrong datetime if server and client are in different timezone.

"""
return utc_to_local(self._running_time)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to @kt474, job.time_per_step() handles conversion and returns the timestamp in the user's local time, so I think we can keep all timestamps in the user's local time without the warning and conversion on our end.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I just removed all redundant conversions and use tzlocal for creation time of the analysis entry. Done in cc5cf41

@CLAassistant
Copy link

CLAassistant commented Jul 18, 2023

CLA assistant check
All committers have signed the CLA.

Copy link

@Musa-Sina-Ertugrul Musa-Sina-Ertugrul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments are about using sets, I think you can attach set to class for uniqueness. I only reviewed push that has not outdated label.

"""
with self._lock:
# Order sensitive
new_columns = [c for c in new_columns if c not in self.get_columns()]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we convert self.get_columns() statement to set from list that will be faster. Also, using set wont be a problem because, property of column names that about uniqueness.

Copy link
Collaborator Author

@nkanazawa1989 nkanazawa1989 Aug 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree set is faster but column name is order sensitive -- not only about uniqueness. This is basically designed so that important information must come on the left side on the pandas html table (and I also believe a user assumes the keys are added to the table in the same ordering with the add method call). Unfortunately Python doesn't provide ordered set in builtin (I don't want to add extra dependency without drastic performance gain) and just gave up using the set here.

Here is the result of very casual performance check on my laptop (typically columns size is at most 20, and new keys added by a heavily customized analysis might be ~3):
image

Indeed set operation is faster, but the difference is just less than 1 usec. Since this line is run only when the analysis class generates an extra key, when you run batch of 10 experiments (this is really rare case), the number of method call cannot exceed 10 times. So the difference that user may experience in some heavy setting would be something like few us.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added new test case for key order preservation in 2cda8dc

ValueError: When index is not in this table.
"""
with self._lock:
if index not in self._container.index:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned on my first review you can use set here too. Because indexes must be unique. I think, you can attach set to class blueprint for unique indexes according to my opinion it would not hold much space.

Copy link
Collaborator Author

@nkanazawa1989 nkanazawa1989 Aug 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed this is the pandas Index object and this check is sufficiently performant. Typecast overhead is much expensive.
image

Note that index is not only added, but also being deleted. Adding new instance member of tracking the existing indices may make the code slightly faster, but this will add lines of code in multiple places and it increases maintenance overhead.

qiskit_experiments/database_service/utils.py Show resolved Hide resolved
qiskit_experiments/database_service/utils.py Show resolved Hide resolved
@itoko
Copy link
Contributor

itoko commented Jul 24, 2023

Thank you for the nice update @nkanazawa1989 . Deferring logic changes in composite analysis to a followup PR makes sense to me. Also I'm fine with upgrading add_analysis_results. LGTM except for the following two points that could be improved (sorry, I missed the second point at the first round of my review).

Regarding index, if we want it as a short name of UUID mainly for viewing purpose, how about generating it as a column on the fly when analysis_results(dataframe=True) is called? Such a lazy evaluation avoids the constant overhead of 8-chars collision check. It would also allow a shorter string length, e.g. 4 chars would be sufficient in most cases (for the uniqueness, maybe we need to try 4, 8, 16,... in some cases). You can query an entry with the short id (e.g. 'ab12') by df[df['result_id'].str.startswith('ab12')].

To me, current design of ThreadSafeDataFrame feels too specific to AnalysisResultTable and hence not useful in general. For example, having _column, _extra and container method with collapse_extra. Having just AnalysisResultTable(ThreadSafeContainer) makes more sense to me this time (since we have no concrete subclass of ThreadSafeDataFrame other than AnalysisResultTable, suggesting it's difficult to think of what is a good design for a common interface ThreadSafeDataFrame at this moment).

What do you think? @nkanazawa1989 (I'm happy to discuss these point offline if you want.)

nkanazawa1989 and others added 12 commits August 8, 2023 01:04
This change decouples AnalysisResult from ExperimentData. Since AnalysisResult not only defines data model but also provide API for the IBM experiment service, this coupling limits capability of experiment data analysis. ExperimentData.analysis_results still returns AnalysisResult for backward compatibility, but these object is not identical object. Returned object is AnalysisResult which is newly generated from dataframe.
These test cases use MagicMock for convenience, rather than using a mocked AnalysisResult. However, now added result entry is internally converted into dataframe format and the input object is discarded after add_analysis_results call. So it's no longer possible to track method call or evaluate identity of input object. Since user will never store MagicMock to experiment data, this change to unittest should not introduce any edge case and should not decrease the practical test coverage.
Co-authored-by: Helena Zhang <Helena.Zhang@ibm.com>
…, and it copies container into each process. Saving data in the copied container just discards added data when the process finishes. For now Qiskit Experiments don't use multi processing and we don't need to care these tests immediately.
…o provide convenient access to internal dataframe container so that ThreadSafeDataFrame becomes comparable with pandas DataFrame (indeed pandas implements lots of convenient method to manipulate data). However, this is dangerous in multithread environment, because lock is released after acquired item (this may be some method to manipulate data) is returned. The holder can still mutate the internal container asynchronously in this situation. To avoid this problem, __getattr__ is dropped and all necessary methods are implemented as method with reentrant lock.
- result id must be str(uuid) including "-"
- np.nan must be replaced with None
…added to extra field since IBM Experiment Service doesn't have data field for them.
@nkanazawa1989 nkanazawa1989 force-pushed the feature/dataframe-pr1 branch from ba431c8 to 2cda8dc Compare August 7, 2023 16:09
…me with its experiment type, qubits, and experiment id.
@nkanazawa1989
Copy link
Collaborator Author

nkanazawa1989 commented Aug 7, 2023

@coruscating Thanks for testing save with the real service. Fix is here c2cd090. This is not a serialization issue. The code generated the analysis result id with uuid4().hex and this generates a string without dash. However, likely IBM Experiment service expects str(uuid4()) which is a string including dash. I added regex to check this pattern. Another error of Unexpected token N in JSON at position 1792 is due to np.nan in the table. This must be replaced with None. These are very IBM specific but must be handled on client side. I also found a convenient function in the service, but this converter likely assumes column names in my draft branch, and doesn't work with current PR as-is. Anyways save/load work now.

It would be great if you could also check 665ff8f. This is related to figure name generation.

@itoko We cannot create dataframe without index, and we cannot remove index column without hacking the html representation of the dataframe. Adding short id as a new column just consumes extra horizontal space in the web browser, and this doesn't make sense. One thing we could do would be creating index with dataframe.loc[len(dataframe)], but we still need to consider the uniqueness of the index because ExperimentData allows users to delete entries, and hence the table length doesn't guarantee the uniqueness of index. Alternatively we can just give up the uniqueness of the index and just always use first 8 digit regardless of collision, but this will eventually require that user should check the length of returned entry (if there is collision they may get multiple unrelated entries) even though they filtered the table with a particular ID.

Here is the result of very casual performance test. Time consumption in the unique index generation is just 1.75%, so this logic is not critical to the entire performance.

exp_data = ExperimentData()

def add_100_entries():
    for i in range(100):
        exp_data.add_analysis_results(name=f"data_{i}", value=i)

image

Regarding the base class design, I plan to use this dataframe for curve data. Currently the curve data is represented by a unwieldy CurveData object. This takes arbitrary circuit metadata in addition to the fixed columns of xdata, ydata, yerr, etc..., so current base class implementation also fit in with this case.

…ary id as long as they don't use the save feature. The validation error is replaced with the user warning not to stop this workflow.
@itoko
Copy link
Contributor

itoko commented Aug 9, 2023

LGTM, thanks!

index or on-demand short_id column

I thought it's sufficient to tweak _repr_html_ to return self._container.to_html(index=False). That said, looking at the profiling results, as you pointed out, the unique index generation seems not a bottleneck in the expecting regime (~10K). The test adding 1000 (10K) rows takes 1.4 (18.4) seconds in my env while _unique_table_index takes 2.2% (7.3%) of the time. So I don't stick to using a column for short ID for now.

FYI: During my profiling, I encountered a weird performance regression in Pandas Dataframe when adding rows including a None entry in a column (e.g. quality in my example below). If rows have a column with None, the time scales super-linear to n_rows while it scales almost linear if they have no None. I don't think this is an issue for now since we are assuming ~3000 rows (~ batched three parallel experiments for a 1000-qubit system) for analysis result data size, right? But if we want to handle more rows in future, we may need to revisit this issue.

image

Script I used:

import time
from qiskit_experiments.framework import ExperimentData
def add_n_rows(nrows, none_quality=True):
    exp_data = ExperimentData()
    for i in range(nrows):
        exp_data.add_analysis_results(name=f"data_{i}", value=i, 
                        quality=None if none_quality else i,
                        components="",
                        experiment=i,
                        experiment_id="",
                        result_id="",
                        tags="",
                        backend=i,
                        run_time=i,
                        created_time=i
                )
times_nona=[]
for nrows in [1000, 2000, 3000, 4000, 5000]:
    elapsed_time = time.time()
    add_n_rows(nrows, none_quality=False)
    elapsed_time = time.time() - elapsed_time
    times_nona.append(elapsed_time)
times_withna=[]
for nrows in [1000, 2000, 3000, 4000, 5000]:
    elapsed_time = time.time()
    add_n_rows(nrows, none_quality=True)
    elapsed_time = time.time() - elapsed_time
    times_withna.append(elapsed_time)

ThreadSafeDataFrame

Good to know you designed it with two concrete subclasses, now I'm fine to have it. I see the doc says
This object is expected to be used internally in the ExperimentData.

@nkanazawa1989
Copy link
Collaborator Author

Thank you @itoko san! <1K entries is indeed a target of current implementation, i.e. running parallel experiment on few 100s qubits and batching them (e.g. combining T1 and T2) would generate ~1K entries. For future I also consider to switch to Polar.

Just fyi:

Main branch (I cannot run 3000 entries. Only 500 entries)

import time
from qiskit_experiments.framework import ExperimentData
from qiskit_experiments.framework import AnalysisResult

def add_n_rows(nrows, none_quality=True):
    exp_data = ExperimentData()
    data_list = []
    for i in range(nrows):
        data = AnalysisResult(
            name=f"data_{i}", 
            value=i, 
            quality=None if none_quality else i,
            device_components=["Q0"],
            experiment_id="",
            result_id="",
            tags="",
        )
        data_list.append(data)
    exp_data.add_analysis_results(data_list)

%timeit add_n_rows(500, none_quality=True)  
# 25.5 s ± 322 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In this branch with your code

%timeit add_n_rows(3000, none_quality=True) 
# 8.83 s ± 391 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

There is already a drastic performance improvement (more than expected).

Copy link
Collaborator

@coruscating coruscating left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you! 🚀

@nkanazawa1989 nkanazawa1989 added this pull request to the merge queue Aug 11, 2023
Merged via the queue into qiskit-community:main with commit 06f0718 Aug 11, 2023
github-merge-queue bot pushed a commit that referenced this pull request Aug 29, 2023
### Summary

Follow-up of #1133 for data frame. 

### Details and comments

I realized I had some mistake when I wrote a base class of
`ThreadSafeDataFrame`. We plan to move the curve data (XY data) to the
artifact for better reusability of the data points. In addition,
[CurveData](https://github.com/Qiskit-Extensions/qiskit-experiments/blob/c66034c90dad73d705af25be7e9ed9617e7eb2ef/qiskit_experiments/curve_analysis/curve_data.py#L88-L113)
object which is a container of the XY data is also replaced with the
data frame in a follow up PR. Since this table should contain predefined
columns such as `xval`, `yval`, `yerr`, I was assuming this container
could be a subclass of `ThreadSafeDataFrame` -- but this was not a right
assumption.

Since the curve analysis always runs on a single thread (analysis
callbacks might be run on multiple threads and thus
`AnalysisResultTable` must be thread-safe), this curve data frame
doesn't need to be thread-safe object. In this PR, a functionality to
define default columns of the table is reimplemented as a Python mixin
class so that the function to add default columns will be used also for
curve data frame without introducing the unnecessary thread safe
mechanisms. `AnalysisResultTable` is a thread safe container but the
functionality to manage default columns is delegated to the mixin.
nkanazawa1989 added a commit to nkanazawa1989/qiskit-experiments that referenced this pull request Jan 10, 2024
### Summary

Executing approved design proposal in
https://github.com/Qiskit/rfcs/blob/master/0007-experiment-dataframe.md.
Introduction of the `Artifact` will be done in the followup.

This PR introduces dataframe to `ExperimentData` and replaces the
conventional
[AnalysisResult](https://github.com/Qiskit/qiskit-experiments/blob/main/qiskit_experiments/framework/analysis_result.py)
with it. With this PR, experimentalist can visualize all analysis
results with a single html table by `dataframe=True`.


![image](https://user-images.githubusercontent.com/39517270/230959307-d5a4d471-1659-4cf4-974e-cdb21e625ad8.png)

Once can control the columns to show in the html table with `columns`
option.


### Details and comments

Experiment analysis is a local operation on client side, and data
generated there must be agnostic to the payload format of remote
service. In the current implementation, `AnalysisResult` appears in the
`BaseAnalysis` and the data generated by the analysis is implicitly
typecasted to this class. Note that `AnalysisResult` *not only* defines
canonical data format *but also* implements the API for the IBM
Experiment service. This tight coupling to the IBM Experiment service
may limit data expressibility and an analysis class author always need
to be aware of the data structure of the remote service.

In this PR, internal data structure is replaced with the flat dataframe,
and one should be able to add arbitrary key-value pair to the analysis
result without concerning about data model. Added key creates a new
column in the table, which must be internally formatted and absorbed by
extra field of the payload at a later time when these results are saved.
This is implemented by
`qiskit_experiments.framework.experiment_data._series_to_service_result`.
Bypassing the generation of `AnalysisResult` object also drastically
improves the performance of heavily parallelized analysis that generates
huge amount of data.

---------

Co-authored-by: Helena Zhang <Helena.Zhang@ibm.com>
nkanazawa1989 added a commit to nkanazawa1989/qiskit-experiments that referenced this pull request Jan 10, 2024
### Summary

Follow-up of qiskit-community#1133 for data frame. 

### Details and comments

I realized I had some mistake when I wrote a base class of
`ThreadSafeDataFrame`. We plan to move the curve data (XY data) to the
artifact for better reusability of the data points. In addition,
[CurveData](https://github.com/Qiskit-Extensions/qiskit-experiments/blob/c66034c90dad73d705af25be7e9ed9617e7eb2ef/qiskit_experiments/curve_analysis/curve_data.py#L88-L113)
object which is a container of the XY data is also replaced with the
data frame in a follow up PR. Since this table should contain predefined
columns such as `xval`, `yval`, `yerr`, I was assuming this container
could be a subclass of `ThreadSafeDataFrame` -- but this was not a right
assumption.

Since the curve analysis always runs on a single thread (analysis
callbacks might be run on multiple threads and thus
`AnalysisResultTable` must be thread-safe), this curve data frame
doesn't need to be thread-safe object. In this PR, a functionality to
define default columns of the table is reimplemented as a Python mixin
class so that the function to add default columns will be used also for
curve data frame without introducing the unnecessary thread safe
mechanisms. `AnalysisResultTable` is a thread safe container but the
functionality to manage default columns is delegated to the mixin.
github-merge-queue bot pushed a commit that referenced this pull request Jan 10, 2024
### Summary

#1133 introduce a bug in `MitigatedTomographyAnalysis` that accidentally
drops extra analysis metadata. This PR fixes the bug.

### Details and comments

In the new data storage implementation with dataframe,
`ExperimentData.analysis_results` returns a copy of the protected
dataframe and mutating the returned object doesn't update the source.
This PR introduces new option to `TomographyAnalysis` to inject extra
metadata.

<img width="993" alt="screenshot"
src="https://github.com/Qiskit-Extensions/qiskit-experiments/assets/39517270/d8674af1-78ad-4870-bfea-d441f9dfd1e8">

(edit) note that reno is not necessary because 1133 is not released yet.
nkanazawa1989 added a commit to nkanazawa1989/qiskit-experiments that referenced this pull request Jan 17, 2024
…ity#1344)

### Summary

qiskit-community#1133 introduce a bug in `MitigatedTomographyAnalysis` that accidentally
drops extra analysis metadata. This PR fixes the bug.

### Details and comments

In the new data storage implementation with dataframe,
`ExperimentData.analysis_results` returns a copy of the protected
dataframe and mutating the returned object doesn't update the source.
This PR introduces new option to `TomographyAnalysis` to inject extra
metadata.

<img width="993" alt="screenshot"
src="https://github.com/Qiskit-Extensions/qiskit-experiments/assets/39517270/d8674af1-78ad-4870-bfea-d441f9dfd1e8">

(edit) note that reno is not necessary because 1133 is not released yet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Changelog: New Feature Include in the "Added" section of the changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants