-
Notifications
You must be signed in to change notification settings - Fork 7
EuroSys'22 Artifact Evaluation Comments
Paper #86 Reasoning about Configurable System Performance through the lens of Causality
Hi authors, thanks so much for submitting the artifact. I have downloaded it from the public github link (not the one listed in artifact appendix) and finished the first example. Everything seems good so far.
Dear authors, I was able to set up Unicorn and run the first example as well. One thing to note, in order to get the results reproducible badge, we believe that the exact setup from the paper's evaluation needs to be mimicked, specifically the online version of Unicorn. Would it be possible for us to get access to the hardware so that we can evaluate the online version of Unicorn.
Comment @A3 by Md Shahriar Iqbal miqbal@email.sc.edu
Hi reviewers, thank you for the excellent reviews. @A2, Sure. Please let me work on that to allow access to the hardware. I will notify you once it can be tested.
Comment @A4 by Md Shahriar Iqbal miqbal@email.sc.edu
Dear reviewers, we hope to provide access to the devices to run experiments for reproducibility within the next 3 days as there are some external dependencies. We will notify you immediately once it is ready to access and run. Thank you for your patience and time.
Comment @A5 by Md Shahriar Iqbal miqbal@email.sc.edu
Dear reviewers, steps to run unicorn in online mode is updated in https://github.com/softsys4ai/unicorn/blob/master/artifact/REPRODUCE.md. A video run of the example is also provided to start the experiment. Please let us know if you find any issues.
Hi, thanks a lot for preparing the online env. But 34.125.174.0 is not reachable from my internet. is that a internal ip address ? or the ip is already collected by the vm manager?
Comment @A7 by Md Shahriar Iqbal miqbal@email.sc.edu
Dear reviewer, can you please check with the updated instruction? The issue is caused by dynamic IP as we used reverse tunneling to our device using google cloud. The new IP is 34.125.91.37.
Hi author, thanks for the update now the ip is reachable. But "ssh -p 2200 nvidia@localhost " seems not work due to connection refused. do you have any idea to solve that ?
Comment @A9 by Md Shahriar Iqbal miqbal@email.sc.edu
Dear reviewer, we are sorry for the connection issues. Can you please try again now?
Comment @A10 by Pooyan Jamshidi pjamshid@cse.sc.edu
Dear reviewer, for your convenience, here are the instructions you can access the device:
git clone https://github.com/softsys4ai/unicorn.git
chmod 400 ./unicorn/etc/key
ssh -i ./unicorn/etc/key nvidia@34.125.91.37
ssh -p 2200 nvidia@localhost
user: nvidia
password: nvidia
Thanks for the update. just launched the 11 hours test. A small typo in command: "python3 ./services/run_services.py Image " should be python3 ./services/run_service.py Image. removing the "s" before ".py"
Hi author, I saw the following error when running the script:
Connections discovered by the causal graph
[('sched_sched_wakeup_new', 'sched_sched_stat_runtime'), ('vm.vfs_cache_pressure', 'total_energy_consumption'), ('major-faults', 'sched_sched_load_avg_cpu'), ('logical_devices', 'sched_sched_overutilized'), ('branch-loads', 'raw_syscalls_sys_enter'), ('branch-misses', 'context-switches'), ('cache-misses', 'sched_sched_wakeup'), ('cache-misses', 'total_energy_consumption'), ('L1-dcache-load-misses', 'branch-misses'), ('migrations', 'sched_sched_wakeup_new'), ('sched_sched_load_avg_cpu', 'sched_sched_switch'), ('num_cores', 'raw_syscalls_sys_enter'), ('sched_sched_load_avg_cpu', 'branch-load-misses'), ('L1-dcache-load-misses', 'cache-misses'), ('instructions', 'L1-dcache-loads'), ('sched_sched_process_wait', 'L1-dcache-loads'), ('cycles', 'L1-dcache-load-misses'), ('sched_sched_switch', 'cycles'), ('sched_sched_wakeup', 'major-faults'), ('sched_sched_wakeup_new', 'total_energy_consumption'), ('vm.vfs_cache_pressure', 'total_energy_consumption'), ('major-faults', 'total_energy_consumption'), ('logical_devices', 'total_energy_consumption'), ('branch-loads', 'total_energy_consumption'), ('branch-misses', 'total_energy_consumption'), ('cache-misses', 'total_energy_consumption'), ('cache-misses', 'total_energy_consumption'), ('L1-dcache-load-misses', 'total_energy_consumption')]
--------------------------------------------------------------
Traceback (most recent call last):
File "./tests/run_unicorn_debug.py", line 301, in <module>
columns, options, NUM_PATHS)
File "./tests/run_unicorn_debug.py", line 54, in run_unicorn_loop
G = ADMG(columns, di_edges=di_edges, bi_edges=bi_edges)
File "/usr/local/lib/python3.6/dist-packages/ananke/graphs/admg.py", line 30, in __init__
super().__init__(vertices=vertices, di_edges=di_edges, bi_edges=bi_edges, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/ananke/graphs/sg.py", line 30, in __init__
raise TypeError("TypeError: Graph is not acyclic")
TypeError: TypeError: Graph is not acyclic
Is this error expected ?
Comment @A13 by Md Shahriar Iqbal miqbal@email.sc.edu
Dear reviewer, this is not expected. Please allow me some time to take a look.
Comment @A14 by Md Shahriar Iqbal miqbal@email.sc.edu
Dear reviewer, I was not able to reproduce the issue and when I ran the tests it ran fine. The graph can become acyclic when there the feature values do not change. This can happen if the perf stops working in the background. However, I updated the test instructions in https://github.com/softsys4ai/unicorn/blob/master/artifact/REPRODUCE.md for better tracking and in case of an issue arises the made progress is not lost. Please let me know if you find any issue. Sometimes the flask app gets killed if it has more than 10000 MB memory when the app launches. In that case, please rerun the python3 ./services/run_service.py Image script again.
Comment @A15 by Md Shahriar Iqbal miqbal@email.sc.edu
Dear reviewer, considering the time available for the review period, I can also video record the entire session if you feel is sufficient to receive the reproducibility badge. We immensely appreciate your time and effort.
Hi author, thanks a lot for the update. I totally understood that it is not easy to debug a 11 hours script in a container. Personally, I accept a video for the 11 hours script but need to sync with other reviewers. Also I will retry the script.
BTW, I still see "ssh -p 2200 nvidia@localhost seems not work due to connection refused". It may be related to your debug. anyway I will keep retrying.
Comment @A17 by Md Shahriar Iqbal miqbal@email.sc.edu
Dear reviewer, thank you. Sorry for the inconvenience. Please try again. The port closes every time the system is boot.
Comment @A18 by Md Shahriar Iqbal miqbal@email.sc.edu
Currently, I have divided the scripts to run 7 bugs at a time instead of all 29. So, there are four ~3hr long tests.
Hi Author, thanks so much for the help recently. I just finished the 11 hours script successfully.
Comment @A20 by Md Shahriar Iqbal miqbal@email.sc.edu
Dear reviewer, thank you for taking so much time and effort. We apologize for all the inconvenience. Please let me know if you need any help or have any issues in comparing the results. For this online experiment, you would want to compare with Energy Faults results in Table 2 for Xception (Image) in Xavier. You would want to compare the gain, and time which are mostly useful for comparison for each bug.
Comment @A21 by Md Shahriar Iqbal miqbal@email.sc.edu
/data/measurement/output/debug_exp.csv contains the results. The reported time is in seconds.