Slurm Ground Motion Workflow
- Added site reponse code for BB step
- New entry in gmsim templates to control site response directory
- Added new State for a job failing due to Wall Clock Time
- Updated Enum values in the DB (failed is now 7, killed_WCT is 6)
- Updated function to determine number of retires to only count tasks that hit WCT
- Added setup.py such that importable code is installed as submodules
- Updated the structure to have clear separation between scripts and importables
- TACC support
- Unused test for manual_install
- Added cybershake_progress.py producing a formatted report and optionally alerts to Slack channel
- Creates a temporary virtual environment for Jenkins testing
- Removed a dependence on Docker image
- Flags --site_specific and --site_v1d_dir to hf_sim.py
- Updated nesi_tests configs to use correct non uniform grid
- Replaced the use of direct argparse object with unpacked variables
- Updated auto_submit and automated testing to accommodate the change
- Flags --vm_perturbations and --ignore_vm_perturbations for use with velocity model perturbations
- If used the sim_params and root_params files will be updated for emod3d to use the relevant perturbation file
- Removed .travis.yml
- Removed Travis-CI Webhooks from the repository setting
- Removed the repository link from Travis-CI web.
- Jenkinsfile now controls the workflow of auto-testing
- Added "jenkins" user and group to Docker image, so that the Docker image can be run not as root - preventing lots of files with root permission that need sudo to be cleaned up after testing
- Can be manually kicked off testing.
- Replaced build status badge
- Added Empirical calculation as a part of automated workflow
- Run_cybershake can take a list as part of each task entry
- Extra query modes for query_mgmt_db - todo and to list tasks that have exceeded max_retry count
- Extra parameter for BB_Sim - ability to choose the site_amp model from config
- IM plot to use python 3 workflow
- removed partition specifier for mahuika slurm scripts (as per NeSI recommendations)
- lowered core requirements on quick animation, less queue time
- Does not create new tasks for faults that are re-installed
- When marking a task as failed it will mark that tasks depedencies as failed too. The output files are not touched
- Removed default partition on slurm scripts for Mahuika as they aren't necessary anymore
- Querymgmtdb now accepts lists as part of the task_config yaml entries.
- E2E tests bugs fixed
- Changed the default version of HF to 6.0.3
- Can be switched back to 5.4.5 with "--version 5.4.5" option for hf_sim.py
- If version 6.0.3 is used, it utilizes a new argument--stress_param_adj
- Changed HF Rayset default from 1 2 to 1, which is internally handled as [1].
- Added slurm templates related to Advanced_IM
- Added adv_im related logic to submit_sim_imcalc
- Added simple tests for adv_im output to check for blank fields
- Updated log_metadata to handle adv_im related metadata
- Updated automated workflow to be able to handle adv_im as a step
- Updated queue_monitor to be able to handle a specific case where job is missing in queue and also not store in sacct history
- Updated unit test to test for 15 steps intead of 14
- Stations along the South or Eastern edge of the VM domain had a broken component, so for now these stations have been removed from the station list.
- End to end test data has been updated to remove the dropped stations
- Dashboard has been relocated to a separete dashboard repo
- Empirical calculation and aggregation scripts to the scripts/empirical directory
- Empirical calculation and aggregation orchestration script to the scripts/empirical directory
- Rrups has had some buggy behaviour removed, was previously failing due to bad bash
- If an event or fault only has 1 realisation to be run it is installed without the _REL suffix
- Fixed auto close time out behaviour for run_cybershake and auto_submit, should finish as expected like previous behaviour
- Increased timeout to two iterations of each task subthread
- Empiricals don't work yet, but have had updates to be closer to working
- Test data for v16p1 simple and validation, and 18p5p3p4 simple updated to have data for the updated workflow with timeshifting. Old test data is still available for testing old workflow versions.
- Version files for gm versions 18.5.3.2, 18.5.3.1, 18.5.4.1
- LF now has additional sim duration time added to account for time shifts to preserve causality
- Version 3.0.4 of EMOD3D had a time shift of 1/flo, while proceeding versions have a time shift of 3/flo
- Updated deprecated DashTable attributes so the app can run without error
- For BB, HF2BB, LF2BB the dt can now be set and the given data will be interpolated to that level
- Added metadata logging for failed runs in queue monitor
- If squeue does not return the expected headers, it is assumed to have failed. In this case no jobs will be marked as failed and requiring resubmitting
- add_to_mgmt_queue now optionally takes in the slurm job id. If it is given it is used to match the update with an existing database entry, if it is not given the user is warned and the update is applied anyway
- If more than one entry in the database is updated by an update the user is alerted to this
- Updated the way station list is split to achieve improved load balancing for HF calculation
- Removed --seed: -1 option which allowed fully random run of HF calculation
- Added MIT License
- fixed logic bug checking completion of IM_calculation
- Added a script to aggregate HF logs into a csv with the amount of core hours lost to thread idling
- Dashboard now try 4 hpc login nodes [maui|mahuika][01|02] instead of 2
- Logging has been moved to the qcore repository to allow its use in other repositories
- Added IM_plot, plot_ts and plot_srf slurm scripts to cybershake workflow
- Updated rrup to run as expected
- The nt check on install now rounds to the nearest integer, instead of rounding down to the next one.
- EMOD3D no longer dumps partial results to the output directory.
- Two more testing configurations to end to end tests
- Queue monitor stress test
- IM_calculation and visualisation added to environment creation
- Checks to install and BB_sim to ensure lf and hf have the same number of (extrapolated) steps
- On start any tasks that can have retires with the current max_n_retries have them added
- The flag -c is now available for the query_mgmt_db script to give a count of how many tasks are in each state
- plot_ts and plot_srf added
- Added loading test to test_merge_ts
- Template task_config for the automated wrapper
- Improved cross platform support for slurm scripts from Maui to Mahuika
- Queue monitor is now responsible for checking squeue and keeping the database up to date
- Improved logging for threaded scripts
- Shared library refactored out of shared: shared_automated_workflow
- slurm log file names are now determined by job number and name
- rrup renabled
- A wrapper for auto_submit and queue_monitor is now available
- LF and HF may now be converted to BB without the other, and IM calculations subsequently performed on them
- When tasks fail they will have a new task made for them instead of having their retry counter incremented
- The parts of the workflow relating to text based computation outputs have been removed
- This includes winbin_aio, match_seismo and hfsims-stats
- version is now the final parameter and is optional. If it is not provided it is assumed to be version 16.1
- Cybershake path is now converted to the absolute path of what is passed in
- The first srf of each fault is checked that it is within the bounds of the velocity model.
- As a result, out of bounds srfs will now cause validation to fail.
- HF seed is now an optional parameter to the install script
- Instead of passing the path to a cybershake config file, the required version should be passed to the install script
- Cybershake_config.json no longer needed. All relevant values have been moved to the root_defaults.yaml in the gmsim templates subdirectory.
- Improved submit and mgmt queue logging
- Changed to work with new automated workflow submit (see [19.4.2])
- Lots of minor improvements so it can handle running/testing a large quantity of simulations.
- Added core hours estimation unit tests
- Added extra LF test that checks if there are any zeros in the velocities
- Installation paths can now be provided as relative paths
- Files in the management db folder are not deleted if they are not valid update files
- HF seeds are now correctly communicated to the child processes
- LF checkpoint files are now removed when clean up occurs
- merge_ts now accepts absolute paths
- The relative path to the cybershake directory can now be passed to install_cybershake
- Changelog now has newest first
Changes to dashboard:
- Changed to use sreport to get total & daily core hours usage
- Changed dashboard test functions & test zip file download path
- Tidied up Maui dashboard
- Added Mahuika total core hours usage to dashboard
- Added functionalities to collection of old total core hours usgae
- fixed logic bug in inserting into maui/mahuika daily table
Changes to automated workflow:
-
Split into two processes:
- Auto-submit: which submits job to the HPC & populates the mgmt db queue with updates for the mgmt db
- Queue monitor: Updates the status of the tasks in mgmt db
For how to use, see the updated README
Note: The auto-submit reads from the mgmt db, but NEVER writes or updates
Changes to autosubmit:
- Machine queues are maintained individually
- The desired maximum length for all or each machine queue can be set with the -n flag