-
Notifications
You must be signed in to change notification settings - Fork 32
Draft Rucio data management plan
A description of how we manage, in Rucio, the data stored on disk at: https://twiki.cern.ch/twiki/bin/view/CMS/DMWMPG_Namespace and https://twiki.cern.ch/twiki/bin/view/CMS/DMWMPG_PrimaryDatasets
It is worth noting that ATLAS also uses Rucio to manage log files from production jobs, something we could consider as well. We would have to see how the ability to handle archives (zip files only) plays into this.
Rucio has a concept of a DID (data identifier) to describe all levels of data. There are three types of DID: file, dataset, and container. For CMS, these will correspond to file, block, and dataset respectively. Each DID consists of a "scope" and a "name". Only the combination of scope and name is guaranteed to be unique. Most Rucio APIs and commands interact with DIDs regardless of type.
For CMS the name portion will correspond to LFN, block name, and dataset name from CMS while scopes will be new functionality for us. Because of the way CMS namespace is implemented, there will be some interactions between the scope and the namespace (see the Scope section).
For example, all of the following are valid DIDs:
- cms:/BTagMu/Run2018A-17Sep2018-v1/AOD | CONTAINER |
- cms:/BTagMu/Run2018A-17Sep2018-v1/AOD#297bc346-d69c-4149-baf3-aaea1fe6111e
- cms:/store/data/Run2018A/BTagMu/AOD/17Sep2018-v1/60001/FF5E6F53-6F2D-2D47-B266-5B896392E75F.root
- user.dciangot:/store/test/rucio/temp/dciangot/testASO29.txt
Accounts must be given a quota at each RSE where they need to write. Some accounts will be used to produce new data, others are only for requesting replicas of existing data. Separating uses out into different accounts will improve accounting, quota enforcement, and understanding what rules are in place for what reasons.
root This account will be able to perform any action in rucio. No part of the production system will be given root access. Rather this will be used for defining the topology (making RSEs, etc) as well as forcing the removal of rules made with other accounts.
tier0 data produced by the Tier0 and subscribed to CERN and Tier 1 MSS. Quotas only needed at Tier0 and Tier1 sites.
production data and replicas produced by the production system. Quotas needed at each site used for long term disk storage.
staging (or another name) Used for transient data placement transfers by unified for processing. Quotas needed at each site used for production.
tape_recall for replicas to be staged from tape by CRAB. We may wish to limit these quotas to a few known good analysis sites.
user.[jdoe] data produced by users. Each user would have quota on one RSE.
group.[physics] replicas of existing datasets to be stored at particular RSEs (does CMS need this?)
manager.[T2_XX_Yyyy] associated with an account at a particular RSE (effectively local quotas)
In Rucio the RSE is virtual concept corresponding to a portion of a storage element. We propose to use at least two and perhaps as many as four RSEs per site (or PhEDEx Node Name). Each RSE will be named for the site/PNN as we have now plus the extensions listed below. We require all the below RSEs to be created per site (except _Tape only for Tier1 sites)
- T1_US_FNAL_Disk – the normal disk endpoint for a Tier1
- T1_US_FNAL_Tape – the normal tape endpoint for a Tier1
- T2_CH_CERN - the disk endpoint of a Tier2/3
-
T2_CH_CERN_Temp – this is where user data can be written. This is a different type of RSE called a non-deterministic RSE in which both the PFN and the LFN are specified. This is needed for user data for two reasons:
- It allows the site to manage, as now, /store/temp/user/ areas where each user is allowed to write their own files
- The sub-directory where the user writes is /store/temp/user/[NAME].[HASH]/ where the hash is needed to allow a user to write with multiple DNs (HASH is the hash of the user's DN).
- T2_CH_CERN_Test – mapped to /store/test/rucio initially. This will initially be used for Rucio testing where Rucio can write and delete data without impacting CMS operations. Such an area may prove useful for load tests, so may be kept afterwards.
DIDs in Rucio have a wide array of attached metadata. We propose to use the following fields for these purposes
In Rucio, only the pair of scope and name are required to be unique. Because of the way we want to map scope to the upper level of the CMS LFN namespace, we will need to enforce scope/LFN allowed pairings
cms: This will be the default scope for CMS and will allow data to be stored into the following LFN prefixes:
- /store/data/
- /store/hidata/
- /store/mc/
- /store/himc/
- /store/relval/
- /store/hirelval/
- /store/express/
- /store/results/
- /store/backfill/
- /store/generator/
test: Data used for tests, including load tests
- /store/test/ (Note this same PFN space may be accessible with test RSEs if the site does not enforce a separate PFN namespace. This is the same situation with PhEDEx.)
user.[jdoe]: A scope for each user. This will automatically be filled from CRIC. Prefixing with "user." is necessary to keep users from overlapping with other groups, central accounts, etc.
- /store/user/rucio/[jdoe]/
- /store/temp/user/[jdoe] (these data can be registered on special RSEs where LFN and PFN are both specified. The LFN will be of the first form, the PFN will correspond to the /temp/ part of the LFN space)
group.[name]: Similar to user, but for groups
- /store/group/rucio/[name]/
unmerged: A special scope in which unmerged data can be stored and tracked until it is merged. These data can be deleted without storing the metadata into the archive.
- /store/unmerged/
There is a subset of metadata used by Rucio in its own operation that must be provided. This includes
- bytes (of file)
- md5 (checksum)
- adler32 (checksum) as well as a number of fields internal to Rucio (which do not need to be supplied).
Additionally there is optional metadata which may be useful for describing the data and for forming subscriptions (see below). All of these values should be set per file as well as per dataset. Setting per file helps with accounting at sites.
name: This will be the dataset (Rucio container), block (Rucio dataset), or LFN name
events: We can set this. Not sure what it would be used for
project: Use this for the Era (undefined length in CMS, 199 limit for all of processed dataset). For ATLAS this is a duplicate of the scope
datatype: Use this for the data tier (GEN, MINIAOD, RAW)
run_number: Only one run # can be attached per DID. This is probably not useful to us and should be ignored.
stream_name: This is the primary dataset (current limit of 70 characters, CMS limit is 99)
prod_step: The name of the WMAgent/cmsRun step which made the data. This can be the "TaskType" from WMAgent or the process name from the CMSSW config
version: We could use this or not to version datasets (version is already in the name)
campaign: The campaign we used to produce the data.
task_id: This field is integer, so we can't use it for the request name or PREP ID. We should be able to attach the request name, original request name, and the PREP_ID(s) of each request to the datasets.
phys_group: If this could be passed from McM as a new request parameter, it's possible it could be useful.
Much of what we do today with Dynamo can be accomplished with Rucio rules and subscriptions. In Rucio a rule is kind of like a PhEDEx subscription in that it specifies what to do with one DID, however they can be more general and have built-in expirations. So it's possible already to say "Keep two copies of this dataset on disk in Europe until one year from now."
Subscriptions are best viewed as "rule generators" which will match on dataset metadata (name or any other field described in the metadata section above) to generate a rule. These subscriptions have names and all rules created by those names can be cleaned up by removing rules by the name of the creating subscription. A subscription is how we will generate files like "One copy of all MINIAOD from campaign Spring19 in the USA".
We will need to translate what Dynamo is doing into a set of subscriptions and rules which are more or less static.
The popularity portion of what Dynamo is doing will be done with additional, short-lived, rules created by the dynamic data management daemons of Rucio.
FTS3 multi-dimentional scheduling offers two features:
- activities shares, which allow to divide the assigned slots acording to weights decided by the VO
- priorities, which allow to reshuffle the jobs within an activity share
Currently CMS does not rely on these features. PhEDEx has its own intricate scheduling mechanism, and FTS3 acts simply as a FIFO queue.
Rucio provides interface for both features; however, ATLAS is only using activity shares, as setting different priorities may cause starvation.
We propose to introduce the following activity shares in CMS Rucio (initial weights to be decided):
- tier-0 transfers
- pre-production transfers
- post-production transfers
- debug transfers
- popularity based replication
- rebalancing
- user transfers
- crab staging
- ASO
- default (pre-existing in FTS)
Activity and priority can be specified as an attribute of the replication rule. Activity weights are defined in FTS configuration and can be adjusted via FTS API using VO production role.
- FTS scheduling: http://fts3-docs.web.cern.ch/fts3-docs/docs/features.html#multidimensional-scheduler
- Rucio interface: https://rucio.readthedocs.io/en/latest/api/rule.html
- FTS interface: http://fts3-docs.web.cern.ch/fts3-docs/fts-rest/docs/api.html#activity-shares-configuration
- PhEDEx scheduling: https://twiki.cern.ch/twiki/bin/view/CMS/PhedexAdminDocsPriorityQueues
- Priority levels (high, normal, low, reserved) are specified by user in transfer request.
- Approved request turn into subscriptions.
- Subscribed files are allocated for transfer in order of priority and the time the data was requested, older request first.
- Files are allocated for routing until 50 TB request windows per priority are filled.
- Priority changes are propagated to the allocated file requests.
- Once data are routed, transfer tasks are created with task_priority, time_assigned, and rank . The rank is ordered by priority and then the file's logical name.
- Transfer tasks are fetched by the site FileDownload agent, sorted by time_assigned and rank and submitted to FTS in that order.
VO | Activity | Weight |
---|---|---|
atlas | data brokering | 0.3 |
data consolidation | 0.2 | |
data rebalancing | 0.5 | |
default | 0.02 | |
express | 0.4 | |
functional test | 0.2 | |
production input | 0.25 | |
production output | 0.25 | |
recovery | 0.4 | |
staging | 0.5 | |
t0 export | 0.7 | |
t0 tape | 0.7 | |
user subscriptions | 0.1 |