Skip to content

CMSPopularity is a community project to cover various aspects of CMS popularity via data-stream aggregation on HDFS.

Notifications You must be signed in to change notification settings

cronosnull/CMSPopularity

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 

Repository files navigation

CMSPopularity

CMSPopularity is a community project to cover various aspects of CMS popularity via data-stream aggregation on HDFS.

Introduction

We'll use CMSSpark package to produce and collect various metrics from HDFS. These metrics represent use activities with various CMS data-services. For description of data-services and available data please refer to CMSSpark package.

For previous effort to aggregated different metrics please refer to summer student reports.

So far we feed data into CERN MONIT system with the following dashboards:

Data Popularity Scrutiny Plot Specification

This histogram shows dataset usage by CMS jobs. The bins of the plot are labeled by number of accesses. One access is equal to reading 100% of the events or files in the dataset. The 1-bin includes any non-zero reading < 150% of the dataset. Higher accesses are rounded to the nearest integer. The 0-bin contains datasets created during the period but not used. The 0-old bin contains datasets created before the period but not used.

Each bin is broken into three sub-bins that cover the last three months, the last six months, and the full time period of the plot.

Each bin is weighted by the dataset sizes in the bin. The size calculation starts with the average replica size at a site, which is the daily weighted average size of the dataset during the time it is present at a site. Then the average sizes are summed for each day of the dataset’s lifetime over all the sites where replicas are located. This sum is divided by the number of days in the period to give an overall daily weighted average of the CMS disk space taken up by the dataset during the period.

The plot is usually limited to showing data for datasets on T1s and T2s.

Tasks:

In this project we'll follow the following tasks

References:

CMS popularity CMS data-management CMSSpark PySpark

About

CMSPopularity is a community project to cover various aspects of CMS popularity via data-stream aggregation on HDFS.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 81.0%
  • Shell 17.7%
  • Awk 1.3%