Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v7r2] fixes for ES docs, SummarizeLogsAgent and StatesAccountingAgent can also commit to Monitoring #4905

Merged
merged 4 commits into from
Feb 3, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 88 additions & 21 deletions docs/source/AdministratorGuide/Systems/MonitoringSystem/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,32 +10,34 @@ Monitoring System
Overview
=========

The Monitoring system is used to monitor various components of DIRAC. Currently, we have two monitoring types:
The Monitoring system is used to monitor various components of DIRAC. Currently, we have three monitoring types:

- WMSHistory: for monitoring the DIRAC WMS
- Component Monitoring: for monitoring DIRAC components such as services, agents, etc.
- Component Monitoring: for monitoring DIRAC components such as services, agents, etc.
- RMS Monitoring: for monitoring the DIRAC RequestManagement System (mostly the Request Executing Agent).

It is based on Elasticsearch distributed search and analytics NoSQL database. If you want to use it, you have to install the Monitoring service and
elasticsearch db. You can use a single node, if you do not have to store lot of data, otherwise you need a cluster (more than one node).
It is based on Elasticsearch distributed search and analytics NoSQL database.
If you want to use it, you have to install the Monitoring service, and of course connect to a ElasticSearch instance.

Install Elasticsearch
======================

You can found in https://www.elastic.co official web site. I propose to use standard tools to install for example: yum, rpm, etc. otherwise
you encounter some problems. If you are not familiar with managing linux packages, you have to ask your college or read some relevant documents.
This is not covered here, as installation and administration of ES are not part of DIRAC guide.
Just a note on the ES versions supported: ES7 and ES6 are supported, the support for ES5 is not assured,
and the one for ES6 will be dropped in a future release.

Configure the MonitoringSystem
===============================

You can run your El cluster without authentication or using User name and password. You have to add the following parameters:
You can run your Elastic cluster even without authentication, or using User name and password. You have to add the following parameters:

- User
- Password
- Host
- Port

The User name and Password must be added to the local cfg file while the other can be added to the CS using the Configuration web application.
You have to handle the EL secret information in a similar way to what is done for the other supported SQL databases, e.g. MySQL
The *User* name and *Password* must be added to the local cfg file while the other can be added to the CS using the Configuration web application.
You have to handle the ES secret information in a similar way to what is done for the other supported SQL databases, e.g. MySQL


For example::
Expand All @@ -47,24 +49,54 @@ For example::
User = test
Password = password
}
}


The following option can be set in `Systems/Monitoring/<Setup>/Databases/MonitoringDB`:

*IndexPrefix*: Prefix used to prepend to indexes created in the ES instance. If this
is not present in the CS, the indices are prefixed with the setup name.

For each monitoring types managed, the Period (how often a new index is created)
can be defined with::

MonitoringTypes
{
ComponentMonitoring
{
# Indexing strategy. Possible values: day, week, month, year, null
Period = month
}
RMSMonitoring
{
# Indexing strategy. Possible values: day, week, month, year, null
Period = month
}
WMSHistory
{
# Indexing strategy. Possible values: day, week, month, year, null
Period = day
}
}

The given periods above are also the default periods in the code.


Enable WMSHistory monitoring
============================

You have to install the WorkloadManagemet/StatesMonitoringAgent. This agent is used to collect information using the JobDB and send it to the Elasticsearch database.
If you install this agent, you can stop the StatesAccounting agent.
You have to add ``Monitoring`` to the ``Backends`` option of WorkloadManagemet/StatesAccountingAgent.
If you do so, this agent will collect information using the JobDB and send it to the Elasticsearch database.
This same agent can also report to the MySQL backend of the Accounting system (which is in fact the default).

Note: You can use RabbitMQ for failover. This is optional as the agent already has a failover mechanism. You can configure RabbitMQ in the local dirac.cfg file
where the agent is running::
Optionally, you can use an MQ system (like RabbitMQ) for failover, even though the agent already has a simple failover mechanism.
You can configure the MQ in the local dirac.cfg file where the agent is running::

Resources
{
MQServices
{
hostname (for example lbvobox10.cern.ch)
hostname.some.where
{
MQType = Stomp
Port = 61613
Expand All @@ -86,24 +118,59 @@ where the agent is running::
Enable Component monitoring
===========================

You have to set DynamicMonitoring=True in the CS::
You have to set ``DynamicMonitoring=True`` in the CS::

Systems
{
Framework
{
SystemAdministrator
Framework
{
<instance>
{
Services
{
SystemAdministrator
{
...
DynamicMonitoring = True
}
...
DynamicMonitoring = True
}
}
}
}
}


.. image:: cs.png
:align: center


Enable RMS Monitoring
=====================

In order to enable RMSMonitoring we need to set value of ``EnableActivityMonitoring`` flag to yes/true in the CS::


Systems
{
RequestManagement
{
<instance>
{
Agents
{
RequestExecutingAgent
{
...
EnableActivityMonitoring = True
}
}
}
}
}


or inside the ``/Operations`` section as a general flag.


Accessing the Monitoring information
=====================================

Expand Down
20 changes: 9 additions & 11 deletions src/DIRAC/MonitoringSystem/Client/MonitoringReporter.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,12 +37,11 @@ class MonitoringReporter(object):
"""
.. class:: MonitoringReporter

This class is used to interact with the db using failover mechanism.
This class is used to interact with the ES DB, using a MQ as a failover mechanism.

:param int __maxRecordsInABundle: limit the number of records to be inserted to the db.
:param threading.RLock __documentLock: is used to lock the local store when it is being modified.
:param __documents: contains the recods which will be inserted to the db.
:type __documents: python:list
:param list __documents: contains the records which will be inserted to the db.
:param str __monitoringType: type of the records which will be inserted to the db. For example: WMSHistory.
:param str __failoverQueueName: the name of the messaging queue. For example: /queue/dirac.certification
"""
Expand All @@ -62,8 +61,8 @@ def __del__(self):

def processRecords(self):
"""
It consumes all messages from the MQ (these are failover messages). In case of failure, the messages
will be inserted to the MQ again.
It consumes all messages from the MQ (these are failover messages).
In case of failure, the messages will be inserted to the MQ again.
"""
retVal = monitoringDB.pingDB() # if the db is not accessible, the records will be not processed from MQ
if retVal['OK']:
Expand All @@ -74,10 +73,9 @@ def processRecords(self):

result = createConsumer("Monitoring::Queues::%s" % self.__failoverQueueName)
if not result['OK']:
gLogger.error("Fail to create Consumer: %s" % result['Message'])
return S_ERROR("Fail to create Consumer: %s" % result['Message'])
else:
mqConsumer = result['Value']
gLogger.error("Fail to create Consumer", result['Message'])
return S_ERROR("Fail to create Consumer")
mqConsumer = result['Value']

result = S_OK()
failedToProcess = []
Expand Down Expand Up @@ -128,8 +126,8 @@ def publishRecords(self, records, mqProducer=None):

def commit(self):
"""
It inserts the accumulated data to the db. In case of failure
it keeps in memory/MQ
It inserts the accumulated data to the db.
In case of failure it keeps in memory/MQ
"""
# before we try to insert the data to the db, we process all the data
# which are already in the queue
Expand Down
22 changes: 2 additions & 20 deletions src/DIRAC/MonitoringSystem/ConfigTemplate.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -6,30 +6,12 @@ Services
Port = 9137
Authorization
{
Default = authenticated
Default = authenticated
FileTransfer
{
Default = authenticated
}
}
MonitoringTypes
{
ComponentMonitoring
{
# Indexing strategy. Possible values: day, week, month, year, null
Period = month
}
RMSMonitoring
{
# Indexing strategy. Possible values: day, week, month, year, null
Period = month
}
WMSHistory
{
# Indexing strategy. Possible values: day, week, month, year, null
Period = day
}
}
}
##END
}
}
25 changes: 24 additions & 1 deletion src/DIRAC/MonitoringSystem/DB/MonitoringDB.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,34 @@

**Configuration Parameters**:

The following options can be set in ``Systems/Monitoring/<Setup>/Databases/MonitoringDB``
The following option can be set in `Systems/Monitoring/<Setup>/Databases/MonitoringDB`

* *IndexPrefix*: Prefix used to prepend to indexes created in the ES instance. If this
is not present in the CS, the indexes are prefixed with the setup name.

For each monitoring types managed, the Period (how often a new index is created)
can be defined with::

MonitoringTypes
{
ComponentMonitoring
{
# Indexing strategy. Possible values: day, week, month, year, null
Period = month
}
RMSMonitoring
{
# Indexing strategy. Possible values: day, week, month, year, null
Period = month
}
WMSHistory
{
# Indexing strategy. Possible values: day, week, month, year, null
Period = day
}
}


"""
from __future__ import absolute_import
from __future__ import division
Expand Down
5 changes: 3 additions & 2 deletions src/DIRAC/ResourceStatusSystem/Agent/SummarizeLogsAgent.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,8 +101,8 @@ def execute(self):
self.log.error(deleteResult['Message'])
continue

if self.months:
self._removeOldHistoryEntries(element, self.months)
if self.months:
self._removeOldHistoryEntries(element, self.months)

return S_OK()

Expand Down Expand Up @@ -252,6 +252,7 @@ def _removeOldHistoryEntries(self, element, months):
:return: S_OK / S_ERROR
"""
toRemove = datetime.utcnow().replace(microsecond=0) - timedelta(days=30 * months)
self.log.info("Removing history entries", "older than %s" % toRemove)

deleteResult = self.rsClient.deleteStatusElement(element, 'History',
meta={'older': ['DateEffective', toRemove]})
Expand Down
Loading