Skip to content

Commit

Permalink
Implements fine-grained data directories. Fixes #1051.
Browse files Browse the repository at this point in the history
* Configurable data directories instead of a single /sfm-data directory.
* Updates reference to sfm-base Docker image
* Updates space monitoring.
Co-authored-by: SvenLieber <sven.lieber@ugent.be>
  • Loading branch information
lwrubel authored Jun 22, 2021
1 parent 8a75aef commit 39cca2f
Show file tree
Hide file tree
Showing 26 changed files with 304 additions and 97 deletions.
3 changes: 2 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
FROM gwul/sfm-base@sha256:e68cb98bdc9dc23bbed734f3e507a0ffb866b007dffea038b6af8d88a62150e6
# gwul/sfm-base 2.4.0
FROM gwul/sfm-base@sha256:dc6f305631f756d9c25517b6322ec83a024c11c3e1e1d7be742d74a57b1f1028
MAINTAINER Social Feed Manager <sfm@gwu.edu>

# Install apache
Expand Down
5 changes: 3 additions & 2 deletions Dockerfile-consumer
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
FROM gwul/sfm-base@sha256:e68cb98bdc9dc23bbed734f3e507a0ffb866b007dffea038b6af8d88a62150e6
# gwul/sfm-base 2.4.0
FROM gwul/sfm-base@sha256:dc6f305631f756d9c25517b6322ec83a024c11c3e1e1d7be742d74a57b1f1028
MAINTAINER Social Feed Manager <sfm@gwu.edu>

ADD . /opt/sfm-ui/
Expand All @@ -11,4 +12,4 @@ RUN chmod +x /opt/sfm-setup/invoke_consumer.sh
ENV DJANGO_SETTINGS_MODULE=sfm.settings.docker_settings
ENV LOAD_FIXTURES=false

CMD ["/opt/sfm-setup/invoke_consumer.sh"]
CMD ["/opt/sfm-setup/invoke_consumer.sh"]
5 changes: 3 additions & 2 deletions Dockerfile-runserver
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
FROM gwul/sfm-base@sha256:e68cb98bdc9dc23bbed734f3e507a0ffb866b007dffea038b6af8d88a62150e6
# gwul/sfm-base 2.4.0
FROM gwul/sfm-base@sha256:dc6f305631f756d9c25517b6322ec83a024c11c3e1e1d7be742d74a57b1f1028
MAINTAINER Social Feed Manager <sfm@gwu.edu>

ADD . /opt/sfm-ui/
Expand All @@ -18,4 +19,4 @@ ENV DJANGO_SETTINGS_MODULE=sfm.settings.docker_settings
ENV LOAD_FIXTURES=false
EXPOSE 8000

CMD ["/opt/sfm-setup/invoke_runserver.sh"]
CMD ["/opt/sfm-setup/invoke_runserver.sh"]
2 changes: 1 addition & 1 deletion docker/consumer/invoke_consumer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
set -e

sh /opt/sfm-setup/setup_reqs.sh
appdeps.py --wait-secs 90 --port-wait ${SFM_POSTGRES_HOST}:${SFM_POSTGRES_PORT} --file /opt/sfm-ui --port-wait ${SFM_RABBITMQ_HOST}:${SFM_RABBITMQ_PORT} --port-wait ui:8080 --file-wait /sfm-data/collection_set
appdeps.py --wait-secs 90 --port-wait ${SFM_POSTGRES_HOST}:${SFM_POSTGRES_PORT} --file /opt/sfm-ui --port-wait ${SFM_RABBITMQ_HOST}:${SFM_RABBITMQ_PORT} --port-wait ui:8080 --file-wait /sfm-collection-set-data/collection_set

echo "Running consumer"
exec gosu sfm /opt/sfm-ui/sfm/manage.py startconsumer
2 changes: 1 addition & 1 deletion docker/db/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,5 @@ FROM postgres:9.6

MAINTAINER Laura Wrubel <sfm@gwu.edu>

ENV PGDATA /sfm-data/postgresql/9.6/data
ENV PGDATA /sfm-db-data/postgresql/9.6/data
ADD initdb.sql /docker-entrypoint-initdb.d/
2 changes: 1 addition & 1 deletion docker/ui/invoke.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
set -e

sh /opt/sfm-setup/setup_reqs.sh
appdeps.py --wait-secs 60 --port-wait ${SFM_POSTGRES_HOST}:${SFM_POSTGRES_PORT} --file /opt/sfm-ui --port-wait ${SFM_RABBITMQ_HOST}:${SFM_RABBITMQ_PORT} --file-wait /sfm-data/collection_set
appdeps.py --wait-secs 60 --port-wait ${SFM_POSTGRES_HOST}:${SFM_POSTGRES_PORT} --file /opt/sfm-ui --port-wait ${SFM_RABBITMQ_HOST}:${SFM_RABBITMQ_PORT} --file-wait /sfm-collection-set-data/collection_set
sh /opt/sfm-setup/setup_ui.sh

echo "Running server"
Expand Down
2 changes: 1 addition & 1 deletion docker/ui/invoke_runserver.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
set -e

sh /opt/sfm-setup/setup_reqs.sh
appdeps.py --wait-secs 60 --port-wait ${SFM_POSTGRES_HOST}:${SFM_POSTGRES_PORT} --file /opt/sfm-ui --port-wait ${SFM_RABBITMQ_HOST}:${SFM_RABBITMQ_PORT} --file-wait /sfm-data/collection_set
appdeps.py --wait-secs 60 --port-wait ${SFM_POSTGRES_HOST}:${SFM_POSTGRES_PORT} --file /opt/sfm-ui --port-wait ${SFM_RABBITMQ_HOST}:${SFM_RABBITMQ_PORT} --file-wait /sfm-collection-set-data/collection_set
sh /opt/sfm-setup/setup_ui.sh

echo "Running server"
Expand Down
29 changes: 23 additions & 6 deletions docs/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -162,16 +162,33 @@ Note the following:

Configuration is documented in ``example.env``. For a production deployment, pay particular attention to the following:

* Set new passwords for ``SFM_SITE_ADMIN_PASSWORD``, ``RABBIT_MQ_PASSWORD``, and ``POSTGRES_PASSWORD``.
* Set new passwords for ``SFM_SITE_ADMIN_PASSWORD``, ``SFM_RABBIT_MQ_PASSWORD``, and ``SFM_POSTGRES_PASSWORD``.
* The `data volume strategy <https://docs.docker.com/engine/userguide/dockervolumes/#creating-and-mounting-a-data-volume-container>`_
is used to manage the volumes that store SFM's data. By default, normal Docker volumes are used. To use a host volume
instead, change the ``DATA_VOLUME`` and ``PROCESSING_VOLUME`` settings. Host volumes are recommended for production
because they allow access to the data from outside of Docker.
is used to manage the volumes that store SFM's data. By default, normal Docker volumes are used. Host volumes are recommended for production
because they allow access to the data from outside of Docker. To use host volumes, change the following values to point
to a directory or mounted filesystem (e.g. ``/sfm-data/sfm-mq-data:/sfm-mq-data``):
* ``DATA_VOLUME_MQ``
* ``DATA_VOLUME_DB``
* ``DATA_VOLUME_EXPORT``
* ``DATA_VOLUME_CONTAINERS``
* ``DATA_VOLUME_COLLECTION_SET``
* ``PROCESSING_VOLUME``
* SFM allows data volumes to live on mounted filesystems and will monitor space usage of each. Many SFM instances are configured
with all data on the same server, however. If all data volumes are on the same filesystem:
* Change ``DATA_SHARED_USED`` to True.
* Set ``DATA_SHARED_DIR`` to the path of the parent directory on the filesystem, e.g. ``/sfm-data``.
* Provide a threshold for space usage warning emails to be sent by updating ``DATA_THRESHOLD_SHARED``.
* In ``docker-compose.yml``, uncomment the ``volumes`` section in the ``ui`` container definition so that the
``DATA_SHARED_DIR`` is accessible to SFM for monitoring.
* Set the ``SFM_HOSTNAME`` and ``SFM_PORT`` appropriately. These are the public hostname (e.g., sfm.gwu.edu) and port (e.g., 80)
for SFM.
* Email is configured by providing ``SFM_SMTP_HOST``, ``SFM_EMAIL_USER``, and ``SFM_EMAIL_PASSWORD``.
* If running RabbitMQ or Postgres on another server, set appropriate values for ``SFM_RABBITMQ_HOST``, ``SFM_RABBITMQ_PORT``,
``SFM_RABBITMQ_MANAGEMENT_PORT, ``SFM_POSTGRES_HOST``, and ``SFM_POSTGRES_PORT``.
* Email is configured by providing ``SFM_SMTP_HOST``, ``SFM_EMAIL_USER``, and ``SFM_EMAIL_PASSWORD``.
(If the configured email account is hosted by Google, you will need to configure the account to "Allow less secure apps."
Currently this setting is accessed, while logged in to the google account, via https://myaccount.google.com/security#connectedapps).


* Application credentials for social media APIs are configured in by providing the ``TWITTER_CONSUMER_KEY``,
``TWITTER_CONSUMER_SECRET``, ``WEIBO_API_KEY``, ``WEIBO_API_SECRET``, and/or ``TUMBLR_CONSUMER_KEY``,
``TUMBLR_CONSUMER_SECRET``. These are optional, but will make acquiring credentials easier for users.
Expand All @@ -188,7 +205,7 @@ Configuration is documented in ``example.env``. For a production deployment, pay
``SFM_COOKIE_CONSENT_BUTTON_TEXT``.

Note that if you make a change to configuration *after* SFM is brought up, you will need to restart containers. If
the change only applies to a single container, then you can stop the container with ``docker kill <container name>``. If
the change only applies to a single container, then you can stop the container with ``docker stop <container name>``. If
the change applies to multiple containers (or you're not sure), you can stop all containers with ``docker-compose stop``.
Containers can then be brought back up with ``docker-compose up -d`` and the configuration change will take effect.

Expand Down
10 changes: 5 additions & 5 deletions docs/messaging_spec.rst
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ Harvest start messages specify for a harvester the details of a harvest. Example
{
"id": "sfmui:45",
"type": "flickr_user",
"path": "/sfm-data/collections/3989a5f99e41487aaef698680537c3f5/6980fac666c54322a2ebdbcb2a9510f5",
"path": "/sfm-collection-set-data/collections/3989a5f99e41487aaef698680537c3f5/6980fac666c54322a2ebdbcb2a9510f5",
"seeds": [
{
"id": "a36fe186fbfa47a89dbb0551e1f0f181",
Expand Down Expand Up @@ -132,7 +132,7 @@ Another example::
{
"id": "test:1",
"type": "twitter_search",
"path": "/sfm-data/collections/3989a5f99e41487aaef698680537c3f5/6980fac666c54322a2ebdbcb2a9510f5",
"path": "/sfm-collection-set-data/collections/3989a5f99e41487aaef698680537c3f5/6980fac666c54322a2ebdbcb2a9510f5",
"seeds": [
{
"id": "32786222ef374eb38f1c5d56321c99e8",
Expand Down Expand Up @@ -250,7 +250,7 @@ created during a harvest. Example::

{
"warc": {
"path": "/sfm-data/collections/3989a5f99e41487aaef698680537c3f5/6980fac666c54322a2ebdbcb2a9510f5/2015/07/28/11/harvest_id-2015-07-28T11:17:36Z.warc.gz",,
"path": "/sfm-collection-set-data/collection_set/3989a5f99e41487aaef698680537c3f5/6980fac666c54322a2ebdbcb2a9510f5/2015/07/28/11/harvest_id-2015-07-28T11:17:36Z.warc.gz",,
"sha1": "7512e1c227c29332172118f0b79b2ca75cbe8979",
"bytes": 26146,
"id": "aba6033aafce4fbabd846026ca47f13e",
Expand Down Expand Up @@ -311,7 +311,7 @@ Export start messages specify the requests for an export. Example::
"collection": {
"id": "005b131f5f854402afa2b08a4b7ba960"
},
"path": "/sfm-data/exports/45",
"path": "/sfm-export-data/export/45",
"format": "csv",
"dedupe": true,
"segment_size": 100000,
Expand All @@ -336,7 +336,7 @@ Another example::
"uid": "85779209@N08"
}
],
"path": "/sfm-data/exports/45",
"path": "/sfm-export-data/export/45",
"format": "json",
"segment_size": null
}
Expand Down
4 changes: 2 additions & 2 deletions docs/monitoring.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,11 +68,11 @@ id will be something like `bffcae5d0603`.

Second, determine the harvest id. This is available from the harvest's detail page.

Third, get the stdout log with ``docker exec -t <name> cat /sfm-data/containers/<container id>/log/<harvest id>.out.log``.
Third, get the stdout log with ``docker exec -t <name> cat /sfm-containers-data/containers/<container id>/log/<harvest id>.out.log``.
To get the stderr log, substitute `.err` for `.out`.

To follow the log, use `tail -f` instead of `cat`. For example,
``docker exec -t sfm_twitterstreamharvester_1 tail -f /sfm-data/containers/bffcae5d0603/log/d4493eed5f4f49c6a1981c89cb5d525f.err.log``.
``docker exec -t sfm_twitterstreamharvester_1 tail -f /sfm-containers-data/containers/bffcae5d0603/log/d4493eed5f4f49c6a1981c89cb5d525f.err.log``.

-----------------------------
RabbitMQ management console
Expand Down
12 changes: 6 additions & 6 deletions docs/portability.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,13 @@ has a complete set of JSON database records to support loading it into a differe

Here are the JSON database records for an example collection::

[root@1da93afd43b5:/sfm-data/collection_set/4c59ebf2dcdc4a0e9660e32d004fa846/072ff07ea9954b39a1883e979de92d22/records# ls
[root@1da93afd43b5:/sfm-collection-set-data/collection_set/4c59ebf2dcdc4a0e9660e32d004fa846/072ff07ea9954b39a1883e979de92d22/records# ls
collection.json groups.json historical_collection.json historical_seeds.json users.json
collection_set.json harvest_stats.json historical_collection_set.json info.json warcs.json
credentials.json harvests.json historical_credentials.json seeds.json

Thus, moving a collection set only requires moving/copying the collection set's directory; moving a collection
only requires moving/copying a collection's directory. Collection sets are in ``/sfm-data/collection_set`` and
only requires moving/copying a collection's directory. Collection sets are in ``/sfm-collection-set-data/collection_set`` and
are named by their collection set ids. Collections are subdirectories of their collection set
and are named by their collection ids.

Expand Down Expand Up @@ -61,11 +61,11 @@ can be refreshed used the ``serializecollectionset`` and ``serializecollection``
Loading a collection set / collection
---------------------------------------

1. Move/copy the collection set/collection to ``/sfm-data/collection_set``. Collection sets should be placed
1. Move/copy the collection set/collection to ``/sfm-collection-set-data/collection_set``. Collection sets should be placed
in this directory. Collections should be placed into a collection set directory.
2. Execute the ``deserializecollectionset`` or ``deserializecollection`` management command::

root@1da93afd43b5:/opt/sfm-ui/sfm# ./manage.py deserializecollectionset /sfm-data/collection_set/4c59ebf2dcdc4a0e9660e32d004fa846
root@1da93afd43b5:/opt/sfm-ui/sfm# ./manage.py deserializecollectionset /sfm-collection-set-data/collection_set/4c59ebf2dcdc4a0e9660e32d004fa846

Note:

Expand All @@ -86,12 +86,12 @@ Note:
-------------------------------

1. Stop the source instance: ``docker-compose stop``.
2. Copy the ``/sfm-data`` directory from the source server to the destination server.
2. Copy the data directories (``/sfm-collection-set-data``, ``/sfm-containers-data``, ``/sfm-export-data``, ``/sfm-db-data``, ``/sfm-mq-data``) from their location on the source server to the destination server.
3. If preserving processing data, also copy the ``/sfm-processing`` directory from the source server to the destination
server.
4. Copy the ``docker-compose.yml`` and ``.env`` files from the source server to the destination server.
5. Make any changes necessary in the ``.env`` file, e.g., ``SFM_HOSTNAME``.
6. Start the destination instance: ``docker-compose up -d``.

If moving between AWS EC2 instances and ``/sfm-data`` is on a separate EBS volume, the volume can be detached from
If moving between AWS EC2 instances and one or more sfm data directories are on a separate EBS volume, the volume can be detached from
the source EC2 instances and attached to the destination EC2 instance.
10 changes: 5 additions & 5 deletions docs/processing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,17 +18,17 @@ Using a processing container requires familiarity with the Linux shell and shell
you are interested in using a processing container, please contact your SFM administrator for help.

When exporting/processing data, remember that harvested social media content are stored
in ``/sfm-data``. ``/sfm-processing`` is provided to store your exports, processed data, or scripts. Depending
in ``/sfm-collection-set-data``. ``/sfm-processing`` is provided to store your exports, processed data, or scripts. Depending
on how it is configured, you may have access to ``/sfm-processing`` from your local filesystem. See :doc:`storage`.

----------------------
Processing container
----------------------

To bootstrap export/processing, a processing image is provided. A container instantiated from this
image is Ubuntu 14.04 and pre-installed with the warc iterator tools, ``find_warcs.py``, and some other
image is Ubuntu 16.04 and pre-installed with the warc iterator tools, ``find_warcs.py``, and some other
useful tools. (Warc iterators and ``find_warcs.py`` are described below.) It will also have read-only
access to the data from ``/sfm-data`` and read/write access to ``/sfm-processing``.
access to the data in sfm data directories (e.g. ``/sfm-collection-set-data``) and read/write access to ``/sfm-processing``.

The other tools available in a processing container are:

Expand Down Expand Up @@ -103,7 +103,7 @@ For example, to get a list of the WARC files in a particular collection, provide
the collection id::

root@0ac9caaf7e72:/sfm-data# find_warcs.py 4f4d1
/sfm-data/collection_set/b06d164c632d405294d3c17584f03278/4f4d1a6677f34d539bbd8486e22de33b/2016/05/04/14/515dab00c05740f487e095773cce8ab1-20160504143638715-00000-47-88e5bc8a36a5-8000.warc.gz
/sfm-collection-set-data/collection_set/b06d164c632d405294d3c17584f03278/4f4d1a6677f34d539bbd8486e22de33b/2016/05/04/14/515dab00c05740f487e095773cce8ab1-20160504143638715-00000-47-88e5bc8a36a5-8000.warc.gz

(In this case there is only one WARC file. If there was more than one, it would be space separated. Use ``--newline`` to
to separate with a newline instead.)
Expand Down Expand Up @@ -173,7 +173,7 @@ This command puts all of the JSON files in the same directory, using the filenam

If you want to maintain the directory structure, but use a different root directory::

cat source.lst | sed 's/sfm-data\/collection_set/sfm-processing\/export/' | sed 's/.warc.gz/.json/'
cat source.lst | sed 's/sfm-collection-set-data\/collection_set/sfm-processing\/export/' | sed 's/.warc.gz/.json/'

Replace `sfm-processing\/export` with the root directory that you want to use.

Expand Down
Loading

0 comments on commit 39cca2f

Please sign in to comment.