Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduled serialization fails due to write permission problems #532

Closed
justinlittman opened this issue Nov 15, 2016 · 34 comments
Closed

Scheduled serialization fails due to write permission problems #532

justinlittman opened this issue Nov 15, 2016 · 34 comments

Comments

@justinlittman
Copy link
Contributor

When run from Apache (instead of runserver), scheduled serialization fails like following:

[Tue Nov 15 03:00:00.013820 2016] [wsgi:error] [pid 51:tid 140324986918656] INFO 2016-11-15 03:00:00,013 51 ui.serialize Serializing all
[Tue Nov 15 03:00:00.015603 2016] [wsgi:error] [pid 51:tid 140324986918656] INFO 2016-11-15 03:00:00,015 51 ui.serialize Serializing <Collection Set 1 "Test collection set">
[Tue Nov 15 03:00:00.025285 2016] [wsgi:error] [pid 51:tid 140324986918656] INFO 2016-11-15 03:00:00,025 51 ui.serialize Serializing <Collection 1 "Election filter">
[Tue Nov 15 03:00:00.027929 2016] [wsgi:error] [pid 51:tid 140324986918656] ERROR 2016-11-15 03:00:00,027 51 apscheduler.executors.default Job "serialize_all (trigger: cron[hour='3', minute='0'], next run at: 2016-11-16 03:00:00 EST)" raised an exception
[Tue Nov 15 03:00:00.027944 2016] [wsgi:error] [pid 51:tid 140324986918656] Traceback (most recent call last):
[Tue Nov 15 03:00:00.027947 2016] [wsgi:error] [pid 51:tid 140324986918656]   File "/usr/local/lib/python2.7/site-packages/apscheduler/executors/base.py", line 112, in run_job
[Tue Nov 15 03:00:00.027949 2016] [wsgi:error] [pid 51:tid 140324986918656]     retval = job.func(*job.args, **job.kwargs)
[Tue Nov 15 03:00:00.027950 2016] [wsgi:error] [pid 51:tid 140324986918656]   File "/opt/sfm-ui/sfm/ui/serialize.py", line 514, in serialize_all
[Tue Nov 15 03:00:00.027951 2016] [wsgi:error] [pid 51:tid 140324986918656]     serializer.serialize_all()
[Tue Nov 15 03:00:00.027952 2016] [wsgi:error] [pid 51:tid 140324986918656]   File "/opt/sfm-ui/sfm/ui/serialize.py", line 49, in serialize_all
[Tue Nov 15 03:00:00.027954 2016] [wsgi:error] [pid 51:tid 140324986918656]     self.serialize_collection_set(collection_set, force_serialize=force_serialize)
[Tue Nov 15 03:00:00.027955 2016] [wsgi:error] [pid 51:tid 140324986918656]   File "/opt/sfm-ui/sfm/ui/serialize.py", line 54, in serialize_collection_set
[Tue Nov 15 03:00:00.027957 2016] [wsgi:error] [pid 51:tid 140324986918656]     self.serialize_collection(collection, force_serialize=force_serialize)
[Tue Nov 15 03:00:00.027958 2016] [wsgi:error] [pid 51:tid 140324986918656]   File "/opt/sfm-ui/sfm/ui/serialize.py", line 73, in serialize_collection
[Tue Nov 15 03:00:00.027960 2016] [wsgi:error] [pid 51:tid 140324986918656]     self._initialize_records_dir(records_path)
[Tue Nov 15 03:00:00.027961 2016] [wsgi:error] [pid 51:tid 140324986918656]   File "/opt/sfm-ui/sfm/ui/serialize.py", line 249, in _initialize_records_dir
[Tue Nov 15 03:00:00.027963 2016] [wsgi:error] [pid 51:tid 140324986918656]     os.makedirs(records_path)
[Tue Nov 15 03:00:00.027965 2016] [wsgi:error] [pid 51:tid 140324986918656]   File "/usr/local/lib/python2.7/os.py", line 157, in makedirs
[Tue Nov 15 03:00:00.027966 2016] [wsgi:error] [pid 51:tid 140324986918656]     mkdir(name, mode)
[Tue Nov 15 03:00:00.027968 2016] [wsgi:error] [pid 51:tid 140324986918656] OSError: [Errno 13] Permission denied: '/sfm-data/collection_set/3ab3c4d518f14d738d5925da9e5ee6d0/ab38b36cbfcb4d3d8e89e7027cee29bd/records'

Steps to reproduce:

  1. Bring up an instance of SFM that is running Apache for SFM UI.
  2. Create some collections.
  3. Wait for scheduled serialization (or manipulate the configurable serialization time).
  4. Verify serialization by looking for README.txt files in collection set or collection directories.

Excepted result:
Serialization occurs according to schedule.

Actual results:
Serialization fails because serialization is running within Apache and therefore is running as a non-root user. However, non-root user does not have write permissions for /sfm-data.

@justinlittman justinlittman added this to the 1.4 milestone Nov 15, 2016
@justinlittman justinlittman self-assigned this Nov 15, 2016
@justinlittman
Copy link
Contributor Author

This will also be a problem for #479

@justinlittman
Copy link
Contributor Author

Workaround is to perform the serialization using the management command:

ubuntu@ip-172-30-3-149:~/sfm-docker$ docker exec -it sfm_ui_1 /bin/bash
root@f03e2cd8ee85:/opt/sfm-ui# cd sfm                                                                                                                               
root@f03e2cd8ee85:/opt/sfm-ui/sfm# ./manage.py serializecollectionset 1880fd483c534589b1423a9958d90469
Serializing
INFO 2016-11-15 12:36:21,241 277 ui.serialize Serializing <Collection Set 11 "Election 2016">
INFO 2016-11-15 12:36:21,264 277 ui.serialize Serializing <Collection 25 "Trump hashtags">
INFO 2016-11-15 12:36:21,618 277 ui.serialize Serializing <Collection 57 "HillaryClinton">
INFO 2016-11-15 12:36:21,874 277 ui.serialize Serializing <Collection 56 "Trump">
INFO 2016-11-15 12:36:22,120 277 ui.serialize Serializing <Collection 38 "Both parties search">
INFO 2016-11-15 12:36:22,281 277 ui.serialize Serializing <Collection 53 "ImWithHer">
INFO 2016-11-15 12:36:22,520 277 ui.serialize Serializing <Collection 34 "Trump search">
INFO 2016-11-15 12:36:22,985 277 ui.serialize Serializing <Collection 43 "Media organizations ">
INFO 2016-11-15 12:36:24,006 277 ui.serialize Serializing <Collection 52 "Election2016 ">
INFO 2016-11-15 12:36:24,230 277 ui.serialize Serializing <Collection 48 "VP Debate">
INFO 2016-11-15 12:36:24,285 277 ui.serialize Serializing <Collection 39 "HillaryClinton">
INFO 2016-11-15 12:36:24,457 277 ui.serialize Serializing <Collection 59 "MakeAmericaGreatAgain">
INFO 2016-11-15 12:36:24,600 277 ui.serialize Serializing <Collection 37 "Sample">
INFO 2016-11-15 12:36:24,747 277 ui.serialize Serializing <Collection 36 "TImelines">
INFO 2016-11-15 12:36:28,340 277 ui.serialize Serializing <Collection 49 "VP Debate">
INFO 2016-11-15 12:36:28,472 277 ui.serialize Serializing <Collection 41 "Press coverage">
INFO 2016-11-15 12:36:42,263 277 ui.serialize Serializing <Collection 45 "Journalists and Pundits">
INFO 2016-11-15 12:36:42,981 277 ui.serialize Serializing <Collection 55 "HillaryEmails">
INFO 2016-11-15 12:36:43,116 277 ui.serialize Serializing <Collection 61 "ElectionDay">
INFO 2016-11-15 12:36:43,348 277 ui.serialize Serializing <Collection 58 "BasketOfDeplorables">
INFO 2016-11-15 12:36:43,480 277 ui.serialize Serializing <Collection 46 "Other MIT deemed influencers ">
INFO 2016-11-15 12:36:44,536 277 ui.serialize Serializing <Collection 51 "TrumpTapes">
INFO 2016-11-15 12:36:44,683 277 ui.serialize Serializing <Collection 60 "MAGA">
INFO 2016-11-15 12:36:44,863 277 ui.serialize Serializing <Collection 44 "Celebrity accounts covering politics">
INFO 2016-11-15 12:36:45,497 277 ui.serialize Serializing <Collection 42 "Candidates and other politicans">
INFO 2016-11-15 12:36:46,640 277 ui.serialize Serializing <Collection 54 "NeverTrump">
INFO 2016-11-15 12:36:46,856 277 ui.serialize Serializing <Collection 50 "Debate">
Done serializing

@justinlittman
Copy link
Contributor Author

To allow SFM UI under Apache to write to /sfm-data, the solution is to run the various services as the sfm user in the sfm group. This includes sfm-ui, harvesters, exporters, and ELK.

Branches for this ticket:

TODO:

  • After sfm-utils is committed and new sfm-base is built, the correct sha for the image needs to be placed in most Dockerfiles.
  • Directions need to be written for upgrading an existing SFM instance with both a docker and linked data volume.
  • Automatic build of sfm-data container needs to be added to Docker hub.

For testing:

  • Every container will need to be built using the appropriate branch.
  • Test every type of harvest and export.
  • Test with both Apache and runserver for SFM-UI.
  • Test scheduling of serialization.
  • Test with both docker and linked data volumes.

justinlittman pushed a commit that referenced this issue Nov 29, 2016
…zation back on. Turns off lazy initialization of sfm app when running under Apache.
@justinlittman
Copy link
Contributor Author

New sfm-base image is built and correct sha added to Dockerfiles. PRs created. Automatic build of sfm-data setup.

@Tanych - you should be good to go with testing.

@Tanych
Copy link
Contributor

Tanych commented Nov 29, 2016

@justinlittman for the other containers such as sfm-ui,sfm-twittert-harvester, etc, I need to build new image based on the sfm_532-user since the sfm-base has changed, right?

@justinlittman
Copy link
Contributor Author

justinlittman commented Nov 30, 2016 via email

@Tanych
Copy link
Contributor

Tanych commented Nov 30, 2016

@justinlittman
I have built every image based on the sfm_532-user and the sfm-data using the automatical build from sfm-base.
The docker-compose.yml like this:

  • sfm-data
 data:
        image: gwul/sfm-data:master
  • UI(apache)
 ui:
        build:
            context: ../sfm-ui
            dockerfile: Dockerfile
  • harvesters
 twitterrestharvester:
        build:
            context: ../sfm-twitter-harvester
            dockerfile: Dockerfile-rest-harvester

Test

Create a twitter user timeline, permission error:

arc.gz to /sfm-data/collection_set/fd7bb2aae00f481eb153529c9206089b/32155f0a8cf84528a14f1481fd2b85b8/2016/11/30/15/8206844747c14e9796a48a081504e84f-20161130155800635-00000-58-c16001fcbe30-8000.warc.gz
Exception in thread warc_processing_thread:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/local/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/opt/sfm-utils/sfmutils/harvester.py", line 508, in _process_warc_thread
    os.makedirs(dest_path)
  File "/usr/local/lib/python2.7/os.py", line 150, in makedirs
    makedirs(head, mode)
  File "/usr/local/lib/python2.7/os.py", line 150, in makedirs
    makedirs(head, mode)
  File "/usr/local/lib/python2.7/os.py", line 150, in makedirs
    makedirs(head, mode)
  File "/usr/local/lib/python2.7/os.py", line 150, in makedirs
    makedirs(head, mode)
  File "/usr/local/lib/python2.7/os.py", line 150, in makedirs
    makedirs(head, mode)
  File "/usr/local/lib/python2.7/os.py", line 157, in makedirs
    mkdir(name, mode)
OSError: [Errno 13] Permission denied: '/sfm-data/collection_set/fd7bb2aae00f481eb153529c9206089b'

Check the permission stats of sfm-data:

  File: ‘/sfm-data/’
  Size: 4096      	Blocks: 8          IO Block: 4096   directory
Device: 801h/2049d	Inode: 3809951     Links: 6
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2016-11-30 10:28:11.737121154 -0500
Modify: 2016-11-29 16:19:04.482423185 -0500
Change: 2016-11-29 16:19:04.482423185 -0500
 Birth: -

Based on the Access info, only root have write permission and group and others only get the read and execute.

Apache owers:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        75  0.0  0.0   4328   748 ?        S    10:26   0:00 /bin/sh /usr/sbin/apachectl -DFOREGROUND
root        77  0.0  0.0  85052  6076 ?        S    10:26   0:00 /usr/sbin/apache2 -DFOREGROUND
sfm         78  0.2  0.2 1537156 78792 ?       Sl   10:26   0:09 /usr/sbin/apache2 -DFOREGROUND
sfm         79  0.0  0.0 2017060 12608 ?       Sl   10:26   0:01 /usr/sbin/apache2 -DFOREGROUND
sfm         80  0.0  0.0 1820324 11820 ?       Sl   10:26   0:01 /usr/sbin/apache2 -DFOREGROUND

@Tanych
Copy link
Contributor

Tanych commented Nov 30, 2016

It seems that the setup_dirs chown didn't run, I will check it again.

@Tanych
Copy link
Contributor

Tanych commented Nov 30, 2016

@justinlittman
I think the reason is that the data container can't find the env SFM_UID and SFM_GID. so the group and user create failed .

+ groupadd -r sfm --gid=
groupadd: invalid group ID ''
+ export COLLECTION_SET_DIR=/sfm-data/collection_set
+ [ ! -d /sfm-data/collection_set ]
+ echo Creating collection_sets directory
+ mkdir -p /sfm-data/collection_set
Creating collection_sets directory
+ chown sfm:sfm /sfm-data/collection_set
chown: invalid user: 'sfm:sfm'
Creating containers directory
+ export CONTAINERS_DIR=/sfm-data/containers
+ [ ! -d /sfm-data/containers ]
+ echo Creating containers directory
+ mkdir -p /sfm-data/containers
+ chown sfm:sfm /sfm-data/containers
chown: invalid user: 'sfm:sfm'
Creating export directory
+ export EXPORT_DIR=/sfm-data/export
+ [ ! -d /sfm-data/export ]
+ echo Creating export directory
+ mkdir -p /sfm-data/export
+ chown sfm:sfm /sfm-data/export
chown: invalid user: 'sfm:sfm'

We need add SFM_UID and SFM_GID on the data environment in docker-compose.yml, right now, the permission has changed:

drwxr-xr-x 2 sfm  sfm  4096 Nov 30 13:50 collection_set
drwxr-xr-x 3 sfm  sfm  4096 Nov 30 13:50 containers
drwxr-xr-x 2 sfm  sfm  4096 Nov 30 13:50 export
drwxr-xr-x 3 root root 4096 Nov 30 13:50 postgresql

@justinlittman
Copy link
Contributor Author

Just pushed a correct for the example docker-compose.yml files: gwu-libraries/sfm-docker@4ca292c

@Tanych
Copy link
Contributor

Tanych commented Nov 30, 2016

@justinlittman
sfm-web-harvester can't work well. The heritrix can't start.
error: exec: "/opt/heritrix/bin/heritrix": stat /opt/heritrix/bin/heritrix: permission denied

@justinlittman
Copy link
Contributor Author

@Tanych I can't reproduce. Check to take sure your docker-compose is similar to below:

    data:
#        build:
#            context: .
#            dockerfile: Dockerfile-data
#        image: ubuntu:14.04
        image: gwul/sfm-data:master
#        command: /bin/true
        volumes:
             - ${DATA_VOLUME}
        environment:
            - TZ
            - SFM_UID
            - SFM_GID

and

   heritrix:
#        image:  gwul/sfm-heritrix:master
        build:
            context: ../sfm-web-harvester/docker/heritrix
            dockerfile: Dockerfile
        ports:
            # Opens up the port for Heritrix admin console.
            - "${HERITRIX_ADMIN_PORT}:8443"
        environment:
            - HERITRIX_USER
            - HERITRIX_PASSWORD
            # Memory for heritrix
            - JAVA_OPTS=-Xmx512M
            - SFM_UID
            - SFM_GID
        volumes_from:
            - data

@Tanych
Copy link
Contributor

Tanych commented Dec 1, 2016

@justinlittman
The docker-compose.yml is the same for heritrix and data. I am running sfm-ui on apache. I have cleared the volumes and images. To reproduce the error,

  • kill the heritrix and clear the volumes heritrix
  • remove the image of heritrix
  • build the image again
  • start the sfm-heritrix and check the heritrix logs

The error could as follows:

error: exec: "/opt/heritrix/bin/heritrix": stat /opt/heritrix/bin/heritrix: permission denied

Actually the permission of heritrix belongs to sfm, also it has exec permission. Not sure how gosu work.

  File: ‘/opt/heritrix/bin/heritrix’
  Size: 4952      	Blocks: 16         IO Block: 4096   regular file
Device: 22h/34d	Inode: 131         Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (  990/     sfm)   Gid: (  990/     sfm)
Access: 2016-05-30 13:57:24.000000000 +0000
Modify: 2016-05-30 13:57:24.000000000 +0000
Change: 2016-12-01 16:40:50.388886330 +0000
 Birth: -

@justinlittman
Copy link
Contributor Author

justinlittman commented Dec 1, 2016 via email

@Tanych
Copy link
Contributor

Tanych commented Dec 1, 2016

sent through slack.

@justinlittman
Copy link
Contributor Author

I can't replicate:

GLSS-F0G5RP:sfm-docker justinlittman$ docker logs -f sfm_heritrix_1
Oracle Corporation OpenJDK Runtime Environment 1.7.0_95-b00
Using ad-hoc HTTPS certificate with fingerprint...
SHA1:31:05:8A:CD:0D:B0:9A:6E:0B:AD:1D:F0:34:FB:0D:90:2B:58:E2:BF
Verify in browser before accepting exception.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
engine listening at port 8443
operator login set per command-line
NOTE: We recommend a longer, stronger password, especially if your web 
interface will be internet-accessible.
Heritrix version: 3.3.0-LBS-2016-02

Try: docker-compose build --no-cache heritrix

@Tanych
Copy link
Contributor

Tanych commented Dec 1, 2016

I have tried this but the problem still occurs. when I dig into the building log, all seems normal.
I will try to build in another machine.

@justinlittman
Copy link
Contributor Author

@Tanych Problem solved:

justinlittman@gwdev-ychtan:/home/ychtan/sfmproject/sfm-docker$ docker-compose logs -f heritrix 
Attaching to sfm_heritrix_1
heritrix_1                | Oracle Corporation OpenJDK Runtime Environment 1.7.0_95-b00
heritrix_1                | Using ad-hoc HTTPS certificate with fingerprint...
heritrix_1                | SHA1:69:F3:6C:B9:18:E2:9E:7C:F3:02:BA:DF:A1:49:C2:03:C2:DC:CB:F5
heritrix_1                | Verify in browser before accepting exception.
heritrix_1                | SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
heritrix_1                | SLF4J: Defaulting to no-operation (NOP) logger implementation
heritrix_1                | SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
heritrix_1                | engine listening at port 8443
heritrix_1                | operator login set per command-line
heritrix_1                | NOTE: We recommend a longer, stronger password, especially if your web 
heritrix_1                | interface will be internet-accessible.
heritrix_1                | Heritrix version: 3.3.0-LBS-2016-02
heritrix_1                | 2016-12-02 21:17:22.926 INFO thread-12 org.archive.crawler.framework.Engine.addJobDirectory() added crawl job: sfm

@kerchner kerchner mentioned this issue Dec 5, 2016
@Tanych
Copy link
Contributor

Tanych commented Dec 6, 2016

@justinlittman

SFM-UI Apache Instance Functions Testing

Building all harvesters, sfm-ui and sfm-data using the corresponding dockerfile

Harvesters and exporters

The testing results for running the harvesters:

Haverster Exporter Web harvester
Flickr user
Tumblr posts
Twitter search
Twitter user timeline
Weibo timeline
Twitter Filter pending pending
Twitter Sample

Twitter Filter harvester

For the twitter filter, adding a seed with track:gwu libraries, waiting for one day. The twitter stream harvester's logs show:

2016-12-06 09:50:49,723: __main__ --> Harvested 0 tweets
2016-12-06 09:50:50,100: __main__ --> Stopping since stop event set.
2016-12-06 09:50:50,272: sfmutils.warcprox --> Shutting down warcprox
2016-12-06 09:50:50,275: sfmutils.warcprox --> Terminating warcprox
2016-12-06 09:50:50,324: sfmutils.warcprox --> Waiting for warcprox to terminate

It might be the situation that the seed has no tweets.

To do the exporter, stop the harvester and waiting for updating. Not sure how long it take.

Scheduling of serialization

setting of scheduling

SERIALIZE_HOUR = env.get('SFM_SERIALIZE_HOUR', '3')
SERIALIZE_MINUTE = env.get('SFM_SERIALIZE_MINUTE', '0')

README.txt create:

-rw-r--r-- 1 sfm sfm  913 Dec  6 03:00 README.txt

Serialiazation records:

-rw-r--r-- 1 sfm sfm  738 Dec  6 03:00 collection.json
-rw-r--r-- 1 sfm sfm  393 Dec  6 03:00 collection_set.json
-rw-r--r-- 1 sfm sfm  702 Dec  6 03:00 credentials.json
-rw-r--r-- 1 sfm sfm  109 Dec  6 03:00 groups.json
-rw-r--r-- 1 sfm sfm  452 Dec  6 03:00 harvest_stats.json
-rw-r--r-- 1 sfm sfm 2158 Dec  6 03:00 harvests.json
-rw-r--r-- 1 sfm sfm 1806 Dec  6 03:00 historical_collection.json
-rw-r--r-- 1 sfm sfm  559 Dec  6 03:00 historical_collection_set.json
-rw-r--r-- 1 sfm sfm  868 Dec  6 03:00 historical_credentials.json
-rw-r--r-- 1 sfm sfm 1264 Dec  6 03:00 historical_seeds.json
-rw-r--r-- 1 sfm sfm   90 Dec  6 03:00 info.json
-rw-r--r-- 1 sfm sfm  537 Dec  6 03:00 seeds.json
-rw-r--r-- 1 sfm sfm  677 Dec  6 03:00 users.json
-rw-r--r-- 1 sfm sfm 1239 Dec  6 03:00 warcs.json

It works as expected.

@Tanych
Copy link
Contributor

Tanych commented Dec 6, 2016

@justinlittman

SFM-UI Runserver Functions Testing

Building all harvesters, sfm-ui and sfm-data using the corresponding dockerfile

sfm-ui building error

yml like the following:

 ui:
        build:
            context: ../sfm-ui
            dockerfile: Dockerfile-runserver

Error with the docker-compose build --no-cache ui or docker-compose build ui

Use 'apt-get autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 118 not upgraded.
 ---> e5bb9d2a300a
Removing intermediate container d2fe566af61d
Step 4 : RUN setcap 'cap_net_bind_service=+ep' /usr/local/bin/python2.7
 ---> Running in 4e7ba533fbe4
Failed to set capabilities on file `/usr/local/bin/python2.7' (Invalid argument)
The value of the capability argument is not permitted for a file. Or the file is not a regular (non-symlink) file
ERROR: Service 'ui' failed to build: The command '/bin/sh -c setcap 'cap_net_bind_service=+ep' /usr/local/bin/python2.7' returned a non-zero code: 1
``

@Tanych
Copy link
Contributor

Tanych commented Dec 6, 2016

@justinlittman
It seems that AUFS storage drive don't support the setcap extended attributes. Someone says set CONFIG_AUFS_XATTR, I am not sure whether it support on docker.

Relate issue

moby/moby#20658
moby/moby#5650

@justinlittman
Copy link
Contributor Author

@Tanych Rather than try to run runserver and Apache on port 80, I switched to port 8080. This eliminated the need to use setcap. See changes in sfm-ui and sfm-docker branches.

@justinlittman
Copy link
Contributor Author

Looking at the logic in the harvester, a Twitter filter that is blocking waiting for a tweet will not stop. See #579.

@Tanych
Copy link
Contributor

Tanych commented Dec 7, 2016

@justinlittman
SFM-ui runserver, port 8080 not 80. Exporter common error below:

Exporter common error

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/kombu/mixins.py", line 177, in run
    for _ in self.consume(limit=None):  # pragma: no cover
  File "/usr/local/lib/python2.7/site-packages/kombu/mixins.py", line 199, in consume
    conn.drain_events(timeout=safety_interval)
  File "/usr/local/lib/python2.7/site-packages/librabbitmq/__init__.py", line 220, in drain_events
    self._basic_recv(timeout)
  File "/usr/local/lib/python2.7/site-packages/kombu/messaging.py", line 598, in _receive_callback
    return on_m(message) if on_m else self.receive(decoded, message)
  File "/usr/local/lib/python2.7/site-packages/kombu/messaging.py", line 564, in receive
    [callback(body, message) for callback in callbacks]
  File "/opt/sfm-utils/sfmutils/consumer.py", line 145, in _callback
    self.on_message()
  File "/opt/sfm-utils/sfmutils/exporter.py", line 75, in on_message
    warc_paths = self._get_warc_paths(collection_id, seed_ids, harvest_date_start, harvest_date_end)
  File "/opt/sfm-utils/sfmutils/exporter.py", line 171, in _get_warc_paths
    exclude_web=True):
  File "/opt/sfm-utils/sfmutils/api_client.py", line 46, in warcs
    warcs = self._get("/api/v1/warcs/", params)
  File "/opt/sfm-utils/sfmutils/api_client.py", line 23, in _get
    resp = requests.get(url, params=clean_params)
  File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 67, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 53, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/adapters.py", line 437, in send
    raise ConnectionError(e, request=request)
ConnectionError: HTTPConnectionPool(host='api', port=80): Max retries exceeded with url: /api/v1/warcs/?collection=cee606f9822b4ced8b372597490a4946&exclude_web=True (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f615cbc2390>: Failed to establish a new connection: [Errno 111] Connection refused',))

The default port for API client to find warc seems to be 80.

@justinlittman
Copy link
Contributor Author

I've changed the Dockerfiles of all of the exporters so that they use port 8080. You'll need to rebuild all of the exporter images.

@Tanych
Copy link
Contributor

Tanych commented Dec 7, 2016

@justinlittman
Exporting to html needs create a tmp file to add the <html> tag, but it get a permission error.

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/kombu/mixins.py", line 177, in run
    for _ in self.consume(limit=None):  # pragma: no cover
  File "/usr/local/lib/python2.7/site-packages/kombu/mixins.py", line 199, in consume
    conn.drain_events(timeout=safety_interval)
  File "/usr/local/lib/python2.7/site-packages/librabbitmq/__init__.py", line 220, in drain_events
    self._basic_recv(timeout)
  File "/usr/local/lib/python2.7/site-packages/kombu/messaging.py", line 598, in _receive_callback
    return on_m(message) if on_m else self.receive(decoded, message)
  File "/usr/local/lib/python2.7/site-packages/kombu/messaging.py", line 564, in receive
    [callback(body, message) for callback in callbacks]
  File "/opt/sfm-utils/sfmutils/consumer.py", line 145, in _callback
    self.on_message()
  File "/opt/sfm-utils/sfmutils/exporter.py", line 117, in on_message
    self._file_fix(filepath, prefix="<html><head><meta charset='utf-8'></head>\n", suffix="</html>")
  File "/opt/sfm-utils/sfmutils/exporter.py", line 146, in _file_fix
    with tempfile.NamedTemporaryFile(dir='.', delete=False) as outfile:
  File "/usr/local/lib/python2.7/tempfile.py", line 475, in NamedTemporaryFile
    (fd, name) = _mkstemp_inner(dir, prefix, suffix, flags)
  File "/usr/local/lib/python2.7/tempfile.py", line 244, in _mkstemp_inner
    fd = _os.open(file, flags, 0600)
OSError: [Errno 13] Permission denied: './tmpoCUzQY'

Whether all the tmp file should create in the /sfm-data/.

@justinlittman
Copy link
Contributor Author

Fixed with gwu-libraries/sfm-utils@4c5a2ac

@Tanych
Copy link
Contributor

Tanych commented Dec 7, 2016

@justinlittman

SFM-UI Runserver Instance Functions Testing

Building all harvesters, sfm-ui and sfm-data using the corresponding dockerfile

Harvesters and exporters

The testing results for running the harvesters:

Haverster Exporter Web harvester
Flickr user
Tumblr posts
Twitter search
Twitter user timeline
Weibo timeline

sfm_heritrix port

For the sfm_heritrix, 8443 as default seems not work. It alters an error 8443 is not available. I change the port to 8081 in docker-compose.yml. It works.

Do I need to retest sfm-ui apache?

@justinlittman
Copy link
Contributor Author

@Tanych 8443 is probably blocked by WRLC.

Just make sure sfm-ui apache builds and starts and test a single harvest and export.

@Tanych
Copy link
Contributor

Tanych commented Dec 7, 2016

@justinlittman it's not the HERITRIX_ADMIN_PORT. For this port I already modify as follows:
HERITRIX_ADMIN_PORT=8081. If I keep the docker-compose.yml as 8443. It's not working. I modify like this:

 # Opens up the port for Heritrix admin console.
            - "${HERITRIX_ADMIN_PORT}:8082"

I will test the apache in single harvester and exporter.

@Tanych
Copy link
Contributor

Tanych commented Dec 8, 2016

@justinlittman I have verified the harvesters and exporters on the sfm-ui apache. All of them can work well.

@justinlittman
Copy link
Contributor Author

@Tanych Thanks for your persistence in testing this.

justinlittman pushed a commit that referenced this issue Dec 9, 2016
…zation back on. Turns off lazy initialization of sfm app when running under Apache.
justinlittman added a commit that referenced this issue Dec 9, 2016
refs #532. Changes to use sfm user and group. Turns scheduled seriali…
@justinlittman
Copy link
Contributor Author

This shows changes that need to be made to docker-compose.yml: gwu-libraries/sfm-docker@9937ed9#diff-cd93896a90210fd283cd7d7354c3f2f4

This shows changes that need to be made to .env: gwu-libraries/sfm-docker@9937ed9#diff-2c3589a38c2f2f6a6b8f033766584f40

@justinlittman
Copy link
Contributor Author

Here are the steps for an upgrade:

  1. Copy over the new docker-compose.yml file and makes changes to .env.
  2. docker-compose stop
  3. docker-compose run data chown -R 990:990 /sfm-data/collection_set
  4. docker-compose run data chown -R 990:990 /sfm-data/containers
  5. docker-compose run data chown -R 990:990 /sfm-data/elk
  6. docker-compose run data chown -R 990:990 /sfm-data/export
  7. docker-compose up -d

It is possible that the directories in step 5 or 6 don't exist and an error is returned. That's OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants