Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: server logging via tempfile may have threading issues #1196

Closed
jku opened this issue Nov 2, 2020 · 1 comment · Fixed by #1198
Closed

tests: server logging via tempfile may have threading issues #1196

jku opened this issue Nov 2, 2020 · 1 comment · Fixed by #1198

Comments

@jku
Copy link
Member

jku commented Nov 2, 2020

Filing this after debugging rare errors in #1192

The tempfile approach to capturing server logging works fine in current develop branch but it triggers issues in PR 1192 where we start reading from the file at a rapid pace -- based on the results this seems to be a threading issue in file object as the log contents become corrupted (rarely but still reproducibly)

I just tried another approach (worker thread + Queue instead of a TemporaryFile) and it seems to work (this is completely unintegrated into the tests, just an example):

try:
  import queue
except ImportError:
  import Queue as queue # python2

import subprocess
import sys
import threading
import time


# Worker function to run in separate thread
# Reads from 'stream' (stdout), puts lines in Queue 'line_queue' (Queue is thread-safe)
def log_queue_worker(stream, line_queue):
  while True:
    log_line = stream.readline()
    line_queue.put(log_line)
    if len(log_line) == 0:
      # EOS
      break

# Start child process
proc = subprocess.Popen([sys.executable, "-u", "proxy_server.py"],
    stdout=subprocess.PIPE, stderr=subprocess.STDOUT)

# Run log_queue_worker() in a thread
# The thread will exit when child process dies
log_queue = queue.Queue()
log_thread = threading.Thread(target=log_queue_worker, args=(proc.stdout, log_queue))
log_thread.daemon = True
log_thread.start()

# setup timeout variables
start = time.time()
elapsed = 0
timeout = 10

# Get lines from log_queue until port is found, child process exits, or we timeout
while True:
  try:
    line = log_queue.get(timeout=timeout - elapsed)
    if len(line) == 0:
      print ("process exit")
      break
    elif line.startswith(b"bind succeeded, server port is: "):
      print (line.rstrip('\n'))
      break
  except queue.Empty:
    print("timeout")
    break
  finally:
    elapsed = time.time() - start

This has the added benefit of not needing a single sleep (because the worker thread can block and main thread uses Queue.get() timeout)...

@jku
Copy link
Member Author

jku commented Nov 2, 2020

Maybe I'll document the reproduction as well in case we need to come back to it:

  • the issues only happen with PR 1192 (because it reads the tempfile more often)
  • the tests need to be repeated a hundreds or thousands of times to trigger the issue, e.g.:
    count=0
    while python test_proxy_use.py; do 
      count=$((count+1)); 
    done
    echo "Finished after $count iterations"
    
  • reproduction on python2 was much easier but it was possible with python3 as well
  • By printing the current log contents after reading, it's clear that the log line has been corrupted (instead of "Server starting\nbinding succeeded...\n" it might contain "Server starting\n\ninding succeeded...")

MVrachev added a commit to MVrachev/tuf that referenced this issue Nov 4, 2020
These changes can be summarized with the following bullets:
- Delegate generation of ports used for the tests to the OS
- Use thread-safe Queue for processes communication
instead of temporary files
- Remove all instances of port generation or hardcoded ports
- Make test_slow_retrieval.py fully conform with TestServerProcess

Delegate generation of ports used for the tests to the OS is much
better than if we manually generate them, because there is always
a chance that the port we have randomly pick turns out to be taken.
By giving 0 to the port argument we ask the OS to give us
an arbitrary unused port.

Use thread-safe Queue for processes communication instead of temporary
files became a necessity because of findings made by Jussi Kukkonen.
With the latest changes made in pr 1192 we were rapidly reading
from the temporary files and Jussi found that it happened rarely
the successful message "bind succeded..." to be corrupted.
It seems, this is a thread issue related to the thread redirecting
the subprocess stdout to the temp file and our thread rapidly
reading from the file.
By using a thread-safe Queue we eliminate this possibility.
For reference read:
theupdateframework#1196

Lastly, test_slow_retrieval.py and slow_retrieval.py were refactored.
Until now, slow_retrieval.py couldn't use the TestServerProcess class
from utils.py for a port generation because of a bug related to
httpd.handle_request().
Now, when we use httpd.serve_forever() we can refactor both of those
files and fully conform with TestServerProcess.

Signed-off-by: Martin Vrachev <mvrachev@vmware.com>
MVrachev added a commit to MVrachev/tuf that referenced this issue Nov 11, 2020
These changes can be summarized with the following bullets:
- Delegate generation of ports used for the tests to the OS
- Use thread-safe Queue for processes communication
instead of temporary files
- Remove all instances of port generation or hardcoded ports
- Make test_slow_retrieval.py fully conform with TestServerProcess

Delegate generation of ports used for the tests to the OS is much
better than if we manually generate them, because there is always
a chance that the port we have randomly pick turns out to be taken.
By giving 0 to the port argument we ask the OS to give us
an arbitrary unused port.

Use thread-safe Queue for processes communication instead of temporary
files became a necessity because of findings made by Jussi Kukkonen.
With the latest changes made in pr 1192 we were rapidly reading
from the temporary files and Jussi found that it happened rarely
the successful message "bind succeded..." to be corrupted.
It seems, this is a thread issue related to the thread redirecting
the subprocess stdout to the temp file and our thread rapidly
reading from the file.
By using a thread-safe Queue we eliminate this possibility.
For reference read:
theupdateframework#1196

Lastly, test_slow_retrieval.py and slow_retrieval.py were refactored.
Until now, slow_retrieval.py couldn't use the TestServerProcess class
from utils.py for a port generation because of a bug related to
httpd.handle_request().
Now, when we use httpd.serve_forever() we can refactor both of those
files and fully conform with TestServerProcess.

Signed-off-by: Martin Vrachev <mvrachev@vmware.com>
MVrachev added a commit to MVrachev/tuf that referenced this issue Nov 13, 2020
These changes can be summarized with the following bullets:
- Delegate generation of ports used for the tests to the OS
- Use thread-safe Queue for processes communication
instead of temporary files
- Remove all instances of port generation or hardcoded ports
- Make test_slow_retrieval.py fully conform with TestServerProcess

Delegate generation of ports used for the tests to the OS is much
better than if we manually generate them, because there is always
a chance that the port we have randomly pick turns out to be taken.
By giving 0 to the port argument we ask the OS to give us
an arbitrary unused port.

Use thread-safe Queue for processes communication instead of temporary
files became a necessity because of findings made by Jussi Kukkonen.
With the latest changes made in pr 1192 we were rapidly reading
from the temporary files and Jussi found that it happened rarely
the successful message "bind succeded..." to be corrupted.
It seems, this is a thread issue related to the thread redirecting
the subprocess stdout to the temp file and our thread rapidly
reading from the file.
By using a thread-safe Queue we eliminate this possibility.
For reference read:
theupdateframework#1196

Lastly, test_slow_retrieval.py and slow_retrieval.py were refactored.
Until now, slow_retrieval.py couldn't use the TestServerProcess class
from utils.py for a port generation because of a bug related to
httpd.handle_request().
Now, when we use httpd.serve_forever() we can refactor both of those
files and fully conform with TestServerProcess.

Signed-off-by: Martin Vrachev <mvrachev@vmware.com>
MVrachev added a commit to MVrachev/tuf that referenced this issue Nov 13, 2020
These changes can be summarized with the following bullets:
- Delegate generation of ports used for the tests to the OS
- Use thread-safe Queue for processes communication
instead of temporary files
- Remove all instances of port generation or hardcoded ports
- Make test_slow_retrieval.py fully conform with TestServerProcess

Delegate generation of ports used for the tests to the OS is much
better than if we manually generate them, because there is always
a chance that the port we have randomly pick turns out to be taken.
By giving 0 to the port argument we ask the OS to give us
an arbitrary unused port.

Use thread-safe Queue for processes communication instead of temporary
files became a necessity because of findings made by Jussi Kukkonen.
With the latest changes made in pr 1192 we were rapidly reading
from the temporary files and Jussi found that it happened rarely
the successful message "bind succeded..." to be corrupted.
It seems, this is a thread issue related to the thread redirecting
the subprocess stdout to the temp file and our thread rapidly
reading from the file.
By using a thread-safe Queue we eliminate this possibility.
For reference read:
theupdateframework#1196

Lastly, test_slow_retrieval.py and slow_retrieval.py were refactored.
Until now, slow_retrieval.py couldn't use the TestServerProcess class
from utils.py for a port generation because of a bug related to
httpd.handle_request().
Now, when we use httpd.serve_forever() we can refactor both of those
files and fully conform with TestServerProcess.

Signed-off-by: Martin Vrachev <mvrachev@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant