-
Notifications
You must be signed in to change notification settings - Fork 0
Improve error logging, proxy robustness, and operational quality of life #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…tiplex doesn't have a log file
…nd kb for heap so this works better on small servers
…ference startup scripts
Unresolved Mystery No. 1Finding the source of error message given a reported error from Joshua
This is located in }).on('error', function (err) {
let errstring = `### NETCREATE STARTUP ERROR: '${err.errno}'\n`;
switch (err.errno) {
case 'EADDRINUSE':
errstring += `Another program is already using port ${config.port}.\n`;
errstring += `Go to "http://localhost:${config.port}" to check if NetCreate is already running.\n\n`;
errstring += `Still broken? See https://github.com/daveseah/netcreate-2018/issues/4\n`;
break;
default:
errstring += `${err}`;
console.log(err);
}
console.log(`\n\n${errstring}\n### PROGRAM STOP\n`);
throw new Error(err.errno);
}); Interestingly, we are seeing this being emitted from the There could be a bug in this routine. It's originating in |
Unresolved Mystery No. 2What's weird about this one is that the displayed number of active graphs is not related to running instances, but on a "instance token" array that is a pool of spawned children. They are literally just javascript numbers. This array should always have the "base" deployment that's used to serve static assets which is |
Ben captured a crash on his ubuntu server
The EADDRINUSE error is due to the child processes continuing to run after nc-multiplex has crashed due to the request processing in the http-proxy-middleware filter function shown. However, WHY this is even happening (and why websocket upgrading is happening) is a mystery. |
Adding hardening to process control, reporting of weird URLs. |
At this point, the server may be stable enough by not crashing, but it remains to be seen if additional crashes happen. |
… still works. This file was renamed to nc-launch-instance.jssh because it's not a pure js file, but a shell script
…so it feels less broken when cookie timeout
…when using pm2 avoid absurdist commands that look like "pm2 stop start"
req.params.graph is undefined
…onal location.reload() check
LOGGING NEWSJoshua ran trials on 198 today, August 27, and they went well! Sri ran through the log and found one instance of a crash: and another two instances from this specific IP address which isn't recognized as one of ours: This seems to support that websocket connection attempts by random bots was the cause of increased server failure rates. UPDATE NEWS198 was running the older branch of |
Looking at that excerpt again, there are a lot of UADDR dropping from the network at the same time at the time of that error. Unfortunately, the NetCreate server itself doesn't timestamp these connection status messages. Coincidence? |
Digging further into better ways of handling errors like this, the original router code and the way it redirects things is hard to work with. I tried to not change it appreciably in making my annotations but I think it needs a rewrite, which means a whole new test cycle. Joshua reports that with his Thursday Aug 19 2024 sessions, he noticed no problems; the server restart seems to take about 15 seconds and client reloads should just work. What's unknown is whether socket connections are somehow not getting dropped. I'm unable to reproduce the behavior on my Mac, so I'll have to use 159 to make some changes to try to reproduce this particular crash, probably next week. |
NEW ERRORSThis log shows excerpts of a crash. There are two major errors noted in the logs ERR_INVALID_HTTP_TOKEN triggered by IP 64.39.98.3 trying to login hundreds of times. The error is originating in
Then, ERR_NOT_NETMESG is thrown out of our own code in NetCreate itself (not nc-multiplex), occurring several times. Each error represents a proxied server crashing, because after these errors stop appearing in the log, the HEARTBEAT no longer lists any running instances.
After all the servers have crashed, there are various HPM proxy request errors that appear despite the
This appears to be an attepmt to access a cgi-bin script...a blind access attempt:
Also, the management page does not show running instances, meaning that nc-multiplex was unable to detect the crash of the child servers. ATTACHMENTS |
…ute a path in m_RouterGraph
# Conflicts: # nc-multiplex.js
ERRORS for 198 on SEP 6
on 159:
|
Fixed bug in OutOfMemory report, which didn't convert to MB. Also added finer grained out-of-memory checks. |
This PR adds robust logging to
nc-multiplex
and cleans up the code base, correcting numerous oversights related to logging. Most notably it hardens the proxy against crashes, adds much improved logging, and provides a way to autorestore datasets on server start.TESTING
nc-multiplex
if it is running.dev-sri/timestamps
branch in yournc-multiplex
directorystart-nc-multiplex.sh
(see comments for pm2 instructions)npm ci
tail -f log.txt
to monitor the outputhttp://host/manage
to load databasesADVANCED TESTING
Suggest
tail -f log.txt
in a terminal window to see what's happening in the log while you try these.wsping.jssh
script (in.vscode/utilities
) to connect via websocket to nc-multiplex root, and see if the log reports it or crashes (it should not be reported)INSTALLATION ON DIGITAL OCEAN
Instructions from inside the
nc-multiplex
directory.pm2 list
to find the running instance (usually 'start.sh' or 'do-start.sh')pm2 stop [instance]
git fetch
to update branch informationgit status
to check what branch you're ongit checkout [branch]
where [branch] is dev (if installing) or dev-sri/timestamps (if testing)git pull
to grab latest branch informationpm2 delete [instance]
to remove the old startup scriptmv log.txt log-yyyy-mmdd.txt
followed bytouch log.txt
pm2 start ./start-nc-multiplex.sh
to start the new scriptpm2 save
tail -n 100 -f log.txt
to inspect the running log content as you try things outBACKGROUND
Joshua reported that hist NetCreate servers on Digital Ocean were crashing or becoming unresponsive. The cause of this crashing could not be determined by examining the
log.txt
file that captures output fromnc-multiplex.js
.The preliminary thought was that the server was somehow rebooting due to the reported EADDRESS_IN_USE in the log, but because the logging did not capture adequate information and the server was rebooted manually every time the problem occurred, there was no way to find out when it happened. The system logs also did not seem to reveal any surprise reboots.
The next thought was that pm2 was rerunning the startup script
do-start.sh
on a cron job, but these seemed fishy because a process manager should only run when the instance stops working, not constantly try to rerun it. There was an error in the configuration of pm2 due to the change in NodeJS version (see RELATED ISSUES above), but this was not thought to be a cause for a restart error because it prevented restarting.Because we can not find the trigger or timing of the errors, this augmented version of
nc-multiplex
has a lot of additional logging to make it easier to determine when the next crash happens.RELATED ISSUES
A related fix in NetCreate is addressed in PR#242, which adds the missing ursys build instructions to
npm run package
. This command is necessary for correct functioning ofnc-multiplex
and may be the cause of unresponsive server errors. However, given the lack of issue tracking, this is speculative.A problem with the Digital Ocean servers with the
pm2
process manager for NodeJS was related to a recent change in node version from18.16.0
to18.18.2
as defined in.nvmrc
. There was an old process script that referred to the old 18.16.0 path.nvm/node/versions/18.16.0/bin
, which no longer existed. This was detected in the ubuntu system logs. It's not likely the cause of the reported server issues, though.The use of
npm run package
as a way to initialize thenetcreate-itest
subrepo does not build source maps which will show up as errors in the netcreate console. We will modify the packaging script.COMPLETE CHANGE LOG
LOGGING IMPROVEMENTS
start.sh
script to not overwrite the log file every time. It now appends, and also captures stderr as well as stdout. new script provided now as part of the repo.nc-multiplex
output so we could tell when particular events occurred./manage
,/
, and/graph/:graph/:file?
express route handling.nc-server-start.txt
CRASH PROTECTION
nc-process-state.json
file is found and valid. It will crash if the datasets are not found or there is a loading error and log the message.throw Error()
innc-multiplex
because this could cause an actual crash we can't detect/
, causing a crash.http-proxy-middleware
express package to a version that doesn't have the crashing behavior (1.0.6 -> 3.0.0)UI IMPROVEMENTS
/manage
page now resets the cookie duration on refresh properly, and now includes a countdown/manage
page displays more pertinent memory statistics and now includes the server start.CODE IMPROVEMENTS
ps
commandUTILITIES
All utilities are in the hidden
.vscode/utilities
directory, and must be copied into thenc-multiplex
root directory to run.log-less.sh
which with strip color ANSI markers from the log and pipe it intoless
log-rotate.sh
renameslog.txt
tolog.txt.0
,log.txt.1
tolog.txt.2
and so on such that the lowest-number log is more recent then the next highest number.log-archive.sh
uses a different log archiving strategy, creating alog-YYYY-MMDD-HHMM.txt
file where the date is the time of creating the log, and then creating a blanklog.txt
log-check
provides a convenient way to scan for certain errors in the log.txt file using severalgrep
commands.memdump.jssh
provides way of checking for instances and system memory conveniently from the command line, filtering only for netcreate instances launched bync-multiplex
wsping.jssh
that tries to open a websocket connection to a server's home page and send a message. This would reliably crash older versions of nc-multiplex.