-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core dump when enabled CCW tracing for 2703 BSC device #319
Comments
Notes:
Need to collect additional information. In channel.c:
Will need to see the debug output starting at the beginning of the CCW chain in error, along with the dump. Please make sure compilation was with -g3 or -ggdb3 specified and preferably not optimized (-O0). Also need to know the compiler and compiler version/level used. If no failure is seen with these options, then it is definitely a compiler failure and you will need to update your compiler. |
|
I used this to disable optimization:
Enabled DEBUG_PREFETCH Then rebuilt the entire thing and re-installed. Now it crashes a bit later, but in a completely different location:
It appears that it tries to write enormous amount of data. Which is not at all correct. |
I think I may know what your problem is. What guest operating system are you running under Hercules, and is device 604 by perchance used by BTAM? If my suspicion is correct that your guest is using BTAM, then the cause of your problem is very likely that BTAM is dynamically modifying its channel program(s) "on the fly". This has been a known issue (problem) for as long as I can remember. Back in the day when I was gainfully employed as a DOS/VSE System Software Developer (and VM Administrator to boot), we used BTAM extensively and I recall having to define certain special options to VM/SP to accommodate this standard BTAM behavior. Even VSE itself had a handshaking option to notify CP whenever it dynamically modified one of its channel programs.
As I said, I have not figured out precisely where in Hercules the "bug" is nor how to fix/workaround it yet, but if I discover anything I'll let you know. In the mean time, please confirm or deny that your Hercules guest is using BTAM (or a program similar to BTAM that dynamically modifies channel programs on the fly). If it's not, then what I've written above isn't your problem. I'm CC'ing Mark in the hope that he might be able to find the "bug" in channel.c. I use the term "bug" within scare quotes since I personally don't feel it's a bug per se. Rather, it is simply missing support for such unusual guest behavior. We'll probably need some type of new device flag to let Hercules's channel code know that this is a device whose channel programs are dynamically modified, and if set, to bypass CCW prefetch logic or something, but I leave all that to Mark. He's the channel expert. Hope that helps! |
Hello Fish and dasdman. Sounds interesting. Just to make clear the OS is MVS 3.8j installed per the guidelines of Jay Moseley. On top of that there are TCAM running to control the terminal. I build a TSOMCP program which I have described here. I can run Hercules/Spinhawk with my patch to correctly handle ETX/ETB without any problems whatsoever. So there has to be some kind of difference in how Spinhawk deals with this and Hyperion-SDL. Most code differences are in the channel.c although there are some in commadpt.c as well. To test this I use my simple testprogram called 3274emu and the x3270 client. The 3274emu takes a TN3270 session and sends the data over BSC/TCP. 3274emu was designed as a testing tool when building the BSCBridge firmware for the SyncDongle hardware. All this to be able to connect a real Alfaskop terminal with accompanying cluster controller and floppy unit to the Hercules emulator. |
Well DUH! We know what the channel code differences are between spinhawk and SDL Hyperion and we know why that is causing your crash. That's not the unknown. The unknown is precisely where in SDL Hyperion the problem is and how to fix/workaround the problem. The issue has to do with CCW prefetching (I think) and the fact that the channel program is being dynamically modified on the fly by your guest unbeknownst to Hercules's channel logic, thereby causing it (SDL Hyperion's channel logic) to end up fetching garbage instead of the actual next CCW. Spinhawk doesn't crash because its channel logic doesn't do that. I always fetches the CCWs one at a time as they're being executed, so if MVS 3.8j's BTAM logic (presumed) dynamically changes a CCW, spinhawk will end up fetching that newly modified CCW and everything proceeds normally. (It's device tracing logic is also very different too.) With SDL Hyperion however, due to CCW prefetching (Mark? Correct me if I'm wrong here!), instead of the next CCW being fetched, garbage is being fetched instead. Or something like that. I don't know the precise details, but honestly, it is absolutely no surprise whatsoever that spinhawk works fine since it's channel logic isn't as sophisticated (functionally correct) as SDL Hyperion's is. The channel code in SDL Hyperion is much closer to conforming to actual architectural specification. Spinhawk's channel code is not, and technically does not actually behave the way real channels do. It works, sure, but it's not as conformant to the actual specs as SDL Hyperion's channel code is.
Cool! Can I use this program to connect my tn3270 program to it and then have it (3274emu) connect to Hercules via a 2703 device defined the same way that you have it defined?
If so I might have to give it a try to see if I can recreate your crash on my Windows system. The only problem is, I don't have MVS 3.8j installed nor am I really that interested in doing so. I don't want to have to learn how to use yet another guest operating system just to debug/test Hercules. I'd rather have a simple standalone program I can use instead. Can you tell me the series/sequence of commands that leads up to the failure? I might be able to write a quick and dirty standalone program that recreates the problematic sequence. The only missing piece of the puzzle is where (i.e. which CCW) and how BTAM is changing its channel program on the fly (presuming that's what it's doing). If I knew that, then I could throw together a dirt simple standalone program to recreate the problem. THAT would go a long way towards us being able to identify and fix this bug! Thanks for any help you can provide and thank you for your continued patience as well. (p.s. what you're doing sounds really cool! Trying to connect a real mainframe device to Hercules!) |
Hello! So I add here a more detailed instruction on how to recreate this. I use Jay Moseley build of MVS and added built the Unpack it to the I also use the There is some description on how to build and use it in the README and the code itself. Then the steps are (since you are experts on IBM and Hercules these steps are most likely redundant, but for completeness):
|
FYI: The preferred way to create a Hercules log is to start Hercules itself with logging enabled via simple standard output redirection:
Refer to the Hercules " Note that you might need to make a minor change to one or more of the MVS 3.8j package's pre-provided script/batch files in order to accomplish that, but it is recommended that you take the time to do so. Starting Hercules with standard output redirected to a log file ( I really wish the people responsible for the MVS 3.8j package ("TK4-"?) would fix that. It's been a problem (bug) for a long time now. Being able to capture the Hercules startup/initialization messages in the logfile becomes very important when trying to debug a problem! |
I am personally unable to proceed with debugging this issue due to the unavailability of a Windows version of your |
Mattis, (@MattisLind) Have you tried Mark's suggested patch to
The below patch contains Mark's above suggested change. Please apply it to current git, rebuild Hercules, and then try your test again, letting us know whether or not Hercules still crashes or not. We're hoping it won't. Thanks! |
Hello? (@MattisLind) Anyone home? |
Been busy with work, life in general and other projects. Sorry. I have tested applying the above patch. It makes the situation worse. With this patch it is not possible to IPL the system properly any longer. After doing the IPL 150 in the Hercules console I get the usual "IEA101A SPECIFY SYSTEM FOR RELEASE 03.8 .VS2" Now what happens when I press enter is that I get a message from c3270 saying: Press to resume session. The log file is here: |
Dang. Oh well. I guess we'll have to let Mark look into this issue whenever he can (which as we both know may be quite a while given his current workload and financial situation). But don't despair! We'll get it figured out eventually! I promise you! As mentioned earlier, if I had a Windows version of your And while I do indeed have a version of Linux installed in a VMware virtual machine that I could use (CentOS 6.10), my Linux debugging skills are virtually non-existent so my trying to debug it using Linux is pretty much a no-go from the get-go. Sorry! |
You should be able to run 3274emu on the VMware virtual machine while Hyperion runs on windows I guess. The command line can be used to specify a different host than localhost if one wish so. 3274emu is just a simple network tool and as long as the ports are not behind a firewall on the VMware machine it should work fine as far as I can see. |
MVS is in a minimal I/O mode for console support in handling the IEA101A message and response. The non-response issue also appeared heavily on real hardware when certain timings happened. One item to try is instead of just hitting enter is to respond with “r 00,u”. A console device trace would also be helpful.
|
The same happens if I try r 0,u (or r 00,u for that matter) |
(Doh!) Thanks. I'm duly embarrassed for not having thought of that myself. I will give it a try as soon as I can and get back to you. (Might still be a day or three however due to my own workload.) |
FYI: "A day or three" == "(undefined length of time)".
I could not find any instructions on how to build I'm using CentOS 6.10 under VMware, which I use to build Hercules all the time, so I know I have all the pieces needed. I just need a makefile and a command I can use to build (p.s. it doesn't matter if I have to build a bunch of other stuff that I don't need. That's okay. As long as |
Sorry Fish! My bad. All these small stupid utils I just compiled directly on the command line without proper makefile.
EDIT: I just saw that in the top comment of the 3274emu.cpp file it says "Compile using ..." |
THANK YOU! I was able to successfully build and run Unfortunately however, my attempt was unsuccessful. First, after entering
Presuming that's normal I then entered the
Ever thinking positively, I presumed that was benign as well, and pushed on, eventually reaching steps 16-18:
However, after receiving the login prompt and entering the Do I need to enter a password on my login? Should the command instead be: Bottom line: since I was never able to logon to TSO (and thus was never able to reach step 19), I was never able to reproduce your crash. What now? |
What am I supposed to see (what does the 3270 screen look like) after logging on to TSO? (i.e. after entering |
After the login there shall be a transient screen saying something about login in progress. A good idea is to use Hercules 3.13 as a reference. Make sure to include the patch I have created to make it work correctly. PR with BSC fixes I have BTW now bitten the bullet and compiled gcc 10.2.0 so the "old compiler" issue is off the table. It still crashes. Sometimes it crashes directly after logging in. Sometimes I am able to issue the HELP command and then it crashes during the printout of the HELP screens. Just like before Hercules is spewing out loads and loads of data over the TCP socket. While compiling I see quite many warnings about potential buffer overflows: |
I'll try to remember to eventually give that a try. Thanks.
Obvious compiler bug. Maybe that's your problem? Maybe that's what's causing your crash? Just as a point of note too, so far I have been using using current SDL Hyperion (i.e. as of commit 360156d) for all of my tests, not commit 2c0d41e which appears to be what you are using. I will next try 2c0d41e (the same as what you are using, but with your patch applied of course) In the mean time, have YOU tried using the latest and greatest most current SDL Hyperion yet? Does it also crash the same way? Finally, something else to consider: the tn3270 emulator you are using: x3270. Have you tried a different tn3270 client? I'm using Tom Brennan's Vista tn3270 on Windows, not x3270 (which I don't have on my Linux system). I had to make a minor change to your 2703 device statement (And just to make sure we're on the same page, are you using your 3274emu utility too? Or are you connecting directly to Hercules? I'd just like to confirm that too.) Oh yeah! I'm also using CCKD64 dasd too, not CCKD, but I seriously doubt that makes any difference. Let me do some more testing to see if I can eventually recreate your crash on my Windows system. Maybe the problem is in the difference in the way sockets are handled on Linux vs. the way they're handled on Windows? Dunno. <shrug> I'll also have to try running it on Linux instead too, to see whether that makes any difference. Maybe it's the difference in the way sockets are handled on Linux vs. on Linux that's causing the crash? Dunno. Something else for me to try. Bottom line: I'm trying to get to the bottom of this! But obviously having trouble recreating it. But don't despair: I'll eventually figure it out I'm sure.
I'm only itemizing the differences so we have a record of how my attempts to recreate your crash are obviously different from the environment that you use where the crash occurs, just in case one of them does unexpectedly make a difference. |
Not that this has anything to do with the issue at hand, that msg is a gcc "note", not a "warning". Here's an example:
The pre-processor expands that to:
What the compiler msg is stating is true. It's not a bug. Bill |
Ah. I see. GCC is just trying to be <cough> "helpful", eh? <me, rolls eyes> In my years of experience using gcc for Hercules development, gcc's overabundance of "helpful" warning and note messages only serve to accomplish the complete opposite of their intended function. They are rarely if ever helpful and only serve to get in the way and either alarm developers who then must go out of their way to suppress such <cough> "helpful" messages or cause actual bona-fide important messages to be missed due to their spamming the log with an overabundance of "helpful" messages (which, as said, only serve to accomplish the complete opposite of their intended effect). But then to be fair that's probably because Hercules's build default on Linux is to "show me everything!" instead of simply requesting default gcc message settings? Maybe we should consider fixing/changing that? |
To rule out potential environment problems I got myself a new cloudserver. I installed it with Ubuntu 18.04. Not very old. Not the latest. I then installed gcc, g++, make and tools. gcc is now 7.5.0 I cloned the spinhawk repo and built it. Then downloaded the mvs environment from the link in this thread. I started hercules and was able to connect to the hercules from my c3270 console from my laptop, ipled the system and started TCAM. Then started 3274emu x3270 on my laptop and performed the steps above. Worked as expected. Was able to log in and was greeted with the TSO screen. I never applied my ETX/ETB length patch so there were some problems. But it worked. Then I did exactly the same steps with Hyperion-SDL. Cloned it from HEAD (6de77e1) and built the thing with ./configure and make. I did a make check which revealed some kind of problem. 1 out 322 tests was failing: `
Nevertheless I decided to give it a try so I did a make install and started hercules in the mvs directory. The c3270 client on my laptop connected to Hyperion and got the Hyperion 3270 greeting screen. Then when ipling the system nothing whatsoever happened. It just sat there. One could see the instruction count increasing. Typing exit on the hercules prompt gave "HHC01420I Begin Hercules shutdown". But it never really exited. Had to Ctrl-Z and kill -9 it. |
I just tried Herc 3.13, and the behavior is exactly identical as when SDL Hyperion is used: whenever I connect my 3270 client (from Windows) to 3274emu (on Linux), I get the login prompt, but upong entering the logon command, the logon appears to fail because I'm almost immediately presented with another identical login prompt. I am unable to login to TSO. One thing that bothers me is the terminology you use when describing your connecting 3270 sessions:
I'm not familiar with the tn3270 client you are using, but why is it that you sometimes use the term "c3270" whereas other times you use "x3270"? Is there a difference? If so, WHAT is the difference? Is it possible for you to try your same test using a different 3270 client? What other tn3270 products are there for Linux? If needed, is there a way for ME to get x3270 installed on MY Linux system? I'n beginning to suspect the problem may be in what the 3270 client is sending to your 3274emu utility and what your 3274emu utility is then sending to Hercules. Thanks. (This is so very frustrating! It shouldn't be this difficult to get this working!)
|
Fish,
x3270 is the X windows version of the same program. On CentOS you should be able to install it with:
And then just run Bill |
Ubuntu 18.04 and gcc 7.5.0 are in my test suite for the build changes I'm working on, and How much ram and how many CPU cores are you running? Can you post the output from these: I'd like to try building with your exact settings. Thanks, |
Bill Lewis wrote:
Thank you, Bill! When I tried installing x3270, it said it was already installed (which I thought was strange), but then I tried installing x3270-x11, and sure enough, it installed just fine! Then I started x3270 just to see what it looked like and how it behaved. It's mouse/menu handling sucks. It behaves like the Mac: clicking on a menu displays the menu, but the click isn't sticky like the way I'm used to. As soon as you let the mouse button go, the menu you clicked on disappears! I was expecting normal menu behavior like the way it works on every other program on Linux and Windows and everywhere else: you click a menu and it drops down but doesn't disappear as soon as the mouse button is released. So then you can mouse over the menu item you want and click on that item to "execute" that menu command. But with x3270, its menus aren't "sticky"! No big deal though. I can deal with it. I then investigated how to do a "Reset" and "Clear" and "Erase EOF", and just could not figure it out! ..... at first. But then I noticed an interesting keyboard icon/image at the top right of the screen, so I clicked that and VOILA! A very nice auxilary window appeared with buttons for all of the usual 3270 functions! PA1, Clear, Reset, Erase EOF, Erase Input, Cursor Up, Cursor Down, etc... Great! So now we get to the good part. Long story short: I was able to recreate the crash! Every time! But for me, the crash occurs as soon as I try to logon. I do reset and clear and get my login prompt. I then enter my login ( So now that I am able to recreate the bug, I can now begin looking into the cause of the crash and hopefully devise a fix for you. Given my current workload however (I'm also working on something else too at the same time), it might be a "day or three" Thank you both for the great help you've each provided! I couldn't have gotten this far without you guys. MUCH appreciated!! So hang in there Mattis! A fix is on the way!!
|
FYI: Lesson learned: apparently your I wish I had known that earlier! |
UPDATE:
It's not as consistent as I had originally thought. I just tried it again, and this time I was actually able to successfully logon to TSO! And not only that, after logging on (and seeing my nice TSO welcome screen), I entered "help", and that worked too! No crash! It displayed a nice(?) help screen (list of commands?). So I then entered "logoff", and I guess it logged me off, because the help screen changed. (logoff was no longer in the list). So then I entered "logoff" again, and THIS time it crashed! So the crash isn't consistent as I originally thought. But hey, at least I am able to cause a crash without a lot of effort, so I am be able to look into the problem. In fact, I think I may have already found the bug! Give me "a day or three" to do some more tests to make sure I've actually found it and that I know how to fix it, and then I'll go ahead and commit it and let you know. So hang loose! Things are looking good! (This is great!) |
Fixed by commit 1ba20b4. Closing. |
I don't know why the "mainsize check storage size" test failed for you, but it's really not that big of a deal. If it always fails for you however, then we might want to look into it, but for now, I'm not going to worry about it.
THIS however, is somewhat concerning! I'll have to give it a try on my Linux system to see whether or not it behaves similarly. FWIW, it works flawlessly on Windows. <shrug> If it continues to fail the same way every time you try it, then that's something we'll definitely need to look into it. If that's the case, then please open a new GitHub Issue so we can look into it. Thanks. In the mean time, as my previous comment states, your original crash when CCW tracing 2703 BSC device issue should be fixed now. Please give SDL Hyperion another try to confirm. Thanks. |
x3270 is the guts and x3270-x11 is the GUI front end.
I don't run that thing, myself. Yeah, it's an X Windows (X11) program. And they, well, take some getting used to. Glad you got it going! |
Painful to reproduce and good to get it fixed! |
That is fantastic! Thanks! |
Fish, Can we (meaning you, since I don't know how :-), create a new issue for this Mainsize Test problem. Bill |
See that big green "New issue" button at the top right of your screen? Click it. |
I get a core dump from Hyperion when trying to login over BSC to the
2703
emulation incommadpt.c
. It works fine in Spinhawk but not when trying Hyperion.I get the TSO Login message. Then I am able to start the login process and get a few login messages from TSO and then Hyperion barfs.
I am on commit 2c0d41e of SDL Hyperion.
Log file with
t+604
(which is the2703
used):Trying
gdb
on the Hercules binary with the core gives this:It appears that there is a wild BYTE pointer ccw dereferenced. The source of the problem somewhere around line 5415 in the function
execute_ccw_chain
inchannel.c
I believe:hyperion/channel.c
Lines 5414 to 5417 in 928dd38
Unfortunately it is very hard to get a grip of the code since it is not only 1500 lines long but also filled with goto.
What the prefetch system does is unclear to me.
However disabling this code by setting line 5399 to "if (false)" at least makes it not crash any longer:
hyperion/channel.c
Lines 5398 to 5400 in 928dd38
The code that creates the path that is taken when it crashes was added in commit accb0fc. If this the offending commit or other commits after this has caused the problem is hard to say. I have been building those old versions but was unable to run them: "check-stop due to host error: Segementation fault". No other diagnostics given on what is happening.
The text was updated successfully, but these errors were encountered: