-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with OSA multiple IP support #203
Comments
Forgive me Bob, but I am not very familiar with OSA devices nor how they're supposed to work. When you say "stop", what do you mean? How do you stop an OSA? For that matter, how does one define a VIPA? When I implemented my changes (to both CTCI-WIN as well as to Hercules), I tested it by simply defining an OSA device like normal, and then used the TCPIP
Basically, I created the two members you see above: one called
Which brings me back to my original question: How does one "stop" an OSA? What is the command you use? And perhaps more importantly, who is responsible for managing the IP addresses assigned to an OSA? z/OS? Or the OSA device itself? That is to say, when you "stop" an OSA (however the heck you do that), or the OSA otherwise "fails" (i.e. stops responding), how does the other OSA magically know that it needs to "take over responsibility for" all of the IP addresses that the failing OSA was originally responsible for? How does it magically know that? I would have thought that z/OS would have issued the appropriate "Add IP Address" commands to the surviving OSA, which it obviously didn't (since we do not see any Hercules So who's at fault here? Where is the bug? Why isn't z/OS registering the IP addresses that it knows were assigned to the OSA that failed to the surviving OSA? If it would do that, then I believe things would work just fine! If z/OS is not responsible for doing that, then I again have to ask, how the frick does the other OSA somehow magically know to take over responsibility for the failing OSA's IP Addresses?! Thanks in advance for any enlightenment you or anyone else can provide! |
On May 2, 2019 at 12:05 PM Fish-Git ***@***.***> wrote:
> >
> However when I stop the VIPAOWNER OSA ...
>
> >
Forgive me Bob, but I am not very familiar with OSA devices nor how they're supposed to work. When you say "stop", what do you mean? How do you stop an OSA? For that matter, how does one define a VIPA?
When I implemented my changes (t both CTCI-WIN as well as to Hercules), I tested it by simply defining an OSA device like normal, and then used the TCPIP OBEYFILE command to, I think, add a VIPA (and then verify it could be pinged) and then deleting it (and verifying the IP address now no longer responded to pings again):
ADCD.Z21S.TCPPARMS(FISHADDV):
INTERFACE VLINKX DEFINE
VIRTUAL
IPADDR 192.168.20.4
ADCD.Z21S.TCPPARMS(FISHDELV):
INTERFACE VLINKX DELETE
VARY TCPIP,,OBEYFILE,ADCD.Z21S.TCPPARMS(FISHADDV)
VARY TCPIP,,OBEYFILE,ADCD.Z21S.TCPPARMS(FISHDELV)
Basically, I created the two members you see above: one called FISHADDV (containing the statements to, I think, define the VIPA) and one called FISHDELV to, I think, delete the VIPA. I then used the two VARY TCPIP... commands that you see to add the VIPA to the OSA and the other to delete it. That's basically the only testing I did. (And the IP address I chose for the VIPA was in the same subnet as the OSA device itself.)
> >
> What should happen is the addresses owned by the failing OSA should be taken over by a surviving OSA, and all addresses should continue to respond to ping (or TN3270).
>
> >
Which brings me back to my original question: How does one "stop" an OSA? What is the command you use?
To start or stop an OSA (of CLS for that matter), issue
V TCPIP,,STOP,osalink
If you are using interface statements for the OSA, like you are for the VIPA, use the name of the link in the start or stop statement. Adding and deleting the VIPA or OSA works, too!
Here are my definitions:
INTERFACE VLINK10
DEFINE VIRTUAL
IPADDR 192.168.&ip..10
INTERFACE LNK3000
DEFINE IPAQENET
PORTNAME OSA3000 ; MUST MATCH TRLE PORT NAME
IPADDR 192.168.&IP..12/24 ; INTERFACE IP ADDRESS
SOURCEVIPAINT VLINK10
INTERFACE LNK3004
DEFINE IPAQENET
PORTNAME OSA3004 ; MUST MATCH TRLE PORT NAME
IPADDR 192.168.&IP..13/24 ; INTERFACE IP ADDRESS
SOURCEVIPAINT VLINK10
I'm using a z/OS system symbol for part of the IP address, so I can use the same profile on all my z/OS instances.
My OSA delete looks like this, which also works for a VIPA
INTERFACE OSA3004 DELETE
My OSA add looks like this, which has the same syntax as in the TCPIP profile.
INTERFACE LNK3004
DEFINE IPAQENET
PORTNAME OSA3004 ; MUST MATCH TRLE PORT NAME
IPADDR 192.168.&IP..13/24 ; INTERFACE IP ADDRESS
SOURCEVIPAINT VLINK10
And perhaps more importantly, who is responsible for managing the IP addresses assigned to an OSA?
z/OS? Or the OSA device itself?
The IP addresses assigned to the OSA are loaded into the OSA when the OSA is started in the IP stack, aand should be removed when the OSA is stopped or deleted. Z/OS should send the commands to the OSA to perform those functions, so z/OS initiates the actions.
That is to say, when you "stop" an OSA (however the heck you do that), or the OSA otherwise "fails" (i.e. stops responding), how does the other OSA magically know that it needs to "take over responsibility for" all of the IP addresses that the failing OSA was originally responsible for? How does it magically know that
I would have thought that z/OS would have issued the appropriate "Add IP Address" commands to the surviving OSA, which it obviously didn't (since we do not see any Hercules HHC03805I IP Address Registration messages which always occur when z/OS registers a given IP address to an OSA).
So who's at fault here? Where is the bug? Why isn't z/OS registering the IP addresses that it knows were assigned to the OSA that failed to the surviving OSA? If it would do that, then I believe things would work just fine!
I've looked at this a bit, but need to do more research. What I recall is that when an OSA (LCS does the same thing, but all the LCS code is in the IP stack, not in the LCS adapter), there is a local ARP (will not propagate to other subnets) sent out. If another adapter in the stack 'sees' the ARP, it knows that it can back up the newly started OSA (or LCS) if it fails.
If an adapter fails that has a backup (on the same LAN or VLAN), it will take over ARP responsibility for the failing adapter and send a gratituous ARP to let the gateway switches know the mac address for the failing atapter's IP addresses has changed.
I know the ARP sent at adapter start time is working because both my OSA adapters are in the same LAN group (which means they can back each other up). Here is the last few lines from a D TCPIP,,N,DEV command:
LANGROUP: 00001
NAME STATUS ARPOWNER VIPAOWNER
---- ------ -------- ---------
LNK3000 ACTIVE LNK3000 YES
LNK3004 ACTIVE LNK3004 NO
As you can see, both links are active and LNK3000 has ARP responsibility for any VIPA in the stack, so you are supporting the ARP correctly. If the OSAs were on separate LANs or the ARP wasn't working properly, they would be in separate LAN groups.
If I stop LNK3000, I get the following on the z/OS console:
V TCPIP,TCPIP,STOP,LNK3000
EZZ0060I PROCESSING COMMAND: VARY TCPIP,TCPIP,STOP,LNK3000
EZZ0053I COMMAND VARY STOP COMPLETED SUCCESSFULLY
EZD0040I INTERFACE LNK3004 HAS TAKEN OVER ARP RESPONSIBILITY FOR
INACTIVE INTERFACE LNK3000
EZZ4341I DEACTIVATION COMPLETE FOR INTERFACE LNK3000
And the D TCPIP DEV looks like this:
IPV4 LAN GROUP SUMMARY
LANGROUP: 00001
NAME STATUS ARPOWNER VIPAOWNER
---- ------ -------- ---------
LNK3004 ACTIVE LNK3004 YES
LNK3000 NOT ACTIVE LNK3004 NO
This is just how it should look, so I think z/OS is doing it's job, but the IP addresses are not being deleted from the stopped OSA and registered in the surviving OSA. I suspect the commands are being sent to the OSA but not being acted upon.
Hope that helps a little.
…
If z/OS is not responsible for doing that, then I again have to ask, how the frick does the other OSA somehow magically know to take over responsibility for the failing OSA's IP Addresses?!
Thanks in advance for any enlightenment you or anyone else can provide!
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub #203 (comment) , or mute the thread https://github.com/notifications/unsubscribe-auth/ACMWJAG46KH4OC7G2FBCUM3PTMNOFANCNFSM4HJEJNAQ .
|
Bob? FYI: I'd appreciate it very much if you would not respond/reply to GitHub Issues via email. I'd greatly prefer that you instead respond/reply directly via the GitHub Issues web page:
When you reply directly via their web page, I can make minor edits to your reply so it is more readable (prettier) by editing the fonts being used, etc. (just like I did with your original post). When you reply via email however, I cannot edit your reply, so oftentimes it is much harder (more difficult) to read. It's up to you whether or not you want to take the time to reply via their web page or continue to reply via email, but I'd rather that you reply directly via their web page. Thanks! |
No problem responding via git-hub. I'm still learning the proper protocol here. BTW, I see in the Hercules log, the IP address is being deregistered on the OSA I stopped, but it is not being registered on the surviving OSA, so things are mostly working. |
Thanks.
Thank you for all of that too. I'll try to update my TCPIP PROFILE to match yours so I can hopefully (maybe) reproduce your same set of tests.
I doubt that very much! Just because an ARP (whether normal or gratuitous or even reverse) is sent out, doesn't give any other device on the network permission to "take over" for that IP! (Besides, how could it reliably know when to "take over"?!) No, I rather suspect that instead, some type of control command (perhaps one we're not supporting but I don't know that yet) is sent to the OSA by z/OS to specifically ask it to "take over" for a given IP.
You're probably right. I would have expected the commands to be "delete ip" and "add ip" (which we do currently support, obviously), but apparently the commands are some different type (or new type) of command that, you are correct, we are either currently not supporting or not supporting properly. I will need @mcisho Ian Shorter's help with doing some QETH driver packet tracing to see if we can maybe discover what those new control commands are (or where we're going wrong with our current set of commands). Bottom line: there is obviously still much work to be done in order to support multiple IP addresses, both in my CTCI-WIN product as well as in (and probably especially in!) Hercules too. |
Right. I think that's the root of our problem. z/OS is either sending a new type of control command that we're not supporting, or we're not handling an existing control command properly. More testing/tracing needs to be done. |
I'm not sure I stated the purpose of the ARP at OSA startup correctly, but as you can see from the I'd be happy to do any tracing or other troubleshooting you need. For your reading enjoyment, I'll attach my full TCPIP profile. There is other fun stuff in there, too. It happens to be for LCS because all the functions in the profile work under Linux. Note port 3023 has one function that I'd like to see work under Windows. Port 3023's IP address is dynamically created when TN3270A is started, and it can be started on multiple LPARS at the same time in a sysplex environment. The syplex will determine which LPAR creates the IP address and if that LPAR fails, another LPAR will take over the address. TN3270 sessions are then distributed (round robin in my profile) to LPARS listed in the
|
I have disabled IP routing on Windows, which changed my ping expired in transit issue to time outs. At least the transit problem is fixed! Before disabling:
After:
|
Please see my response in Issue #204 (comment). You need to add a route for 10.0.0.2 to your network router so that is knows to send packets for 10.0.0.2 to your Hercules z/OS guest instead (i.e. to either 192.168.20.12 or 192.168.20.13, whichever you prefer). What's happening is, 10.0.0.2 is not within the same subnet as your Windows host (nor as your Hercules z/OS guest either for that matter), so the packet is being sent to the default gateway (i.e. to your network router). Your router is then sending the packets to ............... who knows where! But wherever it is sending them to, it's obviously not sending them to the same network segment that your Windows host or Hercules z/OS guest is on. That is to say, neither CTCI-WIN (which can only see packets that your Windows host sees) nor your Hercules z/OS guest is ever seeing any of the ping packets because they're being misrouted by your router to somewhere where you obviously don't want them (or need them) to go. By adding a route to your network router telling it to send (forward?) all packets destined to 10.0.0.2 to one of your Hercules z/OS guest's IP addresses, the pings should then be seen and properly replied to. Try that and let me know whether it resolves this issue or not. I suspect it will. |
(Oops!) I have removed the "Invalid" and "Waiting to close" labels and have re-added the "Bug" and "Researching" labels because the original issue (problem) has not been resolved yet! The original problem was that when the first OSA (which was the VIPA owner) was removed (deleted), the VIPA IP address was not automatically moved over to the surviving second OSA like Bob claims it automatically should be. We know the cause but not the resolution yet. Thus the issue has not been resolved yet. The cause is that when the first OSA ("tun0") is deleted, Hercules receives a As I stated earlier, I would have expected that z/OS to have followed the So we're missing some key part of the puzzle somewhere. Either there's a command packet that z/OS is sending us that we're not processing or else there's some additional information being passed in the (Or .... Something else entirely different (unknown) is going on.) However, the bottom line is: IF a "Stop LAN" command packet is supposed to somehow automatically transfer the VIPA IP address over to the surviving OSA (which I'm not convinced of yet), then we need to figure out how to detect and do that. We need to figure out how to detect that and figure out how to know to do it (as well as figure out how to magically know which OSA to transfer the VIPA IP address to!). More work (more research) obviously still needs to be done for this issue. (@mcisho Ian Shorter? I might need your help with this one, buddy!) |
Sorry about my wording. I did a git clone, which I called a download by mistake. I have not downloaded the zip file. I just git cloned into a directory called Then I deleted all the files from my I missed copying the message numbers. Here is a better copy from the Hercules log screen (don't know the official name).
|
Here you go. |
Good.
I presume you're using the command-line (Command Prompt), yes? What git client are you using? What was the command that you entered? Did you also ensure that the directory did not already exist before entering your git clone command? (So that the git clone could then create that directory fresh, from scratch?)
The official name I guess is called the "panel" screen. It would be better however to copy messages from the Hercules logfile, not from the panel screen. When you start Hercules, do you specify a logfile? That is to say, when you start Hercules, you first open a Command Prompt windows, yes. And then from there, you enter the command to start Hercules. What does the command that you enter look like? Does it look like:
Or does it look like:
The second format (with the This is important since some of the messages that Hercules issues can be much longer (wider) than your panel window. Thus what you see on the screen might not be the full text of the message, whereas the messages written to the logfile are always the full text of the message. I hope you already knew that and I apologize if you did. |
Okay, I see a problem right away! For some (as-yet-unknown) reason it appears the To verify, please do the following and then post the results:
I need to see the output of both commands. Thanks. |
(Oops!) (I forgot the most important step!) After step 3 (enter _dynamic_version command), you need to:
That is the command that will only display a couple of lines. The _dynamic_version command itself does not display anything it all. It runs silently. BUT... It defines (sets) the all-important Sorry about that. |
Here is what I get. I suspect the set ver is wrong. I did forget to clear hyperiongit the second time I did a git clone, but the first time it was empty. C:\hyperiongit>dir /b /ad C:\hyperiongit>_dynamic_version.cmd C:\hyperiongit> C:\hyperiongit> No, I haven't been using the log file, but will change my batch file ASAP to do so. I will also delete the hyperiongit directory and try again. This is my git: C:\Users\HP\Documents\GitHub> git clone https://github.com/SDL-Hercules-390/hyperion.git c:\hyperiongit |
Just tried it with new hyperiongit directory - same results. |
While the way you are doing it is fine (it will work), the way you're supposed to do a
For example:
That's the way it's normally done. But as I said, the way you're doing it should also work just fine. I'm just passing on some helpful information here that allows you to create a git clone wherever you want (instead of letting it default to some weird directory name in the root of your drive). |
Do this:
Attach the Thanks. |
Now I'm even more confused. I started Hercules:
IPLed the guest OS and here is a dir of the logfile:
It is in fact empty. I did a Not too worried about it now, but puzzling. |
Did you do the Exit completely from Hercules first, and then do your dir and you will then see that it now actually has some data in it (and you will now be able to edit the file too, whereas before, while Hercules was still up and running, you couldn't). |
The I have since committed a fix(*) for this, so please re-clone Hercules and try the above _dynamic_version.cmd test again (i.e. set traceon=1, set debug=1, _dynamic_version.cmd > dyn.txt 2>&1) and attach the file to your GitHub comment. Thanks.
(*) The fix is for the early debug exit, not for whatever is causing it to fail for you. I still don't know why it is failing for you. That's why I need to see the dyn.txt debugging output with trace/debug enabled. It will hopefully tell me where your _dynamic_version script is taking its wrong turn. |
I looked at the log file both during and after Hercules was running. Just for kicks, I'm going to clone Hercules with linux and copy the source to Windows, then build Hercules in both environments from the same source just to see if the version line is different between the two. I've always gotten the long character string under linux. OK that doesn't work. git must build folder names when cloning. |
Please re-clone and do the _dynamic_version test again! |
What are you using for git? My Powershell git seems to be really old. Deprecated, in fact:
|
Thank you. Looking at the output it appears "git.exe" is nowhere to be found anywhere in your PATH. That is the cause for the default non-git Hercules VERSION string that you're getting.
I was about to ask you the same thing!
Powershell git? ((groan))
That's your problem: the only version of git that you have is a Powershell version of git, not a standard command-line version. And not only that, it's nowhere in your Windows PATH either. You need to install a quality git client. Do that, and I'm sure things will work much better for you. I'm using both Git for Windows as well as TortoiseGit, but that's only because, as a Hercules developer, I use git a LOT. It's up to you (optional) whether you want to also install TortoiseGit or not, but I would highly recommend installing at least Git for Windows. (and getting rid of that weird posh-git that you have) So try this:
For someone like you, I would in fact recommend not installing TortoiseGit. Doing so would be vast overkill for someone like you and would probably only confuse you further. To keep things simple, I would recommend installing only Git for Windows. Once you install Git for Windows your Hercules VERSION should then be correct. |
OK! I installed git for Windows and now have the proper version info!
Sorry about all the fuss. I didn't know the version of git would have an effect as long as everything (or so I thought!) downloaded correctly. Will check the cross-subnet ping later today. |
And when you do, please remember to post your comment to Issue #204. |
Ages ago you said:
I very much doubt the former, but the latter is quite possible. However, as we don't know what commands are, or are not, being sent, it's a moot point. |
Just checking for a status update: Is this issue still a problem? |
There are multiple problems associated with having more than one OSA defined. I think most, if not all, are related to the lack of ARP support in QETH and CTCI-WIN (I think) sending gratuitous ARPs for all registered addresses for all OSAs, even though those addresses are registered to every OSA. I have no problem calling this a restriction until a possible future enhancement to QETH includes proper ARP support. |
I am going to close this GitHub Issue at this time due to it getting way off track from the original reported problem and would like to request that you create a brand new GitHub Issue describing your perceived unresolved problem in more specific detail so we can look into it (and pray that we can keep on track in that new issue!). Thanks. |
Environment is Windows 7, Hercules with multiple IP support and CTCI-WIN 3.7
I've added a second OSA and a same-subnet VIPA to my config and I can ping all three addresses. However when I stop the VIPAOWNER OSA, I'm getting mixed results. I can still ping the stopped OSA and sometimes the VIPA responds to ping, but a TN3270 session to those two addresses fails (the ping responses turn to timeouts on those two addresses when TN3270 is trying to connect).
The VIPA TAKEOVER function worked according to the z/OS console, but I did not see the address of the VIPA being deregistered on the stopped OSA and then registered on the running OSA on the Hercules console.
What should happen is the addresses owned by the failing OSA should be taken over by a surviving OSA, and all addresses should continue to respond to ping (or TN3270).
I don't think it matters, but I am using INTERFACE definitions instead of DEVICE, LINK and HOME.
NOTE: This issue is closely related to Issue #204.
The text was updated successfully, but these errors were encountered: