-
Notifications
You must be signed in to change notification settings - Fork 836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to build openembedded system LuneOS, error in packaging linux-libc-headers #2665
Comments
Hey Eric. Thanks for filing, tho' in future, please do complete the issue template - your issue above omits requested info:
Please also attach the failure logfile noted at the end of your error dump above. And since this looks like a permissions issue, please also include a long file listing (inc. permissions) of the offending Thanks - appreciate your help diagnosing this issue. |
OK, so, it's not able to create the last bit there -- .../sstate-build-package//package. Paste of results of ls -al coming ls -al output for requested files/dirs:
|
Attached zip of strace. Please let me know if I can be of any further assistance. |
aka
aka |
Thanks for the super quick find, @therealkenc . So that seems to indicate that it's not something that will be going away anytime in the near future? So, I guess, there's a question involved -- what can be done to get around the problem? Clearly this is happening to a number of other things, so it seems like having a general idea for how to change some things to not encounter this problem would be a good course of action, and I'd be happy to help in any way I can. (i wonder if my idea of getting loopback device + ext fs would be a way to get 100% fs compatibility...) |
I've been thinking about that too since I posted #1529 and this turned out to be the cause of my #640 grief. Seems to me that with effort some cases (perhaps Other than fd caching scenarios like #1529, I speculate (without any evidence) that the handles sometimes aren't being held open for any particular reason. It is just that no one even thinks about it on Real Linux because it isn't a problem you'd worry about. Maybe they're even benign handle leaks. A strategic Or might not. It might take a wholesale refactoring, or be (practically speaking) impossible because the fd is open for a good reason. I have not looked at your specific fail closely. |
Also before you do anything else, run the thing as root just to make absolutely sure we're talking about the open directory handle problem and not something else. [Can't be too careful.] |
I'm not at all familiar with the python code that the problem is occurring in, though the number of times I see "python" in the issues that you've linked does make me wonder if there's something weird in python land. But I'll have a look at it and see if I see anything obvious as to if it might be unintentionally holding a file open before doing a move. Or is it a different process entirely that is holding a file inside the directory open, then python tries to move it? There shouldn't be other things occurring in there, but perhaps something is being a little sticky -- like in releases of Win8 when you'd have to wait minutes sometimes for files to be unlocked after a program had opened them, before you could delete. I'm pretty sure OE will absolutely 100% refuse to run as root, but I'll try it tomorrow anyway. At least knowing the root cause of the problem (file somewhere in tree held open while a rename/move occurs) might allow me to cast a critical eye on it and find something. |
My yocto rocko build in WSL was also hit by this bug. Would appreciate suggestion on how to get around it. I tried run as root (by disable sanity in sanity.conf) and still got hit with the bug. |
That's basically the litmus test. If you get |
hey @derekyu56 are you having the same problem in the same package? others are reporting that different kinds of builds (different distros, i don't recall which ones i read, but there were some people on the OE mailing list talking about it) are working, or different kinds of errors, it seems .. but i think it might be something specific to the linux-libc-headers recipe, or it's build/package process . . . i still haven't had time to go and open it up and check to see if there's anything obvious. |
Yes, It is the same package with a different version: |
Thanks @derekyu56 gives me some insight that it might be package specific. I have taken a very brief look at the python script, and I didn't see anything that just stuck out at me like a sore thumb -- I'm starting to suspect that Linux Python in general can be problematic in WSL, based on the number of tickets that therealkenc mentioned that also mention Python in them. I just had a thought -- I wonder if the script's current working directory happens to be the directory that it's trying to move. Anyway, that's just speculation without real data to go on. What I think I want to do next, when I have time, unless someone else gets to it before me (which is entirely possible, given my real serious lack of time to do personal projects right now) .. in local.conf:
then do a strace bitbake linux-libc-headers, and then go analyzing it to see if i can see what is being held open. I think first, though, i'll go and print the CWD at the point of failure. |
FWIW, i've tracked down the problem a little bit when running on a Windows drive -- a process called "Cooker" (according to 'top') spins up the CPU to max, and just hangs there like that until terminated (i let it run for 4 hours, normally it doesn't stick there for more than a few seconds on a real Linux machine, or when run from the WSL root drive) I'm also trying to put a little time into figuring out the problem with linux-libc-headers, and ran into another problem that seems to relate to fork() not working exactly like it's expected to -- the "db-native" package (not sure if that's included in the distro that others are using) fails when configuring, with "configure: fork: Invalid argument" followed by "wait_for: No record of process 2857". I saw some traffic on the oe-dev mailing list suggesting that fork issues in WSL would tank efforts to make it work, but they weren't specific as to what the issues might be. |
Use the bitbake -k option to continue after error, linux-libc-headers-4.12-r0 is not the only one that has this problem. Found at least two other packages have the same problem: ERROR: update-rc.d-0.7-r5 do_package: Error executing a python function in exec_python_func() autogenerated: This is not a package specific problem. |
So.. weirdly enough -- seems like it's caused in bitbake by some sort of a race condition. I have no idea why this fixed it, or why it appears to have fixed the problems with db-native and fork.. but .. I am currently running the build greatly overloading the normal number of tasks. I wanted to do that to see if it would crank up all the cores on my CPU, because using the regular number didn't. When I more than doubled the amount of operations going on, I've managed to get through more than double the build that I had previously, and it's still going. Either that, or those packages failed an hour ago, but it's still carrying on with other tasks. :-) I added this to my local.conf
The first time I ran it, it aborted sort of early in the process, then I ran it again, and it seems to be ripping right through both the spot it crashed on earlier (db-native), and linux-libc-headers, and a few other packages that i thought had failed on me in the past, too. It's not a great solution, but at least it gets farther into finding any more potential bugs, or having a built system. |
Try going the other way.
That is going to be slow as #$@&%* but might work. Possibly (probably?) one thread has not had a chance to close their handles before a different one tries to rename the directory. Why would increasing the count help? That's a pretty good question 🤷. Maybe more threads just means a higher chance of getting lucky for obscure scheduling reasons. Maybe single threaded fails all the same, but we won't know until someone tries. WSL is pretty rock solid in terms of syscall level concurrency, for what it is worth. The whole edifice would collapse if it wasn't. With the usual caveat that anything is possible per Murphy. |
I have a working theory that I don't know how to specifically test, that a previous step in the build process for that package has locked a file, then released it, but that NTFS or WSL or Bitbake or someone else somewhere in between isn't releasing it all the way.... and that increasing the task concurrency causes the amount of time in between the individual steps to increase, giving whatever has that file open/locked time to release it all the way. Unfortunately, I got a big ol' BSOD around the time I reached 2000 tasks completed in the build (out of 7000.. sigh) and now I get to clean up all the pieces of packages that are halfway through some step, and don't know how to recover. I had intended on going down to 1 thread once I had all the dependencies built to get to the linux-libc-headers package.. and much to my surprise, it blew right through. @derekyu56 thanks for the -k switch, didn't know about that. I wonder if there's anything common to those two packages and linux-libc-headers. My guess is that update-rc.d and linux-libc-headers probably don't have any sort of compilation step, and probably go straight from fetch to unpack to install then to package. |
So, running it with -k, everything passed until it hit towards the very end, when a ton of LuneOS specific packages hit, that were all dependent on the rest of the system. My suspicion about packages that don't include any involved build step seems to be right on -- 48 tasks failed, 47 of them with the exact same error in do_package. Every single one of those 47 packages appear to be files that are installed/packaged directly after the fetch step, with little to no handling in an intermediate configure/compile phase. The 48th was a compile error in gcc-cross-initial. Sorry for spamming people's inboxes on a closed topic -- if there's a better forum to discuss this, I'd take it there. It's clearly OE related, but only in WSL . . and not clear yet that a workaround / fix would be appropriate for OE. I see also that it didn't blow by linux-libc-headers .. it for some reason executed it towards the end of the build, instead of at the 900 package mark like usual. Possibly due to the amount of concurrency I enabled. ... before i lose the terminal output...
|
Anything possible, but I am gonna say 'nah'. File operations (including Here's a work-around that might be worth trying. Track down that line in the Python script quoted in the OP:
Guard the |
Looks like the solution is right in the python manual (https://docs.python.org/3/library/os.html): quote I replaced line 359 and line 620 os.rename() with os.replace() in file meta/classes/sstate.bbclass and the linux-libc-headers-4.12-r0 package passed the build. There are two other os.rename() in the same file. I did not change all of them These simple changes triggered a total rebuild. It really sucks. |
Sorry, my computer was too slow. It didn't fix the problem. the package showed later and os.replace() still generate EACCES error. |
Attached is a strace of mostly just the failing packaging process of a single package, after I had modified the script to retry if it failed, so there's a couple hundred retries after the initial failure. Retrying did not fix it. I tried to match up all the file opens to file closes, and I didn't see any mismatches.. i also checked the cwd of the script, and that wasn't the culprit either. I just don't have any idea what is holding which file open that would prevent the os.rename from succeeding. In this case, it's /home/eric/luneos/webos-ports-env/webos-ports/tmp-glibc/work/all-webos-linux/luna-webkit-api/2.0.0-1+gitAUTOINC+d0b393ac0c-r0/package that is failing to move to /home/eric/luneos/webos-ports-env/webos-ports/tmp-glibc/work/all-webos-linux/luna-webkit-api/2.0.0-1+gitAUTOINC+d0b393ac0c-r0/sstate-build-package//package |
Reopening if there's no positive evidence of an open handle. Thanks for digging. |
Thanks for taking the time to capture & share the trace @ericblade. Assigning to @scooley @SvenGroot @sunilmut for triage. |
Curiosity killed the cat so I looked.
FD 21 is never closed before the I don't know if I want to be so declarative as to dupe and close this issue since it has already been closed and re-opened once. But I'm about 98% sure directory |
Maybe (spitballing because I haven't thought about it hard enough) |
ok, i think i want to know how you figured that out .. and then i think i don't want to know because it's probably crazy. :-) so, just a few lines under the line you point out (line 612749)
There is (line 612771)
According to stackoverflow, this sets a "close-on-exec" flag, which causes the file descriptor to "be automatically (and atomically) closed when any of the exec family functions succeed". Then the rename fails at line 714814. So, 21 is a duplicate of fd 6, line 612747?
6 appears to be line 612733?
So.. it looks like that is a handle to ....-r0/psuedo/ not -r0/package/ ? Is CLOEXEC not working or not implemented? Are no exec() calls being executed between here and there? Are the exec() calls not entirely implemented? (edit: i had originally pinpointed FD 6 to a different file entirely, changed the last of this text quite a bit after I realized I needed to be at a different fd 6) |
Gold star for tenacity on this one 🌟. I think you might be onto something. I saw the Line 685930:
Is there any chance you can test this on Insider 17046? It could be that
Gotta say, if that does turn out to be the problem, getting from a |
Couple of final data points for the night. The good news is The bad news in doing the cursory test case, it looks like my first impression about that [edit Below is total nonsense because I failed to register that there is no |
Can you tell me if I was incorrect about the file handle appearing to point to a different directory than the one that it is failing to move? I don't currently have a machine on Insiders (the one that was got bit by the "does not boot on AMD devices" bug a few releases back, and had to be restored, and is now in use by someone else :| ) but I can probably get there. The race-condition idea turned out to be a non-starter -- because I had enabled such high amounts of parallelization, the failing tasks got bumped far to the back of the queue, instead of happening in the normal order that they go in, where you usually see all of the tasks for a specific package complete at once -- it was trying to do as much fetching and building as it could, and saving the packaging steps for later, or something like that. On the bright side, I did just remember that it should be possible to just run the failing step on it's own, and maybe I can strace from that, saving hundreds of thousands of lines maybe. |
@therealkenc Only CLONE_VFORK would cause 31182 to wait for 31196 to exit, and that wasn't included. RCU supports the included set of clone flags perfectly (CLONE_VM doesn't work by itself, but CLONE_VM | CLONE_THREAD works fine). |
@jstarks
You can pull out the processes with grep. |
So.. should I try on the Insider build? I'm still slightly confused as to the exact source of the problem -- i'm trying to find out which file is open, so that I might have a hope of working around the problem somewhere in one of the bitbake classes or recipe, now that i've narrowed it down to a common set of recipes that all have similar characteristics, there is probably? something that should be closing the open handle, but that isn't happening because of something in common among these recipes. That's why I was trying to track down which file is actually open, which lead me to the handle at -r0/psuedo instead of -r0/package .. which must be incorrect, though, since the open handle that is blocking the rename would have to be in the directory tree being moved, and -r0/psuedo is not in -r0/package |
[edit with re-think]
Personally I wouldn't. #1878 is not going to change anything.
Okay I see what you mean now. You could put a big sleep in the Python code after catching the error, and then look at |
well, i'm interested in putting the effort in to figure it out, so i can either post a workaround (fixing the recipes), or we can absolutely determine that there's not another bug hiding. My Linux builder is no longer supported in real Linux, so I'm itching to get rid of it :-) Where do I look in /proc ? and I'd guess I probably need to compare that to another strace run to see what the handles actually point to? |
[edit] There's gotta be an idiomatic way of finding the open file handles in Python too, but I wouldn't know. |
Duping into #1529. If 1529 flips status and building LuneOS still fails, open a new issue to that effect and we can take another run at it. |
I guess I come back to this about once a year, give or take. :-D I wanted to see if newest Windows release (1809) had made any headway on this, so I dug it out again. Stops in exactly the same spots, same packages (linux-libc-headers and base-files). SO, I finally opened up the code that was failing, and swapped out the python "os.rename" call to a "shutil.copy" call, and it has moved from being about 5% successful before failure, to about 15%, and still running without a failure, so far. IF I don't run out of disk space (i haven't attempted any of the hacks to move WSL to non-boot-disk yet), it's looking good. . . but will take several hours to complete. (in WSL it's moving at a much slower clip than my native Linux box, which is slower, has less memory, and only has a spinning disk.... but it's probably due to the sheer volume of disk access, which we know is a bit slow in WSL) |
Nah, nothing new on the underlying problem. As a time sink you might take a try at running the LuneOS build under the recently posted fuzzyTew/nodejswsl. His work-around is aimed at NodeJS scenarios, but assuming your open handle is transient (and I think it might be) it has a better than nothing chance of working for you. The hook is doing a sleep/retry along the lines of what I was suggesting back in November 2017, only lower down so you don't actually have to patch the Python code. |
Your Windows build number: Microsoft Windows [Version 10.0.16299.19]
(WSL Ubuntu fresh install from store)
What you're doing and what's happening:
Build instructions for the LuneOS system are here: http://webos-ports.org/wiki/Build_for_Tenderloin
Error text:
This seems to be some kind of permissions error, but I'm not sure what kind of permissions error. This does not occur when following identical instructions on an actual Ubuntu system running 16.04.
This occurs a couple of hours into a system build on my laptop, approximately task #478.
I am not intimately familiar with the OE build system, so advice for tracking down any further specifics would be helpful. I did notice the double "/" on the pathname at the end, the last time I tried to run this, and I attempted to fix that, but it had no effect on the overall outcome. I also tested that double // in paths work correctly, normally, and they seem to.
Building on DrvFS results in a hung process very early in the build process, not sure how to diagnose.
What should be happening:
it shouldn't error here, and should complete the OS build.
Strace of the failing command: I've got a 290MB strace, I could use some direction on how to trim it down specifically to an area that is relevant to the problem
The text was updated successfully, but these errors were encountered: