**** BEGIN LOGGING AT Thu Aug 15 02:59:58 2019 Aug 15 08:24:38 Can a recipe be made to build multile images ? eg. bitbake multi-image Aug 15 08:26:00 where multi-image.bb instructs to build 3 or more images Aug 15 08:28:46 qt-x: yes, just have them as dependencies do_sometask[depends] = "image1:do_image_complete image2:do_image_complete2 imag3_do_image_complete" Aug 15 08:30:13 RP: Regarding that ResourceWarning I mentioned yesterday, I am building without the hash server enabled. Aug 15 08:30:14 awesome thanks RP Aug 15 08:49:44 Hi everyone Aug 15 08:50:24 I was wondering why when an initramfs is built and bundled with bitbake, the kernel with the bundled initramfs is not actually packages Aug 15 08:50:35 and only left laying around in the build tree Aug 15 08:52:36 s/packages/packaged/ Aug 15 08:56:05 New news from stackoverflow: /dev/fd/ socket or pipe links fail, NOT missing /dev/fd link Aug 15 09:11:47 In my case Image.gz.initramfs is created but only Image.gz is packaged Aug 15 09:12:23 This is inside kernel.bbclass Aug 15 11:25:15 LetoThe2nd: https://www.youtube.com/watch?v=Itn_at7kfVw Aug 15 11:33:38 scary thig to see in the morning! Aug 15 12:28:43 Does anyone have any input regarding the question I asked earlier? Aug 15 12:29:38 asabil: initramfs depends on a lot of different things, some images include them, some don't. It could also vary depending on the target architecture, kernel and machine. Basically it depends on a lot of different things Aug 15 12:30:05 RP: yes I know, my problem is with the kernel.bbclass Aug 15 12:30:32 it generates .initramfs and then forgets about it Aug 15 12:30:54 it feels like a bug to me, but I am not sure Aug 15 12:31:14 https://git.yoctoproject.org/cgit/cgit.cgi/poky/tree/meta/classes/kernel.bbclass#n256 Aug 15 12:32:04 How to overwrite default Kconfig vaue using meta? Aug 15 12:32:39 hello Aug 15 12:32:54 I have vendor SDK which have heavy patched u-boot + kernel + rootfs Aug 15 12:33:04 and then this is where it specifies the Package files https://git.yoctoproject.org/cgit/cgit.cgi/poky/tree/meta/classes/kernel.bbclass#n93 Aug 15 12:33:07 I plan to use u-boot and kernel and move building rest to yocto Aug 15 12:33:14 project using 4.9. linaro toolchain Aug 15 12:33:25 my target is poky rocko Aug 15 12:33:33 nothing ever refers to $imageType.initramfs Aug 15 12:34:03 but I'm having some issues like: ERROR: gmp-6.1.2-r0 do_package_qa: QA Issue: libgmpxx rdepends on external-linaro-toolchain-dbg [debug-deps] Aug 15 12:34:03 ERROR: gmp-6.1.2-r0 do_package_qa: QA Issue: /usr/lib/libgmpxx.so.4.5.2 contained in package libgmpxx requires libstdc++.so.6(CXXABI_1.3), but no providers found in RDEPENDS_libgmpxx? [file-rdeps] Aug 15 12:34:03 ERROR: gmp-6.1.2-r0 do_package_qa: QA Issue: /usr/lib/libgmpxx.so.4.5.2 contained in package libgmpxx requires libstdc++.so.6(CXXABI_ARM_1.3.3), but no providers found in RDEPENDS_libgmpxx? [file-rdeps] Aug 15 12:34:03 ERROR: gmp-6.1.2-r0 do_package_qa: QA Issue: /usr/lib/libgmpxx.so.4.5.2 contained in package libgmpxx requires libstdc++.so.6(GLIBCXX_3.4), but no providers found in RDEPENDS_libgmpxx? [file-rdeps] Aug 15 12:34:06 ERROR: gmp-6.1.2-r0 do_package_qa: QA Issue: /usr/lib/libgmpxx.so.4.5.2 contained in package libgmpxx requires libstdc++.so.6(GLIBCXX_3.4.11), but no providers found in RDEPENDS_libgmpxx? [file-rdeps] Aug 15 12:34:09 ERROR: gmp-6.1.2-r0 do_package_qa: QA Issue: /usr/lib/libgmpxx.so.4.5.2 contained in package libgmpxx requires libstdc++.so.6, but no providers found in RDEPENDS_libgmpxx? [file-rdeps] Aug 15 12:34:18 it is even good idea to try it like that? Aug 15 12:38:37 opennandra: the shlibs automatic dependency code is confused as it can't figure out what should be providing libstdc++ Aug 15 12:38:53 opennandra: a question for the supplier of this external toolchain Aug 15 12:39:06 RP: looks like it's QA issue so can I just suppress it and try? Aug 15 12:39:14 does this thing even make sense? Aug 15 12:46:59 RP: I have some more information about the glibc-locale pseudo issue, but I still don't really understand the nature of the problem yet. Aug 15 12:47:22 It took 11 hours to get an instrumented pseudo to reproduce the problem. It just failed a bit ago. Aug 15 12:47:30 https://pastebin.com/yHc0HSme Aug 15 12:47:52 I don't know how much you looked at it, but there is a filter log with the important bits stripped out. Aug 15 12:49:32 I am trying to understand the sequence of events that lead up to pseudo to decide the uid is wrong. Aug 15 13:12:45 opennandra: Its a warning that something is seriously wrong with your build. Sure you can turn it off but that doesn't fix it. Aug 15 13:13:08 jwessel: interesting. Let me see if I can page in from swap :) Aug 15 13:13:10 RP: ok what is then sugeested way ? Use some older yoct orelease? Aug 15 13:13:21 Rp: I need to build stuff with external toolchain Aug 15 13:17:46 opennandra: Fix the external toolchain, or as the supplier of the external toolchain why its broken. I know nothing about it so I can't really help. I know what that error means but I don't know what the correct fix is or anything about the toolchain. I do know that turning off the check will just make it fail later Aug 15 13:18:03 jwessel: the other interesting thing is "mode 100600" - where did that come from... Aug 15 13:18:33 RP: ok thanks a lot, I'll ask on mailing list then Aug 15 13:19:17 jwessel: I've always thought that there was stale data in pseudo's db that by chance happens to corrupt a new file Aug 15 13:19:37 jwessel: If I was right about that, the question is where did the bad info come from (stale inode?) Aug 15 13:20:02 jwessel: how complete are your logs? can you tell if that inode has any previous history with an unrelated file? Aug 15 13:22:10 RP: Is there some way to tell bitbake to copy files from SSTATE_MIRRORS rather than creating symbolic links to them? In our case we have the global sstate cache on an NFS mount and I would prefer to copy the files to the local sstate cache rather than having to retrieve them via NFS each time. Aug 15 13:31:17 Saur: you'd have to tweak the fetcher code afaik Aug 15 13:33:02 RP: Ok. Would it be acceptable to add a way to force copying instead of linking? Either globally, or perhaps some way to do it per URL in SSTATE_MIRRORS? Aug 15 13:33:44 Saur: My reservation is just about more codepaths combinations in the fetcher code :( Aug 15 13:34:05 per url in sstate_mirrors sounds like some kind of nightmare Aug 15 13:34:39 Probably not too easy to add either. Globally should probably be a lot easier. Aug 15 13:35:04 Ok, I'll have a look at the code and see what it would involve. Aug 15 13:36:34 Saur: The trouble is each time you add a binary "yes/no" decision for a feature like that into the fetcher, it doubles our test matric Aug 15 13:37:02 matirx. Given the people we have maintaining it (or not), I'm rather adverse to such controls Aug 15 13:38:43 RP: Yeah, I know. At the same time having the symbolic links to an NFS mount is less than optimal. Especially when there is network failure and the NFS mount is gone for a day due to IT not being able to get the network working (yes, we had that the other day) :P Aug 15 13:39:29 Saur: You might be able to expose it over HTTP also? Aug 15 13:40:17 JPEW: Yeah, that is an alternative, but then it becomes a matter of authentication... With the NFS mount that is taken care of by who can mount it... Aug 15 13:40:37 Saur: Ah, ya that gets a little tricky Aug 15 13:42:05 JPEW: btw, I found another data corruption bug in runqueue. Hoping this explains a few weird things! Aug 15 13:42:36 RP: Cool. Did you get any stats from the AB? Aug 15 13:42:42 Saur: Its a tricky one. Once I accept such a patch we're stuck trying to maintain that API effectively indefinitely though :/ Aug 15 13:43:05 JPEW: {"connections": {"total_time": 2058.1817467394285, "max_time": 0.2676054770126939, "num": 1772291, "average": 0.0011613114024386676, "stdev": 0.003402929594519231}, "requests": {"total_time": 1224.0615269057453, "max_time": 0.26773543702438474, "num": 1772290, "average": 0.0006906666103773904, "stdev": 0.0005487492249695723}} Aug 15 13:43:18 JPEW: that was after a single build approximately completed Aug 15 13:43:30 RP: Any connection timeouts? Aug 15 13:43:38 JPEW: loads Aug 15 13:44:05 JPEW: https://autobuilder.yoctoproject.org/typhoon/#/builders/83/builds/331 - any warning is a timeout Aug 15 13:44:24 Hmm. Ok, that's unfortunate. It means my stats probably aren't capturing where the timeout happens :( Aug 15 13:44:32 RP: Yet without the possible to add those kinds of tweaks, my hands are very limited as I have no way of doing local modifications to bitbake, compared to classes and recipes that I can copy/append to locally. Aug 15 13:44:59 JPEW: I think it just means the server can't handle enough requests to stop some connections stalling Aug 15 13:45:37 i.e. the Kernel can't accept anymore connections? Ya that seems likely Aug 15 13:46:46 1772290 * (0.0011 + 0.0007) = 3190 seconds Aug 15 13:47:00 Saur: you can monkey patch bitbake. You'd just not *like* to do that Aug 15 13:47:35 No, I definitely don't like the idea of doing that... Aug 15 13:48:14 RP: So, I think we can say that connections are being serviced in a reasonably timely fashion once if the kernel actually allows them Aug 15 13:48:54 RP: That was included in the log I posted. Aug 15 13:49:09 That particular inode was used earlier, but it was still good at the time. Aug 15 13:51:04 I am not exactly sure if it is DB corruption, or some kind of a odd race condition. Aug 15 13:51:27 I suspect I'll have to add additional logging information but I am not sure what to add yet. Aug 15 13:52:52 This is the first time we have caught it "red handed" so to speak at the first time the bad entry is inputed into the DB. Aug 15 13:54:19 I'd like to be able to create a stand alone test that does that emits the same kind of log entries. Aug 15 13:55:35 What I don't know is if the client 1/2 of the operation is where things went bad. This is only the server side. I wasn't sure if the client picks up the UID info and just passes it along, or if the server is making some kind of decision. To me it looked like a brand new entry. Aug 15 13:57:14 jwessel: this is the challenge with debugging this, its very hard to tell Aug 15 13:57:54 I'll have to go read some more code and such. I have only been working on this intermittently, so I thought I might post what I had. Aug 15 13:58:19 JPEW: I guess we need to more threads in parallel answering the connections? Aug 15 13:58:40 What we know now definitively, is that it is a re-used inode and it has something to do with the hard links and mv operations. Aug 15 13:58:43 jwessel: its useful, I'm just also unfortunately in the middle of a complex mess with runqueue :( Aug 15 13:59:11 jwessel: yes Aug 15 13:59:22 I am in the middle of 2 or 3 other things myself :-) Aug 15 13:59:31 Ya, I think we should make the server use the socketserver.ThreadMixIn class to thread the server, then make the siggen code use a persistent connection Aug 15 13:59:51 JPEW: doesn't that code use a thread per connection? Aug 15 14:00:16 JPEW: I'm not really willing to go that far, bit risky Aug 15 14:01:34 If you want to share a persistent connection, it makes more sense because then it won't run out of threads to handle new connections (or you need a thread pool that is bounded by the maximum number of clients you expect at any given time) Aug 15 14:02:43 JPEW: I suspect the current design can handle persistent connections, its the shear number which is overloading it Aug 15 14:02:56 JPEW: I suspect a thread pool may be easier than persisting though :/ Aug 15 14:03:14 RP: Not with one thread.... the single thread will handle only one connection until it closes, so it would block all others Aug 15 14:04:48 RP: There are easy ways to make the connection persistent, just not with stock python modules. Aug 15 14:05:58 JPEW: hmm, I thought the thread worked differently to that :/ Aug 15 14:06:30 RP: let me look, I might be confusing something Aug 15 14:08:42 Rp: OK, I was right. The HTTP handler base class processes request on the connection until it closes: https://github.com/python/cpython/blob/3.7/Lib/http/server.py#L422 Aug 15 14:09:26 You could, I supposes pass *those* all off to yet another thread, but that seems messy Aug 15 14:09:34 JPEW: I think we're seeing it differently in that we're thinking about different threads Aug 15 14:10:18 JPEW: I'm looking at it from the perspective of the sockets being opened by the server. There is a thread dedicated to doing that and queuing them up which isn't blocked on closing Aug 15 14:14:27 RP: Correct... I was running ahead and trying to think about persistent connections and threads. I don't really see how adding more threads would help with non-persistent connections? Aug 15 14:15:06 JPEW: it depends where our bottleneck is and I guess on that we're still perhaps not quite in agreement Aug 15 14:19:11 RP: Ok, right. I think (assuming my stat code is correct) that the server handles requests timely once it actually accepts() them. The one metric we don't have is the amount of time a connection is pending in the kernel before userspace calls accept() to get it. Aug 15 14:20:09 So either 1) The connections are waiting for a long period of time in the listen queue before userspace calls accept() them Aug 15 14:20:29 JPEW: think about this maths, We have 40 different autobuilder targets each with 9000 tasks starting in parallel Aug 15 14:21:04 JPEW: that means 360,000 requests approximately in parallel which we need to answer in less than two minutes. Can the server do that? Aug 15 14:22:56 JPEW: with the average connection time it would take 418s, the average requests time, 248s, both of which are > 120s Aug 15 14:23:19 JPEW: so I suspect even the request time is too slow :/ Aug 15 14:23:44 (I'm open to persuasion I'm missing something) Aug 15 14:24:38 RP: You are correct. I suspect adding one more thread to the pool would cut the request time almost in half... the request time includes the I/O time required to read the data from TCP socket as well as write the results back. Aug 15 14:28:06 JPEW: right, we probably ideally want a pool of around 5-10 Aug 15 14:28:56 RP: I think doing that will significantly reduce the request time Aug 15 14:30:49 RP: But the connect time on it's own is still too long.... I suppose it's possible the reduction in request time would also reduce the connection time Aug 15 14:31:55 RP: Which actually seems pretty likely. The connect time can't *possibly* be shorter (on average) than the request time if the server is running full tilt with a single request handling trhead Aug 15 14:40:20 either my builds are much bigger or bitbake on one of most powerful servers I've access to is still slower than what people use, will send the parsing times to ML "shortly" last test running on No currently running tasks (28875 of 71749) Aug 15 15:35:56 JaMa: or you're hitting the same inotify issue that kanavin's profile seemed to show... Aug 15 16:05:05 RP: yes, we'll see -P is running now Aug 15 16:08:15 RP: those "ResourceWarning: unclosed" warnings are about the socket of PRServer on localhost, now I see in some builds that it even started PRServer twice Aug 15 16:08:29 NOTE: Started PRServer with DBfile: prserv.sqlite3, IP: 127.0.0.1, PORT: 44707, PID: 3947 Aug 15 16:08:39 NOTE: Terminating PRServer... Aug 15 16:08:44 NOTE: Started PRServer with prserv.sqlite3, IP: 127.0.0.1, PORT: 42189, PID: 3949 Aug 15 16:08:54 bb/codeparser.py:419: ResourceWarning: unclosed "while_clause": lambda x: (chain(x.condition, x.cmds), None), Aug 15 16:09:36 and it's triggered from various places (not just codeparser.py:419) Aug 15 16:16:26 RP: if I return at "Executing tasks", then it took only 6mins and there is only one notification in the profile (21 config_notification) Aug 15 16:18:48 JaMa: so its the iterating through the tasks part that is slow for you? Aug 15 16:22:05 JaMa: I think kanavin's numbers were Ctrl+C before exeuting tasks Aug 15 16:22:27 RP: looks like it, will let it run whole build with -n (with latest master-next it was "only" 90 mins) Aug 15 16:22:57 JaMa: that is *way* too long Aug 15 16:23:11 JaMa: did you have the -P output for that? Aug 15 16:23:40 or should I move the return to some better place? I'm looking at the execute() function but don't see where it would make most sense Aug 15 16:24:01 no, I have -P output only for this short 6min part Aug 15 16:27:29 90mins is *way* too long, but with older master-next it was over 10 hours, so it's nice improvement :) Aug 15 16:27:55 JaMa: Right, its better. What was it before we started messing with runqueue? Aug 15 16:28:38 before messing with runqueue (bitbake 1f630fdf0260db08541d3ca9f25f852931c19905) it is over 4 hours Aug 15 16:29:01 JaMa: so we did actually get better, its still just slow Aug 15 16:29:35 I can try even older revision, but on small sample (core-image-minimal) this revision was the fast baseline Aug 15 16:30:06 or something is messing with my benchmark like those PRserver connections Aug 15 16:30:12 will try to disable PRserv as well Aug 15 16:35:06 JaMa: I tried a "bitbake -n world" for poky so effectively oe-core and it takes 2m50 Aug 15 16:35:36 Hey, i'm trying to implement something similar to that: https://stackoverflow.com/questions/52729727/how-to-include-git-revision-of-image-layer-in-boot-output-or-etc-issue Aug 15 16:35:37 JaMa: appears to be around 200 tasks/second (12044 in total) Aug 15 16:35:46 but the variable with the layer revision seems to never be updated Aug 15 16:38:36 for me it was about 10 tasks/second, but most of the time was spent between "Executing tasks" message and the next line "Executing task (1 from 71749)" Aug 15 16:40:39 RP: I should also note that I had BB_NUMBER_THREADS = "8" on this 72threads machine Aug 15 16:42:59 most of the time it looks like this: Aug 15 16:43:02 PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command Aug 15 16:43:02 28719 mjansa 20 0 1328M 1265M 7096 R 100. 0.3 4:03.05 ├─ python3 /bitbake/bin/bitbake world -P -n Aug 15 16:43:06 28888 mjansa 20 0 496M 370M 8856 S 0.0 0.1 0:01.91 │ └─ python3 /bitbake/bin/bitbake-worker decafbadbad Aug 15 16:43:09 28889 mjansa 20 0 496M 370M 8856 S 0.0 0.1 0:00.01 │ └─ python3 /bitbake/bin/bitbake-worker decafbadbad Aug 15 16:43:12 28712 mjansa 20 0 164M 29380 9532 S 40.7 0.0 0:10.01 │ │ └─ python3 /bitbake/bin/bitbake world -P -n Aug 15 16:43:15 28718 mjansa 20 0 164M 29380 9532 R 17.7 0.0 0:04.14 │ │ └─ python3 /bitbake/bin/bitbake world -P -n Aug 15 16:44:58 and only the first one being busy after "NOTE: Executing Tasks" is shown Aug 15 16:45:45 JaMa: that makes sense, the workers get spawned but just return so only the cooker would be loaded Aug 15 16:47:22 arent workers spawned only when it also shows something like "Executing task (1 from 71749)" ? Aug 15 16:48:03 JaMa: yes Aug 15 16:48:19 I'm not that far, it will take 1-3 hours until I get to that point Aug 15 16:48:51 Initialising tasks: 100% |###############################################################################################################################################################################################################################| Time: 0:03:20 Aug 15 16:48:55 Sstate summary: Wanted 25260 Found 0 Missed 25260 Current 0 (0% match, 0% complete) Aug 15 16:48:58 NOTE: Executing Tasks Aug 15 16:50:27 JaMa: Setting to 8 threads gave 3m1 Aug 15 16:51:32 interesting, lets see what the profile data will show, but will probably send it tomorrow Aug 15 16:52:29 this build has also multilib enabled, so it almost doubles the number of available tasks, I should have used something a bit smaller Aug 15 17:13:56 JaMa: I'll have to run some more experiments, a build with meta-oe in here is definitely slower :/ Aug 15 17:15:54 JaMa: 33000 tasks, 22minutes Aug 15 17:26:33 RP: this is with latest bitbake? Aug 15 17:27:55 JaMa: yes Aug 15 17:28:29 JaMa: putting it under profiling now but I really need to look at the correctness problems of the last build :/ Aug 15 17:29:10 Looks like we've got some kind of nasty metadata race somewhere :/ Aug 15 17:31:17 will bisect where bitbake started terminating PRserv, it's not one of 2 commits I was suspecting (the cleanup for termination) Aug 15 17:31:53 looks like this one: 05888700 cooker: Improve hash server startup code to avoid exit tracebacks Aug 15 17:33:34 this issue is reproducible with 'PRSERV_HOST = "localhost:0"' in local.conf Aug 15 17:34:24 but probably not related to the delay I'm seeing, because for last run with -P I've disabled PRserv and rm_work as well and it's still sitting between "Executing tasks" and executing them Aug 15 17:39:05 RP: if I remove self.handlePRServ() added to reset() it works again Aug 15 17:40:51 JaMa: right, that would account for the socket issue but probably not the delays Aug 15 18:03:05 RP: Added more stats logging and multiple threads to the hash server: http://git.yoctoproject.org/cgit.cgi/poky-contrib/commit/?h=jpew/hashserve-stats&id=66c70708515cdeb903a93bd738b8f6a87cdd3926 Aug 15 18:07:38 do you have tips on how I could include the git revision of our custom layer in the yocto image? I have a hacky solution with a "nostamp" task, but as a consequence it forces the generation of the rootfs even if nothing changed, I'd like to avoid that Aug 15 18:15:59 dkc: check ./meta/classes/image-buildinfo.bbclass Aug 15 18:20:40 JaMa: looks like exactly like what I need, thanks Aug 15 18:31:15 is it expected that core-image-minimal depends on things like libx11 now? It seems to be caused by systemd -> dbus -> libx11: "dbus.do_package" -> "libx11.do_packagedata" Aug 15 18:39:43 it seems if i use fitimage and also INITRAMFS_IMAGE_BUNDLE my kernel has the initramfs twice. is there a way to prevent that? Aug 15 18:40:41 RP: I got another hit with more logs and things line up to a lack of understanding how this could happen. Aug 15 18:40:51 I am not so sure that pseudo is the problem either. Aug 15 18:41:10 If we assume that pseudo is not doing the inode assignment. Aug 15 18:41:27 It is asked to instantiate a hard link, but there is no "origination" location. Aug 15 18:42:01 Example: Aug 15 18:42:07 (exact 1 d /L/2.28-r0/locale-tree/usr/lib64/locale/ar_JO.ISO-8859-6/LC_NUMERIC) (exact 1 duf /L/2.28-r0/locale-tree/usr/lib64/locale/ar_JO.ISO-8859-6/LC_NUMERIC) Aug 15 18:42:07 (new?) new link: path /L/2.28-r0/locale-tree/usr/lib64/locale/ar_DZ.ISO-8859-6/LC_NUMERIC.tmp, ino 190901401, mode 0100644 Aug 15 18:42:07 No origin inode data: 190901401 [ no path ] Aug 15 18:43:14 I went and poked the actual dir structure, and sure enough. That is the only instance of that inode. It should have been linked off to other locations with the LC_NUMERIC, since they are all the same file. Aug 15 18:43:34 I don't really understand how that can happen though. Aug 15 18:44:14 jwessel: How can you hardlink to something which doesn't have an inode you're linking to? Aug 15 18:44:43 I have no idea. That is why I put a new log line in there, to prove it was hitting this condition. Aug 15 18:45:25 I need to trace back to the requestor, since down at the pseudo DB level / server level that information is long gone. Aug 15 18:45:47 Some how this file was turned into a copy instead of a link. Aug 15 18:45:56 jwessel: It at least gives a clue on what we're looking for... Aug 15 18:46:25 All the failures (I have 3 so far), have the exact same signature. Aug 15 18:47:44 I thought about putting in clean up code in pseudo to "fix it up" but it clearly isn't the right way to deal with this. Aug 15 18:47:55 We have a garbage in, garbage out situation. Aug 15 18:48:30 I just don't understand where the garbage came from. Clearly it is an error which should probably be fatal if you are asked to hard link something for which there is no reference. Aug 15 18:48:33 jwessel: right, I think we need to understand it more... Aug 15 18:48:40 I don't get how it becomes a copy though. Aug 15 18:49:12 that is odd... Aug 15 18:49:32 Just thought I'd provide an update or if you or anyone else has insights, I am open to any input. Aug 15 18:50:07 jwessel: I appreciate the update but I'm not close enough to the code to have useful input :( Aug 15 18:50:42 jwessel: I do agree its odd though as how would a hardlink become a copy. Maybe a libc call fails and this is the fallback? Aug 15 18:51:19 I was chatting with marka earlier. He mentioned there is glibc.bbclass or something which deals with some of this. Aug 15 18:51:35 It is specific to how the locales are copied. Aug 15 18:51:40 I'll look there next. Aug 15 18:52:29 I am not sure how to track down the caller yet, but we do know that by the time pseudo (the server side) is asked to process the hard link, it is already trashed. Aug 15 18:52:41 jwessel: yes, if its being copied in our code then it will try a hardlink, if it fails it will resort to a copy Aug 15 18:52:43 It is still possible the pseudo client end is broken in some way. Aug 15 18:53:16 jwessel: we have a the copy fall back for cases spanning different disks or similar Aug 15 18:55:06 jwessel: Its glibc-locale.inc and libc-package.bbclass Aug 15 18:55:40 I'll have to figure out what to instrument, but I'll start with old fashion code inspection first. Aug 15 18:55:56 I'd like to see what lines with what prints out in the logs. Aug 15 19:05:27 hmm... I seem to have stumbled on what happened, but it will take a while to figure out what to instrument next. Aug 15 19:05:29 https://pastebin.com/DwnTENVU Aug 15 19:05:52 The evidence show the original file was created and reference was purged. Aug 15 19:05:58 so it became a copy. Aug 15 19:12:06 jwessel: I wonder if some kind of atomic op guarantee on a libc call was broken by pseudo? Aug 15 19:12:52 Good question. I am inspecting through the pseudo code first, then I need to try and find the caller. Aug 15 19:13:03 calls used to be blocking until recently.. but that isn't the issue cause this stuff was broken before that change.. Aug 15 19:16:17 case OP_MAY_UNLINK: Aug 15 19:16:17 if (pdb_may_unlink_file(msg, msg->client)) { Aug 15 19:16:18 /* harmless, but client wants to know so it knows Aug 15 19:16:18 * whether to follow up... */ Aug 15 19:16:18 msg->result = RESULT_FAIL; Aug 15 19:16:18 } Aug 15 19:16:43 I am not so sure we do the right thing... Or a badly written app might not do the right thing. Aug 15 19:17:22 More instrumentation is required to find out exactly where we go off the rails. Aug 15 19:51:42 Hey folks. I was given a Poky SDK Installer. I ran it and tried running the 'environment-setup-armv5e-poky-linux-gnueabi' file that it created, but nothing seems to be happening Aug 15 19:53:29 that just sets some environment variables Aug 15 19:54:53 Do you know how i would compile a program using the Poky SDK that i installed? Aug 15 19:57:01 once you source it.. call 'make' or autoconf or.... Aug 15 19:57:16 if you just want to compile a single file, $CC -o output input.c Aug 15 19:57:28 no magic involved.. standard environment variables are set Aug 15 20:03:39 So after running 'environment-setup' i can just compile as i normally would and it will compile to the targets platform? Aug 15 20:04:07 that is the idea.. just be sure you use the environment variables and not direct calls to 'gcc' Aug 15 20:07:27 Ahah. I understand. Thank you for your help. Aug 15 20:08:46 One last question. How can i make sure that i 'use the ennironment variables'? Do i just have to make sure i run the 'environment-setup' file in the current shell session? Aug 15 20:09:41 Sorry for me naivety. I'm a new-grad and this is all pretty new to me Aug 15 20:09:47 vpaladino778: Make sure your build system (make, autotools, meson, cmake, whatever) uses them Aug 15 20:10:43 Including using them yourself if you are your own build system, e.g.: `$CC -o hello -c hello.c` Aug 15 20:49:45 JPEW: Going to try that patch, thanks! Aug 15 21:05:15 jwessel: have you seen this? http://git.openembedded.org/openembedded-core-contrib/tree/meta/recipes-devtools/pseudo/pseudo-test.bb?h=jansa/pseudo2 it's definitelly badly written app and breaks pseudo every single time (maybe with different root cause though) Aug 15 21:05:57 I had not seen that one. Aug 15 21:06:31 more details in https://bugzilla.yoctoproject.org/show_bug.cgi?id=12434#c48 Aug 15 21:06:32 Bug 12434: normal, Medium+, 2.8 M3, randy.macleod, ACCEPTED , pseudo: Incorrect UID/GID in packaged files Aug 15 21:10:03 JPEW: To continue that conversation from earlier, I just dreamt the idea of task specific hosttools using symlinks Aug 15 21:10:30 Its so crazy we have to do it... Aug 15 21:10:45 JaMa: It doesn't seem at first glance that is the same problem. Aug 15 21:11:02 JPEW: would solve the task rss contamination problem in theory to a large extent Aug 15 21:24:33 jwessel: we've had far too many rays of hope with this bug where we think we may have found it, then not.... Aug 15 21:24:53 I don't doubt that. This is fairly complex. Aug 15 21:25:39 I see something happening consistently with each build, even in the ones that don't actually fail. Aug 15 21:26:12 jwessel: I've suspected/hoped for that. Question is why some fail and some don't Aug 15 21:26:34 jwessel: if we could make it 100% failing we'd no doubt quickly figure it out :) Aug 15 21:26:35 That answer is pretty easy, from the logs. Aug 15 21:26:54 If I add some sleep into pseudo, it will probably fail every time, but I am not sure. Aug 15 21:27:31 The logs indicate it is picking up the underlying (what ever was used last) for the inode in the case of the hard link that was turned to a file. Aug 15 21:27:59 It happens a few times in every single build, but the ones that pass are the ones where has selected UID 0. Aug 15 21:28:23 It really should be 100% impossible for the hard link to have no source reference. Aug 15 21:28:59 That is why I put a print log there. I figured it shouldn't be hitting that line of code _ever_ Aug 15 21:29:10 interesting part for me was that it's consistent when restoring "bad" do_package archive from sstate (so once you create and store bad archive it will always fail as long as you're using the same sstate signature) Aug 15 21:29:19 I haven't been able to make a shell script which makes it act the same. Aug 15 21:29:21 jwessel: so my inode reuse theory is actually right! Aug 15 21:29:47 Well technically it appears to be the pseudo cache. Aug 15 21:30:08 The problem is a bit hard to track because each instance involves 3 inodes. Aug 15 21:30:31 jwessel: Right, that was what I'd theorised though - that some file known to pseudo was deleted and then a new file was created using the same inode so the permissions/user were copied Aug 15 21:31:00 and that pseudo writtes a lot of warnings and errors in each build even the ones which ended probably OK :) Aug 15 21:31:08 jwessel: it kind of implies we're missing deletion tracking with pseudo which is possible as we remove files out of pseudo context Aug 15 21:31:26 Is there any way that is happening? Aug 15 21:31:34 jwessel: I wish we could do path filtering in pseudo :/ Aug 15 21:31:47 jwessel: absolutely certain its happening Aug 15 21:31:52 It looked to me like the delete was actually there, and it more like there is something happening in parallel. Aug 15 21:32:10 jwessel: it was only a theory but I know we do delete files out of context Aug 15 21:32:35 (why would we load pseudo and do deletion under it?) Aug 15 21:32:55 As long as you are sane about it, it should be ok. Aug 15 21:33:17 jwessel: what could be interesting would be doing a sanity check of pseudo's db against the disk periodically Aug 15 21:33:20 I can say with certainty I know how to re-use the inodes. Aug 15 21:33:53 On a quiet disk I can allocate and re-allocate any which way I want. Aug 15 21:34:06 jwessel: I never figured that out :) Aug 15 21:34:25 But I am trying to limit it down to something reproducible. Aug 15 21:34:42 jwessel: right, that is what we need Aug 15 21:34:59 I can't see at the moment how to get it to the state where the ln proceeds without the whole "load my system and hope it happens..." Aug 15 21:35:30 At this point I do see the bad state with every single run of the do_package. Aug 15 21:36:42 * jwessel has a few more tests cases to try and duplicate it, before the day over for now. Aug 15 21:41:37 * RP is hoping jwessel can figure it out :) Aug 15 21:41:49 This might take days. Aug 15 21:42:34 As with all problems like this, finding the root cause doesn't mean it will be easy to fix. Aug 15 21:43:33 JPEW: new server swaps build warnings for failures: https://autobuilder.yoctoproject.org/typhoon/#/builders/15/builds/1171 Aug 15 21:44:11 jwessel: I know. I'm dealing with a similar multiheaded monster in the form of runqueue too :/ Aug 15 21:46:03 JPEW: {"connections": {"total_time": 130.6708268285729, "max_time": 0.027425227221101522, "num": 485853, "average": 0.0002689513635370635, "stdev": 0.0002935243837515032}, "requests": {"total_time": 824.3801162638701, "max_time": 0.43299578316509724, "num": 485848, "average": 0.0016967860653205739, "stdev": 0.0015868506651319188}} Aug 15 21:47:26 RP: hmm... did you remove the io and sql stats? Aug 15 21:47:38 JPEW: no? Aug 15 21:48:30 JPEW: oh, I think I've missed a commit Aug 15 21:48:49 Ah.... I'm actually impressed that applies cleanly :) Aug 15 21:49:13 JPEW: yes, so am I looking at it! Aug 15 21:49:31 RP: Git is a mysterious entity Aug 15 21:49:34 JPEW: would that cause the connection reset issue? Aug 15 21:50:05 Possibly? Aug 15 21:50:26 I have been running bitbake-selftest hashserv.tests on my changes Aug 15 21:51:14 If the server was getting a python exception in the thread I think that would cause a connection reset Aug 15 21:52:16 JPEW: that was what I was just thinking looking at the missing import Aug 15 21:53:37 RP: time bitbake world -P -n is still running after over 5 hours not yet executing tasks (on older bitbake) so it might be even slower than 253min run done earlier today on the same build Aug 15 21:57:33 JaMa: That is just crazy :( Aug 15 21:57:57 JaMa: I need to figure out what is wrong. I think my more limited world build should give me the baseline profile data I need... Aug 15 21:58:25 JPEW: I've sorted out the patches and restarted everything, lets try again Aug 15 21:58:32 Cool Aug 15 21:58:55 * RP knows he's going to run out of time on this soon :( Aug 15 22:06:33 JPEW: https://autobuilder.yoctoproject.org/typhoon/#/builders/48/builds/938 :( Aug 15 22:08:45 RP: Argh.... OK. Aug 15 22:09:08 I'll try and keep the logging but revert back to the previous threading Aug 15 22:09:17 RP: OK. I have to go home, but I'll give it a think. Aug 15 22:09:42 JPEW: thanks, this was a good try :) Aug 15 22:10:37 JPEW: I'm just reverting so I can test the other changes in -next which are a mix of normal patches and runqueue fixes Aug 15 22:20:50 So, in a non-failed build I absolutely have the same problem each time, as I mentioned before. Aug 15 22:21:09 I see how it happens, but still don't know the root cause. Aug 15 22:22:50 There are is pile of these requests coming in, parallel. And the mv bit is not atomic. Aug 15 22:23:14 The unlink occurs before the rename Aug 15 22:23:36 I haven't determined that the problem is actually the fault of pseudo or not. Aug 15 22:23:56 I don't understand how all these db requests are coming in, in parallel. Aug 15 22:25:23 BTW, if we were to add a QA check against the files we know are supposed to all be hard linked together. It would fail 100% of the time. Aug 15 22:25:39 We get a couple strays that became hard copies each time. Aug 15 22:26:05 jwessel: That sounds like a good way to debug this if we have a sentinel we can spot.... Aug 15 22:26:26 By luck of the draw most of the time it has not failed due to the QA check being for the builder UID. Aug 15 22:26:48 jwessel: most files we delete would be owned by "root", not the build user Aug 15 22:26:53 It is going to be a different file every time, but the investigation thus far shows 100% failure rate. Aug 15 22:27:11 right statistically we process enough files something breaks Aug 15 22:27:36 I really need to have a hard look at the operations. Aug 15 22:27:46 But that is it for now. Aug 15 22:27:51 For today anyway. Aug 15 22:28:08 jwessel: its great progress as I can think of ways of debugging this further from there! :) Aug 15 22:42:17 kanavin, JaMa: the notifications are about the generation of task specific profiles Aug 15 22:46:25 JPEW: managed another connection reset without the threading patch :( Aug 15 22:46:55 definitely not as common Aug 15 22:54:56 RP: I'm getting close with that profile, not on task 50K from 62K Aug 15 22:56:52 now when some tasks are "running" there is some load on the worker process as well Aug 15 22:57:19 JaMa: that makes sense since I guess its instructed to fork() Aug 15 23:10:26 heh, it finished.. *a lot* of profile* files in this directory :), will upload in a sec Aug 15 23:16:18 http://paste.ubuntu.com/p/Nw6n5hTPmF/ http://paste.ubuntu.com/p/sydFGvYTxQ/ http://paste.ubuntu.com/p/RMGbm9DyZz/ Aug 15 23:17:00 the first is the "short 5 min" build til "Executing Tasks", then the profile.log.processed and last is from profile-worker.log.processed Aug 15 23:17:21 all 3 with bitbake 1f630fdf0260db08541d3ca9f25f852931c19905 (before most runqueue changes) Aug 15 23:35:15 maybe the logger.debug in scenequeue_updatecounters()? **** ENDING LOGGING AT Fri Aug 16 03:05:29 2019