**** BEGIN LOGGING AT Sun Jun 28 02:59:58 2020 Jun 28 09:33:20 RP: has the matchbox-wm crash been observed with sysvinit actually? In all failure links I have seen, it happens only with systemd Jun 28 09:33:51 I have read through the code, and: Jun 28 09:34:18 a) startup notification does not handle errors, and instead proceed directly to accessing a struct member ---> crash Jun 28 09:34:20 (boo!) Jun 28 09:34:41 b) the only way the error can happen where it does is if the X server can't allocated memory Jun 28 09:35:04 I suspect that because systemd starts everything at once, there is a brief situation where memory is exhausted Jun 28 09:35:26 rburton: ^^^ Jun 28 10:40:47 kanavin_home1: We have seen it with both sysvinit and systemd, its more common with systemd Jun 28 10:40:59 kanavin_home1: Its an interesting theory, you could be right Jun 28 10:41:51 kanavin_home1: I wonder if we can add some error handling to confirm the hypothesis? Jun 28 10:42:30 RP: we will only get a BadRequest error code from the X server Jun 28 10:43:05 RP: what I am doing now is starting 20 qemux86-64-alt builds, with RAM bumped to 1Gb :) Jun 28 10:43:23 kanavin_home1: ah, yay X error handling :/ Jun 28 10:43:38 RP: I wonder if systemd has a facility for tracking RAM usage during boot, that would be handy Jun 28 10:43:45 kanavin_home1: interesting. I wonder how many other errors you'll find! Jun 28 10:44:06 kanavin_home1: with overcommit, tracking memory usage is rather hard :/ Jun 28 10:45:04 RP: if you do find a failure where systemd is not used, pls share the link :) Jun 28 10:45:12 all the links in the bug are systemd Jun 28 10:45:14 kanavin_home1: I think these only fail when the system is under load, the systemd boot times are always about twice as long on the failed builds. Another reproduction approach may be to pin multiple qemu images to a single core and boot all together Jun 28 10:46:26 kanavin_home1: will do if I see one Jun 28 10:48:17 RP: here's why I think it's OOM: https://cgit.freedesktop.org/xorg/xserver/tree/dix/atom.c#n72 Jun 28 10:48:36 all the error paths are due to not being able to allocate or reallocate memory Jun 28 10:49:36 kanavin_home1: Is there some other error we could tweak this to for debugging? Jun 28 10:49:49 kanavin_home1: its definitely a good theory Jun 28 10:50:16 RP: we could patch xserver to do something else in these error paths that would show up in the AB failure logs Jun 28 10:51:15 kanavin_home1: right, that could be useful Jun 28 10:51:18 like, oh, exit loudly Jun 28 10:51:26 and abruptly Jun 28 10:51:41 kanavin_home1: even dumping something into the syslog that parselogs would pick up would be enough Jun 28 10:52:55 RP: right, the test build with more RAM succeeded https://autobuilder.yoctoproject.org/typhoon/#/builders/109/builds/976 Jun 28 10:53:01 RP: now let me start 20 of those at once :D Jun 28 10:53:18 kanavin_home1: I don't think it will help since they'll all get farmed to individual workers Jun 28 10:53:33 and if it is load specific there is nothing else running Jun 28 10:53:34 RP: right, but they still fail sporadically even then? Jun 28 10:53:49 kanavin_home1: I've not seen a failure with a "normal" boot time Jun 28 10:54:01 kanavin_home1: I can run a master-next in parallel with your test? Jun 28 10:55:09 RP: wait, but where do you see that the boot time is not normal? e.g. here everything looks normal to me https://autobuilder.yoctoproject.org/typhoon/#/builders/72/builds/2095/steps/8/logs/step5c Jun 28 10:55:51 kanavin_home1: I think "normal" is the NFSD messages at around 3s Jun 28 10:56:12 kanavin_home1: the failures are all with NFSD at 5-8s Jun 28 10:56:17 approximately Jun 28 10:56:40 kanavin_home1: we can try it, if we have a master-next in parallel that will give a reasonable load to the systems Jun 28 10:56:48 RP: right - sure let's try that Jun 28 10:57:08 RP: after that we can patch xserver, and you can carry that patch in -next without merging it maybe Jun 28 10:57:33 kanavin_home1: that works Jun 28 10:58:15 kanavin_home1: just tell me when we're ready for the -next. Timing may be tricky since a-full takes a short while to get going but you want your builds running first Jun 28 10:58:26 (or you can start the -next actually :) ) Jun 28 10:58:39 oh, you can just start it now? Jun 28 10:59:07 kanavin_home1: ok Jun 28 10:59:45 kanavin_home1: its away Jun 28 11:01:06 RP: right, I'll start firing the test builds once builds are underway across workers Jun 28 11:01:23 for better measure, I can start 20 increase-RAM builds, and 20 regular builds Jun 28 11:01:38 if we get no failures in the first set, but failures in the second, that adds to the theory Jun 28 11:02:23 it's a rainy day, otherwise I'd be cycling :) Jun 28 11:03:19 RP: btw, bind updates were sent I think, will you be picking those separately? Jun 28 11:05:08 kanavin_home1: more work is needed on the series, failed testing Jun 28 11:05:24 kanavin_home1: you should fire now before it gets going? Jun 28 11:05:52 firing now Jun 28 11:06:26 kanavin_home1: it forked out so most workers were "taken" :/ Jun 28 11:07:02 kanavin_home1: Its supposed to be rainy here but actually looks quite nice out... Jun 28 11:07:16 Still, a rest would probably be good for me Jun 28 11:12:03 RP: I fired 12 builds before it started throttling https://autobuilder.yoctoproject.org/typhoon/#/builders/109 Jun 28 11:15:39 kanavin_home1: cool Jun 28 11:43:08 kanavin_home1: all green Jun 28 11:44:05 kanavin_home1: the other set should be interesting Jun 28 11:45:10 RP: yes, watching that too :) Jun 28 11:45:33 would be disappointing if that is all green too Jun 28 11:47:03 kanavin_home1: could be there isn't enough load from master-next, there isn't much delta there Jun 28 11:47:29 RP: I am making a patch meanwhile Jun 28 12:02:59 RP: all green. I'll now test a patch that always crashes X, then adjust it to crash only when mallocs fail, and send Jun 28 12:03:37 kanavin_home1: sounds. Shame on all green but this is the fun of intermittent bugs :/ Jun 28 12:16:48 kanavin_home1: For interest I tried 6 core-image-sato -c testimage in paralellel, all pinned to the same cpu core. Didn't fail Jun 28 12:19:21 RP: right, it is fiendishly tricky to create the exact conditions where it does fail - if my theory is right, all startup items need to be at exact spots where they consume the most amount of memory at the same time :-/ Jun 28 12:23:07 Spirit532: if you are still around, my guess is that you do += inside the appends and are hving problems with the expansion/evaluation order. yet another reason why i strongly recommend custom images over appending, and even more so over appending a base images. Jun 28 12:23:11 Spirit532: have fun. Jun 28 12:34:11 kanavin_home1: I'll try again with core-image-sato-sdk and systemd instead of sysvinit Jun 28 12:43:21 RP: meanwhile this patch was supposed to cause a fail but didnt http://git.yoctoproject.org/cgit/cgit.cgi/poky-contrib/commit/?h=akanavin/make-x-crash&id=c10c796c4ec42d006b55f07325461367557da4b6 Jun 28 12:44:29 RP: I need to dig deeper and observe what is actually happening, rather than just read source code :-/ Jun 28 13:25:16 kanavin_home1: hmm, looks like we're still missing some element of this Jun 28 14:14:04 kanavin_home: system load of 498, world build in progress and seven systemd core-image-sato or sato-sdk images and no segv :/ Jun 28 14:16:51 RP: I found why I couldn't get X to "error out" - it actually did, but this wasnt noticed by tests Jun 28 14:17:36 kanavin_home: I've wondered about that, unless parse_logs sees a segv, they probably don't notice :( Jun 28 14:18:11 printing to stdout/stderr seems to be simply discarded, so I am using X logging to write to Xorg.log instead, just checking now Jun 28 14:19:12 RP: the reason why X quitting wasn't noticed is that it is run by xinit, and xinit doesnt notice that Xorg has turned into a zombie process Jun 28 14:19:33 RP: and the runtime test is using ps| grep X kind of thing, which succeeds even if Xorg is a zombie Jun 28 14:23:16 kanavin_home: ah, that would explain it. We could do with improving that then... Jun 28 14:47:16 RP: right, I got the failure to be properly reported https://autobuilder.yoctoproject.org/typhoon/#/builders/109/builds/1020 Jun 28 14:47:34 will now adjust the patch to report it only in spots where we want to see it :) Jun 28 15:04:54 RP: patch sent Jun 28 15:30:48 kanavin_home: Thanks, I've run a quick build containing it. I'll keep it in -next and see where we end up Jun 28 15:33:27 RP: right, the failure should occur together with the matchbox-wm crash, and if it doesnt then we need to consider other possibilities Jun 28 15:38:07 kanavin_home: we probably should have done this on a revision known to show the bug too :/ Jun 28 15:39:00 Its in theory possible something fixed it :/ Jun 28 15:43:42 RP: my idea was that this is kept in -next until the bug shows up again Jun 28 15:45:57 kanavin_home: agreed, just thinking out loud Jun 28 15:48:08 RP: the most recent failure was here I think https://autobuilder.yoctoproject.org/typhoon/#/builders/83/builds/1092 Jun 28 15:48:16 less than 24 hours ago :) Jun 28 15:58:32 kanavin_home: right, we've seen it a lot recently Jun 28 20:10:11 hi Jun 28 20:12:04 just a meta-technical question: is there a good way to freeze multiple similar builds? **** ENDING LOGGING AT Mon Jun 29 02:59:57 2020