**** BEGIN LOGGING AT Mon Jun 15 02:59:58 2020 Jun 15 06:53:06 hi @zmatt, finally got the log from a crash; https://pastebin.com/NbZGHUs7 Jun 15 06:54:05 this is on the eMMC only, sdcard removed Jun 15 09:14:14 zmatt: I'm looking at the dmtimer capture mode timing diagram in spruh73q.pdf (figure 20-5) and I'm wondering if it's really cycle-accurate Jun 15 09:15:13 zmatt: if yes, it looks like there will be about 4 or 5 cycles of timer_clock from the actual pievent edge until the TCRR is actually transferred to the TCAR Jun 15 09:15:26 that's quite a lot... Jun 15 09:17:34 3 cycles for the edge resync, another one for the actual capture pulse to appear Jun 15 09:47:45 and another one until the TCAR is updated. Jun 15 12:30:47 mmc-utils doesn't seem to be able to write to the mmc, so I'm looking for the source code and I've been unable to locate it. I've found several repos but I'd prefer to download source for the beaglebone. Could someone point me there? Jun 15 12:38:46 https://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc-utils.git/ Jun 15 12:38:58 Thanks! Jun 15 12:42:34 @mru, maybe you are able to help me resolve my issue. We've talked a while ago about my BBBW that stops working, having solid USR0 and USR2 leds. I've got a dump of the serial debug line, could I perhaps get you to take a look at it ? You can find it at https://pastebin.com/NbZGHUs7 Jun 15 12:44:26 something is trying to access address 0xf9e00004 which doesn't exist Jun 15 12:44:35 no idea why that would happen Jun 15 12:44:56 said something is apparently am33xx_clkdm_wakeup() Jun 15 12:45:22 either your kernel is badly misconfigured or your device is broken Jun 15 12:50:16 Ok, and badly misconfigured how would that happen ? I just flashed the console OS, so I would expect it to be "clean", or am I mistaken? Jun 15 13:43:41 Munke: wow what on earth is happening there Jun 15 13:44:13 my guess was right though: I said it'd be either the sd card or the wifi driver... it's the wifi driver Jun 15 13:50:05 the wifi firmware crashed, and normally the driver is able to recover that so smoothly you wouldn't even notice it happened without looking at the kernel log, but in this instance something is going terribly wrong during recovery Jun 15 13:56:24 thinkfat: I'd expect rsync to take 1-2 cycles (with the worst-case of 2 cycles being drawn in that diagram). the 1 cycle delay from that to capt_pulse == tcar1_enable is unfortunate but understandable, and the value loaded into TCAR will be the value of TCRR while tcar1_enable is high Jun 15 13:57:56 so basically, if your rising edge happens while TCRR=n then I'd expect n+3 to be loaded into TCAR Jun 15 13:59:28 thinkfat: rsync=resync Jun 15 14:00:45 so this is one cycle worse than optimal Jun 15 14:01:09 (for a synchronous timer circuit capturing an asynchronous event) Jun 15 14:01:22 so you weren't talking about copying files with crazy speed :) Jun 15 14:01:57 hehe, I use that command so much that it's keep in my muscle memory I guess ;) Jun 15 14:02:01 *deep Jun 15 14:02:04 or I just can't type Jun 15 14:05:04 thinkfat: the exact cycle-accurate timing for both capture and pwm is someting you could fairly easily test by sticking some painfully slow clock onto tclkin (possibly even control it manually using a gpio) Jun 15 14:15:30 if an LCD screen is interfering with the use of other GPIO pins the problem would be in the device overlay? Jun 15 14:19:13 I am wondering if there was an update in the latest images that may require an update to the cape setup Jun 15 14:19:40 I have been reading and others have used the cape in older builds of the beaglebone images. Jun 15 14:20:14 I spent enough time working on a gui that I would prefer not cannibalizing it to keep other functionality. Jun 15 14:29:43 MattB0ne: any pins used by the cape should be declared in the overlay hence be visible using show-pins and you wouldn't be able to claim those pins using another overlay Jun 15 14:30:03 and, assuming the cape datasheet is correct, you know which pins those are Jun 15 14:30:25 right but that is the only thing keeping code from working vs completely broken Jun 15 14:30:39 everything works with the cape off nothing works with it on Jun 15 14:31:17 I could try another cape. I had to physically modify it for it to turn on reliably Jun 15 14:31:49 maybe something else is wrong in the hardware. I can assume since there are not overlay collisions that software wise nothing is wrong Jun 15 14:31:54 the cape has pass-through headers right? just use a multimeter (in continuity-test mode) to confirm each pin you're trying to use on the pass-through header is physically connected to the corresponding pin on the pain header Jun 15 14:32:05 (while the cape is not attached to the BBB obviously Jun 15 14:32:28 I can do that Jun 15 14:32:45 that makes sense to rule out hardware Jun 15 14:37:50 zmatt: the absolute cycle delay isn't so important, but I hope it's at least a constant Jun 15 14:40:50 thinkfat: it seems extremely improbable it wouldn't be constant, apart from the resynchronization time which intrinsically depends on where the capture signal edge is relative to the timer clock Jun 15 15:19:45 zmatt: another question you might be able to answer: do all dmtimers share the same interconnect? Jun 15 15:21:03 timers 0 and 1 are on the wakeup interconnect (L4WK), the rest is on the L4LS (aka L4_PER) Jun 15 15:24:08 zmatt: the writes to the actual timer registers will be synchronous to the timer fclk, right? Jun 15 15:26:17 writes will be resynchronized to the timer clock domain. in non-posted mode the write will wait for completion, in posted mode it happens asynchronously and the status can be checked in TWPS Jun 15 15:26:44 writes to the actual timer registers that is Jun 15 15:27:18 (control, counter, reload, trigger, match) Jun 15 15:41:49 @zmatt: well that is quite bad, I really can't use this device without wifi. Do you have any suggestions as to what I can do ? Do you think this is a hardware or a software issue ? Jun 15 15:43:17 100% software Jun 15 15:45:29 try downgrading to the latest 4.14-ti kernel (4.14.108-ti-r136) to see if it's a regression Jun 15 15:45:30 rule #1: it's the wifi driver Jun 15 15:45:44 rule #2: no really, it's the wifi driver Jun 15 15:51:44 :/ Jun 15 15:52:12 well software can be fixed! Jun 15 15:52:14 over here, I suspect a wifi driver of munging data coming from other usb devices on the same hub Jun 15 15:52:36 that.. shouldn't be possible Jun 15 15:52:46 So do you have a guide for downgrading kernel ? Jun 15 15:52:47 and yet it happens Jun 15 15:53:12 mru: that would imply the fault is with the driver for the usb controller, or a hardware problem in the hub Jun 15 15:53:13 it's an external driver for a realtek chip not supported in-tree Jun 15 15:53:31 oh I love that.. At my work I saw the famous log print "this should never happen", printed to error log ;) .. Jun 15 15:53:33 I suspect something like a use after free bug Jun 15 15:53:49 oh right if it's like kernel memory corruption then anything can happen of course Jun 15 15:54:08 Munke: sudo apt-get install linux-image-4.14.108-ti-r136 Jun 15 15:54:21 oh Jun 15 15:54:22 if I read a file from a flash stick while the wifi is active, I occasionally get bad data Jun 15 15:54:28 there's also a script somewhere in /opt/ but I forgot where or what arguments to give it Jun 15 15:54:35 usbmon sees the correct data Jun 15 15:54:40 mru: oof Jun 15 15:55:13 somewhere during its journey up the stack, it goes bad Jun 15 16:02:25 This might be harder than i thought.. Jun 15 16:02:37 crashed before i could do anything :D Jun 15 16:02:38 .. Jun 15 16:03:36 Munke: ? you previously made it sound like took non-negligible time to crash Jun 15 16:04:33 It previously took days or at least many hours. now it did in 5 minutes Jun 15 16:04:48 due to the traffic of downloading a package? Jun 15 16:04:48 trying to downgrade as you told me now Jun 15 16:04:55 no just booting Jun 15 16:05:07 oh after downgrading you mean? Jun 15 16:05:08 but let's see Jun 15 16:05:14 no before Jun 15 16:05:18 oh just bad luck you mean Jun 15 16:05:21 yes Jun 15 16:05:24 :P Jun 15 16:05:48 if you are used to crashes in hours and now they happen in minutes, it is quite fast ;) Jun 15 16:05:57 oh! downgraded complete Jun 15 16:07:35 debian@beaglebone:~$ uname -a Jun 15 16:07:35 Linux beaglebone 4.14.108-ti-r136 #1buster SMP PREEMPT Mon Jun 8 15:38:39 UTC 2020 armv7l GNU/Linux <-- let's hope that is all Jun 15 16:07:52 if that does fix the problem then obviously the regression needs to be reported Jun 15 16:07:58 @zmatt, if this fails too like the one before, do you have any other kernel version to try ? Jun 15 16:08:48 @zmatt true, so this is an issue to report in the linux kernel ? Jun 15 16:09:27 Munke: I'd report it to https://github.com/beagleboard/Latest-Images/issues to begin with Jun 15 16:11:56 Oh, ok then, but thanks, I really can tell that you've done this a couple of times before :)! Jun 15 16:12:26 I have a bbbw running stably on 4.14 (albeit 4.14.78-bone17 but I doubt that matters) Jun 15 16:13:44 it does occasionally trip a warning in the kernel log, but no panic Jun 15 16:13:56 zmatt: should the pins be connected for LCD cape the only pins connected are the GNDs Jun 15 16:14:31 MattB0ne: that's almost certainly a measurement error or you're measuring the wrong pins Jun 15 16:15:07 ok let me try again. I am just focusing on the pins that should be usable according to the sheet Jun 15 16:15:32 i.e. nearly all of P9 (except two pins iirc) and some of P8 Jun 15 16:16:03 should be passed through Jun 15 17:20:17 ok it works Jun 15 17:21:01 the issue falls in line with my past exploits Jun 15 17:21:15 the pins were correct though Jun 15 17:21:30 you mean user error? Jun 15 17:23:14 they flip the headers Jun 15 17:23:22 so while the pins were in the right stop Jun 15 17:23:40 they change the orientation so that it is flipped Jun 15 17:23:53 so I had pins in what I thought to be P9 but was P8 Jun 15 17:24:22 it is labeled but the silk screen makes it difficult to discern A vs B Jun 15 17:24:34 the continuity test showed what was linked Jun 15 17:24:59 moving the pins to the correct header has all the scripts working Jun 15 17:26:26 I mean, of course they're flipped... the cape's passthrough headers are facing the opposite direction of the beaglebone's headers when plugged into this cape Jun 15 17:27:14 they're mirrored Jun 15 17:27:44 I assumed same orientation essentially translated but once again I was wrong Jun 15 17:28:21 if you think where P9 and P8 are on the BBB and then where they are when you plug it into the cape, it should be obvious Jun 15 17:28:21 a lot of headache for a bonehead maneuver Jun 15 17:28:47 (pout) Jun 15 17:29:04 yeah I am pouting Jun 15 17:32:18 anyways Jun 15 17:32:51 I am buiding code around the py-uio debugger script. I can use the core object as my reporter of all things going on in the PRU Jun 15 17:33:03 it appears to have all the data pulled out Jun 15 17:33:23 going to use it to populate my gui with the ecoder counts Jun 15 17:33:24 just keep in mind that single-stepping the pru from a python script is obviously excruciatingly slow Jun 15 17:33:50 debugging.py is just so you can see step by step what the pru is doing Jun 15 17:34:07 but it means it'll do only a few instructions per second instead of 200000000 instructions per second Jun 15 17:36:21 normally you'd just start the core and read the counter from memory Jun 15 17:45:41 e.g. since the program writes it to offset 0x00 from c28 (typically configured to pruss shared memory) you'd read it from there: https://pastebin.com/z4A6TWyX Jun 15 17:47:01 it might actually be more sensible to store to offset 0x00 from c24 (core local ram) and read it from there (core.dram.map( u32, 0x00 ) for python code, or offset 0x00 from c25 for the other pru core) Jun 15 17:54:30 My use didn't need share memory, only PRU local, so I used shared a ring buffer for logging and let the PRUs run full-speed. If one crashed it would still usually log enough that I could figure out why. Jun 15 17:54:48 I mean, this is all pruss memory Jun 15 17:54:53 I'm not talking about ddr3 Jun 15 17:55:56 the only difference between the three pruss local memories is the prioritization of initiators that try to access it simultaneously Jun 15 17:56:44 (and accesses by different initiators to different pruss local memories proceed simultaneously) Jun 15 18:00:33 since core0 is updating the counter at potentially a very high rate, putting it in core0's local ram would ensure that it'll never stall as a result of memory accesses by pru core 1 or the cortex-a8 Jun 15 18:02:45 @zmatt: Just happened again, will get dump Jun 15 18:03:39 Munke: how odd, I wonder why this is happening in your specific situation Jun 15 18:04:27 it is just sitting on my table Jun 15 18:04:30 nothing else. Jun 15 18:04:49 running nothing, with a freshly flashed console os Jun 15 18:05:50 right, so presumably there's just something about your wifi situation that's causing it to go nuts. but I can try running the latest console image on a bbbw at the office when I get there, see what happens Jun 15 18:06:26 there's currently one running there on a kernel 4.14 since the last power interruption 20 days ago Jun 15 18:06:38 https://pastebin.com/spx7dAhT Jun 15 18:07:02 what's that garbage at the start? that's really weird too Jun 15 18:07:08 yup Jun 15 18:07:42 using a raspberry pi to log this, Maybe i should start using the newly bought usb part. Jun 15 18:07:48 shouldn't matter Jun 15 18:08:06 configured to baudrate 115200 ? Jun 15 18:08:37 are you seeing log output from u-boot and the kernel during boot? Jun 15 18:08:47 yes 115200 Jun 15 18:08:58 yes i see that as well Jun 15 18:09:24 ohh never mind, that's a clear-screen Jun 15 18:09:29 \e[H\e[J Jun 15 18:09:36 terminal sequence Jun 15 18:11:09 was it connected to wifi while this was happening? Jun 15 18:11:34 https://pastebin.com/U3AkKpaT Jun 15 18:11:51 yes I assume Jun 15 18:12:29 you still have the "quiet" option, you should disable that to actually get useful log output Jun 15 18:12:30 that it was connected, I've never seen the wifi led off and it does autoconnect, so I assume that it was connected at that point Jun 15 18:12:50 oh yea, well reflash does not preserve changes ;) Jun 15 18:12:54 sec Jun 15 18:14:33 like, it needs to reenable the sdio interface in an interrupt handler? that sounds like some pointless power management thing Jun 15 18:16:30 there's probably a way to disable that, keep the sdio controller enabled... I do wonder how it's managing to get a bus error in the power management code though Jun 15 18:18:12 https://pastebin.com/ZnfsQPV3 Jun 15 18:18:14 here we go Jun 15 18:18:48 and like, literally _nobody_ on the entire internet has had this specific fault: https://www.google.com/search?hl=en&tbs=li:1&q=%22unhandled+fault%22+am33xx_pwrdm_read_pwrst Jun 15 18:19:36 nope, I've got some skills in breaking stuff Jun 15 18:23:35 Munke: what on earth was happening at the start of your first paste though... https://pastebin.com/NbZGHUs7 Jun 15 18:23:55 you were typing in your screen session and got that garbage echoed back? Jun 15 18:24:35 I see bits of commands, but mostly garbage Jun 15 18:24:41 I was not using it when it crashed, but oviously had been earlier on. but that ip afconfig is not of mine duing Jun 15 18:24:49 honestly :D Jun 15 18:25:24 seems to me to be wireless related, wl at least was the drivers for my laptops wireless... oh the troubles I had with that...! Jun 15 18:25:45 it's also not quite the same fault as the one you had now on 4.14 Jun 15 18:25:57 or maybe the function just renamed Jun 15 18:26:19 no google hits for that one either Jun 15 18:26:32 [130491.356558] Unhandled fault: external abort on non-linefetch (0x1028) at 0xf9e00004 is the error from the one you just referenced Jun 15 18:26:42 that line literally means nothing Jun 15 18:26:57 the address is a virtual address and the error message is wrong Jun 15 18:27:24 so it basically says "some fault at some address" Jun 15 18:27:25 well that is great :D Jun 15 18:28:23 But I think it is quite serious that it is so badly handled that the entire device crashes.. I would hope that such errors might be handled.. Jun 15 18:29:35 One thing @zmatt.. The antennas are still laying on the board, just as it was delivered. Do you think that it might be some insane interference bentween the antennas and the cpu ? Jun 15 18:32:05 but 0x1028 decodes to: synchronous fault (so traceback should be valid), bus error (DECERR) on data read Jun 15 18:32:26 Munke: dunno, but you should definitely rotate them to be pointing away from the board Jun 15 18:32:49 having an antenna hugging the pcb is not exactly a good idea Jun 15 18:32:58 they're just rotated inward to protect them during shipping Jun 15 18:34:45 Munke: was this previous log (https://pastebin.com/NbZGHUs7) captured with or without "quiet" ? Jun 15 18:34:55 it looks quiet to me Jun 15 18:35:23 there should have been substantially more output associated with the "wlcore: ERROR Tx stuck" event Jun 15 18:36:38 but maybe the default console log level filters it out even without 'quiet' Jun 15 18:37:46 it could be useful to also have a 'journalctl -f -o short-monotonic' open (via ssh via usb networking) to log what's going on Jun 15 18:38:25 maybe it won't catch what happens right before the crash, but maybe it will Jun 15 19:15:31 @zmatt, will enable journalctl -f -o short-monotonic, then let's see what happens Jun 15 19:28:25 I've started a screen on my raspberry pi and put it in detached mode. Jun 15 19:28:45 But well, if we are "unlucky" it can take days before this happens again Jun 15 19:29:55 @zmatt yes that was logged with quiet, quiet removed now. Jun 15 19:32:53 you're still on krenel 4.14 right? once you capture an event with this it might be useful to also capture one with 4.19 .. you can switch back by changing the "uname_r" variable in /boot/uEnv.txt to any installed kernel version, e.g. 4.19.94-ti-r42 (double-check what's in /boot if unsure what kernels are installed) **** ENDING LOGGING AT Mon Jun 15 20:36:34 2020 **** BEGIN LOGGING AT Mon Jun 15 20:38:15 2020 Jun 15 21:32:51 fsfreeze works by doing FIFREEZE ioctl, according to the source I'm looking at, but I can't find FIFREEZE in the 4.19.106-bone49 source to see what it's actually doing. Anyone know what I'm missing this time? Jun 15 21:33:31 The Magic Of Asking. Even as I typed that, the latest find, found. \o/ Jun 15 21:33:55 fs/ioctl.c Jun 15 21:34:14 That is exactly what I found. Thanks! Jun 15 21:34:28 elixir is useful for stuff like that: https://elixir.bootlin.com/linux/v4.19.106/A/ident/FIFREEZE Jun 15 21:35:35 Neat. **** ENDING LOGGING AT Tue Jun 16 03:06:37 2020