**** BEGIN LOGGING AT Mon Jun 15 02:59:58 2020
Jun 15 06:53:06 <Munke> hi @zmatt, finally got the log from a crash; https://pastebin.com/NbZGHUs7
Jun 15 06:54:05 <Munke> this is on the eMMC only, sdcard removed
Jun 15 09:14:14 <thinkfat> zmatt: I'm looking at the dmtimer capture mode timing diagram in spruh73q.pdf (figure 20-5) and I'm wondering if it's really cycle-accurate
Jun 15 09:15:13 <thinkfat> zmatt: if yes, it looks like there will be about 4 or 5 cycles of timer_clock from the actual pievent edge until the TCRR is actually transferred to the TCAR
Jun 15 09:15:26 <thinkfat> that's quite a lot...
Jun 15 09:17:34 <thinkfat> 3 cycles for the edge resync, another one for the actual capture pulse to appear
Jun 15 09:47:45 <thinkfat> and another one until the TCAR is updated.
Jun 15 12:30:47 <Ragnorok> mmc-utils doesn't seem to be able to write to the mmc, so I'm looking for the source code and I've been unable to locate it.  I've found several repos but I'd prefer to download source for the beaglebone.  Could someone point me there?
Jun 15 12:38:46 <mru> https://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc-utils.git/
Jun 15 12:38:58 <Ragnorok> Thanks!
Jun 15 12:42:34 <Munke> @mru, maybe you are able to help me resolve my issue. We've talked a while ago about my BBBW that stops working, having solid USR0 and USR2 leds. I've got a dump of the serial debug line, could I perhaps get you to take a look at it ? You can find it at https://pastebin.com/NbZGHUs7
Jun 15 12:44:26 <mru> something is trying to access address 0xf9e00004 which doesn't exist
Jun 15 12:44:35 <mru> no idea why that would happen
Jun 15 12:44:56 <mru> said something is apparently am33xx_clkdm_wakeup()
Jun 15 12:45:22 <mru> either your kernel is badly misconfigured or your device is broken
Jun 15 12:50:16 <Munke> Ok, and badly misconfigured how would that happen ? I just flashed the console OS, so I would expect it to be "clean", or am I mistaken?
Jun 15 13:43:41 <zmatt> Munke: wow what on earth is happening there
Jun 15 13:44:13 <zmatt> my guess was right though: I said it'd be either the sd card or the wifi driver... it's the wifi driver
Jun 15 13:50:05 <zmatt> the wifi firmware crashed, and normally the driver is able to recover that so smoothly you wouldn't even notice it happened without looking at the kernel log, but in this instance something is going terribly wrong during recovery
Jun 15 13:56:24 <zmatt> thinkfat: I'd expect rsync to take 1-2 cycles (with the worst-case of 2 cycles being drawn in that diagram).  the 1 cycle delay from that to capt_pulse == tcar1_enable is unfortunate but understandable, and the value loaded into TCAR will be the value of TCRR while tcar1_enable is high
Jun 15 13:57:56 <zmatt> so basically, if your rising edge happens while TCRR=n then I'd expect n+3 to be loaded into TCAR
Jun 15 13:59:28 <zmatt> thinkfat: rsync=resync
Jun 15 14:00:45 <zmatt> so this is one cycle worse than optimal
Jun 15 14:01:09 <zmatt> (for a synchronous timer circuit capturing an asynchronous event)
Jun 15 14:01:22 <mru> so you weren't talking about copying files with crazy speed :)
Jun 15 14:01:57 <zmatt> hehe, I use that command so much that it's keep in my muscle memory I guess ;)
Jun 15 14:02:01 <zmatt> *deep
Jun 15 14:02:04 <zmatt> or I just can't type
Jun 15 14:05:04 <zmatt> thinkfat: the exact cycle-accurate timing for both capture and pwm is someting you could fairly easily test by sticking some painfully slow clock onto tclkin (possibly even control it manually using a gpio)
Jun 15 14:15:30 <MattB0ne> if an LCD screen is interfering with the use of other GPIO pins the problem would be in the device overlay?
Jun 15 14:19:13 <MattB0ne> I am wondering if there was an update in the latest images that may require an update to the cape setup
Jun 15 14:19:40 <MattB0ne> I have been reading and others have used the cape in older builds of the beaglebone images.
Jun 15 14:20:14 <MattB0ne> I spent enough time working on a gui that I would prefer not cannibalizing it to keep other functionality.
Jun 15 14:29:43 <zmatt> MattB0ne: any pins used by the cape should be declared in the overlay hence be visible using show-pins and you wouldn't be able to claim those pins using another overlay
Jun 15 14:30:03 <zmatt> and, assuming the cape datasheet is correct, you know which pins those are
Jun 15 14:30:25 <MattB0ne> right but that is the only thing keeping code from working vs completely broken
Jun 15 14:30:39 <MattB0ne> everything works with the cape off nothing works with it on
Jun 15 14:31:17 <MattB0ne> I could try another cape. I had to physically modify it for it to turn on reliably
Jun 15 14:31:49 <MattB0ne> maybe something else is wrong in the hardware. I can assume since there are not overlay collisions that software wise nothing is wrong
Jun 15 14:31:54 <zmatt> the cape has pass-through headers right?  just use a multimeter (in continuity-test mode) to confirm each pin you're trying to use on the pass-through header is physically connected to the corresponding pin on the pain header
Jun 15 14:32:05 <zmatt> (while the cape is not attached to the BBB obviously
Jun 15 14:32:28 <MattB0ne> I can do that
Jun 15 14:32:45 <MattB0ne> that makes sense to rule out hardware
Jun 15 14:37:50 <thinkfat> zmatt: the absolute cycle delay isn't so important, but I hope it's at least a constant
Jun 15 14:40:50 <zmatt> thinkfat: it seems extremely improbable it wouldn't be constant, apart from the resynchronization time which intrinsically depends on where the capture signal edge is relative to the timer clock
Jun 15 15:19:45 <thinkfat> zmatt: another question you might be able to answer: do all dmtimers share the same interconnect?
Jun 15 15:21:03 <zmatt> timers 0 and 1 are on the wakeup interconnect (L4WK), the rest is on the L4LS (aka L4_PER)
Jun 15 15:24:08 <thinkfat> zmatt: the writes to the actual timer registers will be synchronous to the timer fclk, right?
Jun 15 15:26:17 <zmatt> writes will be resynchronized to the timer clock domain.  in non-posted mode the write will wait for completion,  in posted mode it happens asynchronously and the status can be checked in TWPS
Jun 15 15:26:44 <zmatt> writes to the actual timer registers that is
Jun 15 15:27:18 <zmatt> (control, counter, reload, trigger, match)
Jun 15 15:41:49 <Munke> @zmatt: well that is quite bad, I really can't use this device without wifi. Do you have any suggestions as to what I can do ? Do you think this is a hardware or a software issue ?
Jun 15 15:43:17 <zmatt> 100% software
Jun 15 15:45:29 <zmatt> try downgrading to the latest 4.14-ti kernel (4.14.108-ti-r136) to see if it's a regression
Jun 15 15:45:30 <mru> rule #1: it's the wifi driver
Jun 15 15:45:44 <mru> rule #2: no really, it's the wifi driver
Jun 15 15:51:44 <Munke> :/
Jun 15 15:52:12 <Munke> well software can be fixed!
Jun 15 15:52:14 <mru> over here, I suspect a wifi driver of munging data coming from other usb devices on the same hub
Jun 15 15:52:36 <zmatt> that.. shouldn't be possible
Jun 15 15:52:46 <Munke> So do you have a guide for downgrading kernel ?
Jun 15 15:52:47 <mru> and yet it happens
Jun 15 15:53:12 <zmatt> mru: that would imply the fault is with the driver for the usb controller, or a hardware problem in the hub
Jun 15 15:53:13 <mru> it's an external driver for a realtek chip not supported in-tree
Jun 15 15:53:31 <Munke> oh I love that.. At my work I saw the famous log print "this should never happen", printed to error log ;) ..
Jun 15 15:53:33 <mru> I suspect something like a use after free bug
Jun 15 15:53:49 <zmatt> oh right if it's like kernel memory corruption then anything can happen of course
Jun 15 15:54:08 <zmatt> Munke: sudo apt-get install linux-image-4.14.108-ti-r136
Jun 15 15:54:21 <Munke> oh
Jun 15 15:54:22 <mru> if I read a file from a flash stick while the wifi is active, I occasionally get bad data
Jun 15 15:54:28 <zmatt> there's also a script somewhere in /opt/ but I forgot where or what arguments to give it
Jun 15 15:54:35 <mru> usbmon sees the correct data
Jun 15 15:54:40 <zmatt> mru: oof
Jun 15 15:55:13 <mru> somewhere during its journey up the stack, it goes bad
Jun 15 16:02:25 <Munke> This might be harder than i thought..
Jun 15 16:02:37 <Munke> crashed before i could do anything :D
Jun 15 16:02:38 <Munke> ..
Jun 15 16:03:36 <zmatt> Munke: ?  you previously made it sound like took non-negligible time to crash
Jun 15 16:04:33 <Munke> It previously took days or at least many hours. now it did in 5 minutes
Jun 15 16:04:48 <zmatt> due to the traffic of downloading a package?
Jun 15 16:04:48 <Munke> trying to downgrade as you told me now
Jun 15 16:04:55 <Munke> no just booting
Jun 15 16:05:07 <zmatt> oh after downgrading you mean?
Jun 15 16:05:08 <Munke> but let's see
Jun 15 16:05:14 <Munke> no before
Jun 15 16:05:18 <zmatt> oh just bad luck you mean
Jun 15 16:05:21 <Munke> yes
Jun 15 16:05:24 <zmatt> :P
Jun 15 16:05:48 <Munke> if you are used to crashes in hours and now they happen in minutes, it is quite fast ;)
Jun 15 16:05:57 <Munke> oh! downgraded complete
Jun 15 16:07:35 <Munke> debian@beaglebone:~$ uname -a
Jun 15 16:07:35 <Munke> Linux beaglebone 4.14.108-ti-r136 #1buster SMP PREEMPT Mon Jun 8 15:38:39 UTC 2020 armv7l GNU/Linux <-- let's hope that is all
Jun 15 16:07:52 <zmatt> if that does fix the problem then obviously the regression needs to be reported
Jun 15 16:07:58 <Munke> @zmatt, if this fails too like the one before, do you have any other kernel version to try ?
Jun 15 16:08:48 <Munke> @zmatt true, so this is an issue to report in the linux kernel ?
Jun 15 16:09:27 <zmatt> Munke: I'd report it to https://github.com/beagleboard/Latest-Images/issues to begin with
Jun 15 16:11:56 <Munke> Oh, ok then, but thanks, I really can tell that you've done this a couple of times before :)!
Jun 15 16:12:26 <zmatt> I have a bbbw running stably on 4.14 (albeit 4.14.78-bone17 but I doubt that matters)
Jun 15 16:13:44 <zmatt> it does occasionally trip a warning in the kernel log, but no panic
Jun 15 16:13:56 <MattB0ne> zmatt: should the pins be connected for LCD cape the only pins connected are the GNDs
Jun 15 16:14:31 <zmatt> MattB0ne: that's almost certainly a measurement error or you're measuring the wrong pins
Jun 15 16:15:07 <MattB0ne> ok let me try again. I am just focusing on the pins that should be usable according to the sheet
Jun 15 16:15:32 <zmatt> i.e. nearly all of P9 (except two pins iirc) and some of P8
Jun 15 16:16:03 <zmatt> should be passed through
Jun 15 17:20:17 <MattB0ne> ok it works
Jun 15 17:21:01 <MattB0ne> the issue falls in line with my past exploits
Jun 15 17:21:15 <MattB0ne> the pins were correct though
Jun 15 17:21:30 <zmatt> you mean user error?
Jun 15 17:23:14 <MattB0ne> they flip the headers
Jun 15 17:23:22 <MattB0ne> so while the pins were in the right stop
Jun 15 17:23:40 <MattB0ne> they change the orientation so that it is flipped
Jun 15 17:23:53 <MattB0ne> so I had pins in what I thought to be P9 but was P8
Jun 15 17:24:22 <MattB0ne> it is labeled but the silk screen makes it difficult to discern A vs B
Jun 15 17:24:34 <MattB0ne> the continuity test showed what was linked
Jun 15 17:24:59 <MattB0ne> moving the pins to the correct header has all the scripts working
Jun 15 17:26:26 <zmatt> I mean, of course they're flipped... the cape's passthrough headers are facing the opposite direction of the beaglebone's headers when plugged into this cape
Jun 15 17:27:14 <zmatt> they're mirrored
Jun 15 17:27:44 <MattB0ne> I assumed same orientation essentially translated but once again I was wrong
Jun 15 17:28:21 <zmatt> if you think where P9 and P8 are on the BBB and then where they are when you plug it into the cape, it should be obvious
Jun 15 17:28:21 <MattB0ne> a lot of headache for a bonehead maneuver
Jun 15 17:28:47 <Ragnorok> (pout)
Jun 15 17:29:04 <MattB0ne> yeah I am pouting
Jun 15 17:32:18 <MattB0ne> anyways
Jun 15 17:32:51 <MattB0ne> I am buiding code around the py-uio debugger script. I can use the core object as my reporter of all things going on in the PRU
Jun 15 17:33:03 <MattB0ne> it appears to have all the data pulled out
Jun 15 17:33:23 <MattB0ne> going to use it to populate my gui with the ecoder counts
Jun 15 17:33:24 <zmatt> just keep in mind that single-stepping the pru from a python script is obviously excruciatingly slow
Jun 15 17:33:50 <zmatt> debugging.py is just so you can see step by step what the pru is doing
Jun 15 17:34:07 <zmatt> but it means it'll do only a few instructions per second instead of 200000000 instructions per second
Jun 15 17:36:21 <zmatt> normally you'd just start the core and read the counter from memory
Jun 15 17:45:41 <zmatt> e.g. since the program writes it to offset 0x00 from c28 (typically configured to pruss shared memory) you'd read it from there: https://pastebin.com/z4A6TWyX
Jun 15 17:47:01 <zmatt> it might actually be more sensible to store to offset 0x00 from c24 (core local ram) and read it from there (core.dram.map( u32, 0x00 ) for python code, or offset 0x00 from c25 for the other pru core)
Jun 15 17:54:30 <Ragnorok> My use didn't need share memory, only PRU local, so I used shared a ring buffer for logging and let the PRUs run full-speed.  If one crashed it would still usually log enough that I could figure out why.
Jun 15 17:54:48 <zmatt> I mean, this is all pruss memory
Jun 15 17:54:53 <zmatt> I'm not talking about ddr3
Jun 15 17:55:56 <zmatt> the only difference between the three pruss local memories is the prioritization of initiators that try to access it simultaneously
Jun 15 17:56:44 <zmatt> (and accesses by different initiators to different pruss local memories proceed simultaneously)
Jun 15 18:00:33 <zmatt> since core0 is updating the counter at potentially a very high rate, putting it in core0's local ram would ensure that it'll never stall as a result of memory accesses by pru core 1 or the cortex-a8
Jun 15 18:02:45 <Munke> @zmatt: Just happened again, will get dump
Jun 15 18:03:39 <zmatt> Munke: how odd, I wonder why this is happening in your specific situation
Jun 15 18:04:27 <Munke> it is just sitting on my table
Jun 15 18:04:30 <Munke> nothing else.
Jun 15 18:04:49 <Munke> running nothing, with a freshly flashed console os
Jun 15 18:05:50 <zmatt> right, so presumably there's just something about your wifi situation that's causing it to go nuts.  but I can try running the latest console image on a bbbw at the office when I get there, see what happens
Jun 15 18:06:26 <zmatt> there's currently one running there on a kernel 4.14 since the last power interruption 20 days ago
Jun 15 18:06:38 <Munke> https://pastebin.com/spx7dAhT
Jun 15 18:07:02 <zmatt> what's that garbage at the start?  that's really weird too
Jun 15 18:07:08 <Munke> yup
Jun 15 18:07:42 <Munke> using a raspberry pi to log this, Maybe i should start using the newly bought usb part.
Jun 15 18:07:48 <zmatt> shouldn't matter
Jun 15 18:08:06 <zmatt> configured to baudrate 115200 ?
Jun 15 18:08:37 <zmatt> are you seeing log output from u-boot and the kernel during boot?
Jun 15 18:08:47 <Munke> yes 115200
Jun 15 18:08:58 <Munke> yes i see that as well
Jun 15 18:09:24 <zmatt> ohh never mind, that's a clear-screen
Jun 15 18:09:29 <zmatt> \e[H\e[J
Jun 15 18:09:36 <zmatt> terminal sequence
Jun 15 18:11:09 <zmatt> was it connected to wifi while this was happening?
Jun 15 18:11:34 <Munke> https://pastebin.com/U3AkKpaT
Jun 15 18:11:51 <Munke> yes I assume
Jun 15 18:12:29 <zmatt> you still have the "quiet" option, you should disable that to actually get useful log output
Jun 15 18:12:30 <Munke> that it was connected, I've never seen the wifi led off and it does autoconnect, so I assume that it was connected at that point
Jun 15 18:12:50 <Munke> oh yea, well reflash does not preserve changes ;)
Jun 15 18:12:54 <Munke> sec
Jun 15 18:14:33 <zmatt> like, it needs to reenable the sdio interface in an interrupt handler?  that sounds like some pointless power management thing
Jun 15 18:16:30 <zmatt> there's probably a way to disable that, keep the sdio controller enabled... I do wonder how it's managing to get a bus error in the power management code though
Jun 15 18:18:12 <Munke> https://pastebin.com/ZnfsQPV3
Jun 15 18:18:14 <Munke> here we go
Jun 15 18:18:48 <zmatt> and like, literally _nobody_ on the entire internet has had this specific fault: https://www.google.com/search?hl=en&tbs=li:1&q=%22unhandled+fault%22+am33xx_pwrdm_read_pwrst
Jun 15 18:19:36 <Munke> nope, I've got some skills in breaking stuff
Jun 15 18:23:35 <zmatt> Munke: what on earth was happening at the start of your first paste though... https://pastebin.com/NbZGHUs7
Jun 15 18:23:55 <zmatt> you were typing in your screen session and got that garbage echoed back?
Jun 15 18:24:35 <zmatt> I see bits of commands, but mostly garbage
Jun 15 18:24:41 <Munke> I was not using it when it crashed, but oviously had been earlier on. but that ip afconfig is not of mine duing
Jun 15 18:24:49 <Munke> honestly :D
Jun 15 18:25:24 <Munke> seems to me to be wireless related, wl at least was the drivers for my laptops wireless... oh the troubles I had with that...!
Jun 15 18:25:45 <zmatt> it's also not quite the same fault as the one you had now on 4.14
Jun 15 18:25:57 <zmatt> or maybe the function just renamed
Jun 15 18:26:19 <zmatt> no google hits for that one either
Jun 15 18:26:32 <Munke> [130491.356558] Unhandled fault: external abort on non-linefetch (0x1028) at 0xf9e00004 is the error from the one you just referenced
Jun 15 18:26:42 <zmatt> that line literally means nothing
Jun 15 18:26:57 <zmatt> the address is a virtual address and the error message is wrong
Jun 15 18:27:24 <zmatt> so it basically says "some fault at some address"
Jun 15 18:27:25 <Munke> well that is great :D
Jun 15 18:28:23 <Munke> But I think it is quite serious that it is so badly handled that the entire device crashes.. I would hope that such errors might be handled..
Jun 15 18:29:35 <Munke> One thing @zmatt.. The antennas are still laying on the board, just as it was delivered. Do you think that it might be some insane interference bentween the antennas and the cpu ?
Jun 15 18:32:05 <zmatt> but 0x1028 decodes to: synchronous fault (so traceback should be valid), bus error (DECERR) on data read
Jun 15 18:32:26 <zmatt> Munke: dunno, but you should definitely rotate them to be pointing away from the board
Jun 15 18:32:49 <zmatt> having an antenna hugging the pcb is not exactly a good idea
Jun 15 18:32:58 <zmatt> they're just rotated inward to protect them during shipping
Jun 15 18:34:45 <zmatt> Munke: was this previous log (https://pastebin.com/NbZGHUs7) captured with or without "quiet" ?
Jun 15 18:34:55 <zmatt> it looks quiet to me
Jun 15 18:35:23 <zmatt> there should have been substantially more output associated with the "wlcore: ERROR Tx stuck" event
Jun 15 18:36:38 <zmatt> but maybe the default console log level filters it out even without 'quiet'
Jun 15 18:37:46 <zmatt> it could be useful to also have a 'journalctl -f -o short-monotonic' open (via ssh via usb networking) to log what's going on
Jun 15 18:38:25 <zmatt> maybe it won't catch what happens right before the crash, but maybe it will
Jun 15 19:15:31 <Munke> @zmatt, will enable journalctl -f -o short-monotonic, then let's see what happens
Jun 15 19:28:25 <Munke> I've started a screen on my raspberry pi and put it in detached mode.
Jun 15 19:28:45 <Munke> But well, if we are "unlucky" it can take days before this happens again
Jun 15 19:29:55 <Munke> @zmatt yes that was logged with quiet, quiet removed now.
Jun 15 19:32:53 <zmatt> you're still on krenel 4.14 right?  once you capture an event with this it might be useful to also capture one with 4.19 .. you can switch back by changing the "uname_r" variable in /boot/uEnv.txt to any installed kernel version, e.g. 4.19.94-ti-r42 (double-check what's in /boot if unsure what kernels are installed)
**** ENDING LOGGING AT Mon Jun 15 20:36:34 2020

**** BEGIN LOGGING AT Mon Jun 15 20:38:15 2020
Jun 15 21:32:51 <Ragnorok> fsfreeze works by doing FIFREEZE ioctl, according to the source I'm looking at, but I can't find FIFREEZE in the 4.19.106-bone49 source to see what it's actually doing.  Anyone know what I'm missing this time?
Jun 15 21:33:31 <Ragnorok> The Magic Of Asking.  Even as I typed that, the latest find, found.  \o/
Jun 15 21:33:55 <zmatt> fs/ioctl.c
Jun 15 21:34:14 <Ragnorok> That is exactly what I found.  Thanks!
Jun 15 21:34:28 <zmatt> elixir is useful for stuff like that: https://elixir.bootlin.com/linux/v4.19.106/A/ident/FIFREEZE
Jun 15 21:35:35 <Ragnorok> Neat.
**** ENDING LOGGING AT Tue Jun 16 03:06:37 2020