**** BEGIN LOGGING AT Sat Jan 19 02:59:56 2019
Jan 19 03:00:02 <mawk> to see which one of the two did something
Jan 19 03:00:32 <zmatt> like, I don't exclude a bug in the mcspi driver, but if so then there must be a reason why it gets triggered in this case specifically.  I've never really had any problems and we use spi heavily
Jan 19 03:01:27 <mawk> yeah I don't think it's from the spi driver
Jan 19 03:01:30 <mawk> it would fail way early
Jan 19 03:01:38 <mawk> for instance when computing the 802.15.4 checksum
Jan 19 03:01:51 <mawk> but then again changing stuff at the spi level shouldn't have fixed the upper levels
Jan 19 03:01:56 <zmatt> wait, it passes the checksum?
Jan 19 03:02:20 <mawk> there's a checksum in 802.15.4 and that one is passed
Jan 19 03:02:25 <mawk> then in the upper levels that's where it's wrong
Jan 19 03:02:26 <zmatt> yet the data is corrupted?
Jan 19 03:02:31 <mawk> yes
Jan 19 03:02:44 <zmatt> I mean, that means the data gets corrupted before the checksum is calculated or after it is verified
Jan 19 03:02:58 <zmatt> i.e. not anywhere in the spi layer
Jan 19 03:03:05 <zmatt> if what you're saying is true
Jan 19 03:03:05 <mawk> yes
Jan 19 03:03:18 <mawk> hmm
Jan 19 03:03:25 <mawk> well I guess the packets would be dropped if the checksum wasn't right
Jan 19 03:03:39 <zmatt> but then it makes no sense to be platform-dependent
Jan 19 03:03:46 <mawk> I'll check with tcpdump what it's saying about the checksum
Jan 19 03:03:48 <mawk> yeah
Jan 19 03:03:56 <mawk> that's why I thought of config/patches differences
Jan 19 03:04:04 <mawk> also it could be a subtle race condition
Jan 19 03:04:14 <zmatt> certainly
Jan 19 03:04:14 <mawk> that a lowered spi speed thus interrupt rate would "fix"
Jan 19 03:04:19 <mawk> the rpi is faster and has 4 cores
Jan 19 03:04:47 <zmatt> multicore definitely has potential to change behaviour
Jan 19 03:04:52 <zmatt> usually for the worse though
Jan 19 03:05:02 <mawk> yeah
Jan 19 03:06:27 <mawk> it keeps being right even with DMA re-enabled
Jan 19 03:06:35 <mawk> so it's the lowered SPI speed that did the trick
Jan 19 03:06:53 <mawk> maybe I can lower it even further but it's not a fix, it's just letting more time to prevent race conditions
Jan 19 03:07:12 <zmatt> still seems odd though
Jan 19 03:07:14 <mawk> but also the errors are different now, now it's only 2 bytes that are corrupted, and not with \0\0 like b efore
Jan 19 03:07:32 <mawk> and corrupted always at the same place in the buffer, I can put a hardware breakpoint on it maybe
Jan 19 03:07:36 <mawk> and debug the kernel through uart
Jan 19 03:20:02 <mawk> hmm
Jan 19 03:20:27 <zmatt> I'm also not getting a really good feeling looking at the mrf24j40 driver
Jan 19 03:20:31 <mawk> now there are only two bytes that are wrong, and their contents are precisely what was present earlier in the device receive buffer
Jan 19 03:20:42 <mawk> 49 4A
Jan 19 03:20:49 <mawk> I'm sending a ping request with data 00 01 02 ...
Jan 19 03:21:11 <mawk> sounds like an out of bounds copy or something
Jan 19 03:21:32 <mawk> but why is it triggered only on the BBB, that's a mystery
Jan 19 03:22:32 <zmatt> so, like, in mrf24j40.c
Jan 19 03:22:47 <mawk> yeah, that or above
Jan 19 03:22:51 <zmatt> hmm
Jan 19 03:22:53 <mawk> mrf24j40 is right above spi
Jan 19 03:23:13 <mawk> I expect a broken packet to be discarded by the mac802154 driver (which is the softMAC implementation, right above mrf24j40 which is the PHY driver I assume)
Jan 19 03:23:24 <mawk> or even by the above level which is ieee802154
Jan 19 03:23:39 <mawk> first I should see if packets with broken 802.15.4 checksums are discarded
Jan 19 03:24:01 <zmatt> I'm trying to understand what's preventing spi_async() from being called on one of the few spi_messages allocated by the driver while one is still in progress on that same spi_message object
Jan 19 03:25:06 <mawk> the driver must serialize the requests, that's not why ?
Jan 19 03:28:46 <zmatt> but... like...
Jan 19 03:28:59 <zmatt> write_tx_buf_complete() does not seem to signal any higher layer
Jan 19 03:29:41 <mawk> hmm
Jan 19 03:29:44 <zmatt> I guess maybe there's an irq that signals transmit complete?
Jan 19 03:29:58 <mawk> that's most probable, yes
Jan 19 03:29:59 <mawk> let me check
Jan 19 03:30:56 <mawk> yes
Jan 19 03:31:06 <zmatt> so maybe it's fine... it just makes me a bit nervous that there's no explicit check or obvious interlock that prevents it
Jan 19 03:31:08 <mawk> a TXNIF interrupt
Jan 19 03:31:24 <mawk> yeah
Jan 19 03:35:02 <mawk> with tcpdump I see that the 802.15.4 packet that is sent already contains the corruption
Jan 19 03:35:15 <mawk> uh no I'm tcpdumping it on the wrong machine
Jan 19 03:35:19 <mawk> well at least it passes the checksums
Jan 19 03:37:31 <zmatt> my race condition spider senses are tingling so hard while reading this code
Jan 19 03:38:10 <mawk> lol
Jan 19 03:38:25 <mawk> ok so when the packet is sent from BBB, the data shown in the 802.15.4 packet capture is fine
Jan 19 03:38:27 <zmatt> maybe it's still fine, but if so it's _extremely_ nonobvious why
Jan 19 03:38:28 <mawk> no corruption here
Jan 19 03:38:40 <mawk> so hints at an actual bug in the module you're looking at
Jan 19 03:38:54 <mawk> and not the higher levels as I was expecting
Jan 19 03:39:48 <mawk> on arrival to the RPi the trame is corrupted
Jan 19 03:40:11 <zmatt> hmm, are you allowed to call spi_async() on an spi_message inside the completion handler for that same message?
Jan 19 03:41:03 <mawk> I think
Jan 19 03:42:05 <mawk> the docs say it can be called even from irq
Jan 19 03:43:00 <zmatt> that's not relevant for my question though
Jan 19 03:43:34 <zmatt> I think the answer may be yes though
Jan 19 03:43:47 <mawk> well it says "it can be invoked in contexts that can't sleep" then "the callback is invoked in a context that can't sleep"
Jan 19 03:44:59 <zmatt> my point was specifically whether, in the completion callback for an spi_message, you're allowed to reuse that same spi_message for a new transfer
Jan 19 03:45:24 <zmatt> which requires that the caller does absolutely nothing further with the message after calling the completion handler
Jan 19 03:46:16 <zmatt> but yeah, looking at the spi core code, that indeed seems to be the case
Jan 19 03:46:29 <mawk> well the docs say to don't do anything with the message after you've submitted it
Jan 19 03:46:33 <mawk> so it seems to be the case yes
Jan 19 03:47:26 <zmatt> yeah I just realized it's a silly question... if you can't reuse/release the message in the callback handler, when would you be able to
Jan 19 03:50:59 <zmatt> this still feels like a race condition
Jan 19 03:51:58 <mawk> lowering the SPI speed on the RPi side made the problem worse
Jan 19 03:52:01 <mawk> I don't understand
Jan 19 03:52:11 <mawk> lowering on BBB makes it better, lowering on RPi makes it worse
Jan 19 03:52:22 <zmatt> yeah I was actually about to say that making things slower could easily make the race condition worse
Jan 19 03:53:10 <zmatt> so, the interrupt handler, mrf24j40_isr() disables the irq and starts the transfer to fetch the interrupt bits
Jan 19 03:53:24 <mawk> now it doesn't work anymore even with the correct speed put back again ! I don't understand what's going on
Jan 19 03:53:34 <mawk> yes
Jan 19 03:53:36 <zmatt> its completion reenables the irq and then handles various interrupts
Jan 19 03:53:51 <mawk> shouldn't it re-enable it after ?
Jan 19 03:54:00 <mawk> unless the handling is done in a guaranteed FIFO manner
Jan 19 03:55:44 <zmatt> that's a good question too, but wasn't even what I was getting at... I think in practice all the callbacks are coming from the kernel thread that handles the async transfers on the spi bus, hence they'll be properly serialized
Jan 19 03:56:11 <zmatt> not sure if that's really guaranteed in general, but it might very well be
Jan 19 03:57:04 <zmatt> however it does mean that another irq can be triggered already, hence a new intstat transfer can happen
Jan 19 03:57:42 <mawk> ah
Jan 19 03:57:43 <mawk> yes
Jan 19 03:58:11 <zmatt> i.e. mrf24j40_handle_rx() disabling packet rx might be too late to prevent a second rx interrupt
Jan 19 03:59:37 <zmatt> anyway, this is just a theory.. I feel like calling spi_async on an spi_message twice would probably result in a more spectacular crash&burn than mere corruption, but maybe I'm mistaken about that
Jan 19 04:00:08 <zmatt> I'd still be inclined to add some tracing and/or sanity checks to the driver to verify what's happening exactly
Jan 19 04:00:27 <zmatt> since this just overall doesn't feel like it's intrinsically safe
Jan 19 04:01:50 <mawk> I can try to add some locks
Jan 19 04:02:10 <mawk> I really don't know how I succeeded in making this work some minutes ago, now it's back to square one
Jan 19 04:02:39 <zmatt> or maybe like atomic test-and-set before spi_async and then clear it again in its completion handler
Jan 19 04:03:06 <zmatt> (and if the test-and-set reports it was already set, BUG())
Jan 19 04:04:55 <mawk> yes
Jan 19 04:05:49 <mawk> I can again rule out the RPi, the tcpdump clearly shows that the 802.15.4 frame that arrives at the BBB is fine, the 802.15.4 frame that's coming out of the BBB is fine, but the frame that arrives at the RPi is corrupted
Jan 19 04:06:17 <mawk> also the corrupted data is always at the very end of the frame
Jan 19 04:06:30 <zmatt> well, no, that could just as easily mean it gets corrupted in the rpi's rx handling
Jan 19 04:06:39 <mawk> ah, right
Jan 19 04:06:55 <mawk> yeah
Jan 19 04:07:09 <zmatt> in fact, since the rx path looks iffier than the tx path, and the rpi has more opportunities for race conditions due to being multicore, that actually seems more likely to me
Jan 19 04:07:29 <mawk> can't I restrict the module to a single core ?
Jan 19 04:07:34 <mawk> set cpu affinity or something
Jan 19 04:08:00 <zmatt> I guess maybe if you do so for all relevant kernel threads and irqs
Jan 19 04:08:26 <mawk> yeah
Jan 19 04:08:42 <mawk> I didn't see a big amount of kernel threads related to mrf24j40 in htop
Jan 19 04:08:49 <mawk> so it should be easy
Jan 19 04:09:11 <mawk> or maybe I can just boot in single core mode ?
Jan 19 04:09:13 <zmatt> spi thread mainly probably
Jan 19 04:09:18 <zmatt> no idea, maybe
Jan 19 04:09:23 <mawk> yeah
Jan 19 04:10:26 <zmatt> having a known-good communication partner instead of both sides being suspect would definitely make things easier ;)
Jan 19 04:10:31 <mawk> yeah
Jan 19 04:10:36 <mawk> I should get another RPi at the lab
Jan 19 04:10:44 <mawk> or another BBB
Jan 19 04:11:05 <zmatt> none of those would be "known-good" though
Jan 19 04:11:33 <mawk> yeah but I can do with 2 RPis or 2 BBBs
Jan 19 04:11:38 <mawk> and determine which configuration is the good one
Jan 19 04:11:42 <zmatt> I guess spi sniffing would help in identifying whether the transmitter or receiver is fucking up
Jan 19 04:11:49 <mawk> yes
Jan 19 04:11:52 <zmatt> I could work on that in a bit
Jan 19 04:12:11 <mawk> a kprobe couldn't work ?
Jan 19 04:12:29 <mawk> inside mcspi
Jan 19 04:13:07 <zmatt> I'm not familiar enough with kprobe to know whether it's suited for capturing data like that
Jan 19 04:13:17 <mawk> ah, right
Jan 19 04:13:18 <zmatt> if you feel it can work, go for it
Jan 19 04:16:50 <mawk> well I've got to go, thanks a lot for your help !
Jan 19 04:16:53 <mawk> and good night
Jan 19 04:16:57 <zmatt> good night
Jan 19 04:27:38 <set_> Hello!
Jan 19 04:31:29 <set_> Here comes a four-wheeled bot for your liking.
Jan 19 04:34:59 <set_> MotorCape and four wheel movement!
Jan 19 19:56:37 <mawk> the power button doesn't work anymore with 5.0.0-rc2
Jan 20 01:52:00 <charlie5> Regarding my confusion with tait-brian angles the other day, the angle are as I expected when I used X-FORWARD IMU orientation (rather than X-UP)
**** ENDING LOGGING AT Sun Jan 20 02:59:57 2019