**** BEGIN LOGGING AT Sat Jan 19 02:59:56 2019 Jan 19 03:00:02 to see which one of the two did something Jan 19 03:00:32 like, I don't exclude a bug in the mcspi driver, but if so then there must be a reason why it gets triggered in this case specifically. I've never really had any problems and we use spi heavily Jan 19 03:01:27 yeah I don't think it's from the spi driver Jan 19 03:01:30 it would fail way early Jan 19 03:01:38 for instance when computing the 802.15.4 checksum Jan 19 03:01:51 but then again changing stuff at the spi level shouldn't have fixed the upper levels Jan 19 03:01:56 wait, it passes the checksum? Jan 19 03:02:20 there's a checksum in 802.15.4 and that one is passed Jan 19 03:02:25 then in the upper levels that's where it's wrong Jan 19 03:02:26 yet the data is corrupted? Jan 19 03:02:31 yes Jan 19 03:02:44 I mean, that means the data gets corrupted before the checksum is calculated or after it is verified Jan 19 03:02:58 i.e. not anywhere in the spi layer Jan 19 03:03:05 if what you're saying is true Jan 19 03:03:05 yes Jan 19 03:03:18 hmm Jan 19 03:03:25 well I guess the packets would be dropped if the checksum wasn't right Jan 19 03:03:39 but then it makes no sense to be platform-dependent Jan 19 03:03:46 I'll check with tcpdump what it's saying about the checksum Jan 19 03:03:48 yeah Jan 19 03:03:56 that's why I thought of config/patches differences Jan 19 03:04:04 also it could be a subtle race condition Jan 19 03:04:14 certainly Jan 19 03:04:14 that a lowered spi speed thus interrupt rate would "fix" Jan 19 03:04:19 the rpi is faster and has 4 cores Jan 19 03:04:47 multicore definitely has potential to change behaviour Jan 19 03:04:52 usually for the worse though Jan 19 03:05:02 yeah Jan 19 03:06:27 it keeps being right even with DMA re-enabled Jan 19 03:06:35 so it's the lowered SPI speed that did the trick Jan 19 03:06:53 maybe I can lower it even further but it's not a fix, it's just letting more time to prevent race conditions Jan 19 03:07:12 still seems odd though Jan 19 03:07:14 but also the errors are different now, now it's only 2 bytes that are corrupted, and not with \0\0 like b efore Jan 19 03:07:32 and corrupted always at the same place in the buffer, I can put a hardware breakpoint on it maybe Jan 19 03:07:36 and debug the kernel through uart Jan 19 03:20:02 hmm Jan 19 03:20:27 I'm also not getting a really good feeling looking at the mrf24j40 driver Jan 19 03:20:31 now there are only two bytes that are wrong, and their contents are precisely what was present earlier in the device receive buffer Jan 19 03:20:42 49 4A Jan 19 03:20:49 I'm sending a ping request with data 00 01 02 ... Jan 19 03:21:11 sounds like an out of bounds copy or something Jan 19 03:21:32 but why is it triggered only on the BBB, that's a mystery Jan 19 03:22:32 so, like, in mrf24j40.c Jan 19 03:22:47 yeah, that or above Jan 19 03:22:51 hmm Jan 19 03:22:53 mrf24j40 is right above spi Jan 19 03:23:13 I expect a broken packet to be discarded by the mac802154 driver (which is the softMAC implementation, right above mrf24j40 which is the PHY driver I assume) Jan 19 03:23:24 or even by the above level which is ieee802154 Jan 19 03:23:39 first I should see if packets with broken 802.15.4 checksums are discarded Jan 19 03:24:01 I'm trying to understand what's preventing spi_async() from being called on one of the few spi_messages allocated by the driver while one is still in progress on that same spi_message object Jan 19 03:25:06 the driver must serialize the requests, that's not why ? Jan 19 03:28:46 but... like... Jan 19 03:28:59 write_tx_buf_complete() does not seem to signal any higher layer Jan 19 03:29:41 hmm Jan 19 03:29:44 I guess maybe there's an irq that signals transmit complete? Jan 19 03:29:58 that's most probable, yes Jan 19 03:29:59 let me check Jan 19 03:30:56 yes Jan 19 03:31:06 so maybe it's fine... it just makes me a bit nervous that there's no explicit check or obvious interlock that prevents it Jan 19 03:31:08 a TXNIF interrupt Jan 19 03:31:24 yeah Jan 19 03:35:02 with tcpdump I see that the 802.15.4 packet that is sent already contains the corruption Jan 19 03:35:15 uh no I'm tcpdumping it on the wrong machine Jan 19 03:35:19 well at least it passes the checksums Jan 19 03:37:31 my race condition spider senses are tingling so hard while reading this code Jan 19 03:38:10 lol Jan 19 03:38:25 ok so when the packet is sent from BBB, the data shown in the 802.15.4 packet capture is fine Jan 19 03:38:27 maybe it's still fine, but if so it's _extremely_ nonobvious why Jan 19 03:38:28 no corruption here Jan 19 03:38:40 so hints at an actual bug in the module you're looking at Jan 19 03:38:54 and not the higher levels as I was expecting Jan 19 03:39:48 on arrival to the RPi the trame is corrupted Jan 19 03:40:11 hmm, are you allowed to call spi_async() on an spi_message inside the completion handler for that same message? Jan 19 03:41:03 I think Jan 19 03:42:05 the docs say it can be called even from irq Jan 19 03:43:00 that's not relevant for my question though Jan 19 03:43:34 I think the answer may be yes though Jan 19 03:43:47 well it says "it can be invoked in contexts that can't sleep" then "the callback is invoked in a context that can't sleep" Jan 19 03:44:59 my point was specifically whether, in the completion callback for an spi_message, you're allowed to reuse that same spi_message for a new transfer Jan 19 03:45:24 which requires that the caller does absolutely nothing further with the message after calling the completion handler Jan 19 03:46:16 but yeah, looking at the spi core code, that indeed seems to be the case Jan 19 03:46:29 well the docs say to don't do anything with the message after you've submitted it Jan 19 03:46:33 so it seems to be the case yes Jan 19 03:47:26 yeah I just realized it's a silly question... if you can't reuse/release the message in the callback handler, when would you be able to Jan 19 03:50:59 this still feels like a race condition Jan 19 03:51:58 lowering the SPI speed on the RPi side made the problem worse Jan 19 03:52:01 I don't understand Jan 19 03:52:11 lowering on BBB makes it better, lowering on RPi makes it worse Jan 19 03:52:22 yeah I was actually about to say that making things slower could easily make the race condition worse Jan 19 03:53:10 so, the interrupt handler, mrf24j40_isr() disables the irq and starts the transfer to fetch the interrupt bits Jan 19 03:53:24 now it doesn't work anymore even with the correct speed put back again ! I don't understand what's going on Jan 19 03:53:34 yes Jan 19 03:53:36 its completion reenables the irq and then handles various interrupts Jan 19 03:53:51 shouldn't it re-enable it after ? Jan 19 03:54:00 unless the handling is done in a guaranteed FIFO manner Jan 19 03:55:44 that's a good question too, but wasn't even what I was getting at... I think in practice all the callbacks are coming from the kernel thread that handles the async transfers on the spi bus, hence they'll be properly serialized Jan 19 03:56:11 not sure if that's really guaranteed in general, but it might very well be Jan 19 03:57:04 however it does mean that another irq can be triggered already, hence a new intstat transfer can happen Jan 19 03:57:42 ah Jan 19 03:57:43 yes Jan 19 03:58:11 i.e. mrf24j40_handle_rx() disabling packet rx might be too late to prevent a second rx interrupt Jan 19 03:59:37 anyway, this is just a theory.. I feel like calling spi_async on an spi_message twice would probably result in a more spectacular crash&burn than mere corruption, but maybe I'm mistaken about that Jan 19 04:00:08 I'd still be inclined to add some tracing and/or sanity checks to the driver to verify what's happening exactly Jan 19 04:00:27 since this just overall doesn't feel like it's intrinsically safe Jan 19 04:01:50 I can try to add some locks Jan 19 04:02:10 I really don't know how I succeeded in making this work some minutes ago, now it's back to square one Jan 19 04:02:39 or maybe like atomic test-and-set before spi_async and then clear it again in its completion handler Jan 19 04:03:06 (and if the test-and-set reports it was already set, BUG()) Jan 19 04:04:55 yes Jan 19 04:05:49 I can again rule out the RPi, the tcpdump clearly shows that the 802.15.4 frame that arrives at the BBB is fine, the 802.15.4 frame that's coming out of the BBB is fine, but the frame that arrives at the RPi is corrupted Jan 19 04:06:17 also the corrupted data is always at the very end of the frame Jan 19 04:06:30 well, no, that could just as easily mean it gets corrupted in the rpi's rx handling Jan 19 04:06:39 ah, right Jan 19 04:06:55 yeah Jan 19 04:07:09 in fact, since the rx path looks iffier than the tx path, and the rpi has more opportunities for race conditions due to being multicore, that actually seems more likely to me Jan 19 04:07:29 can't I restrict the module to a single core ? Jan 19 04:07:34 set cpu affinity or something Jan 19 04:08:00 I guess maybe if you do so for all relevant kernel threads and irqs Jan 19 04:08:26 yeah Jan 19 04:08:42 I didn't see a big amount of kernel threads related to mrf24j40 in htop Jan 19 04:08:49 so it should be easy Jan 19 04:09:11 or maybe I can just boot in single core mode ? Jan 19 04:09:13 spi thread mainly probably Jan 19 04:09:18 no idea, maybe Jan 19 04:09:23 yeah Jan 19 04:10:26 having a known-good communication partner instead of both sides being suspect would definitely make things easier ;) Jan 19 04:10:31 yeah Jan 19 04:10:36 I should get another RPi at the lab Jan 19 04:10:44 or another BBB Jan 19 04:11:05 none of those would be "known-good" though Jan 19 04:11:33 yeah but I can do with 2 RPis or 2 BBBs Jan 19 04:11:38 and determine which configuration is the good one Jan 19 04:11:42 I guess spi sniffing would help in identifying whether the transmitter or receiver is fucking up Jan 19 04:11:49 yes Jan 19 04:11:52 I could work on that in a bit Jan 19 04:12:11 a kprobe couldn't work ? Jan 19 04:12:29 inside mcspi Jan 19 04:13:07 I'm not familiar enough with kprobe to know whether it's suited for capturing data like that Jan 19 04:13:17 ah, right Jan 19 04:13:18 if you feel it can work, go for it Jan 19 04:16:50 well I've got to go, thanks a lot for your help ! Jan 19 04:16:53 and good night Jan 19 04:16:57 good night Jan 19 04:27:38 Hello! Jan 19 04:31:29 Here comes a four-wheeled bot for your liking. Jan 19 04:34:59 MotorCape and four wheel movement! Jan 19 19:56:37 the power button doesn't work anymore with 5.0.0-rc2 Jan 20 01:52:00 Regarding my confusion with tait-brian angles the other day, the angle are as I expected when I used X-FORWARD IMU orientation (rather than X-UP) **** ENDING LOGGING AT Sun Jan 20 02:59:57 2019