**** BEGIN LOGGING AT Fri Jun 05 02:59:57 2009 Jun 05 04:31:16 should php4 get svn rm'ed as its upstream is dead? Jun 05 07:55:22 can someone tell my how to induce firstboot to run again? Jun 05 07:56:12 hmm. maybe dd'ing /dev/zero to the jffs partition? Jun 05 08:03:05 /bin/firstboot ? Jun 05 08:03:22 yeah, but if it finds jffs2 already there it doesn't do anything Jun 05 08:03:33 * russell_ trying to induce the nvram corruption Jun 05 08:04:03 it should clear the contents of the jffs2 partition if the partition is mounted Jun 05 08:04:24 i want it to do exactly what it does when it boots the first time Jun 05 08:04:44 to see if that's the guilty party Jun 05 08:04:48 umount the jffs partition then Jun 05 08:04:56 can't, it's busy Jun 05 08:05:53 boot into failsafe, or pivot_root to the squashfs partition Jun 05 08:06:22 you know how to boot into failsafe on a wgt? i've never done it before. Jun 05 08:06:23 and then unmount the mini_fo partition and you should be able to unmount jffs. Jun 05 08:06:47 not possible on a wgt as far as I now. Jun 05 08:06:49 know* Jun 05 08:08:14 you know the pivot root incantation by chance? Jun 05 08:09:46 should just be 'pivot_root / /rom', I believe Jun 05 08:10:11 pivot_root: Device or resource busy Jun 05 08:10:57 try pivot_root /rom / then Jun 05 08:11:10 same Jun 05 08:11:39 hmmm... can you kill all the processes you don't need? Jun 05 08:13:33 down to ash init and klogd, same same Jun 05 08:14:10 can you kill klogd (that's just the logging daemon I believe) Jun 05 08:14:38 still no dice Jun 05 08:15:04 hmmm...must be ash that's the problem Jun 05 08:15:12 i think it's the other mounts Jun 05 08:15:17 probably open to the current directory Jun 05 08:15:23 * russell_ gradually unmounting everything Jun 05 08:15:30 oh, right, forgot about them Jun 05 08:15:43 can't unmount /dev Jun 05 08:16:07 can you boot with a ramdisk image? Jun 05 08:17:10 hmm. once a long time ago i had an nfsroot infrastructure for the wgt ... not sure it still exists though Jun 05 08:17:47 no, I just mean the ramdisk target (instead of jffs or squashfs) Jun 05 08:26:53 it's limited to about 2.7 or 2.8 MB because of tftp limitations in the CFE Jun 05 08:34:37 hmm. parsing firstboot, the business part seems to be the mtd erase bit Jun 05 08:43:40 I think you have to make sure the DEADC0DE is present after the squashfs for it to erase Jun 05 08:45:00 what board are you working on? (broadcom reference I mean) Jun 05 08:46:27 it's a wgt634u Jun 05 08:47:43 what broadcom board is that? (e.g. 96348GW-11?) Jun 05 08:49:16 is there a command the emits what you want? Jun 05 08:50:43 it might be shown in dmesg output near the top....if not I'd have to check the wiki and see if anyone has posted it Jun 05 08:52:09 BCM95365R Jun 05 08:52:45 ah, ok, not the boards I'm working on imagetag for Jun 05 08:53:16 something is stomping on nvram the first time the image boots Jun 05 08:53:25 * russell_ is trying to figure out what Jun 05 08:53:45 how big is the image? Jun 05 08:53:53 when it reboots, the CFE gets confused and decides it can't do anything more Jun 05 08:54:39 2625536 bytes Jun 05 08:55:50 ok, so it's not too big for flash; you had mentioned it might be jffs stomping on nvram, but unless the partitions are wrong, that's unlikely (what are the partitions?) Jun 05 08:57:24 http://pastebin.ca/1448496 Jun 05 09:03:08 how big is your flash? Jun 05 09:03:15 8 meg Jun 05 09:03:43 I don't see anything obviously wrong then Jun 05 09:04:05 yeah Jun 05 09:04:08 and yet Jun 05 09:04:12 rootfs_data is on an erase_boundary and that's where the jffs lives Jun 05 09:04:13 something clearly is Jun 05 09:04:27 fun, fun Jun 05 09:04:44 do you have jtag Jun 05 09:04:56 no Jun 05 09:05:31 hmmm...it'd be useful to be able to see what was in the nvram after reboot Jun 05 09:05:47 that i can get from cat'ing /dev/mtd4 Jun 05 09:05:57 i'm doing that now, as a matter of fact Jun 05 09:06:02 right, ok Jun 05 09:06:07 ramdisk image? Jun 05 09:06:19 going to compare with what happens after firstboot Jun 05 09:07:11 me wonders where precisely these are coming from: jffs2_scan_eraseblock(): End of filesystem marker found at 0x0 Jun 05 09:07:11 jffs2_build_filesystem(): unlocking the mtd device... done. Jun 05 09:07:11 jffs2_build_filesystem(): erasing all blocks after the end marker... done. Jun 05 09:07:26 if you do a small ramdisk image you could check what happens after reboot as well Jun 05 09:08:02 the damage is occurring (almost certainly) before the reboot though Jun 05 09:08:36 target/linux//driver/mtd-flash or similar Jun 05 09:08:59 or is that the shell script...let me check Jun 05 09:11:37 ooh. more interesting. Jun 05 09:12:01 i reflashed with the image that caused the trouble, no problem on reboot this time. Jun 05 09:12:12 oh? Jun 05 09:12:23 maybe my setenv -p foo bar wrote to a different part, maybe Jun 05 09:13:03 * russell_ should compare with the /dev/mtd4 on one i haven't broken and fixed Jun 05 09:23:57 hmm. some differences Jun 05 09:24:08 besides the macaddrs Jun 05 09:24:48 the first difference doesn't appear until the 98309th byte of /dev/mtd4 Jun 05 09:27:04 is mtd4 the nvram? Jun 05 09:28:40 yeah Jun 05 09:30:31 running strings and sorting, here's the diff: http://pastebin.ca/1448511 Jun 05 09:31:12 (ignore the macaddr differences, they are from separate devices) Jun 05 09:32:32 -boardtype=bcm95365r looks bad to me Jun 05 09:32:44 thing is, they both boot Jun 05 09:32:51 ok Jun 05 09:32:56 what's the message on boot Jun 05 09:33:09 you mean, the bad one? Jun 05 09:33:14 yes Jun 05 09:33:26 http://forum.openwrt.org/viewtopic.php?pid=89144 Jun 05 09:34:03 read the first one, then skip down to #7 Jun 05 09:34:22 if i restore the values from CFE, everything starts working again Jun 05 09:34:43 i've had this happen on two separate devices with recent trunk images Jun 05 09:41:55 is orig the good or the bad? Jun 05 09:42:32 they are both booting. orig is one that i've never "fixed up" from CFE Jun 05 09:42:57 where "fixed up" means, replacing missing values with setenv -p Jun 05 09:43:37 i would guess that orig in identical to stock, but i'm not certain of that Jun 05 09:44:10 hmmm...maybe it's not the nvram settings that are the problem? Jun 05 09:44:26 sure seems like it though Jun 05 09:44:41 CFE shows them broken, i fix, it starts working again Jun 05 09:44:56 what are they supposed to be? Jun 05 09:45:11 what #7 shows from that forum post Jun 05 09:47:43 i have copies of the flash from a bunch of these devices, and the only difference i'm seeing in their respective mtd4's is in the macaddrs Jun 05 09:48:18 so i think we can conclude that -orig is stock Jun 05 09:48:24 what about line4 from orig? Jun 05 09:48:39 boardtype? Jun 05 09:48:55 no the partial startup line Jun 05 09:49:30 partial? Jun 05 09:49:44 that's the full STARTUP line afaik Jun 05 09:49:47 @@ -1,11 +1,13 @@ -8.1.1 -mask=255.255.255.0;boot -elf flash0.os: Jun 05 09:50:12 line 1 in the original file I guess Jun 05 09:50:24 oh Jun 05 09:50:57 yeah, i don't know Jun 05 09:51:15 strings can be funny that way Jun 05 09:51:32 i'm not sure how the nvram partition is structured Jun 05 09:51:57 can you scp mtd4 and hexedit it? Jun 05 09:52:03 anyway, both of them boot. what i don't currently have is a copy of mtd4 when it's hosed. Jun 05 09:52:13 right Jun 05 09:52:14 yeah, i ran cmp Jun 05 09:52:27 cmp -b -l ... Jun 05 09:53:22 does your board have a ramdisk image format available? (it'd be in the same place you configure jffs vs squashfs) Jun 05 09:53:36 i didn't build one Jun 05 09:53:58 i'm not sure what you hope to get out of that Jun 05 09:54:24 you can cat mtd4 of a hosed device by booting over the network Jun 05 09:54:48 i don't have a hosed device at the moment, though i suppose i could create another one Jun 05 09:54:56 I think your cfe should have that option Jun 05 09:56:38 if i assume it's being hosed, i can also grab a copy before i reboot Jun 05 09:56:50 in which case, no fancy booting is needed Jun 05 09:56:54 right Jun 05 09:57:09 depends on where it's getting hosed Jun 05 11:33:14 lars * r16345 /trunk/toolchain/gcc/Makefile: Jun 05 11:33:14 [toolchain] disable tls for stdlibc++. fixes c++ inside a gcc-4.4.0 Jun 05 11:33:14 toolchain. Jun 05 12:44:25 russell_, cshore: http://pastebin.com/m16e7e2a9 Jun 05 15:25:01 juhosg * r16346 /trunk/target/linux/generic-2.6/patches-2.6.30/215-mini_fo_2.6.30.patch: [kernel] generic-2.6/2.6.30: more mini_fo fixes Jun 05 15:31:35 juhosg * r16347 /trunk/target/linux/ar71xx/files/arch/mips/ar71xx/mach-mzk-w300nh.c: [ar71xx] create a 'firmware' partition for MZK-W300NH board Jun 05 19:29:42 juhosg * r16348 /trunk/target/linux/ar71xx/files/arch/mips/ar71xx/mach-mzk-w300nh.c: [ar71xx] fix a typo Jun 05 19:37:49 nbd: ping Jun 05 19:38:03 https://dev.openwrt.org/changeset?old=15055%40trunk%2Fpackage%2Fhostapd%2Ffiles&new=15055%40trunk%2Fpackage%2Fhostapd%2Ffiles Jun 05 19:39:02 nbd: the above changed broke things, if I set hwmode_11n=1 to /etc/config/wireless, hostapd.conf will have hw_mode=1 (should be g) Jun 05 19:53:35 xxiao: maybe hwmode_11n is supposed to be set to "g" Jun 05 19:53:42 it's just passed through Jun 05 19:54:33 because above changeset does not use config_get_bool Jun 05 19:54:43 therfore I assume it expects a string, not 1/0 Jun 05 19:56:42 from hostapd's document it states in hostapd.conf, 11n should be enabled via "hwmode_11n=1" Jun 05 19:57:12 in openwrt you should be able to set hw_mode to 11na or 11ng Jun 05 19:57:21 and the scripts will handle hwmode_11n=1 internally Jun 05 20:00:54 hwmode="$hwmode_11n" ---do I need set up hwmode_11n then? Jun 05 20:01:23 no Jun 05 20:01:29 leave it alone Jun 05 20:02:37 probably we do not need the code "config_get hwmode_11n "$device" hwmode_11n ----- [ -n "$hwmode_11n" ] && { ----hwmode="$hwmode_11n"" then? Jun 05 20:02:58 hwmode_11n is generated Jun 05 20:03:25 you mean by wifi_fixup_hwmode? Jun 05 20:03:45 ok i'll just set hwmode=11na or 11ng Jun 05 20:04:09 yes Jun 05 20:04:38 then in hostapd.conf it's renamed hw_mode, anyway. will try this tonight Jun 05 20:05:26 nbd: any thoughts on the weird wgt634u nvram corruption problem? Jun 05 20:06:06 no idea Jun 05 20:06:48 nbd: is there a way in hostapd to force 11n-only mode? Jun 05 20:07:34 dunno Jun 05 20:10:10 nbd: do you know a known-good madwifi version that i can revert to use for now? madwifi keeps locking my ar71xx board these days whenever wifi is enabled Jun 05 20:10:44 no Jun 05 20:11:19 hmmm....crappy ar71xx Jun 05 20:16:27 just wondering am I the only one seeing madwifi crashes on ar71xx? ath5k also misbehaviour(crashes in mesh mode), i'm using ubut routerstation Jun 05 20:18:13 I've seen frequent reboots with stuff running madwifi these days but no crashes (oops) Jun 05 20:30:07 xMff: maybe you're not using ar71xx, i ran madwifi on ixp4xx before and it ran well Jun 05 22:15:57 fwiw, http://pastehtml.com/view/090605R6W44ene.html is an incomplete breakdown of openwrt feeds packages against their upstream versions Jun 05 22:21:00 swalker: what are you using to generate that? Jun 05 22:29:38 Bartman007: grepping/cutting for the Makefile info, coloring/upstream is by hand at least for now Jun 05 22:30:09 wow. thank you (for doing the grunt work) Jun 05 22:51:41 * russell_ deciphers the nvram format, a simple rle style thing Jun 05 22:52:55 russell_: if it's like broadcom, it's just key=val, separated by \0 Jun 05 22:53:04 and some header Jun 05 22:53:18 with a magic ('FLSH') Jun 05 22:53:57 there is an earlier FLSH section Jun 05 22:54:16 but at 0x1e000 the stuff that shows up in printenv appears Jun 05 22:54:59 that is 0x01,LEN+1,0x00,, ..., 0x00 Jun 05 22:55:34 FLSH appears at 0x18000 Jun 05 22:55:42 how big is your nvram? Jun 05 22:55:59 it's earase size Jun 05 22:56:06 131072 Jun 05 22:56:09 bytes Jun 05 22:57:45 the stuff in the earlier section doesn't do the run-length encoding, at least not in the same way... haven't looked closely at it Jun 05 22:57:49 at which offset is the one from printenv ? Jun 05 22:58:02 0x1e000 Jun 05 22:58:38 although, /me looking at it again, some stuff in the printenv (like the board type or something) doesn't seem to appear in the 0x1e000 section Jun 05 22:58:58 most of the ascii stuff in the 0x18000 section is sdram stuff Jun 05 22:59:07 you run 2.6 ? Jun 05 22:59:11 yes Jun 05 22:59:27 which platform is the wgt thing again? Jun 05 23:01:22 CONFIG_TARGET_brcm47xx=y Jun 05 23:02:37 i'm trying to re-trigger the corruption and capture what the corruption looks like Jun 05 23:02:43 I think it's the userspace nvram that messes things up Jun 05 23:02:49 do you have an /etc/init.d/nvram ? Jun 05 23:03:05 i do Jun 05 23:03:16 that's causing the corruption I think Jun 05 23:03:34 okay, didn't see that before Jun 05 23:03:37 could you provide me with a dump of the nvram mtd partition before and after the corruption? Jun 05 23:03:50 i don't have one yet of the 'after corruption' Jun 05 23:03:56 ok Jun 05 23:04:06 just 'after-the-corruption-after-the-repair' Jun 05 23:04:17 since the corruption prevents it booting Jun 05 23:04:22 the nvram utility assumes some offsets that might be wrong on your device Jun 05 23:04:30 that would make sense Jun 05 23:05:25 clue. Jun 05 23:05:40 https://dev.openwrt.org/browser/trunk/package/nvram/src/nvram.h#L115 Jun 05 23:05:52 nvram_set opo 0x0 shows up in the corrupted/fixed version and not in the original Jun 05 23:06:26 xMff: nvram on wgt634u is different from other boards. Jun 05 23:06:56 Bartman007: some hint's on how to programatically detect it right? Jun 05 23:07:00 see line 50 of target/linux/brcm47xx/files-2.6.28/arch/mips/bcm47xx/nvram.c Jun 05 23:07:07 ok Jun 05 23:07:09 the timing on that is about right too Jun 05 23:07:36 last touched 6 weeks ago or something Jun 05 23:31:26 fwiw, the differences between a corrupted/fixed and pristine stock nvram in the 0x18000 section are the addition of sdram_refresh=0x8040, opo=0x0, sdram_init=0x0419, and sdram_config=0x0000. the only ascii string in the stock version is sdram_ncdl=0x00020080 (unchanged in corrupted/fixed except the location is offset somewhat) Jun 05 23:31:57 start-offset is the same? Jun 05 23:32:18 that values are applied by /et/init.d/nvram Jun 05 23:32:25 this is the stock: Jun 05 23:32:29 00018000: 464c 5348 2c00 0000 5001 1904 0000 4080 FLSH,...P.....@. Jun 05 23:32:29 00018010: 0000 0000 7364 7261 6d5f 6e63 646c 3d30 ....sdram_ncdl=0 Jun 05 23:32:29 00018020: 7830 3030 3230 3038 3000 0000 ffff ffff x00020080....... Jun 05 23:32:29 I think fixup_linksys() is triggered for the wgt too Jun 05 23:32:50 this is the first three lines of the corrupted/fixed: Jun 05 23:33:02 00018000: 464c 5348 7000 0000 de01 1904 0000 4080 FLSHp.........@. Jun 05 23:33:02 00018010: 8000 0200 7364 7261 6d5f 7265 6672 6573 ....sdram_refres Jun 05 23:33:02 00018020: 683d 3078 3830 3430 006f 706f 3d30 7830 h=0x8040.opo=0x0 Jun 05 23:33:45 I see Jun 05 23:34:07 those additional values are generated by the nvram code on commit Jun 05 23:34:34 (this isn't a linksys! ;-) Jun 05 23:34:42 I know Jun 05 23:35:06 * russell_ just giddy we've kind of converged on the problem Jun 05 23:35:08 has the wgt some /proc stuff that could be used to identify it? Jun 05 23:35:40 good question ... unfortunately, i need to run out for sushi for a couple hours Jun 05 23:37:11 cat /proc/cpuinfo yields, in part, this: cpu model : Broadcom BCM3302 V0.7 Jun 05 23:37:21 not sure that's unique enough Jun 05 23:38:23 nope Jun 05 23:38:28 xMff: can't you use broadcom-diag ? Jun 05 23:38:45 okay, /me biab Jun 05 23:39:23 /proc/diag/model ? Jun 05 23:39:46 yes Jun 05 23:40:05 if it's present on a wgt... Jun 05 23:40:11 only needs it's value Jun 05 23:40:20 should be "Netgear WGT634U" Jun 05 23:40:25 k Jun 05 23:41:33 I can double check that it is detected properly this weekend, but that's what package/broadcom-diag/src/diag.c says Jun 05 23:43:02 part 1 would be http://openwrt.pastebin.com/m4ff54f64 then Jun 06 01:38:55 cat /proc/diag/model Jun 06 01:38:55 Netgear WGT634U Jun 06 02:01:39 in the meantime, i'll just /etc/init.d/nvram disable Jun 06 02:01:52 (too late of course, in most instances) Jun 06 02:05:13 jow * r16349 /trunk/package/nvram/files/nvram.init: [package] nvram: don't execute nvram fixups on the WGT634U Jun 06 02:05:51 xMff: the other problem is why it was corrupting things Jun 06 02:06:25 russell_: no idea, unless the checksum calculation is wrong somehow Jun 06 02:06:53 russell_: can you run "nvram info" if the nvram is in a non-corrupted state? Jun 06 02:06:55 i'm about to create a corrupted version Jun 06 02:07:02 yes Jun 06 02:07:10 * russell_ pastebin'ing Jun 06 02:07:52 http://pastebin.ca/1449330 Jun 06 02:08:25 that is with a "stock" nvram partition, with the exception of the devices macaddrs, which i modified with a hex editor Jun 06 02:08:39 apparently no checksum on that part of the nvram Jun 06 02:09:06 it has an 8bit checksum in the header Jun 06 02:09:34 which seems to be correct Jun 06 02:09:44 the macaddr stuff is in the later section Jun 06 02:10:01 do all vars appear in "nvram show" ? Jun 06 02:10:47 I wonder when a nvram counts as corrupted for this cfe Jun 06 02:10:49 most of the things i see from CFE printenv do not show Jun 06 02:11:11 things like et0macaddr, et1macaddr, etc Jun 06 02:11:15 okey, it actually just prints the stuff that's stored on flash Jun 06 02:11:33 so, ther might be some pure virtual variables in kernel space Jun 06 02:12:04 the et0macaddr is on the flash, in the later section, starting at 0x1e000 Jun 06 02:12:13 hm okay Jun 06 02:12:36 let me see what changes when i run /etc/init.d/nvram start Jun 06 02:13:42 hmm, 0x20000 (erase size) - 0x2000 (nvram size) is 0x1e000 ... don't get why it writes stuff in the wrong location Jun 06 02:14:05 aha. it's writing 0xff's all over Jun 06 02:14:18 yeah Jun 06 02:14:36 so it overwrites the real section?# Jun 06 02:14:51 there is data there, for sure Jun 06 02:15:11 in the run-length-encoding i mentioned before Jun 06 02:22:00 maybe the 0xFF confuses the nvram-find-heuristic Jun 06 02:22:54 I just need an image before/after to compare them, to implement a work around Jun 06 02:23:06 * russell_ can provide now Jun 06 02:24:44 http://www.personaltelco.net/~russell/mtd-good Jun 06 02:24:49 http://www.personaltelco.net/~russell/mtd-corrupt Jun 06 02:25:36 #2 not found Jun 06 02:26:07 oh, /me is dumbass, hold on Jun 06 02:26:15 http://www.personaltelco.net/~russell/mtd4-corrupt Jun 06 02:26:30 i meant to name the first one mtd4-good, but failed Jun 06 02:27:09 http://www.personaltelco.net/~russell/mtd4-good now there too Jun 06 02:29:13 /proc/mtd reports 0x20000 0x20000 for mtd4 ? Jun 06 02:30:00 or 0x10000 0x20000 ? Jun 06 02:30:45 mtd4: 00020000 00020000 "nvram" Jun 06 02:32:04 hmm Jun 06 02:33:05 that format is really weird Jun 06 02:33:25 * russell_ not in a position to judge weirdness Jun 06 02:33:33 first half starts at 0x8000, then a big gap, rest is at 0xE000 Jun 06 02:33:34 so i'll take your word for it Jun 06 02:33:35 * xMff neither Jun 06 02:33:51 yeah, and the second bit is in a different format Jun 06 02:34:00 not just zero-terminated strings Jun 06 02:35:31 0x01, STR1LEN+1, 0x00, , 0x01, STR2LEN+1, 0x00, , ..., 0x00 Jun 06 02:35:43 allright Jun 06 02:38:01 at least, that's what i gathered from staring at it Jun 06 02:48:47 russell_: http://openwrt.pastebin.com/m3a240c4d Jun 06 02:48:53 only compile-tested so far Jun 06 02:49:37 that will make the tool only see the first 0x10000 bytes, so it won't overwrite that other thing Jun 06 02:49:40 ... if it works Jun 06 02:50:01 but, the part it wants to write to is at 0x18000 Jun 06 02:50:15 yes, 0x20000 - 0x2000 Jun 06 02:50:28 now it thinks the device is 0x10000 Jun 06 02:50:52 0x20000 - 0x2000 is 0x1e000 Jun 06 02:51:04 so it will write at 0x10000 - 0x200 Jun 06 02:51:08 *0x2000 Jun 06 02:51:49 * russell_ still not sure that's right Jun 06 02:52:07 there are _supposed_ to be two parts, i think Jun 06 02:52:28 * russell_ will look at the code though, before i babble more nonsense Jun 06 02:52:31 according to the dumps you provided, it is. With the above change, the memory mapped area will be clipped after 0x10000 Jun 06 02:52:56 so it wont write 0xFF at 0x10000 and higher Jun 06 02:53:27 but, but ... the first "FLSH" section is at 0x18000, which is higher than 0x10000 Jun 06 02:53:50 0x8000 here Jun 06 02:53:57 in mtd4-good Jun 06 02:54:19 the non-FLSH-thing is at 0xE000 Jun 06 02:54:24 in mtd4-good Jun 06 02:54:32 what are you looking at it with? Jun 06 02:54:37 "vbindiff" Jun 06 02:55:27 * russell_ looking with hexl-mode on emacs and i see 0x18000 byes of 0xff before FLSH Jun 06 02:56:59 argh hmpf Jun 06 02:57:02 sorry Jun 06 02:57:09 cmp -l -b /tmp/mtd4-new /tmp/mtd4-corrupt | less Jun 06 02:57:15 * xMff overlooked the first column Jun 06 02:57:22 okay Jun 06 02:57:57 nbd: ath5k similar pci bus error on ar71xx, ath9k worked fine: http://openwrt.pastebin.com/m842d18c **** ENDING LOGGING AT Sat Jun 06 02:59:58 2009