NetBSD Problem Report #47057

From www@NetBSD.org  Thu Oct 11 17:17:06 2012
Return-Path: <www@NetBSD.org>
Received: from mail.netbsd.org (mail.netbsd.org [149.20.53.66])
	by www.NetBSD.org (Postfix) with ESMTP id 6D60263E407
	for <gnats-bugs@gnats.NetBSD.org>; Thu, 11 Oct 2012 17:17:06 +0000 (UTC)
Message-Id: <20121011171705.85EF163E3BF@www.NetBSD.org>
Date: Thu, 11 Oct 2012 17:17:05 +0000 (UTC)
From: royger@netbsd.org
Reply-To: royger@netbsd.org
To: gnats-bugs@NetBSD.org
Subject: Xen NetBSD DomU file system trash under Linux Dom0
X-Send-Pr-Version: www-1.0

>Number:         47057
>Category:       port-xen
>Synopsis:       Xen NetBSD DomU file system trash under Linux Dom0
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    port-xen-maintainer
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu Oct 11 17:20:00 +0000 2012
>Closed-Date:    Fri Nov 30 09:30:57 +0000 2012
>Last-Modified:  Fri Nov 30 09:30:57 +0000 2012
>Originator:     Roger Pau Monné
>Release:        6.0RC2
>Organization:
Citrix
>Environment:
NetBSD  6.0_RC2 NetBSD 6.0_RC2 (XEN3_DOMU) #6: Wed Sep 26 18:06:29 BST 2012  root@roger-xen:/root/obj/sys/arch/amd64/compile/XEN3_DOMU amd64
>Description:
This problem might be related to 'port-xen/47056', and the root cause might actually be the same, but I'm posting them as different PR until we can figure out if they are related or not.

When doing heavy IO inside a NetBSD DomU backed by a Linux Dom0 I get random file system crashes, I've found this with FFSv1, FFSv2 with both WAPL enabled and disabled. The panics where about performing a free of an already free'd block usually, but I've also saw that sometimes on a fresh install you can end up with corrupted files (when performing the install from netbsd-INSTALL_XEN3_DOMU kernel).
>How-To-Repeat:
As with 'port-xen/47056', the easiest way to reproduce this is to try to do a build of NetBSD from sources from inside a DomU backed by a MP Linux Dom0.
>Fix:
I'm not sure about this, but I think we have a problem with reentrancy of the xen event channel callback (do_hypervisor_callback in hypervisor_machdep.c), but I haven't been able to find a fix for this.

The right solution might be to bind all events to CPU#0 and use a producer/consumer approach to dispatch them to different threads. This way it will be easier to block all events while we are in the callback itself, and then it's just a matter of calling the appropriate callback from the "consumer" thread. Also, we will be sure that callbacks won't be nested (ie. we will not have reentrant callbacks).

>Release-Note:

>Audit-Trail:
From: Manuel Bouyer <bouyer@antioche.eu.org>
To: gnats-bugs@NetBSD.org
Cc: port-xen-maintainer@NetBSD.org, gnats-admin@NetBSD.org,
        netbsd-bugs@NetBSD.org
Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux
 Dom0
Date: Thu, 11 Oct 2012 19:28:51 +0200

 On Thu, Oct 11, 2012 at 05:20:00PM +0000, royger@NetBSD.org wrote:
 > >Fix:
 > I'm not sure about this, but I think we have a problem with reentrancy of the xen event channel callback (do_hypervisor_callback in hypervisor_machdep.c), but I haven't been able to find a fix for this.


 Can you expand on this ? AFAIK this code is safe.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: =?UTF-8?Q?Roger_Pau_Monn=C3=A9?= <royger@NetBSD.org>
To: gnats-bugs@netbsd.org
Cc: port-xen-maintainer@netbsd.org, gnats-admin@netbsd.org, 
	netbsd-bugs@netbsd.org
Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux Dom0
Date: Fri, 12 Oct 2012 17:52:22 +0200

 On Thu, Oct 11, 2012 at 7:30 PM, Manuel Bouyer <bouyer@antioche.eu.org> wrote:
 > The following reply was made to PR port-xen/47057; it has been noted by GNATS.
 >
 > From: Manuel Bouyer <bouyer@antioche.eu.org>
 > To: gnats-bugs@NetBSD.org
 > Cc: port-xen-maintainer@NetBSD.org, gnats-admin@NetBSD.org,
 >         netbsd-bugs@NetBSD.org
 > Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux
 >  Dom0
 > Date: Thu, 11 Oct 2012 19:28:51 +0200
 >
 >  On Thu, Oct 11, 2012 at 05:20:00PM +0000, royger@NetBSD.org wrote:
 >  > >Fix:
 >  > I'm not sure about this, but I think we have a problem with reentrancy of the xen event channel callback (do_hypervisor_callback in hypervisor_machdep.c), but I haven't been able to find a fix for this.
 >
 >
 >  Can you expand on this ? AFAIK this code is safe.

 I'm not sure, but I think we might have a problem when we call
 intr_biglock_wrapper, this function takes the kernel_lock, but just
 before calling it we call sti(), which allows further hypervisor
 callbacks. Isn't it posible that another hypervisor callback
 interrupts the execution of the handler, leaving the kernel_lock held
 and thus locking the system when this new callback tries to execute a
 handler?

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Roger Pau =?iso-8859-1?Q?Monn=E9?= <royger@NetBSD.org>
Cc: gnats-bugs@NetBSD.org, port-xen-maintainer@NetBSD.org,
        gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux
 Dom0
Date: Fri, 12 Oct 2012 18:10:27 +0200

 On Fri, Oct 12, 2012 at 05:52:22PM +0200, Roger Pau Monné wrote:
 > I'm not sure, but I think we might have a problem when we call
 > intr_biglock_wrapper, this function takes the kernel_lock, but just
 > before calling it we call sti(), which allows further hypervisor
 > callbacks. Isn't it posible that another hypervisor callback
 > interrupts the execution of the handler, leaving the kernel_lock held
 > and thus locking the system when this new callback tries to execute a
 > handler?

 Either the new callback wants to execute a MPSAFE handler and things will
 run fine, or it will also try to grab the kernel_lock, and will either
 succeed or be delayed depending on who did take the kernel_lock before.

 Remember that kernel_lock is not a simple mutex, it's reentrant on
 the same processor. So we may have one handler interrupted and the
 new callback can grab the kernel_lock, because it's already owned
 by its CPU. but then handlers reentrancy is protected by the traditional
 spl mechanism.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: =?UTF-8?Q?Roger_Pau_Monn=C3=A9?= <royger@NetBSD.org>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@netbsd.org, port-xen-maintainer@netbsd.org, 
	gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux Dom0
Date: Fri, 12 Oct 2012 18:55:56 +0200

 On Fri, Oct 12, 2012 at 6:10 PM, Manuel Bouyer <bouyer@antioche.eu.org> wrote:
 > Either the new callback wants to execute a MPSAFE handler and things will
 > run fine, or it will also try to grab the kernel_lock, and will either
 > succeed or be delayed depending on who did take the kernel_lock before.
 >
 > Remember that kernel_lock is not a simple mutex, it's reentrant on
 > the same processor. So we may have one handler interrupted and the
 > new callback can grab the kernel_lock, because it's already owned
 > by its CPU. but then handlers reentrancy is protected by the traditional
 > spl mechanism.

 When you say it's protected by the same mechanism as normal handlers,
 I assume it's due to the code in evtchn.c, the "splx" section of
 evtchn_do_event, so there's no way for example that xbd_handler might
 be interrupted by another xbd_handler?

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Roger Pau =?iso-8859-1?Q?Monn=E9?= <royger@NetBSD.org>
Cc: gnats-bugs@NetBSD.org, port-xen-maintainer@NetBSD.org,
        gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux
 Dom0
Date: Fri, 12 Oct 2012 19:11:08 +0200

 On Fri, Oct 12, 2012 at 06:55:56PM +0200, Roger Pau Monné wrote:
 > When you say it's protected by the same mechanism as normal handlers,
 > I assume it's due to the code in evtchn.c, the "splx" section of
 > evtchn_do_event, so there's no way for example that xbd_handler might
 > be interrupted by another xbd_handler?

 yes.


 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: =?ISO-8859-1?Q?Roger_Pau_Monn=E9?= <roger.pau@citrix.com>
To: "gnats-bugs@NetBSD.org" <gnats-bugs@NetBSD.org>
Cc: Manuel Bouyer <bouyer@antioche.eu.org>, "port-xen-maintainer@netbsd.org"
	<port-xen-maintainer@netbsd.org>, "gnats-admin@netbsd.org"
	<gnats-admin@netbsd.org>, "netbsd-bugs@netbsd.org" <netbsd-bugs@netbsd.org>,
	"royger@netbsd.org" <royger@netbsd.org>
Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux
 Dom0
Date: Sat, 20 Oct 2012 13:31:37 +0200

 More info on this subject, I was able to get to ddb after the system
 freezed (using +++++), here is the output:

 fatal breakpoint trap in supervisor mode
 trap type 1 code 0 rip ffffffff80130bf5 cs e030 rflags 202 cr2
 7f7ff780c390 cpl 8 rsp ffffa0002e4848a8
 Stopped in pid 0.26 (system) at netbsd:breakpoint+0x5:  leave
 breakpoint() at netbsd:breakpoint+0x5
 xencons_tty_input() at netbsd:xencons_tty_input+0xc9
 xencons_handler() at netbsd:xencons_handler+0x79
 intr_biglock_wrapper() at netbsd:intr_biglock_wrapper+0x1d
 evtchn_do_event() at netbsd:evtchn_do_event+0x15a
 call_evtchn_do_event() at netbsd:call_evtchn_do_event+0xd
 hypervisor_callback() at netbsd:hypervisor_callback+0x9e
 xenbus_thread() at netbsd:xenbus_thread+0xf5
 ds          ca00
 es          4a58
 fs          b9e8
 gs          6640
 rdi         1
 rsi         ffffffff80a7b303
 rbp         ffffa0002e4848a8
 rbx         ffffffff80a7b303
 rdx         2b
 rcx         2b
 rax         7f
 r8          ffffffff80a8f800
 r9          0
 r10         ffffffff80a8fa00
 r11         246
 r12         ffffa00002365090
 r13         ffffffff80a7b303
 r14         ffffa00002367340
 r15         1
 rip         ffffffff80130bf5    breakpoint+0x5
 cs          e030
 rflags      202
 rsp         ffffa0002e4848a8
 ss          e02b

 And the ps:

 db{0}> ps
 PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
 26620    1 2   0         0   ffffa00008045b00                cc1
 28013    1 3   0        80   ffffa00002fe61a0   x86_64--netbsd-g wait
 23074    1 3   0        80   ffffa000080456e0            nbmkdep wait
 16970    1 3   0        80   ffffa00002628180                 sh wait
 14476    1 2   0         0   ffffa00002b6f600            nbmkdep
 24751    1 2   0         0   ffffa0000404a4c0                 as
 23651    1 3   0        80   ffffa000034d95a0                 sh wait
 16446    1 2   0     40000   ffffa00002fe65c0                cc1
 3398     1 3   0        80   ffffa00002fd95a0   x86_64--netbsd-g wait
 22279    1 3   0        80   ffffa0000387a980                 sh wait
 16566    1 3   0     40080   ffffa000076672c0             nbmake select
 22489    1 3   0     40080   ffffa0000236e520                 sh wait
 16321    1 3   0        80   ffffa00002aa7660             nbmake select
 21974    1 3   0        80   ffffa00002573960                 sh wait
 5656     1 3   0        80   ffffa0000236b8e0   x86_64--netbsd-g wait
 6264     1 3   0        80   ffffa000075881e0                 sh wait
 870      1 3   0        80   ffffa0000404a0a0             nbmake select
 28228    1 3   0        80   ffffa00002aa7a80                 sh wait
 19537    1 3   0        80   ffffa000076676e0             nbmake select
 19740    1 3   0        80   ffffa00007588600                 sh wait
 8737     1 3   0        80   ffffa00002f8e220             nbmake select
 29418    1 3   0        80   ffffa00002fa5aa0                 sh wait
 4427     1 3   0        80   ffffa0000263a5c0             nbmake select
 28916    1 3   0        80   ffffa000080452c0                 sh wait
 13113    1 3   0        80   ffffa00002fe69e0             nbmake select
 19412    1 3   0        80   ffffa000075691c0                 sh wait
 583      1 3   0        80   ffffa0000387a140             nbmake select
 5923     1 3   0        80   ffffa000085da860                 sh wait
 21434    1 3   0        80   ffffa0000404a8e0             nbmake select
 22103    1 3   0        80   ffffa00002f8ea60                 sh wait
 22077    1 3   0        80   ffffa00002b6f1e0             nbmake select
 6976     1 3   0        80   ffffa000085da020                 sh wait
 18463    1 3   0        80   ffffa000075695e0             nbmake select
 19784    1 3   0        80   ffffa00002f8e640                 sh wait
 8975     1 3   0        80   ffffa00007eb0420             nbmake select
 6597     1 3   0        80   ffffa000026289c0                 sh wait
 18499    1 3   0        80   ffffa00002aa7240             nbmake select
 649      1 3   0        80   ffffa0000263a1a0                 sh wait
 23152    1 3   0        80   ffffa0000387a560             nbmake select
 11133    1 3   0        80   ffffa00007588a20                 sh wait
 11482    1 2   0         0   ffffa00007569a00              getty
 15288    1 3   0        80   ffffa00007667b00                 sh wait
 6588     1 3   0        80   ffffa000034d99c0       screen-4.0.3 select
 541      1 3   0        80   ffffa000026285a0              getty nanoslp
 479      1 3   0        80   ffffa00002573540              getty nanoslp
 539      1 3   0        80   ffffa0000236e100              getty nanoslp
 532      1 3   0        80   ffffa000025fd580               cron nanoslp
 535      1 3   0        80   ffffa0000263a9e0              inetd kqueue
 333      1 3   0        80   ffffa000025fd9a0               sshd select
 463      1 3   0        80   ffffa00002590980             powerd kqueue
 307      1 2   0         0   ffffa000025fd160            syslogd
 249      1 3   0        80   ffffa00002590560             dhcpcd select
 1        1 3   0        80   ffffa0000236c0c0               init wait
 0       36 3   0       200   ffffa00002590140            physiod physiod
 0       35 3   0       200   ffffa0000236b4c0           aiodoned aiodoned
 0       34 3   0       200   ffffa0000236c900            ioflush syncer
 0       33 3   0       200   ffffa0000236b0a0           pgdaemon pgdaemon
 0       30 3   0       200   ffffa0000235e080          cryptoret crypto_w
 0       29 3   0       200   ffffa0000236c4e0        xen_balloon xen_balloon
 0       28 3   0       200   ffffa0000236d920              unpgc unpgc
 0       27 3   0       200   ffffa0000236d500        vmem_rehash vmem_rehash
 0    >  26 7   0       200   ffffa0000236e940             xenbus
 0       25 3   0       200   ffffa0000236d0e0           xenwatch evtsq
 0       15 3   0       200   ffffa0000235e4a0         pmfsuspend pmfsuspend
 0       14 3   0       200   ffffa0000235e8c0           pmfevent pmfevent
 0       13 3   0       200   ffffa00001ee4060         sopendfree sopendfr
 0       12 3   0       200   ffffa00001ee4480           nfssilly nfssilly
 0       11 3   0       200   ffffa00001ee48a0            cachegc cachegc
 0       10 3   0       200   ffffa00001ee3040              vrele vrele
 0        9 3   0       200   ffffa00001ee3460             vdrain vdrain
 0        8 3   0       200   ffffa00001ee3880          modunload mod_unld
 0        7 3   0       200   ffffa00001ed9020            xcall/0 xcall
 0        6 1   0       200   ffffa00001ed9440          softser/0
 0        5 1   0       200   ffffa00001ed9860          softclk/0
 0        4 1   0       200   ffffa00001ed6000          softbio/0
 0        3 1   0       200   ffffa00001ed6420          softnet/0
 0        2 1   0       201   ffffa00001ed6840             idle/0
 0        1 3   0       200   ffffffff805b4c80            swapper uvm

 I will try to create a patch that shows the value of the ring indexes,
 since I'm pretty sure they are screwed up, and the system was blocked in
 xenbus_thread because of that before the callback came in.

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Roger Pau =?iso-8859-1?Q?Monn=E9?= <roger.pau@citrix.com>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@NetBSD.org>,
        "port-xen-maintainer@netbsd.org" <port-xen-maintainer@NetBSD.org>,
        "gnats-admin@netbsd.org" <gnats-admin@NetBSD.org>,
        "netbsd-bugs@netbsd.org" <netbsd-bugs@NetBSD.org>,
        "royger@netbsd.org" <royger@NetBSD.org>
Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux
 Dom0
Date: Sat, 20 Oct 2012 13:47:56 +0200

 On Sat, Oct 20, 2012 at 01:31:37PM +0200, Roger Pau Monné wrote:
 > More info on this subject, I was able to get to ddb after the system
 > freezed (using +++++), here is the output:
 > 
 > fatal breakpoint trap in supervisor mode
 > trap type 1 code 0 rip ffffffff80130bf5 cs e030 rflags 202 cr2
 > 7f7ff780c390 cpl 8 rsp ffffa0002e4848a8
 > Stopped in pid 0.26 (system) at netbsd:breakpoint+0x5:  leave
 > breakpoint() at netbsd:breakpoint+0x5
 > xencons_tty_input() at netbsd:xencons_tty_input+0xc9
 > xencons_handler() at netbsd:xencons_handler+0x79
 > intr_biglock_wrapper() at netbsd:intr_biglock_wrapper+0x1d
 > evtchn_do_event() at netbsd:evtchn_do_event+0x15a
 > call_evtchn_do_event() at netbsd:call_evtchn_do_event+0xd
 > hypervisor_callback() at netbsd:hypervisor_callback+0x9e
 > xenbus_thread() at netbsd:xenbus_thread+0xf5
 > ds          ca00
 > es          4a58
 > fs          b9e8
 > gs          6640
 > rdi         1
 > rsi         ffffffff80a7b303
 > rbp         ffffa0002e4848a8
 > rbx         ffffffff80a7b303
 > rdx         2b
 > rcx         2b
 > rax         7f
 > r8          ffffffff80a8f800
 > r9          0
 > r10         ffffffff80a8fa00
 > r11         246
 > r12         ffffa00002365090
 > r13         ffffffff80a7b303
 > r14         ffffa00002367340
 > r15         1
 > rip         ffffffff80130bf5    breakpoint+0x5
 > cs          e030
 > rflags      202
 > rsp         ffffa0002e4848a8
 > ss          e02b
 > 
 > And the ps:
 > 
 > db{0}> ps
 > PID    LID S CPU     FLAGS       STRUCT LWP *               NAME WAIT
 > 26620    1 2   0         0   ffffa00008045b00                cc1
 > 28013    1 3   0        80   ffffa00002fe61a0   x86_64--netbsd-g wait
 > 23074    1 3   0        80   ffffa000080456e0            nbmkdep wait
 > 16970    1 3   0        80   ffffa00002628180                 sh wait
 > 14476    1 2   0         0   ffffa00002b6f600            nbmkdep
 > 24751    1 2   0         0   ffffa0000404a4c0                 as
 > 23651    1 3   0        80   ffffa000034d95a0                 sh wait
 > 16446    1 2   0     40000   ffffa00002fe65c0                cc1
 > 3398     1 3   0        80   ffffa00002fd95a0   x86_64--netbsd-g wait
 > 22279    1 3   0        80   ffffa0000387a980                 sh wait
 > 16566    1 3   0     40080   ffffa000076672c0             nbmake select
 > 22489    1 3   0     40080   ffffa0000236e520                 sh wait
 > 16321    1 3   0        80   ffffa00002aa7660             nbmake select
 > 21974    1 3   0        80   ffffa00002573960                 sh wait
 > 5656     1 3   0        80   ffffa0000236b8e0   x86_64--netbsd-g wait
 > 6264     1 3   0        80   ffffa000075881e0                 sh wait
 > 870      1 3   0        80   ffffa0000404a0a0             nbmake select
 > 28228    1 3   0        80   ffffa00002aa7a80                 sh wait
 > 19537    1 3   0        80   ffffa000076676e0             nbmake select
 > 19740    1 3   0        80   ffffa00007588600                 sh wait
 > 8737     1 3   0        80   ffffa00002f8e220             nbmake select
 > 29418    1 3   0        80   ffffa00002fa5aa0                 sh wait
 > 4427     1 3   0        80   ffffa0000263a5c0             nbmake select
 > 28916    1 3   0        80   ffffa000080452c0                 sh wait
 > 13113    1 3   0        80   ffffa00002fe69e0             nbmake select
 > 19412    1 3   0        80   ffffa000075691c0                 sh wait
 > 583      1 3   0        80   ffffa0000387a140             nbmake select
 > 5923     1 3   0        80   ffffa000085da860                 sh wait
 > 21434    1 3   0        80   ffffa0000404a8e0             nbmake select
 > 22103    1 3   0        80   ffffa00002f8ea60                 sh wait
 > 22077    1 3   0        80   ffffa00002b6f1e0             nbmake select
 > 6976     1 3   0        80   ffffa000085da020                 sh wait
 > 18463    1 3   0        80   ffffa000075695e0             nbmake select
 > 19784    1 3   0        80   ffffa00002f8e640                 sh wait
 > 8975     1 3   0        80   ffffa00007eb0420             nbmake select
 > 6597     1 3   0        80   ffffa000026289c0                 sh wait
 > 18499    1 3   0        80   ffffa00002aa7240             nbmake select
 > 649      1 3   0        80   ffffa0000263a1a0                 sh wait
 > 23152    1 3   0        80   ffffa0000387a560             nbmake select
 > 11133    1 3   0        80   ffffa00007588a20                 sh wait
 > 11482    1 2   0         0   ffffa00007569a00              getty
 > 15288    1 3   0        80   ffffa00007667b00                 sh wait
 > 6588     1 3   0        80   ffffa000034d99c0       screen-4.0.3 select
 > 541      1 3   0        80   ffffa000026285a0              getty nanoslp
 > 479      1 3   0        80   ffffa00002573540              getty nanoslp
 > 539      1 3   0        80   ffffa0000236e100              getty nanoslp
 > 532      1 3   0        80   ffffa000025fd580               cron nanoslp
 > 535      1 3   0        80   ffffa0000263a9e0              inetd kqueue
 > 333      1 3   0        80   ffffa000025fd9a0               sshd select
 > 463      1 3   0        80   ffffa00002590980             powerd kqueue
 > 307      1 2   0         0   ffffa000025fd160            syslogd
 > 249      1 3   0        80   ffffa00002590560             dhcpcd select
 > 1        1 3   0        80   ffffa0000236c0c0               init wait
 > 0       36 3   0       200   ffffa00002590140            physiod physiod
 > 0       35 3   0       200   ffffa0000236b4c0           aiodoned aiodoned
 > 0       34 3   0       200   ffffa0000236c900            ioflush syncer
 > 0       33 3   0       200   ffffa0000236b0a0           pgdaemon pgdaemon
 > 0       30 3   0       200   ffffa0000235e080          cryptoret crypto_w
 > 0       29 3   0       200   ffffa0000236c4e0        xen_balloon xen_balloon
 > 0       28 3   0       200   ffffa0000236d920              unpgc unpgc
 > 0       27 3   0       200   ffffa0000236d500        vmem_rehash vmem_rehash
 > 0    >  26 7   0       200   ffffa0000236e940             xenbus
 > 0       25 3   0       200   ffffa0000236d0e0           xenwatch evtsq
 > 0       15 3   0       200   ffffa0000235e4a0         pmfsuspend pmfsuspend
 > 0       14 3   0       200   ffffa0000235e8c0           pmfevent pmfevent
 > 0       13 3   0       200   ffffa00001ee4060         sopendfree sopendfr
 > 0       12 3   0       200   ffffa00001ee4480           nfssilly nfssilly
 > 0       11 3   0       200   ffffa00001ee48a0            cachegc cachegc
 > 0       10 3   0       200   ffffa00001ee3040              vrele vrele
 > 0        9 3   0       200   ffffa00001ee3460             vdrain vdrain
 > 0        8 3   0       200   ffffa00001ee3880          modunload mod_unld
 > 0        7 3   0       200   ffffa00001ed9020            xcall/0 xcall
 > 0        6 1   0       200   ffffa00001ed9440          softser/0
 > 0        5 1   0       200   ffffa00001ed9860          softclk/0
 > 0        4 1   0       200   ffffa00001ed6000          softbio/0
 > 0        3 1   0       200   ffffa00001ed6420          softnet/0
 > 0        2 1   0       201   ffffa00001ed6840             idle/0
 > 0        1 3   0       200   ffffffff805b4c80            swapper uvm
 > 
 > I will try to create a patch that shows the value of the ring indexes,
 > since I'm pretty sure they are screwed up, and the system was blocked in
 > xenbus_thread because of that before the callback came in.

 What would be interesting here is a
 tr/a ffffa0000236e940
 (the lwp pointer of the xenbus thread). And alst what xenbus_thread+0xf5
 points to in sources.
 You can also try to type 'continue' and enter ddb again to see
 if things changes (especially where in xenbus_thread it is
 interrupted).

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: =?ISO-8859-1?Q?Roger_Pau_Monn=E9?= <roger.pau@citrix.com>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@NetBSD.org>,
	"port-xen-maintainer@netbsd.org" <port-xen-maintainer@NetBSD.org>,
	"gnats-admin@netbsd.org" <gnats-admin@NetBSD.org>, "netbsd-bugs@netbsd.org"
	<netbsd-bugs@NetBSD.org>, "royger@netbsd.org" <royger@NetBSD.org>
Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux
 Dom0
Date: Sat, 20 Oct 2012 17:31:11 +0200

 On 20/10/12 13:47, Manuel Bouyer wrote:
 > What would be interesting here is a
 > tr/a ffffa0000236e940

 db{0}> tr/a ffffa0000236e940
 trace: pid 0 lid 26 at 0xffffa0002e4848a8
 breakpoint() at netbsd:breakpoint+0x5
 xencons_tty_input() at netbsd:xencons_tty_input+0xc9
 xencons_handler() at netbsd:xencons_handler+0x79
 intr_biglock_wrapper() at netbsd:intr_biglock_wrapper+0x1d
 evtchn_do_event() at netbsd:evtchn_do_event+0x15a
 call_evtchn_do_event() at netbsd:call_evtchn_do_event+0xd
 hypervisor_callback() at netbsd:hypervisor_callback+0x9e
 xenbus_thread() at netbsd:xenbus_thread+0xf5

 > (the lwp pointer of the xenbus thread). And alst what xenbus_thread+0xf5
 > points to in sources.

 Since the kernel was compiled without -g I guess there's no way to get
 that now.

 > You can also try to type 'continue' and enter ddb again to see
 > if things changes (especially where in xenbus_thread it is
 > interrupted).

 Nope, the system is completely frozen, tried several times and the trace
 is exactly the same. Will compile a new kernel with -g and let's see
 what I can get, but I bet xenbus_thread is blocked at:

 831: printk("XENBUS error %d while reading message\n", err);

 In fact I'm going to replace that with a panic.

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Roger Pau =?iso-8859-1?Q?Monn=E9?= <roger.pau@citrix.com>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@NetBSD.org>,
        "port-xen-maintainer@netbsd.org" <port-xen-maintainer@NetBSD.org>,
        "gnats-admin@netbsd.org" <gnats-admin@NetBSD.org>,
        "netbsd-bugs@netbsd.org" <netbsd-bugs@NetBSD.org>,
        "royger@netbsd.org" <royger@NetBSD.org>
Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux
 Dom0
Date: Sat, 20 Oct 2012 17:38:53 +0200

 On Sat, Oct 20, 2012 at 05:31:11PM +0200, Roger Pau Monné wrote:
 > > You can also try to type 'continue' and enter ddb again to see
 > > if things changes (especially where in xenbus_thread it is
 > > interrupted).
 > 
 > Nope, the system is completely frozen, tried several times and the trace
 > is exactly the same.

 You mean, it's always at xenbus_thread+0xf5 ? You never see other offsets ?

 > Will compile a new kernel with -g and let's see
 > what I can get, but I bet xenbus_thread is blocked at:
 > 
 > 831: printk("XENBUS error %d while reading message\n", err);
 > 
 > In fact I'm going to replace that with a panic.

 Maybe just a printf instead of a printk at first.
 If the offset never changes, I wonder if it could be stuck in
 	(void)HYPERVISOR_console_io(CONSOLEIO_write, ret, buf);

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: =?ISO-8859-1?Q?Roger_Pau_Monn=E9?= <roger.pau@citrix.com>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@NetBSD.org>,
	"port-xen-maintainer@netbsd.org" <port-xen-maintainer@NetBSD.org>,
	"gnats-admin@netbsd.org" <gnats-admin@NetBSD.org>, "netbsd-bugs@netbsd.org"
	<netbsd-bugs@NetBSD.org>, "royger@netbsd.org" <royger@NetBSD.org>
Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux
 Dom0
Date: Sat, 20 Oct 2012 17:48:08 +0200

 On 20/10/12 17:38, Manuel Bouyer wrote:
 > On Sat, Oct 20, 2012 at 05:31:11PM +0200, Roger Pau Monné wrote:
 >>> You can also try to type 'continue' and enter ddb again to see
 >>> if things changes (especially where in xenbus_thread it is
 >>> interrupted).
 >>
 >> Nope, the system is completely frozen, tried several times and the trace
 >> is exactly the same.
 > 
 > You mean, it's always at xenbus_thread+0xf5 ? You never see other offsets ?

 No, no other offsets.

 > 
 >> Will compile a new kernel with -g and let's see
 >> what I can get, but I bet xenbus_thread is blocked at:
 >>
 >> 831: printk("XENBUS error %d while reading message\n", err);
 >>
 >> In fact I'm going to replace that with a panic.
 > 
 > Maybe just a printf instead of a printk at first.

 Tried that in the past (replacing the printk with a printf), and then I
 just get in an infite printf loop, intf->rsp_cons and intf->rsp_prod are
 corrupted, check_indexes in xb_read always returns false and this leads
 to a infinite loop in xenbus_thread because process_msg always return error.

 I've now compiled a kernel that has the panic and prints the ring
 indexes. What's the best way to check who modifies intf->rsp_cons and
 intf->rsp_prod? Will ddb watch work on this kind of memory region?

 > If the offset never changes, I wonder if it could be stuck in
 > 	(void)HYPERVISOR_console_io(CONSOLEIO_write, ret, buf);
 > 

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Roger Pau =?iso-8859-1?Q?Monn=E9?= <roger.pau@citrix.com>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@NetBSD.org>,
        "port-xen-maintainer@netbsd.org" <port-xen-maintainer@NetBSD.org>,
        "gnats-admin@netbsd.org" <gnats-admin@NetBSD.org>,
        "netbsd-bugs@netbsd.org" <netbsd-bugs@NetBSD.org>,
        "royger@netbsd.org" <royger@NetBSD.org>
Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux
 Dom0
Date: Sat, 20 Oct 2012 17:57:21 +0200

 On Sat, Oct 20, 2012 at 05:48:08PM +0200, Roger Pau Monné wrote:
 > > 
 > >> Will compile a new kernel with -g and let's see
 > >> what I can get, but I bet xenbus_thread is blocked at:
 > >>
 > >> 831: printk("XENBUS error %d while reading message\n", err);
 > >>
 > >> In fact I'm going to replace that with a panic.
 > > 
 > > Maybe just a printf instead of a printk at first.
 > 
 > Tried that in the past (replacing the printk with a printf), and then I
 > just get in an infite printf loop, intf->rsp_cons and intf->rsp_prod are
 > corrupted, check_indexes in xb_read always returns false and this leads
 > to a infinite loop in xenbus_thread because process_msg always return error.
 > 
 > I've now compiled a kernel that has the panic and prints the ring
 > indexes. What's the best way to check who modifies intf->rsp_cons and
 > intf->rsp_prod? Will ddb watch work on this kind of memory region?

 You can try, but at first glance I'd say it won't work. 

 Can you determine if it's cons or prod (or both) which is corrupted,
 and in which way ? What are the values when it's corrupted ?
 Are they always the same ?

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: =?windows-1252?Q?Roger_Pau_Monn=E9?= <roger.pau@citrix.com>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@NetBSD.org>,
	"port-xen-maintainer@netbsd.org" <port-xen-maintainer@NetBSD.org>,
	"gnats-admin@netbsd.org" <gnats-admin@NetBSD.org>, "netbsd-bugs@netbsd.org"
	<netbsd-bugs@NetBSD.org>, "royger@netbsd.org" <royger@NetBSD.org>
Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux
 Dom0
Date: Sat, 20 Oct 2012 18:02:27 +0200

 On 20/10/12 17:57, Manuel Bouyer wrote:
 > On Sat, Oct 20, 2012 at 05:48:08PM +0200, Roger Pau Monné wrote:
 >>>
 >>>> Will compile a new kernel with -g and let's see
 >>>> what I can get, but I bet xenbus_thread is blocked at:
 >>>>
 >>>> 831: printk("XENBUS error %d while reading message\n", err);
 >>>>
 >>>> In fact I'm going to replace that with a panic.
 >>>
 >>> Maybe just a printf instead of a printk at first.
 >>
 >> Tried that in the past (replacing the printk with a printf), and then I
 >> just get in an infite printf loop, intf->rsp_cons and intf->rsp_prod are
 >> corrupted, check_indexes in xb_read always returns false and this leads
 >> to a infinite loop in xenbus_thread because process_msg always return error.
 >>
 >> I've now compiled a kernel that has the panic and prints the ring
 >> indexes. What's the best way to check who modifies intf->rsp_cons and
 >> intf->rsp_prod? Will ddb watch work on this kind of memory region?
 > 
 > You can try, but at first glance I'd say it won't work. 
 > 
 > Can you determine if it's cons or prod (or both) which is corrupted,
 > and in which way ? What are the values when it's corrupted ?
 > Are they always the same ?

 This is a trim of what I think is relevant, the first lines correspond
 to the last known values of prod and cons before the corruption, and the
 rest is quite self explanatory:

 xenbus_xs (process_msg:763) xb_read hdr 0.
 xb_read: cons: 3470 prod: 3473
 Finished read of 3 bytes (0 to go)
 xenbus_xs (process_msg:776) xb_read body 0.
 xenbus_xs (process_msg:811) process_msg: type 7 body OK.
 xenbus_xs (read_reply:134) read_reply: type 7 body OK.
 xenbus_xs (xs_talkv:224) read done.

 […]

 xb_read: cons: 2403996137 prod: 3531897424
 xb_read EIO
 xenbus_xs (process_msg:763) xb_read hdr 5.
 panic: XENBUS error 5 while reading message

 cpu0: Begin traceback...
 printf_nolog() at netbsd:printf_nolog
 xenbus_thread() at netbsd:xenbus_thread+0x140

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Roger Pau =?iso-8859-1?Q?Monn=E9?= <roger.pau@citrix.com>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@NetBSD.org>,
        "port-xen-maintainer@netbsd.org" <port-xen-maintainer@NetBSD.org>,
        "gnats-admin@netbsd.org" <gnats-admin@NetBSD.org>,
        "netbsd-bugs@netbsd.org" <netbsd-bugs@NetBSD.org>,
        "royger@netbsd.org" <royger@NetBSD.org>
Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux
 Dom0
Date: Sat, 20 Oct 2012 18:42:18 +0200

 --tKW2IUtsqtDRztdT
 Content-Type: text/plain; charset=iso-8859-1
 Content-Disposition: inline
 Content-Transfer-Encoding: 8bit

 On Sat, Oct 20, 2012 at 06:02:27PM +0200, Roger Pau Monné wrote:
 > > Can you determine if it's cons or prod (or both) which is corrupted,
 > > and in which way ? What are the values when it's corrupted ?
 > > Are they always the same ?
 > 
 > This is a trim of what I think is relevant, the first lines correspond
 > to the last known values of prod and cons before the corruption, and the
 > rest is quite self explanatory:
 > 
 > xenbus_xs (process_msg:763) xb_read hdr 0.
 > xb_read: cons: 3470 prod: 3473
 > Finished read of 3 bytes (0 to go)
 > xenbus_xs (process_msg:776) xb_read body 0.
 > xenbus_xs (process_msg:811) process_msg: type 7 body OK.
 > xenbus_xs (read_reply:134) read_reply: type 7 body OK.
 > xenbus_xs (xs_talkv:224) read done.
 > 
 > [?]

 is there anything happening here ?

 > 
 > xb_read: cons: 2403996137 prod: 3531897424

 So both cons and prod would be corrupted. As the domU is supposed to update
 rsp_cons only, I guess we're looking for something that is writing to
 random memory.

 Maybe the atached patch will help; anything trying to write to the page
 outside of xb_read and xb_write should get a page fault.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

 --tKW2IUtsqtDRztdT
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: attachment; filename=diff

 Index: xenbus_comms.c
 ===================================================================
 RCS file: /cvsroot/src/sys/arch/xen/xenbus/xenbus_comms.c,v
 retrieving revision 1.14
 diff -u -p -u -r1.14 xenbus_comms.c
 --- xenbus_comms.c	20 Sep 2011 00:12:24 -0000	1.14
 +++ xenbus_comms.c	20 Oct 2012 16:40:44 -0000
 @@ -37,6 +37,7 @@ __KERNEL_RCSID(0, "$NetBSD: xenbus_comms
  #include <sys/param.h>
  #include <sys/proc.h>
  #include <sys/systm.h>
 +#include <uvm/uvm_extern.h>

  #include <xen/xen.h>	/* for xendomain_is_dom0() */
  #include <xen/hypervisor.h>
 @@ -142,6 +143,10 @@ xb_write(const void *data, unsigned len)
  			continue;
  		if (avail > len)
  			avail = len;
 +		pmap_kenter_ma((vaddr_t)intf,
 +		    xen_start_info.store_mfn << PAGE_SHIFT,
 +		    VM_PROT_READ | VM_PROT_WRITE, 0);
 +		pmap_update(pmap_kernel());

  		memcpy(dst, data, avail);
  		data = (const char *)data + avail;
 @@ -151,6 +156,10 @@ xb_write(const void *data, unsigned len)
  		xen_rmb();
  		intf->req_prod += avail;
  		xen_rmb();
 +		pmap_protect(pmap_kernel(), (vaddr_t)intf,
 +		    (vaddr_t)intf + PAGE_SIZE,
 +		    VM_PROT_READ);
 +		pmap_update(pmap_kernel());

  		hypervisor_notify_via_evtchn(xen_start_info.store_evtchn);
  	}
 @@ -198,9 +207,17 @@ xb_read(void *data, unsigned len)
  		len -= avail;

  		/* Other side must not see free space until we've copied out */
 +		pmap_kenter_ma((vaddr_t)intf,
 +		    xen_start_info.store_mfn << PAGE_SHIFT,
 +		    VM_PROT_READ | VM_PROT_WRITE, 0);
 +		pmap_update(pmap_kernel());
  		xen_rmb();
  		intf->rsp_cons += avail;
  		xen_rmb();
 +		pmap_protect(pmap_kernel(), (vaddr_t)intf,
 +		    (vaddr_t)intf + PAGE_SIZE,
 +		    VM_PROT_READ);
 +		pmap_update(pmap_kernel());

  		XENPRINTF(("Finished read of %i bytes (%i to go)\n",
  		    avail, len));

 --tKW2IUtsqtDRztdT--

From: =?ISO-8859-1?Q?Roger_Pau_Monn=E9?= <roger.pau@citrix.com>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@NetBSD.org>,
	"port-xen-maintainer@netbsd.org" <port-xen-maintainer@NetBSD.org>,
	"gnats-admin@netbsd.org" <gnats-admin@NetBSD.org>, "netbsd-bugs@netbsd.org"
	<netbsd-bugs@NetBSD.org>, "royger@netbsd.org" <royger@NetBSD.org>
Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux
 Dom0
Date: Sat, 20 Oct 2012 20:34:16 +0200

 On 20/10/12 18:42, Manuel Bouyer wrote:
 > So both cons and prod would be corrupted. As the domU is supposed to update
 > rsp_cons only, I guess we're looking for something that is writing to
 > random memory.
 > 
 > Maybe the atached patch will help; anything trying to write to the page
 > outside of xb_read and xb_write should get a page fault.

 I'm sorry to say that the patch didn't seem to help, here is another
 output with your patch applied. It seems like prod and cons gets
 overwritten with random data.

 xenbus_xs (process_msg:763) xb_read hdr 0.
 xb_read: cons: 3521 prod: 3523
 Finished read of 2 bytes (0 to go)
 xenbus_xs (process_msg:776) xb_read body 0.
 xenbus_xs (process_msg:811) process_msg: type 6 body 4.
 xenbus_xs (read_reply:134) read_reply: type 6 body 4.
 xenbus_xs (xs_talkv:224) read done.
 xenbus_xs (xs_talkv:202) write msg.
 xenbus_xs (xs_talkv:204) write msg err 0.
 xenbus_xs (xs_talkv:212) write iovect.
 xenbus_xs (xs_talkv:214) write iovect err 0.
 xenbus_xs (xs_talkv:222) read.
 xb_read: cons: 3523 prod: 3546
 Finished read of 16 bytes (0 to go)
 xenbus_xs (process_msg:763) xb_read hdr 0.
 xb_read: cons: 3539 prod: 3546
 Finished read of 7 bytes (0 to go)
 xenbus_xs (process_msg:776) xb_read body 0.
 xenbus_xs (process_msg:811) process_msg: type 16 body ENOENT.
 xenbus_xs (read_reply:134) read_reply: type 16 body ENOENT.
 xenbus_xs (xs_talkv:224) read done.
 xenbus_xs (xs_talkv:202) write msg.
 xenbus_xs (xs_talkv:204) write msg err 0.
 xenbus_xs (xs_talkv:212) write iovect.
 xenbus_xs (xs_talkv:214) write iovect err 0.
 xenbus_xs (xs_talkv:222) read.
 xb_read: cons: 3546 prod: 3565
 Finished read of 16 bytes (0 to go)
 xenbus_xs (process_msg:763) xb_read hdr 0.
 xb_read: cons: 3562 prod: 3565
 Finished read of 3 bytes (0 to go)
 xenbus_xs (process_msg:776) xb_read body 0.
 xenbus_xs (process_msg:811) process_msg: type 7 body OK.
 xenbus_xs (read_reply:134) read_reply: type 7 body OK.
 xenbus_xs (xs_talkv:224) read done.
 boot device: xbd0
 root on xbd0a dumps on xbd0b
 Your machine does not initialize mem_clusters; sparse_dumps disabled
 /: replaying log to memory
 root file system type: ffs
 Sat Oct 20 17:01:27 UTC 2012
 Starting root file system check:
 /dev/rxbd0a: file system is journaled; not checking
 /: replaying log to disk
 swapctl: setting dump device to /dev/xbd0b
 swapctl: adding /dev/xbd0b as swap device at priority 0
 Starting file system checks:
 Setting tty flags.
 Setting sysctl variables:
 ddb.onpanic: 1 -> 1
 Starting network.
 /etc/rc: WARNING: $hostname not set.
 IPv6 mode: host
 Configuring network interfaces: xennet0.
 Adding interface aliases:.
 Building databases: dev, utmp, utmpx.
 wsconscfg: Cannot open `/dev/ttyEcfg': Device not configured
 wsconscfg: Cannot open `/dev/ttyEcfg': Device not configured
 wsconscfg: Cannot open `/dev/ttyEcfg': Device not configured
 wsconscfg: Cannot open `/dev/ttyEcfg': Device not configured
 Starting syslogd.
 Mounting all filesystems...
 Clearing temporary files.
 Checking quotas: done.
 swapctl: setting dump device to /dev/xbd0b
 Starting virecover.
 Checking for core dump...
 savecore: no core dump
 Starting local daemons:.
 Updating motd.
 Starting powerd.
 Starting sshd.
 postfix: rebuilding /etc/mail/aliases (missing /etc/mail/aliases.db)
 newaliases: warning: valid_hostname: empty hostname
 newaliases: fatal: unable to use my own hostname
 /etc/rc.d/postfix exited with code 1
 Oct 20 17:01:46  postfix/sendmail[494]: fatal: unable to use my own hostname
 Starting inetd.
 Starting cron.
 The following components reported failures:
     /etc/rc.d/postfix
 See /var/run/rc.log for more information.
 Sat Oct 20 17:01:46 UTC 2012
 Oct 20 17:01:48  getty[589]: /dev/ttyE2: Device not configured
 Oct 20 17:01:48  getty[569]: /dev/ttyE3: Device not configured
 Oct 20 17:01:48  getty[501]: /dev/ttyE1: Device not configured

 NetBSD/amd64 (Amnesiac) (console)

 login: root
 Password:
 Oct 20 17:02:51  login: ROOT LOGIN (root) on tty console
 Last login: Sat Oct 20 16:06:11 2012 on console
 Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
     2006, 2007, 2008, 2009, 2010, 2011, 2012
     The NetBSD Foundation, Inc.  All rights reserved.
 Copyright (c) 1982, 1986, 1989, 1991, 1993
     The Regents of the University of California.  All rights reserved.

 NetBSD 6.99.14 (XEN3_DOMU) #2: Sat Oct 20 18:50:37 CEST 2012

 Welcome to NetBSD!

 Terminal type is vt100.
 We recommend that you create a non-root account and use su(1) for root
 access.
 # cd src
 # while [ 1 ]; do
 > ./build.sh -j3 -m amd64 -O ../obj -T ../tools build >log
 > done
 xb_read: cons: 706764616 prod: 2607
 xb_read EIO
 xenbus_xs (process_msg:763) xb_read hdr 5.
 panic: XENBUS error 5 while reading message

 fatal breakpoint trap in supervisor mode
 trap type 1 code 0 rip ffffffff8012e655 cs e030 rflags 246 cr2
 7f7ff780c390 ilevel 0 rsp ffffa0002e4c5b30
 curlwp 0xffffa00002367980 pid 0 lid 32 lowest kstack 0xffffa0002e4c2000
 Stopped in pid 0.32 (system) at netbsd:breakpoint+0x5:  leave
 breakpoint() at netbsd:breakpoint+0x5
 vpanic() at netbsd:vpanic+0x1f2
 printf_nolog() at netbsd:printf_nolog
 xenbus_thread() at netbsd:xenbus_thread+0x140
 ds          7980
 es          5b70
 fs          100
 gs          b880
 rdi         0
 rsi         d
 rbp         ffffa0002e4c5b30
 rbx         104
 rdx         0
 rcx         8
 rax         1
 r8          ffffffff8063d600    cpu_info_primary
 r9          1
 r10         0
 r11         ffffa0000238c000
 r12         ffffffff804adee8    copyright+0x3bae8
 r13         ffffa0002e4c5b70
 r14         ffffa00002367980
 r15         c2c2c2c2c2c2c2c2
 rip         ffffffff8012e655    breakpoint+0x5
 cs          e030
 rflags      246
 rsp         ffffa0002e4c5b30
 ss          e02b
 netbsd:breakpoint+0x5:  leave
 db{0}>

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Roger Pau =?iso-8859-1?Q?Monn=E9?= <roger.pau@citrix.com>
Cc: "gnats-bugs@NetBSD.org" <gnats-bugs@NetBSD.org>,
        "port-xen-maintainer@netbsd.org" <port-xen-maintainer@NetBSD.org>,
        "gnats-admin@netbsd.org" <gnats-admin@NetBSD.org>,
        "netbsd-bugs@netbsd.org" <netbsd-bugs@NetBSD.org>,
        "royger@netbsd.org" <royger@NetBSD.org>
Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux
 Dom0
Date: Sun, 21 Oct 2012 13:29:10 +0200

 --ibTvN161/egqYuK8
 Content-Type: text/plain; charset=iso-8859-1
 Content-Disposition: inline
 Content-Transfer-Encoding: 8bit

 On Sat, Oct 20, 2012 at 08:34:16PM +0200, Roger Pau Monné wrote:
 > I'm sorry to say that the patch didn't seem to help, here is another
 > output with your patch applied. It seems like prod and cons gets
 > overwritten with random data.

 OK, here's another patch, which also checks that the mapping doesn't
 change. But I wonder is the corruption occurs on the NetBSD side.
 Could you also add some debugging code on the other side ?

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

 --ibTvN161/egqYuK8
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: attachment; filename=diff

 Index: xenbus_comms.c
 ===================================================================
 RCS file: /cvsroot/src/sys/arch/xen/xenbus/xenbus_comms.c,v
 retrieving revision 1.14
 diff -u -p -u -r1.14 xenbus_comms.c
 --- xenbus_comms.c	20 Sep 2011 00:12:24 -0000	1.14
 +++ xenbus_comms.c	21 Oct 2012 11:26:23 -0000
 @@ -37,6 +37,7 @@ __KERNEL_RCSID(0, "$NetBSD: xenbus_comms
  #include <sys/param.h>
  #include <sys/proc.h>
  #include <sys/systm.h>
 +#include <uvm/uvm_extern.h>

  #include <xen/xen.h>	/* for xendomain_is_dom0() */
  #include <xen/hypervisor.h>
 @@ -121,7 +122,10 @@ xb_write(const void *data, unsigned len)
  	while (len != 0) {
  		void *dst;
  		unsigned int avail;
 +		paddr_t pa;

 +		KASSERT(pmap_extract_ma(pmap_kernel(), (vaddr_t)intf, &pa));
 +		KASSERT(pa == (xen_start_info.store_mfn << PAGE_SHIFT));
  		while ((intf->req_prod - intf->req_cons) == XENSTORE_RING_SIZE) {
  			XENPRINTF(("xb_write tsleep\n"));
  			tsleep(&xenstore_interface, PRIBIO, "wrst", 0);
 @@ -142,6 +146,10 @@ xb_write(const void *data, unsigned len)
  			continue;
  		if (avail > len)
  			avail = len;
 +		pmap_kenter_ma((vaddr_t)intf,
 +		    xen_start_info.store_mfn << PAGE_SHIFT,
 +		    VM_PROT_READ | VM_PROT_WRITE, 0);
 +		pmap_update(pmap_kernel());

  		memcpy(dst, data, avail);
  		data = (const char *)data + avail;
 @@ -151,6 +159,10 @@ xb_write(const void *data, unsigned len)
  		xen_rmb();
  		intf->req_prod += avail;
  		xen_rmb();
 +		pmap_protect(pmap_kernel(), (vaddr_t)intf,
 +		    (vaddr_t)intf + PAGE_SIZE,
 +		    VM_PROT_READ);
 +		pmap_update(pmap_kernel());

  		hypervisor_notify_via_evtchn(xen_start_info.store_evtchn);
  	}
 @@ -170,6 +182,10 @@ xb_read(void *data, unsigned len)
  	while (len != 0) {
  		unsigned int avail;
  		const char *src;
 +		paddr_t pa;
 +
 +		KASSERT(pmap_extract_ma(pmap_kernel(), (vaddr_t)intf, &pa));
 +		KASSERT(pa == (xen_start_info.store_mfn << PAGE_SHIFT));

  		while (intf->rsp_cons == intf->rsp_prod)
  			tsleep(&xenstore_interface, PRIBIO, "rdst", 0);
 @@ -198,9 +214,17 @@ xb_read(void *data, unsigned len)
  		len -= avail;

  		/* Other side must not see free space until we've copied out */
 +		pmap_kenter_ma((vaddr_t)intf,
 +		    xen_start_info.store_mfn << PAGE_SHIFT,
 +		    VM_PROT_READ | VM_PROT_WRITE, 0);
 +		pmap_update(pmap_kernel());
  		xen_rmb();
  		intf->rsp_cons += avail;
  		xen_rmb();
 +		pmap_protect(pmap_kernel(), (vaddr_t)intf,
 +		    (vaddr_t)intf + PAGE_SIZE,
 +		    VM_PROT_READ);
 +		pmap_update(pmap_kernel());

  		XENPRINTF(("Finished read of %i bytes (%i to go)\n",
  		    avail, len));

 --ibTvN161/egqYuK8--

From: =?UTF-8?Q?Roger_Pau_Monn=C3=A9?= <royger@NetBSD.org>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: =?UTF-8?Q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com>, 
	"gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>, 
	"port-xen-maintainer@netbsd.org" <port-xen-maintainer@netbsd.org>, 
	"gnats-admin@netbsd.org" <gnats-admin@netbsd.org>, "netbsd-bugs@netbsd.org" <netbsd-bugs@netbsd.org>
Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux Dom0
Date: Sun, 21 Oct 2012 19:47:38 +0200

 --bcaec54ee10a2dcbb804cc9555ff
 Content-Type: text/plain; charset=UTF-8

 On Sun, Oct 21, 2012 at 1:29 PM, Manuel Bouyer <bouyer@antioche.eu.org> wrote:
 > OK, here's another patch, which also checks that the mapping doesn't
 > change. But I wonder is the corruption occurs on the NetBSD side.
 > Could you also add some debugging code on the other side ?

 Still no luck with the new patch, I've been looking at the Linux code,
 and the attached patch (taken the idea from Linux) mitigates the
 problem, but we still have it. I've also added the trace and verbose
 options to xenstored running in the Dom0, and there's no sign that
 anyone is writing to xenstore when the crash happens.

 Is it possible that someone writes to the machine address
 xen_start_info.store_mfn and is there anyway to check that nobody is
 mapping this ma to another va?

 --bcaec54ee10a2dcbb804cc9555ff
 Content-Type: application/octet-stream; name="patch.diff"
 Content-Disposition: attachment; filename="patch.diff"
 Content-Transfer-Encoding: base64
 X-Attachment-Id: f_h8kg24l21

 ZGlmZiAtLWdpdCBhL3N5cy9hcmNoL3hlbi94ZW5idXMveGVuYnVzX2NvbW1zLmMgYi9zeXMvYXJj
 aC94ZW4veGVuYnVzL3hlbmJ1c19jb21tcy5jCmluZGV4IDA0ZTRmMDIuLjllMjUzMDcgMTAwNjQ0
 Ci0tLSBhL3N5cy9hcmNoL3hlbi94ZW5idXMveGVuYnVzX2NvbW1zLmMKKysrIGIvc3lzL2FyY2gv
 eGVuL3hlbmJ1cy94ZW5idXNfY29tbXMuYwpAQCAtMTMzLDYgKzEzMyw3IEBAIHhiX3dyaXRlKGNv
 bnN0IHZvaWQgKmRhdGEsIHVuc2lnbmVkIGxlbikKIAkJcHJvZCA9IGludGYtPnJlcV9wcm9kOwog
 CQl4ZW5fcm1iKCk7CiAJCWlmICghY2hlY2tfaW5kZXhlcyhjb25zLCBwcm9kKSkgeworCQkJaW50
 Zi0+cmVxX2NvbnMgPSBpbnRmLT5yZXFfcHJvZCA9IDA7CiAJCQlzcGx4KHMpOwogCQkJcmV0dXJu
 IEVJTzsKIAkJfQpAQCAtMTgwLDYgKzE4MSw3IEBAIHhiX3JlYWQodm9pZCAqZGF0YSwgdW5zaWdu
 ZWQgbGVuKQogCQl4ZW5fcm1iKCk7CiAJCWlmICghY2hlY2tfaW5kZXhlcyhjb25zLCBwcm9kKSkg
 ewogCQkJWEVOUFJJTlRGKCgieGJfcmVhZCBFSU9cbiIpKTsKKwkJCWludGYtPnJzcF9jb25zID0g
 aW50Zi0+cnNwX3Byb2QgPSAwOwogCQkJc3BseChzKTsKIAkJCXJldHVybiBFSU87CiAJCX0K
 --bcaec54ee10a2dcbb804cc9555ff--

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Roger Pau =?iso-8859-1?Q?Monn=E9?= <royger@NetBSD.org>
Cc: Roger Pau =?iso-8859-1?Q?Monn=E9?= <roger.pau@citrix.com>,
        "gnats-bugs@NetBSD.org" <gnats-bugs@NetBSD.org>,
        "port-xen-maintainer@netbsd.org" <port-xen-maintainer@NetBSD.org>,
        "gnats-admin@netbsd.org" <gnats-admin@NetBSD.org>,
        "netbsd-bugs@netbsd.org" <netbsd-bugs@NetBSD.org>
Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux
 Dom0
Date: Sun, 21 Oct 2012 20:00:21 +0200

 On Sun, Oct 21, 2012 at 07:47:38PM +0200, Roger Pau Monné wrote:
 > On Sun, Oct 21, 2012 at 1:29 PM, Manuel Bouyer <bouyer@antioche.eu.org> wrote:
 > > OK, here's another patch, which also checks that the mapping doesn't
 > > change. But I wonder is the corruption occurs on the NetBSD side.
 > > Could you also add some debugging code on the other side ?
 > 
 > Still no luck with the new patch, I've been looking at the Linux code,
 > and the attached patch (taken the idea from Linux) mitigates the
 > problem, but we still have it.

 Does linux do this silently, or does it complain when the ring
 corruption occurs ?

 > I've also added the trace and verbose
 > options to xenstored running in the Dom0, and there's no sign that
 > anyone is writing to xenstore when the crash happens.
 > 
 > Is it possible that someone writes to the machine address
 > xen_start_info.store_mfn and is there anyway to check that nobody is
 > mapping this ma to another va?

 I've been thinking about checking this, but it's harder to do.
 Maybe it's easier to do this check in the hypervisor ?

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: =?UTF-8?Q?Roger_Pau_Monn=C3=A9?= <royger@NetBSD.org>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: =?UTF-8?Q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com>, 
	"gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>, 
	"port-xen-maintainer@netbsd.org" <port-xen-maintainer@netbsd.org>, 
	"gnats-admin@netbsd.org" <gnats-admin@netbsd.org>, "netbsd-bugs@netbsd.org" <netbsd-bugs@netbsd.org>
Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux Dom0
Date: Sun, 21 Oct 2012 20:10:36 +0200

 On Sun, Oct 21, 2012 at 8:00 PM, Manuel Bouyer <bouyer@antioche.eu.org> wrote:
 > Does linux do this silently, or does it complain when the ring
 > corruption occurs ?

 With the patch attached in the previous post, we will do the same as
 Linux (reset indexes and printk). I've never seen that happen in
 Linux, so I'm not sure if there's anything else.

 >> Is it possible that someone writes to the machine address
 >> xen_start_info.store_mfn and is there anyway to check that nobody is
 >> mapping this ma to another va?
 >
 > I've been thinking about checking this, but it's harder to do.
 > Maybe it's easier to do this check in the hypervisor ?

 Will check that, not sure if there's an easy way to this in the hypervisor.

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Roger Pau =?iso-8859-1?Q?Monn=E9?= <royger@NetBSD.org>
Cc: Roger Pau =?iso-8859-1?Q?Monn=E9?= <roger.pau@citrix.com>,
        "gnats-bugs@NetBSD.org" <gnats-bugs@NetBSD.org>,
        "port-xen-maintainer@netbsd.org" <port-xen-maintainer@NetBSD.org>,
        "gnats-admin@netbsd.org" <gnats-admin@NetBSD.org>,
        "netbsd-bugs@netbsd.org" <netbsd-bugs@NetBSD.org>
Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux
 Dom0
Date: Sun, 21 Oct 2012 20:31:18 +0200

 --sdtB3X0nJg68CQEu
 Content-Type: text/plain; charset=iso-8859-1
 Content-Disposition: inline
 Content-Transfer-Encoding: 8bit

 On Sun, Oct 21, 2012 at 08:10:36PM +0200, Roger Pau Monné wrote:
 > On Sun, Oct 21, 2012 at 8:00 PM, Manuel Bouyer <bouyer@antioche.eu.org> wrote:
 > > Does linux do this silently, or does it complain when the ring
 > > corruption occurs ?
 > 
 > With the patch attached in the previous post, we will do the same as
 > Linux (reset indexes and printk). I've never seen that happen in
 > Linux, so I'm not sure if there's anything else.
 > 
 > >> Is it possible that someone writes to the machine address
 > >> xen_start_info.store_mfn and is there anyway to check that nobody is
 > >> mapping this ma to another va?
 > >
 > > I've been thinking about checking this, but it's harder to do.
 > > Maybe it's easier to do this check in the hypervisor ?
 > 
 > Will check that, not sure if there's an easy way to this in the hypervisor.

 You can also try the attached patch, which should catch a mapping to
 the same store's ma via regular pmap functons. If it's something more
 clever, we'll need a more clever checks ...

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

 --sdtB3X0nJg68CQEu
 Content-Type: text/plain; charset=us-ascii
 Content-Disposition: attachment; filename=diff

 Index: x86/x86/pmap.c
 ===================================================================
 RCS file: /cvsroot/src/sys/arch/x86/x86/pmap.c,v
 retrieving revision 1.178
 diff -u -p -u -r1.178 pmap.c
 --- x86/x86/pmap.c	15 Jun 2012 13:53:40 -0000	1.178
 +++ x86/x86/pmap.c	21 Oct 2012 18:26:43 -0000
 @@ -325,6 +325,8 @@ kmutex_t pmaps_lock;

  static vaddr_t pmap_maxkvaddr;

 +extern void *xenstore_interface;
 +
  /*
   * XXX kludge: dummy locking to make KASSERTs in uvm_page.c comfortable.
   * actual locking is done by pm_lock.
 @@ -994,6 +996,9 @@ pmap_kenter_pa(vaddr_t va, paddr_t pa, v
  	} else
  #endif /* DOM0OPS */
  		npte = pmap_pa2pte(pa);
 +
 +	if (xenstore_interface != NULL)
 +		KASSERT(npte != (xen_start_info.store_mfn << PAGE_SHIFT));
  	npte |= protection_codes[prot] | PG_k | PG_V | pmap_pg_g;
  	npte |= pmap_pat_flags(flags);
  	opte = pmap_pte_testset(pte, npte); /* zap! */
 @@ -1026,6 +1031,8 @@ pmap_emap_enter(vaddr_t va, paddr_t pa, 
  #endif
  		npte = pmap_pa2pte(pa);

 +	if (xenstore_interface != NULL)
 +		KASSERT(npte != (xen_start_info.store_mfn << PAGE_SHIFT));
  	npte = pmap_pa2pte(pa);
  	npte |= protection_codes[prot] | PG_k | PG_V;
  	pmap_pte_set(pte, npte);
 @@ -3900,6 +3907,8 @@ pmap_enter_ma(struct pmap *pmap, vaddr_t
  	bool wired = (flags & PMAP_WIRED) != 0;
  	struct pmap *pmap2;

 +	if (xenstore_interface != NULL)
 +		KASSERT(ma != (xen_start_info.store_mfn << PAGE_SHIFT));
  	KASSERT(pmap_initialized);
  	KASSERT(curlwp->l_md.md_gc_pmap != pmap);
  	KASSERT(va < VM_MAX_KERNEL_ADDRESS);
 Index: xen/x86/xen_pmap.c
 ===================================================================
 RCS file: /cvsroot/src/sys/arch/xen/x86/xen_pmap.c,v
 retrieving revision 1.22
 diff -u -p -u -r1.22 xen_pmap.c
 --- xen/x86/xen_pmap.c	24 Jun 2012 18:31:53 -0000	1.22
 +++ xen/x86/xen_pmap.c	21 Oct 2012 18:26:43 -0000
 @@ -174,12 +174,16 @@ void
  pmap_kenter_ma(vaddr_t va, paddr_t ma, vm_prot_t prot, u_int flags)
  {
  	pt_entry_t *pte, opte, npte;
 +	extern void *xenstore_interface;

  	if (va < VM_MIN_KERNEL_ADDRESS)
  		pte = vtopte(va);
  	else
  		pte = kvtopte(va);

 +	if (xenstore_interface != NULL)
 +		KASSERT(ma != (xen_start_info.store_mfn << PAGE_SHIFT) ||
 +		    va == (vaddr_t)xenstore_interface);
  	npte = ma | ((prot & VM_PROT_WRITE) ? PG_RW : PG_RO) |
  	     PG_V | PG_k;
  	if (flags & PMAP_NOCACHE)

 --sdtB3X0nJg68CQEu--

From: =?UTF-8?Q?Roger_Pau_Monn=C3=A9?= <royger@NetBSD.org>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: =?UTF-8?Q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com>, 
	"gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>, 
	"port-xen-maintainer@netbsd.org" <port-xen-maintainer@netbsd.org>, 
	"gnats-admin@netbsd.org" <gnats-admin@netbsd.org>, "netbsd-bugs@netbsd.org" <netbsd-bugs@netbsd.org>
Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux Dom0
Date: Mon, 22 Oct 2012 10:03:35 +0200

 On Sun, Oct 21, 2012 at 8:31 PM, Manuel Bouyer <bouyer@antioche.eu.org> wrote:
 > You can also try the attached patch, which should catch a mapping to
 > the same store's ma via regular pmap functons. If it's something more
 > clever, we'll need a more clever checks ...

 No luck with this either. I've also found sporadic:

 evtchn_do_event: handler 0xffffffff801206a7 didn't lower ipl 8 7

 Which I would say it's not related to the problem at hand (because
 some times I get the corruption without seeing this message), the
 handler in question is xen_timer_handler, which doesn't set any spl
 levels directly, although the mutex tmutex sets IPL_CLOCK.

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Roger Pau =?iso-8859-1?Q?Monn=E9?= <royger@NetBSD.org>
Cc: Roger Pau =?iso-8859-1?Q?Monn=E9?= <roger.pau@citrix.com>,
        "gnats-bugs@NetBSD.org" <gnats-bugs@NetBSD.org>,
        "port-xen-maintainer@netbsd.org" <port-xen-maintainer@NetBSD.org>,
        "gnats-admin@netbsd.org" <gnats-admin@NetBSD.org>,
        "netbsd-bugs@netbsd.org" <netbsd-bugs@NetBSD.org>
Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux
 Dom0
Date: Mon, 22 Oct 2012 10:08:58 +0200

 On Mon, Oct 22, 2012 at 10:03:35AM +0200, Roger Pau Monné wrote:
 > On Sun, Oct 21, 2012 at 8:31 PM, Manuel Bouyer <bouyer@antioche.eu.org> wrote:
 > > You can also try the attached patch, which should catch a mapping to
 > > the same store's ma via regular pmap functons. If it's something more
 > > clever, we'll need a more clever checks ...
 > 
 > No luck with this either. I've also found sporadic:
 > 
 > evtchn_do_event: handler 0xffffffff801206a7 didn't lower ipl 8 7
 > 
 > Which I would say it's not related to the problem at hand (because
 > some times I get the corruption without seeing this message), the
 > handler in question is xen_timer_handler, which doesn't set any spl
 > levels directly, although the mutex tmutex sets IPL_CLOCK.

 I'm seeing this too. The problem is probably in something called by the
 clock handler, but I failed to find what. It's not a real problem because
 the Xen interrupt code will restore the IPL, but this means that
 something is not restoring the IPL properly somewhere.

 But it should be completely unrelated to the ring corruption issue.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: =?UTF-8?Q?Roger_Pau_Monn=C3=A9?= <royger@NetBSD.org>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: =?UTF-8?Q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com>, 
	"gnats-bugs@NetBSD.org" <gnats-bugs@netbsd.org>, 
	"port-xen-maintainer@netbsd.org" <port-xen-maintainer@netbsd.org>, 
	"gnats-admin@netbsd.org" <gnats-admin@netbsd.org>, "netbsd-bugs@netbsd.org" <netbsd-bugs@netbsd.org>
Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux Dom0
Date: Mon, 22 Oct 2012 11:42:40 +0200

 On Mon, Oct 22, 2012 at 10:08 AM, Manuel Bouyer <bouyer@antioche.eu.org> wrote:
 > I'm seeing this too. The problem is probably in something called by the
 > clock handler, but I failed to find what. It's not a real problem because
 > the Xen interrupt code will restore the IPL, but this means that
 > something is not restoring the IPL properly somewhere.
 >
 > But it should be completely unrelated to the ring corruption issue.

 Another possibly unrelated problem, I've tried enabling XENDEBUG_LOW
 in x86_xpmap, to see the ma passed by Xen at start, and I've got the
 following fault:

 xen_arch_pmap_bootstrap init_tables=0xffffffff80b0c000
 xen_bootstrap_tables(0xffffffff80b0c000, 0xffffffff80b21000, 9, 17)
 xen_bootstrap_tables text_end 0xffffffff8063a000 map_end 0xffffffff80b36000
 console 0x124afc xenstore 0x124afd
 L3 va 0xffffffff80b23000 pa 0xb23000 entry 0x124ae4007 -> L4[0x1ff]
 L2 va 0xffffffff80b24000 pa 0xb24000 entry 0x124ae3007 -> L3[0x1fe]
 L1 va 0xffffffff80b25000 pa 0xb25000 entry 0x124ae2007 -> L2[0]
 L1 va 0xffffffff80b26000 pa 0xb26000 entry 0x124ae1007 -> L2[0x1]
 L1 va 0xffffffff80b27000 pa 0xb27000 entry 0x124ae0007 -> L2[0x2]
 L1 va 0xffffffff80b28000 pa 0xb28000 entry 0x124adf007 -> L2[0x3]
 L1 va 0xffffffff80b29000 pa 0xb29000 entry 0x124ade007 -> L2[0x4]
 xenstore_interface va 0xffffffff80b0a000 pte 0x124afd000
 xencons_interface va 0xffffffff80b0b000 pte 0x124afc000
 va 0xffffffff80b0c000 pa 0xb0c000 entry 0x124afb005 -> L1[0x10c]
 va 0xffffffff80b0d000 pa 0xb0d000 entry 0x124afa005 -> L1[0x10d]
 va 0xffffffff80b0e000 pa 0xb0e000 entry 0x124af9005 -> L1[0x10e]
 va 0xffffffff80b0f000 pa 0xb0f000 entry 0x124af8005 -> L1[0x10f]
 va 0xffffffff80b10000 pa 0xb10000 entry 0x124af7005 -> L1[0x110]
 va 0xffffffff80b11000 pa 0xb11000 entry 0x124af6005 -> L1[0x111]
 va 0xffffffff80b12000 pa 0xb12000 entry 0x124af5005 -> L1[0x112]
 va 0xffffffff80b13000 pa 0xb13000 entry 0x124af4005 -> L1[0x113]
 va 0xffffffff80b14000 pa 0xb14000 entry 0x124af3005 -> L1[0x114]
 va 0xffffffff80b21000 pa 0xb21000 entry 0x124ae6005 -> L1[0x121]
 va 0xffffffff80b22000 pa 0xb22000 entry 0x124ae5005 -> L1[0x122]
 va 0xffffffff80b23000 pa 0xb23000 entry 0x124ae4005 -> L1[0x123]
 va 0xffffffff80b24000 pa 0xb24000 entry 0x124ae3005 -> L1[0x124]
 va 0xffffffff80b25000 pa 0xb25000 entry 0x124ae2005 -> L1[0x125]
 va 0xffffffff80b26000 pa 0xb26000 entry 0x124ae1005 -> L1[0x126]
 va 0xffffffff80b27000 pa 0xb27000 entry 0x124ae0005 -> L1[0x127]
 va 0xffffffff80b28000 pa 0xb28000 entry 0x124adf005 -> L1[0x128]
 va 0xffffffff80b29000 pa 0xb29000 entry 0x124ade005 -> L1[0x129]
 va 0xffffffff80b2a000 pa 0xb2a000 entry 0x124add005 -> L1[0x12a]
 va 0xffffffff80b2b000 pa 0xb2b000 entry 0x124adc005 -> L1[0x12b]
 va 0xffffffff80b2c000 pa 0xb2c000 entry 0x124adb005 -> L1[0x12c]
 va 0xffffffff80b2d000 pa 0xb2d000 entry 0x124ada005 -> L1[0x12d]
 va 0xffffffff80b2e000 pa 0xb2e000 entry 0x124ad9005 -> L1[0x12e]
 va 0xffffffff80b2f000 pa 0xb2f000 entry 0x124ad8005 -> L1[0x12f]
 va 0xffffffff80b30000 pa 0xb30000 entry 0x124ad7005 -> L1[0x130]
 va 0xffffffff80b31000 pa 0xb31000 entry 0x124ad6005 -> L1[0x131]
 va 0xffffffff80b32000 pa 0xb32000 entry 0x124ad5005 -> L1[0x132]
 va 0xffffffff80b33000 pa 0xb33000 entry 0x124ad4005 -> L1[0x133]
 va 0xffffffff80b34000 pa 0xb34000 entry 0x124ad3005 -> L1[0x134]
 va 0xffffffff80b35000 pa 0xb35000 entry 0x124ad2005 -> L1[0x135]
 L1 va 0xffffffff80b2a000 pa 0xb2a000 entry 0x124add007 -> L2[0x5]
 L1 va 0xffffffff80b2b000 pa 0xb2b000 entry 0x124adc007 -> L2[0x6]
 L1 va 0xffffffff80b2c000 pa 0xb2c000 entry 0x124adb007 -> L2[0x7]
 L1 va 0xffffffff80b2d000 pa 0xb2d000 entry 0x124ada007 -> L2[0x8]
 L1 va 0xffffffff80b2e000 pa 0xb2e000 entry 0x124ad9007 -> L2[0x9]
 L1 va 0xffffffff80b2f000 pa 0xb2f000 entry 0x124ad8007 -> L2[0xa]
 L1 va 0xffffffff80b30000 pa 0xb30000 entry 0x124ad7007 -> L2[0xb]
 L1 va 0xffffffff80b31000 pa 0xb31000 entry 0x124ad6007 -> L2[0xc]
 L1 va 0xffffffff80b32000 pa 0xb32000 entry 0x124ad5007 -> L2[0xd]
 L1 va 0xffffffff80b33000 pa 0xb33000 entry 0x124ad4007 -> L2[0xe]
 L1 va 0xffffffff80b34000 pa 0xb34000 entry 0x124ad3007 -> L2[0xf]
 L1 va 0xffffffff80b35000 pa 0xb35000 entry 0x124ad2007 -> L2[0x10]
 bt_pgd[PDIR_SLOT_PTE] va 0xffffffff80b21000 pa 0xb21000 entry 0x124ae5005
 pin PGD: b21000
 switch to PGD
 bt_pgd[PDIR_SLOT_PTE] now entry 0x124ae5005
 unpin old PGD
 *pde 0x124add027 addr 0xb2a000 pte 0xffffffff80b2a860
 xen_bootstrap_tables(0xffffffff80b21000, 0xffffffff80b0c000, 21, 17)
 xen_bootstrap_tables text_end 0xffffffff8063a000 map_end 0xffffffff80b28000
 console 0x124afc xenstore 0x124afd
 L3 va 0xffffffff80b0e000 pa 0xb0e000 entry 0x124af9007 -> L4[0x1ff]
 L2 va 0xffffffff80b0f000 pa 0xb0f000 entry 0x124af8007 -> L3[0x1fe]
 L1 va 0xffffffff80b10000 pa 0xb10000 entry 0x124af7007 -> L2[0]
 L1 va 0xffffffff80b11000 pa 0xb11000 entry 0x124af6007 -> L2[0x1]
 L1 va 0xffffffff80b12000 pa 0xb12000 entry 0x124af5007 -> L2[0x2]
 L1 va 0xffffffff80b13000 pa 0xb13000 entry 0x124af4007 -> L2[0x3]
 L1 va 0xffffffff80b14000 pa 0xb14000 entry 0x124af3007 -> L2[0x4]
 xenstore_interface va 0xffffffff80b0a000 pte 0x124afd000
 xencons_interface va 0xffffffff80b0b000 pte 0x124afc000
 va 0xffffffff80b0c000 pa 0xb0c000 entry 0x124afb005 -> L1[0x10c]
 va 0xffffffff80b0d000 pa 0xb0d000 entry 0x124afa005 -> L1[0x10d]
 va 0xffffffff80b0e000 pa 0xb0e000 entry 0x124af9005 -> L1[0x10e]
 va 0xffffffff80b0f000 pa 0xb0f000 entry 0x124af8005 -> L1[0x10f]
 va 0xffffffff80b10000 pa 0xb10000 entry 0x124af7005 -> L1[0x110]
 va 0xffffffff80b11000 pa 0xb11000 entry 0x124af6005 -> L1[0x111]
 va 0xffffffff80b12000 pa 0xb12000 entry 0x124af5005 -> L1[0x112]
 va 0xffffffff80b13000 pa 0xb13000 entry 0x124af4005 -> L1[0x113]
 va 0xffffffff80b14000 pa 0xb14000 entry 0x124af3005 -> L1[0x114]
 va 0xffffffff80b15000 pa 0xb15000 entry 0x124af2005 -> L1[0x115]
 va 0xffffffff80b16000 pa 0xb16000 entry 0x124af1005 -> L1[0x116]
 va 0xffffffff80b17000 pa 0xb17000 entry 0x124af0005 -> L1[0x117]
 va 0xffffffff80b18000 pa 0xb18000 entry 0x124aef005 -> L1[0x118]
 va 0xffffffff80b19000 pa 0xb19000 entry 0x124aee005 -> L1[0x119]
 va 0xffffffff80b1a000 pa 0xb1a000 entry 0x124aed005 -> L1[0x11a]
 va 0xffffffff80b1b000 pa 0xb1b000 entry 0x124aec005 -> L1[0x11b]
 va 0xffffffff80b1c000 pa 0xb1c000 entry 0x124aeb005 -> L1[0x11c]
 va 0xffffffff80b1d000 pa 0xb1d000 entry 0x124aea005 -> L1[0x11d]
 va 0xffffffff80b1e000 pa 0xb1e000 entry 0x124ae9005 -> L1[0x11e]
 va 0xffffffff80b1f000 pa 0xb1f000 entry 0x124ae8005 -> L1[0x11f]
 va 0xffffffff80b20000 pa 0xb20000 entry 0x124ae7005 -> L1[0x120]
 va 0xffffffff80b21000 pa 0xb21000 entry 0x124ae6005 -> L1[0x121]
 va 0xffffffff80b22000 pa 0xb22000 entry 0x124ae5005 -> L1[0x122]
 va 0xffffffff80b23000 pa 0xb23000 entry 0x124ae4005 -> L1[0x123]
 va 0xffffffff80b24000 pa 0xb24000 entry 0x124ae3005 -> L1[0x124]
 va 0xffffffff80b25000 pa 0xb25000 entry 0x124ae2005 -> L1[0x125]
 HYPERVISOR_shared_info va 0xffffffff80b26000 pte 0x80f5000
 va 0xffffffff80b26000 pa 0xb26000 entry 0x80f5005 -> L1[0x126]
 va 0xffffffff80b27000 pa 0xb27000 entry 0x124ae0005 -> L1[0x127]
 L1 va 0xffffffff80b15000 pa 0xb15000 entry 0x124af2007 -> L2[0x5]
 L1 va 0xffffffff80b16000 pa 0xb16000 entry 0x124af1007 -> L2[0x6]
 L1 va 0xffffffff80b17000 pa 0xb17000 entry 0x124af0007 -> L2[0x7]
 L1 va 0xffffffff80b18000 pa 0xb18000 entry 0x124aef007 -> L2[0x8]
 L1 va 0xffffffff80b19000 pa 0xb19000 entry 0x124aee007 -> L2[0x9]
 L1 va 0xffffffff80b1a000 pa 0xb1a000 entry 0x124aed007 -> L2[0xa]
 L1 va 0xffffffff80b1b000 pa 0xb1b000 entry 0x124aec007 -> L2[0xb]
 L1 va 0xffffffff80b1c000 pa 0xb1c000 entry 0x124aeb007 -> L2[0xc]
 L1 va 0xffffffff80b1d000 pa 0xb1d000 entry 0x124aea007 -> L2[0xd]
 L1 va 0xffffffff80b1e000 pa 0xb1e000 entry 0x124ae9007 -> L2[0xe]
 L1 va 0xffffffff80b1f000 pa 0xb1f000 entry 0x124ae8007 -> L2[0xf]
 L1 va 0xffffffff80b20000 pa 0xb20000 entry 0x124ae7007 -> L2[0x10]
 bt_pgd[PDIR_SLOT_PTE] va 0xffffffff80b0c000 pa 0xb0c000 entry 0x124afa005
 pin PGD: b0c000
 switch to PGD
 bt_pgd[PDIR_SLOT_PTE] now entry 0x124afa005
 unpin old PGD
 *pde 0x124af2027 addr 0xb15000 pte 0xffffffff80b15908
 (XEN) d17:v0: unhandled page fault (ec=0000)
 (XEN) Pagetable walk from 00007fffff803f60:
 (XEN)  L4[0x0ff] = 0000000124afb025 0000000000000b0c
 (XEN)  L3[0x1ff] = 0000000124af9027 0000000000000b0e
 (XEN)  L2[0x1fc] = 0000000000000000 ffffffffffffffff
 (XEN) domain_crash_sync called from entry.S
 (XEN) Domain 17 (vcpu#0) crashed on cpu#1:
 (XEN) ----[ Xen-4.3-unstable  x86_64  debug=y  Not tainted ]----
 (XEN) CPU:    1
 (XEN) RIP:    e033:[<ffffffff802eb85f>]
 (XEN) RFLAGS: 0000000000000206   EM: 1   CONTEXT: pv guest
 (XEN) rax: 00007fffff803f60   rbx: 0000000124add007   rcx: 0000000000000000
 (XEN) rdx: ffffffff8063dd40   rsi: 0000000124add007   rdi: 000ffffffffff000
 (XEN) rbp: ffffffff80b24ed0   rsp: ffffffff80b24e40   r8:  ffffffff80b25000
 (XEN) r9:  0000000000000010   r10: 00000000deadbeef   r11: 0000000000000000
 (XEN) r12: 0000000000000000   r13: 0000000000000004   r14: ffffffff007ec976
 (XEN) r15: 000ffffffffff000   cr0: 000000008005003b   cr4: 00000000000026f0
 (XEN) cr3: 0000000124afb000   cr2: 00007fffff803f60
 (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e02b   cs: e033
 (XEN) Guest stack trace from rsp=ffffffff80b24e40:
 (XEN)    0000000000000000 0000000000000000 0000000000000000 ffffffff802eb85f
 (XEN)    000000010000e030 0000000000010006 ffffffff80b24e80 000000000000e02b
 (XEN)    ffffffff80b24ed0 00007fffffc05938 0000000100000000 ffffffff80460288
 (XEN)    0000000000000000 0000000000b25000 ffffffff80b21000 0000000000000000
 (XEN)    0000000000000000 0000000000000000 ffffffff80b24f00 ffffffff80283b3a
 (XEN)    0000000000000000 0000000000000000 00000000756e6547 0000000000000000
 (XEN)    0000000000000000 ffffffff8010009b 0000000000000000 0000000000000000
 (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
 (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
 (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
 (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
 (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
 (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
 (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
 (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
 (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
 (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
 (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
 (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
 (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000

 This only happens with XENDEBUG_LOW set.

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Roger Pau =?iso-8859-1?Q?Monn=E9?= <royger@NetBSD.org>
Cc: Roger Pau =?iso-8859-1?Q?Monn=E9?= <roger.pau@citrix.com>,
        "gnats-bugs@NetBSD.org" <gnats-bugs@NetBSD.org>,
        "port-xen-maintainer@netbsd.org" <port-xen-maintainer@NetBSD.org>,
        "gnats-admin@netbsd.org" <gnats-admin@NetBSD.org>,
        "netbsd-bugs@netbsd.org" <netbsd-bugs@NetBSD.org>
Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux
 Dom0
Date: Mon, 22 Oct 2012 12:01:48 +0200

 On Mon, Oct 22, 2012 at 11:42:40AM +0200, Roger Pau Monné wrote:
 > Another possibly unrelated problem, I've tried enabling XENDEBUG_LOW
 > in x86_xpmap, to see the ma passed by Xen at start, and I've got the
 > following fault:

 This code has not been used for a long time, it's possible that it
 has not tracked other changes ...

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: =?ISO-8859-1?Q?Roger_Pau_Monn=E9?= <roger.pau@citrix.com>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: =?ISO-8859-1?Q?Roger_Pau_Monn=E9?= <royger@NetBSD.org>,
	"gnats-bugs@NetBSD.org" <gnats-bugs@NetBSD.org>,
	"port-xen-maintainer@netbsd.org" <port-xen-maintainer@NetBSD.org>,
	"gnats-admin@netbsd.org" <gnats-admin@NetBSD.org>, "netbsd-bugs@netbsd.org"
	<netbsd-bugs@NetBSD.org>
Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux
 Dom0
Date: Tue, 23 Oct 2012 09:45:48 +0200

 I think I've found the problem, it seems to be related to
 xengnt_more_entries, but still haven't been able to point exactly when
 the overwrite of xenstore_interface happens. Just after the call to
 xengnt_more_entries the ring gets corrupted, but it's not the call
 itself that corrupts the ring.

 xengnt_more_entries start: prod: 3787 cons: 3787
 xengnt_more_entries: map 0x1610ff -> 0xffffa0002da45000
 xengnt_more_entries end: prod: 3787 cons: 3787
 xb_read: xenstore_interface: 0xffffffff80b0a000
 xb_read: cons: 673215352 prod: 1651402104
 xb_read EIO
 xenbus_xs (process_msg:763) xb_read hdr 5.
 panic: XENBUS error 5 while reading message

From: =?ISO-8859-1?Q?Roger_Pau_Monn=E9?= <roger.pau@citrix.com>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: =?ISO-8859-1?Q?Roger_Pau_Monn=E9?= <royger@NetBSD.org>,
	"gnats-bugs@NetBSD.org" <gnats-bugs@NetBSD.org>,
	"port-xen-maintainer@netbsd.org" <port-xen-maintainer@NetBSD.org>,
	"gnats-admin@netbsd.org" <gnats-admin@NetBSD.org>, "netbsd-bugs@netbsd.org"
	<netbsd-bugs@NetBSD.org>
Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux
 Dom0
Date: Tue, 23 Oct 2012 16:01:06 +0200

 --------------090005000003030604070404
 Content-Type: text/plain; charset="ISO-8859-1"
 Content-Transfer-Encoding: 7bit

 Found the problem, grants from 0 to 8 (both included), shouldn't be
 used, they are reserved for the tools. I guess thats xenstore,
 xenconsole and friends, so that's where the corruption came from, and
 that's why the problem seemed to be related to xengnt_more_entries,
 because it gets called when those low grants are used. The attached
 patch solves the problem for me.

 --------------090005000003030604070404
 Content-Type: text/plain; charset="UTF-8"; x-mac-type=0; x-mac-creator=0;
 	name="0001-xen-don-t-use-grants-0-9.patch"
 Content-Transfer-Encoding: 7bit
 Content-Disposition: attachment;
 	filename="0001-xen-don-t-use-grants-0-9.patch"

 From b80f10a3c3d0b95d3cd2a60a4669a2118fdbb9ef Mon Sep 17 00:00:00 2001
 From: Roger Pau Monne <roger.pau@citrix.com>
 Date: Tue, 23 Oct 2012 15:21:18 +0200
 Subject: [PATCH] xen: don't use grants 0-9

 Not all grants from the first frame can be used, grants from 0 to 8
 (both included) are reserved for external tools. Using this grants
 caused system crashes and fs corruption.
 ---
  sys/arch/xen/xen/xengnt.c |   15 +++++++++++----
  1 files changed, 11 insertions(+), 4 deletions(-)

 diff --git a/sys/arch/xen/xen/xengnt.c b/sys/arch/xen/xen/xengnt.c
 index 621d2dc..2de4fd3 100644
 --- a/sys/arch/xen/xen/xengnt.c
 +++ b/sys/arch/xen/xen/xengnt.c
 @@ -51,6 +51,9 @@ __KERNEL_RCSID(0, "$NetBSD: xengnt.c,v 1.24 2012/06/30 23:36:20 jym Exp $");

  #define NR_GRANT_ENTRIES_PER_PAGE (PAGE_SIZE / sizeof(grant_entry_t))

 +/* External tools reserve first few grant table entries. */
 +#define NR_RESERVED_ENTRIES 8
 +
  /* Current number of frames making up the grant table */
  int gnt_nr_grant_frames;
  /* Maximum number of frames that can make up the grant table */
 @@ -161,7 +164,7 @@ xengnt_more_entries(void)
  	gnttab_setup_table_t setup;
  	u_long *pages;
  	int nframes_new = gnt_nr_grant_frames + 1;
 -	int i;
 +	int i, start_gnt;
  	KASSERT(mutex_owned(&grant_lock));

  	if (gnt_nr_grant_frames == gnt_max_grant_frames)
 @@ -204,9 +207,13 @@ xengnt_more_entries(void)

  	/*
  	 * add the grant entries associated to the last grant table frame
 -	 * and mark them as free
 +	 * and mark them as free. Prevent using the first grants (from 0 to 8)
 +	 * since they are used by the tools.
  	 */
 -	for (i = gnt_nr_grant_frames * NR_GRANT_ENTRIES_PER_PAGE;
 +	start_gnt = (gnt_nr_grant_frames * NR_GRANT_ENTRIES_PER_PAGE) <
 +				NR_RESERVED_ENTRIES + 1 ? NR_RESERVED_ENTRIES + 1 :
 +				(gnt_nr_grant_frames * NR_GRANT_ENTRIES_PER_PAGE);
 +	for (i = start_gnt;
  	    i < nframes_new * NR_GRANT_ENTRIES_PER_PAGE;
  	    i++) {
  		KASSERT(gnt_entries[last_gnt_entry] == XENGNT_NO_ENTRY);
 @@ -240,7 +247,7 @@ xengnt_get_entry(void)
  	last_gnt_entry--;
  	entry = gnt_entries[last_gnt_entry];
  	gnt_entries[last_gnt_entry] = XENGNT_NO_ENTRY;
 -	KASSERT(entry != XENGNT_NO_ENTRY);
 +	KASSERT(entry != XENGNT_NO_ENTRY && entry > NR_RESERVED_ENTRIES);
  	KASSERT(last_gnt_entry >= 0);
  	KASSERT(last_gnt_entry <= gnt_max_grant_frames * NR_GRANT_ENTRIES_PER_PAGE);
  	return entry;
 -- 
 1.7.7.5 (Apple Git-26)


 --------------090005000003030604070404--

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Roger Pau =?iso-8859-1?Q?Monn=E9?= <roger.pau@citrix.com>
Cc: Roger Pau =?iso-8859-1?Q?Monn=E9?= <royger@NetBSD.org>,
        "gnats-bugs@NetBSD.org" <gnats-bugs@NetBSD.org>,
        "port-xen-maintainer@netbsd.org" <port-xen-maintainer@NetBSD.org>,
        "gnats-admin@netbsd.org" <gnats-admin@NetBSD.org>,
        "netbsd-bugs@netbsd.org" <netbsd-bugs@NetBSD.org>
Subject: Re: port-xen/47057: Xen NetBSD DomU file system trash under Linux
 Dom0
Date: Tue, 23 Oct 2012 21:57:48 +0200

 On Tue, Oct 23, 2012 at 04:01:06PM +0200, Roger Pau Monné wrote:
 > Found the problem, grants from 0 to 8 (both included), shouldn't be
 > used, they are reserved for the tools. I guess thats xenstore,
 > xenconsole and friends, so that's where the corruption came from, and
 > that's why the problem seemed to be related to xengnt_more_entries,
 > because it gets called when those low grants are used. The attached
 > patch solves the problem for me.

 I guess it's new behavior of the tools ? Otherwise I guess we should have hit
 this sooner. I see messages saying the kernel grows the grant entries
 pool on a regular basis.

 Anyway, good catch. one comment about the patch below.


 > >From b80f10a3c3d0b95d3cd2a60a4669a2118fdbb9ef Mon Sep 17 00:00:00 2001
 > From: Roger Pau Monne <roger.pau@citrix.com>
 > Date: Tue, 23 Oct 2012 15:21:18 +0200
 > Subject: [PATCH] xen: don't use grants 0-9
 > 
 > Not all grants from the first frame can be used, grants from 0 to 8
 > (both included) are reserved for external tools. Using this grants
 > caused system crashes and fs corruption.
 > ---
 >  sys/arch/xen/xen/xengnt.c |   15 +++++++++++----
 >  1 files changed, 11 insertions(+), 4 deletions(-)
 > 
 > diff --git a/sys/arch/xen/xen/xengnt.c b/sys/arch/xen/xen/xengnt.c
 > index 621d2dc..2de4fd3 100644
 > --- a/sys/arch/xen/xen/xengnt.c
 > +++ b/sys/arch/xen/xen/xengnt.c
 > @@ -51,6 +51,9 @@ __KERNEL_RCSID(0, "$NetBSD: xengnt.c,v 1.24 2012/06/30 23:36:20 jym Exp $");
 >  
 >  #define NR_GRANT_ENTRIES_PER_PAGE (PAGE_SIZE / sizeof(grant_entry_t))
 >  
 > +/* External tools reserve first few grant table entries. */
 > +#define NR_RESERVED_ENTRIES 8
 > +
 >  /* Current number of frames making up the grant table */
 >  int gnt_nr_grant_frames;
 >  /* Maximum number of frames that can make up the grant table */
 > @@ -161,7 +164,7 @@ xengnt_more_entries(void)
 >  	gnttab_setup_table_t setup;
 >  	u_long *pages;
 >  	int nframes_new = gnt_nr_grant_frames + 1;
 > -	int i;
 > +	int i, start_gnt;
 >  	KASSERT(mutex_owned(&grant_lock));
 >  
 >  	if (gnt_nr_grant_frames == gnt_max_grant_frames)
 > @@ -204,9 +207,13 @@ xengnt_more_entries(void)
 >  
 >  	/*
 >  	 * add the grant entries associated to the last grant table frame
 > -	 * and mark them as free
 > +	 * and mark them as free. Prevent using the first grants (from 0 to 8)
 > +	 * since they are used by the tools.
 >  	 */
 > -	for (i = gnt_nr_grant_frames * NR_GRANT_ENTRIES_PER_PAGE;
 > +	start_gnt = (gnt_nr_grant_frames * NR_GRANT_ENTRIES_PER_PAGE) <
 > +				NR_RESERVED_ENTRIES + 1 ? NR_RESERVED_ENTRIES + 1 :
 > +				(gnt_nr_grant_frames * NR_GRANT_ENTRIES_PER_PAGE);

 please rewrite with parenthesis:
 +	start_gnt = (gnt_nr_grant_frames * NR_GRANT_ENTRIES_PER_PAGE) <
 +				(NR_RESERVED_ENTRIES + 1) ?
 				(NR_RESERVED_ENTRIES + 1) :
 +				(gnt_nr_grant_frames * NR_GRANT_ENTRIES_PER_PAGE);

 then please commit and request pullups for netbsd-5 and -6.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: "Roger Pau Monne" <royger@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/47057 CVS commit: src/sys/arch/xen/xen
Date: Wed, 24 Oct 2012 13:07:47 +0000

 Module Name:	src
 Committed By:	royger
 Date:		Wed Oct 24 13:07:46 UTC 2012

 Modified Files:
 	src/sys/arch/xen/xen: xengnt.c

 Log Message:
 xen: don't use grants 0-8

 Not all grants from the first frame can be used, grants from 0 to 8
 (both included) are reserved for external tools. Using this grants
 caused system crashes and fs corruption.

 Closes PR port-xen/47057 and port-xen/47056
 Reviewed by bouyer@


 To generate a diff of this commit:
 cvs rdiff -u -r1.24 -r1.25 src/sys/arch/xen/xen/xengnt.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Stephen Borrill" <sborrill@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/47057 CVS commit: [netbsd-5] src/sys/arch/xen/xen
Date: Fri, 26 Oct 2012 11:31:50 +0000

 Module Name:	src
 Committed By:	sborrill
 Date:		Fri Oct 26 11:31:50 UTC 2012

 Modified Files:
 	src/sys/arch/xen/xen [netbsd-5]: xengnt.c

 Log Message:
 Pull up the following revisions(s) (requested by royger in ticket #1805):
 	sys/arch/xen/xen/xengnt.c:	revision 1.25 via patch

 Prevents a memory corruption issue that freezes a Xen DomU and can also
 cause fs corruption. Addresses PR port-xen/47057 and port-xen/47056


 To generate a diff of this commit:
 cvs rdiff -u -r1.10.4.1 -r1.10.4.2 src/sys/arch/xen/xen/xengnt.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Jeff Rizzo" <riz@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/47057 CVS commit: [netbsd-6] src/sys/arch/xen/xen
Date: Wed, 31 Oct 2012 16:15:09 +0000

 Module Name:	src
 Committed By:	riz
 Date:		Wed Oct 31 16:15:09 UTC 2012

 Modified Files:
 	src/sys/arch/xen/xen [netbsd-6]: xengnt.c

 Log Message:
 Pull up following revision(s) (requested by royger in ticket #640):
 	sys/arch/xen/xen/xengnt.c: revision 1.25
 xen: don't use grants 0-8
 Not all grants from the first frame can be used, grants from 0 to 8
 (both included) are reserved for external tools. Using this grants
 caused system crashes and fs corruption.
 Closes PR port-xen/47057 and port-xen/47056
 Reviewed by bouyer@


 To generate a diff of this commit:
 cvs rdiff -u -r1.22.2.1 -r1.22.2.2 src/sys/arch/xen/xen/xengnt.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Jeff Rizzo" <riz@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/47057 CVS commit: [netbsd-6-0] src/sys/arch/xen/xen
Date: Wed, 31 Oct 2012 16:15:28 +0000

 Module Name:	src
 Committed By:	riz
 Date:		Wed Oct 31 16:15:28 UTC 2012

 Modified Files:
 	src/sys/arch/xen/xen [netbsd-6-0]: xengnt.c

 Log Message:
 Pull up following revision(s) (requested by royger in ticket #640):
 	sys/arch/xen/xen/xengnt.c: revision 1.25
 xen: don't use grants 0-8
 Not all grants from the first frame can be used, grants from 0 to 8
 (both included) are reserved for external tools. Using this grants
 caused system crashes and fs corruption.
 Closes PR port-xen/47057 and port-xen/47056
 Reviewed by bouyer@


 To generate a diff of this commit:
 cvs rdiff -u -r1.22.2.1 -r1.22.2.1.4.1 src/sys/arch/xen/xen/xengnt.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->closed
State-Changed-By: royger@NetBSD.org
State-Changed-When: Fri, 30 Nov 2012 09:30:57 +0000
State-Changed-Why:
Fixed in src/sys/arch/xen/xen/xengnt.c version 1.25


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.