NetBSD Problem Report #34293

From mlelstv@pepew.1st.de  Sat Aug 26 21:58:34 2006
Return-Path: <mlelstv@pepew.1st.de>
Received: from mail.netbsd.org (mail.netbsd.org [204.152.190.11])
	by narn.NetBSD.org (Postfix) with ESMTP id B793D63B8F2
	for <gnats-bugs@gnats.NetBSD.org>; Sat, 26 Aug 2006 21:58:33 +0000 (UTC)
Message-Id: <20060826215814.B14BA2CBAA@pepew.1st.de>
Date: Sat, 26 Aug 2006 23:58:14 +0200 (CEST)
From: mlelstv@serpens.de
Reply-To: mlelstv@serpens.de
To: gnats-bugs@NetBSD.org
Subject: vnd deadlocks on I/O buffers
X-Send-Pr-Version: 3.95

>Number:         34293
>Category:       kern
>Synopsis:       vnd deadlocks on I/O buffers
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Sat Aug 26 22:00:00 +0000 2006
>Closed-Date:    
>Last-Modified:  Sun Sep 10 14:55:01 +0000 2006
>Originator:     Michael van Elst
>Release:        NetBSD 4.99.1
>Organization:
-- 
                                Michael van Elst
Internet: mlelstv@serpens.de
                                "A potential Snark may lurk in every tree."
>Environment:


System: NetBSD pepew 4.99.1 NetBSD 4.99.1 (PEPEW) #32: Sat Aug 26 23:34:55 CEST 2006 src@pepew:/usr/obj/usr/src/sys/arch/i386/compile/PEPEW i386
Architecture: i386
Machine: i386
>Description:

When writing to a vnd device faster than the underlying
hardware can handle for a sufficient long enough time
then the system will exhaust all I/O buffers.  This is
expected. However, the vnd device then deadlocks and the
queued I/O will never complete.

The reason is that vnd uses nested I/O buffers. In order
to process a queued I/O request it has to allocate yet
another I/O buffer. When all I/O buffers have already
been exhausted it therfore waits forever.

>How-To-Repeat:

Create a vnd device for a large (e.g. 10GB) file on the local disk.
Write lots of data to the vnd device (e.g. mkfs.ext2).

Watch the system deadlock on I/O buffers. The writing process
and the vnd kernel thread are waiting on "getnewbuf".

>Fix:
I have changed the system to use a separate pool for nested I/O
buffers. This works, but there are probably better ways to
solve the problem.

>Release-Note:

>Audit-Trail:
From: Michael van Elst <mlelstv@serpens.de>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: kern/34293 vnd deadlocks on I/O buffers
Date: Sun, 27 Aug 2006 19:45:47 +0200

 Using a separate pool for nested buffers helps a little, but
 the same condition also prevents normal buffers from being
 allocated leading to the situation similar to the one described
 in kern/10731 and kern/20296.

 The following patch works against the root cause, which is the
 exhaustion of buffers.

 Index: vnd.c
 ===================================================================
 RCS file: /cvsroot/src/sys/dev/vnd.c,v
 retrieving revision 1.148
 diff -u -r1.148 vnd.c
 --- vnd.c	21 Jul 2006 16:48:48 -0000	1.148
 +++ vnd.c	27 Aug 2006 17:42:54 -0000
 @@ -507,6 +507,9 @@
  	if (vnddebug & VDB_FOLLOW)
  		printf("vndstrategy(%p): unit %d\n", bp, unit);
  #endif
 +	while (vnd->sc_maxactive > 0 && vnd->sc_active >= vnd->sc_maxactive) {
 +		tsleep(&vnd->sc_tab, PRIBIO, "vndac", 0);
 +	}
  	BUFQ_PUT(vnd->sc_tab, bp);
  	wakeup(&vnd->sc_tab);
  	splx(s);


 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

State-Changed-From-To: open->closed
State-Changed-By: christos@netbsd.org
State-Changed-When: Sun, 27 Aug 2006 14:45:54 -0400
State-Changed-Why:
fixed, thanks


From: Christos Zoulas <christos@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: PR/34293 CVS commit: src/sys/dev
Date: Sun, 27 Aug 2006 18:45:21 +0000 (UTC)

 Module Name:	src
 Committed By:	christos
 Date:		Sun Aug 27 18:45:20 UTC 2006

 Modified Files:
 	src/sys/dev: vnd.c

 Log Message:
 PR/34293: Michael van Elst: vnd deadlocks on I/O buffers
 Also fixes: PR/10731, PR/12189, PR/20296
 Sleep while there a buffer shortage.


 To generate a diff of this commit:
 cvs rdiff -r1.148 -r1.149 src/sys/dev/vnd.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: closed->open
State-Changed-By: bouyer@netbsd.org
State-Changed-When: Sun, 03 Sep 2006 19:50:29 +0000
State-Changed-Why:
vnd.c 1.149 has been backed out.


From: Manuel Bouyer <bouyer@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: PR/34293 CVS commit: src/sys/dev
Date: Sun,  3 Sep 2006 19:49:34 +0000 (UTC)

 Module Name:	src
 Committed By:	bouyer
 Date:		Sun Sep  3 19:49:34 UTC 2006

 Modified Files:
 	src/sys/dev: vnd.c

 Log Message:
 Back out rev 1.149.
 From various discussion about vndstrategy (see
 http://mail-index.netbsd.org/tech-kern/2005/03/29/0034.html
 http://mail-index.netbsd.org/tech-kern/2005/03/23/0015.html)
 it's not correct to tsleep() in a strategy routine, which may be called from
 interrupt context.
 Unfortunably this reopens PR/10731, PR/12189, PR/20296, PR/34293

 As for what the correct fix it, this needs to be analysed deeper. I suspect
 throttling the caller in vnd only hides the problem; the same caller writing
 to some other device could exaust all buffers as well. If this driver doesn't
 need to allocate buffer this won't cause a deadlock, but it's bad for
 performances on systems with e.g. multiple drives. Also, others stacked
 block device drivers may also have this issue.


 To generate a diff of this commit:
 cvs rdiff -r1.149 -r1.150 src/sys/dev/vnd.c

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Michael van Elst <mlelstv@serpens.de>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
	netbsd-bugs@NetBSD.org
Subject: Re: PR/34293 CVS commit: src/sys/dev
Date: Sun, 3 Sep 2006 22:32:54 +0200

 On Sun, Sep 03, 2006 at 08:00:10PM +0000, Manuel Bouyer wrote:

 >  it's not correct to tsleep() in a strategy routine, which may be called from
 >  interrupt context.

 vndstrategy isn't called from interrupt context. But if that is
 a problem, the feedback loop can be put into both vndread()/vndwrite().


 >  Unfortunably this reopens PR/10731, PR/12189, PR/20296, PR/34293
 >  As for what the correct fix it, this needs to be analysed deeper.

 Then I suggest you do that before you break vnd.


 >  I suspect
 >  throttling the caller in vnd only hides the problem;

 No, it prevents the problem. The writer needs to be throttled to
 not overwhelm the consumer because in this case it just eats up
 all memory. That necessary feedback loop is created by the patch.


 >  the same caller writing
 >  to some other device could exaust all buffers as well.

 This is a general problem with every device but so far not seen as
 a problem. If there is memory to buffer the requests, the current
 buffering policy is to allow all buffers to be used. This is more or
 less harmless as normal device drivers will only consume and free
 buffers on the queue. 

 However, vnd is special in that it requires further buffers to be
 allocated to consume and free buffers on the queue. There is also
 no need to double buffer writes to vnd this way, once in the
 device queue and once in the underlying filesystem. That strategy
 is broken and this is also fixed by the patch that limits the
 double-buffer to whatever the device thread is prepared to consume.


 >  If this driver doesn't
 >  need to allocate buffer this won't cause a deadlock, but it's bad for
 >  performances on systems with e.g. multiple drives.

 Preventing vnd from consuming all buffers cannot be bad for
 performances on systems with multiple drives.


 >  Also, others stacked
 >  block device drivers may also have this issue.

 That is a possiblity. Find those drivers, resolve the issue and
 provide a proper solution for all cases.


 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

From: Michael van Elst <mlelstv@serpens.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: kern/34293 (vnd deadlocks on I/O buffers)
Date: Sun, 3 Sep 2006 22:33:58 +0200

 On Sun, Sep 03, 2006 at 07:50:30PM +0000, bouyer@NetBSD.org wrote:


 Thanks for killing vnd again without thinking.


 > Synopsis: vnd deadlocks on I/O buffers
 > 
 > State-Changed-From-To: closed->open
 > State-Changed-By: bouyer@netbsd.org
 > State-Changed-When: Sun, 03 Sep 2006 19:50:29 +0000
 > State-Changed-Why:
 > vnd.c 1.149 has been backed out.
 > 
 > 
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

From: yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)
To: mlelstv@serpens.de
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org,
	gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: PR/34293 CVS commit: src/sys/dev
Date: Mon,  4 Sep 2006 09:26:22 +0900 (JST)

 > >  it's not correct to tsleep() in a strategy routine, which may be called from
 > >  interrupt context.
 > 
 > vndstrategy isn't called from interrupt context.

 yes, it is.
 please read the thread referred in the commit message.

 > But if that is
 > a problem, the feedback loop can be put into both vndread()/vndwrite().

 why is it enough?

 > >  I suspect
 > >  throttling the caller in vnd only hides the problem;
 > 
 > No, it prevents the problem. The writer needs to be throttled to
 > not overwhelm the consumer because in this case it just eats up
 > all memory. That necessary feedback loop is created by the patch.

 can you explain how it prevents the problem while vndthread() already has
 the same check?

 while vndthread is working on behalf of the pagedaemon here,
 it is not a pagedaemon.  so it can't get special treatment for pagedaemon
 from buffer cache etc and can deadlock.

 YAMAMOTO Takashi

From: Michael van Elst <mlelstv@serpens.de>
To: YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org,
	gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: PR/34293 CVS commit: src/sys/dev
Date: Mon, 4 Sep 2006 07:22:32 +0200

 On Mon, Sep 04, 2006 at 09:26:22AM +0900, YAMAMOTO Takashi wrote:

 > > But if that is
 > > a problem, the feedback loop can be put into both vndread()/vndwrite().
 > 
 > why is it enough?

 It is not enough, but it catches the case where the device is
 accessed directly instead of through a filesystem.


 > > >  I suspect
 > > >  throttling the caller in vnd only hides the problem;
 > > 
 > > No, it prevents the problem. The writer needs to be throttled to
 > > not overwhelm the consumer because in this case it just eats up
 > > all memory. That necessary feedback loop is created by the patch.
 > 
 > can you explain how it prevents the problem while vndthread() already has
 > the same check?

 vndthread() does not have the same check. vndthread() is the consumer,
 you have to stop the producer that allocates all the buffers.


 > while vndthread is working on behalf of the pagedaemon here,
 > it is not a pagedaemon.  so it can't get special treatment for pagedaemon
 > from buffer cache etc and can deadlock.

 In my case it is definitely not working on behalf of the pagedaemon.


 Greetings,
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

From: Michael van Elst <mlelstv@serpens.de>
To: YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org,
	gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: PR/34293 CVS commit: src/sys/dev
Date: Mon, 4 Sep 2006 21:00:15 +0200

 On Mon, Sep 04, 2006 at 07:22:32AM +0200, Michael van Elst wrote:

 > It is not enough, but it catches the case where the device is
 > accessed directly instead of through a filesystem.

 I was mistaken. Access to the block device is not passed through
 the devread/devwrite routines. So it is necessary that vnd
 blocks in the strategy routine.

 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Michael van Elst <mlelstv@serpens.de>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org,
	gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: PR/34293 CVS commit: src/sys/dev
Date: Mon, 4 Sep 2006 22:51:02 +0200

 On Sun, Sep 03, 2006 at 10:32:54PM +0200, Michael van Elst wrote:
 > On Sun, Sep 03, 2006 at 08:00:10PM +0000, Manuel Bouyer wrote:
 > 
 > >  it's not correct to tsleep() in a strategy routine, which may be called from
 > >  interrupt context.
 > 
 > vndstrategy isn't called from interrupt context.

 It is, if called from another driver (e.g. Xen block device).

 > But if that is
 > a problem, the feedback loop can be put into both vndread()/vndwrite().

 I'm not sure we're allowed to tsleep() here either.

 > 
 > 
 > >  Unfortunably this reopens PR/10731, PR/12189, PR/20296, PR/34293
 > >  As for what the correct fix it, this needs to be analysed deeper.
 > 
 > Then I suggest you do that before you break vnd.

 This patch broke vnd for all Xen users, where it worked fine before.

 > 
 > 
 > >  I suspect
 > >  throttling the caller in vnd only hides the problem;
 > 
 > No, it prevents the problem. The writer needs to be throttled to
 > not overwhelm the consumer because in this case it just eats up
 > all memory. That necessary feedback loop is created by the patch.
 > 
 > 
 > >  the same caller writing
 > >  to some other device could exaust all buffers as well.
 > 
 > This is a general problem with every device but so far not seen as
 > a problem. If there is memory to buffer the requests, the current
 > buffering policy is to allow all buffers to be used. This is more or
 > less harmless as normal device drivers will only consume and free
 > buffers on the queue. 
 > 
 > However, vnd is special in that it requires further buffers to be
 > allocated to consume and free buffers on the queue. There is also
 > no need to double buffer writes to vnd this way, once in the
 > device queue and once in the underlying filesystem. That strategy
 > is broken and this is also fixed by the patch that limits the
 > double-buffer to whatever the device thread is prepared to consume.

 I understand all of this. But 1) the patch is broken (calling tsleep in
 a strategy routine) 2) whenever it's not a problem for other drivers is
 a matter of opinion. If has probably performances inpacts.
 > 
 > 
 > >  If this driver doesn't
 > >  need to allocate buffer this won't cause a deadlock, but it's bad for
 > >  performances on systems with e.g. multiple drives.
 > 
 > Preventing vnd from consuming all buffers cannot be bad for
 > performances on systems with multiple drives.

 No, I meant if all buffers end up in a single device queue.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Michael van Elst <mlelstv@serpens.de>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
	netbsd-bugs@NetBSD.org
Subject: Re: PR/34293 CVS commit: src/sys/dev
Date: Mon, 4 Sep 2006 23:23:37 +0200

 On Mon, Sep 04, 2006 at 08:55:06PM +0000, Manuel Bouyer wrote:

 >  > vndstrategy isn't called from interrupt context.
 >  
 >  It is, if called from another driver (e.g. Xen block device).

 I believe that this is not allowed. The strategy routine is
 part of the top half of the driver, it operates in process
 context.

 Saying that, I scanned other drivers and found several that may
 sleep in their strategy() routine. Most promiment scsistrategy()
 does this (by virtue of calling scsipi_command which waits for
 a free scsi transfer context in scsipi_get_xs).


 >  > But if that is
 >  > a problem, the feedback loop can be put into both vndread()/vndwrite().
 >  I'm not sure we're allowed to tsleep() here either.

 vndread()/vndwrite() definitely do already tsleep() in physio() exactly
 when they are waiting for buffers. However, I was mistaken that these
 routines were passed when accessing the block device but they are
 only used by the raw device.


 >  This patch broke vnd for all Xen users, where it worked fine before.

 I agree. The patch probably leaves a lot space for improvement. But
 the real bug is calling the strategy() routine from interrupt context.


 >  I understand all of this. But 1) the patch is broken (calling tsleep in
 >  a strategy routine) 2) whenever it's not a problem for other drivers is
 >  a matter of opinion. If has probably performances inpacts.

 Calling tsleep() in strategy is allowed. You are right that sleeps
 should be avoided when possible for performance reasons.

 However, the patch improves performance significantly by avoiding
 the buffer drain.


 >  > Preventing vnd from consuming all buffers cannot be bad for
 >  > performances on systems with multiple drives.
 >  
 >  No, I meant if all buffers end up in a single device queue.

 That's what happening without the patch. The buffers queue up
 for vnd and the vndthread waits for free buffers. Deadlock.



 Greetings,
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

From: yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)
To: mlelstv@serpens.de
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org,
	gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: PR/34293 CVS commit: src/sys/dev
Date: Tue,  5 Sep 2006 21:53:09 +0900 (JST)

 > > while vndthread is working on behalf of the pagedaemon here,
 > > it is not a pagedaemon.  so it can't get special treatment for pagedaemon
 > > from buffer cache etc and can deadlock.
 > 
 > In my case it is definitely not working on behalf of the pagedaemon.

 did you have plenty of free memory?  it wasn't clear from the PR.

 you said "separate pool for nested I/O buffers" worked.
 it made me think you were low of memory, given that getiobuf() is not
 related to buffer cache watermarks.

 YAMAMOTO Takashi

From: yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)
To: mlelstv@serpens.de
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org,
	gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: PR/34293 CVS commit: src/sys/dev
Date: Tue,  5 Sep 2006 21:53:37 +0900 (JST)

 > >  > vndstrategy isn't called from interrupt context.
 > >  
 > >  It is, if called from another driver (e.g. Xen block device).
 > 
 > I believe that this is not allowed. The strategy routine is
 > part of the top half of the driver, it operates in process
 > context.

 have you read the referred thread?

 > Saying that, I scanned other drivers and found several that may
 > sleep in their strategy() routine. Most promiment scsistrategy()
 > does this (by virtue of calling scsipi_command which waits for
 > a free scsi transfer context in scsipi_get_xs).

 scsistrategy is not a d_strategy routine.

 > >  > But if that is
 > >  > a problem, the feedback loop can be put into both vndread()/vndwrite().
 > >  I'm not sure we're allowed to tsleep() here either.
 > 
 > vndread()/vndwrite() definitely do already tsleep() in physio() exactly
 > when they are waiting for buffers. However, I was mistaken that these
 > routines were passed when accessing the block device but they are
 > only used by the raw device.

 does your "mkfs.ext2" use a block device, rather than a raw device?

 YAMAMOTO Takashi

From: Michael van Elst <mlelstv@serpens.de>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
	netbsd-bugs@NetBSD.org
Subject: Re: PR/34293 CVS commit: src/sys/dev
Date: Tue, 5 Sep 2006 20:17:15 +0200

 On Tue, Sep 05, 2006 at 12:55:04PM +0000, YAMAMOTO Takashi wrote:

 >  > I believe that this is not allowed. The strategy routine is
 >  > part of the top half of the driver, it operates in process
 >  > context.
 >  
 >  have you read the referred thread?

 I have read the thread. It says that the xen driver calls other
 drivers from interrupt context and which failed for vnd which
 at that time wouldn't queue infinite buffers without waiting.

 >  does your "mkfs.ext2" use a block device, rather than a raw device?

 It's not mine but it does.

 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

From: Michael van Elst <mlelstv@serpens.de>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
	netbsd-bugs@NetBSD.org
Subject: Re: PR/34293 CVS commit: src/sys/dev
Date: Tue, 5 Sep 2006 20:21:04 +0200

 On Tue, Sep 05, 2006 at 12:55:02PM +0000, YAMAMOTO Takashi wrote:
 > The following reply was made to PR kern/34293; it has been noted by GNATS.
 > 
 > From: yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)
 > To: mlelstv@serpens.de
 > Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org,
 > 	gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
 > Subject: Re: PR/34293 CVS commit: src/sys/dev
 > Date: Tue,  5 Sep 2006 21:53:09 +0900 (JST)
 > 
 >  > > while vndthread is working on behalf of the pagedaemon here,
 >  > > it is not a pagedaemon.  so it can't get special treatment for pagedaemon
 >  > > from buffer cache etc and can deadlock.
 >  > 
 >  > In my case it is definitely not working on behalf of the pagedaemon.
 >  
 >  did you have plenty of free memory?  it wasn't clear from the PR.

 I can't say. At that time the machine was hardly responsive as
 all disk I/O was stalled.

 >  you said "separate pool for nested I/O buffers" worked.

 It worked some times. But most times the processes were stuck
 waiting on getnewblk.

 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

From: Manuel Bouyer <bouyer@antioche.eu.org>
To: Michael van Elst <mlelstv@serpens.de>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org,
	gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: PR/34293 CVS commit: src/sys/dev
Date: Wed, 6 Sep 2006 23:23:02 +0200

 On Mon, Sep 04, 2006 at 11:23:37PM +0200, Michael van Elst wrote:
 > On Mon, Sep 04, 2006 at 08:55:06PM +0000, Manuel Bouyer wrote:
 > 
 > >  > vndstrategy isn't called from interrupt context.
 > >  
 > >  It is, if called from another driver (e.g. Xen block device).
 > 
 > I believe that this is not allowed. The strategy routine is
 > part of the top half of the driver, it operates in process
 > context.

 There was a discussion on tech-kern when I introduced the Xen
 backend driver, and the conclusion was that it's valid to call
 a strategy routine from interrupt context, and drivers should deal
 with that.

 > 
 > Saying that, I scanned other drivers and found several that may
 > sleep in their strategy() routine. Most promiment scsistrategy()
 > does this (by virtue of calling scsipi_command which waits for
 > a free scsi transfer context in scsipi_get_xs).

 This is a pseudo-strategy, part of the SCSI-specific ioctls implementation.
 It's internal details, it's not exported to other drivers.

 > >  This patch broke vnd for all Xen users, where it worked fine before.
 > 
 > I agree. The patch probably leaves a lot space for improvement. But
 > the real bug is calling the strategy() routine from interrupt context.

 You're redefining the strategy() interface here.

 > >  I understand all of this. But 1) the patch is broken (calling tsleep in
 > >  a strategy routine) 2) whenever it's not a problem for other drivers is
 > >  a matter of opinion. If has probably performances inpacts.
 > 
 > Calling tsleep() in strategy is allowed.

 Where did you get this ?

 > >  > performances on systems with multiple drives.
 > >  
 > >  No, I meant if all buffers end up in a single device queue.
 > 
 > That's what happening without the patch.

 And that's what happens in other driver's queue too.

 -- 
 Manuel Bouyer <bouyer@antioche.eu.org>
      NetBSD: 26 ans d'experience feront toujours la difference
 --

From: Michael van Elst <mlelstv@serpens.de>
To: Manuel Bouyer <bouyer@antioche.eu.org>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org,
	gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: PR/34293 CVS commit: src/sys/dev
Date: Thu, 7 Sep 2006 08:27:45 +0200

 On Wed, Sep 06, 2006 at 11:23:02PM +0200, Manuel Bouyer wrote:

 Hi Manuel,

 > > I believe that this is not allowed. The strategy routine is
 > > part of the top half of the driver, it operates in process
 > > context.
 > 
 > There was a discussion on tech-kern when I introduced the Xen
 > backend driver, and the conclusion was that it's valid to call
 > a strategy routine from interrupt context, and drivers should deal
 > with that.

 I now read that discussion (and one from 1996 when ccd had problems
 and one more when cgd had problems).

 So, in 4.4BSD the strategy functions operated in process context,
 and a few of them even used this to handle ressource shortages
 on the lower parts of the driver. The original vn driver does
 call VOP_STRATEGY itself, so did our vnd driver until you added
 the thread.

 Since strategy routines are not supposed to sleep for performance
 reasons and most therefore didn't, it was concluded that strategy
 routines have to be interrupt safe.

 I believe this change in the interface model was a bad decision,
 and it was done to make life for ccd, cgd and later xbd simpler.


 > > Saying that, I scanned other drivers and found several that may
 > > sleep in their strategy() routine. Most promiment scsistrategy()
 > > does this (by virtue of calling scsipi_command which waits for
 > > a free scsi transfer context in scsipi_get_xs).
 > 
 > This is a pseudo-strategy, part of the SCSI-specific ioctls implementation.
 > It's internal details, it's not exported to other drivers.

 I noticed. There are some left that do spin-locking (possible to do 
 in interrupts) and an ancient broken VAX driver that still shows
 that the BSD design let strategy routines operate in process context
 and sleep.


 > > >  This patch broke vnd for all Xen users, where it worked fine before.
 > > 
 > > I agree. The patch probably leaves a lot space for improvement. But
 > > the real bug is calling the strategy() routine from interrupt context.
 > 
 > You're redefining the strategy() interface here.

 No, some people did that 2 years ago (I believe Jason was the first caller,
 but others agreed). This change was done to accomodate cgd.


 > > >  I understand all of this. But 1) the patch is broken (calling tsleep in
 > > >  a strategy routine) 2) whenever it's not a problem for other drivers is
 > > >  a matter of opinion. If has probably performances inpacts.
 > > 
 > > Calling tsleep() in strategy is allowed.
 > 
 > Where did you get this ?

 That is 4.4BSD.


 > > >  > performances on systems with multiple drives.
 > > >  
 > > >  No, I meant if all buffers end up in a single device queue.
 > > 
 > > That's what happening without the patch.
 > 
 > And that's what happens in other driver's queue too.

 True. But with other drivers this does not end in a deadlock as
 they don't do the circle to the VOP layer. It still is a bad
 thing, maybe that SoC project to add some 'congestion control'
 to the filesystems develops a solution to that. The vnd driver
 itself however should protect itself against this deadlock
 that it generates itself. N.B. there is a second deadlock
 for io buffers, that only gets exhibited when memory runs
 out, but when the buffer deadlock is solved, this will show
 up.

 I am not sure about ccd, it calls VOP_STRATEGY() which then
 calls DEV_STRATEGY() but has some code path that may sleep.


 Greetings,
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

From: yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)
To: mlelstv@serpens.de
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org,
	gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: PR/34293 CVS commit: src/sys/dev
Date: Thu,  7 Sep 2006 21:51:01 +0900 (JST)

 > >  does your "mkfs.ext2" use a block device, rather than a raw device?
 > 
 > It's not mine but it does.

 ok, then i think i understand the problem.
 we should throttle activities creating dirty buffers.
 maybe by having a flag for getblk and friends to tell
 "we are not going to dirty the buffer".

 what your patch does is the opposite.  ie. throttling attempts of
 cleaning buffers.  i believe it makes the situation worse.

 YAMAMOTO Takashi

From: Michael van Elst <mlelstv@serpens.de>
To: gnats-bugs@NetBSD.org
Cc: 
Subject: Re: PR/34293 CVS commit: src/sys/dev
Date: Thu, 7 Sep 2006 19:36:30 +0200

 On Thu, Sep 07, 2006 at 12:55:03PM +0000, YAMAMOTO Takashi wrote:
 > The following reply was made to PR kern/34293; it has been noted by GNATS.
 > 
 > From: yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)
 > To: mlelstv@serpens.de
 > Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org,
 > 	gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
 > Subject: Re: PR/34293 CVS commit: src/sys/dev
 > Date: Thu,  7 Sep 2006 21:51:01 +0900 (JST)
 > 
 >  > >  does your "mkfs.ext2" use a block device, rather than a raw device?
 >  > 
 >  > It's not mine but it does.
 >  
 >  ok, then i think i understand the problem.
 >  we should throttle activities creating dirty buffers.
 >  maybe by having a flag for getblk and friends to tell
 >  "we are not going to dirty the buffer".

 If I remember previous discussions there was some opposition in
 restricting writes this way. Please check the discussions about
 untarring source trees and waiting for X11 to respond again.

 Throttling getblk does not prevent the deadlock. For that you need
 to distinguish between the upper filesystem or block device, that
 needs to be throttled and the lower filesystem (or rather lowest
 if you start nesting deeper), that must be kept running to avoid the
 deadlock.

 There is also the problem about feedback. How would you know when
 to throttle (or rather when to stop)?. You could have a mechanism
 that looks at all the device queues, if buffers pile up there
 because the device is too slow, you stop generating more.

 But throttling getblk isn't good enough, you only want to throttle
 getting buffers that end in the particular slow device queue. And
 getblk doesn't have that information.

 The SoC-Project that adds congestion control to filesystems might
 eventually address this problem correctly.


 >  what your patch does is the opposite.  ie. throttling attempts of
 >  cleaning buffers.  i believe it makes the situation worse.

 The patch does not throttle attempts of cleaning buffers. The
 bottleneck for cleaning buffers is still the underlying filesystem
 where the virtual device file resides.

 With the patch when vnd throttles the writer it does two things:

 it prevents it from generating more dirty buffers and

 it delays the requeuing of dirty buffers _when the underlying
    filesystem is too slow to process them_.

 As a result it may not utilize the buffer cache effectively.
 You could provide a knob to control the queue length, currently
 that's hardcoded 8 for local filesystems and 2 for NFS.

 It also affects vnd operation only, unlike anything that interfers
 with the more global getblk.


 Greetings,
 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

From: yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
	netbsd-bugs@netbsd.org, mlelstv@serpens.de
Subject: Re: PR/34293 CVS commit: src/sys/dev
Date: Fri,  8 Sep 2006 09:06:13 +0900 (JST)

 >  >  ok, then i think i understand the problem.
 >  >  we should throttle activities creating dirty buffers.
 >  >  maybe by having a flag for getblk and friends to tell
 >  >  "we are not going to dirty the buffer".
 >  
 >  If I remember previous discussions there was some opposition in
 >  restricting writes this way. Please check the discussions about
 >  untarring source trees and waiting for X11 to respond again.

 can you provide a pointer?

 >  Throttling getblk does not prevent the deadlock. For that you need
 >  to distinguish between the upper filesystem or block device, that
 >  needs to be throttled and the lower filesystem (or rather lowest
 >  if you start nesting deeper), that must be kept running to avoid the
 >  deadlock.
 >  
 >  There is also the problem about feedback. How would you know when
 >  to throttle (or rather when to stop)?. You could have a mechanism
 >  that looks at all the device queues, if buffers pile up there
 >  because the device is too slow, you stop generating more.

 the point is to avoid filling up buffer cache with dirty data.
 ensuring that the amount of dirty data is less than, say,
 1/2 of total buffer cache, should be enough.
 it prevents deadlock because clean buffers can be reclaimed without i/o.
 well, i don't claim it's a perfect solution.  but i believe it's
 far better than your patch, which doesn't seem to make sense at all.

 >  But throttling getblk isn't good enough, you only want to throttle
 >  getting buffers that end in the particular slow device queue. And
 >  getblk doesn't have that information.
 > 
 >  The SoC-Project that adds congestion control to filesystems might
 >  eventually address this problem correctly.

 the device speed doesn't matter much here.
 once you fill buffer cache with dirty data, you lose.

 (i don't know what the soc-project is about.)

 >  >  what your patch does is the opposite.  ie. throttling attempts of
 >  >  cleaning buffers.  i believe it makes the situation worse.
 >  
 >  The patch does not throttle attempts of cleaning buffers. The
 >  bottleneck for cleaning buffers is still the underlying filesystem
 >  where the virtual device file resides.
 >  
 >  With the patch when vnd throttles the writer it does two things:
 > 
 >  it prevents it from generating more dirty buffers and
 >  
 >  it delays the requeuing of dirty buffers _when the underlying
 >     filesystem is too slow to process them_.

 "the writer" here includes attempts of cleaning buffers.
 it just introduces another deadlock.

 YAMAMOTO Takashi

From: Michael van Elst <mlelstv@serpens.de>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
	netbsd-bugs@NetBSD.org
Subject: Re: PR/34293 CVS commit: src/sys/dev
Date: Fri, 8 Sep 2006 08:02:38 +0200

 On Fri, Sep 08, 2006 at 12:10:07AM +0000, YAMAMOTO Takashi wrote:
 > The following reply was made to PR kern/34293; it has been noted by GNATS.
 > 
 > From: yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)
 > To: gnats-bugs@NetBSD.org
 > Cc: kern-bug-people@netbsd.org, gnats-admin@netbsd.org,
 > 	netbsd-bugs@netbsd.org, mlelstv@serpens.de
 > Subject: Re: PR/34293 CVS commit: src/sys/dev
 > Date: Fri,  8 Sep 2006 09:06:13 +0900 (JST)
 > 
 >  >  If I remember previous discussions there was some opposition in
 >  >  restricting writes this way. Please check the discussions about
 >  >  untarring source trees and waiting for X11 to respond again.
 >  
 >  can you provide a pointer?

 There is http://mail-index.netbsd.org/current-users/2004/08/30/0033.html,
 but there are earlier threads I have to look up.


 >  >  Throttling getblk does not prevent the deadlock. For that you need
 >  >  to distinguish between the upper filesystem or block device, that
 >  >  needs to be throttled and the lower filesystem (or rather lowest
 >  >  if you start nesting deeper), that must be kept running to avoid the
 >  >  deadlock.
 >  >  
 >  >  There is also the problem about feedback. How would you know when
 >  >  to throttle (or rather when to stop)?. You could have a mechanism
 >  >  that looks at all the device queues, if buffers pile up there
 >  >  because the device is too slow, you stop generating more.
 >  
 >  the point is to avoid filling up buffer cache with dirty data.
 >  ensuring that the amount of dirty data is less than, say,
 >  1/2 of total buffer cache, should be enough.
 >  it prevents deadlock because clean buffers can be reclaimed without i/o.
 >  well, i don't claim it's a perfect solution.  but i believe it's
 >  far better than your patch, which doesn't seem to make sense at all.

 If you limit dirty buffers to 1/2 of total buffer cache you will
 still deadlock. It doesn't matter what limit you put on dirty
 buffers. Whenever that limit it reached then vnd cannot process
 them because for doing so it would have to exceed that limit. You
 must make sure that the filesystem that vnd is calling through
 VOP_STRATEGY _can_ allocate new buffers, even those that it
 can use for writing.

 If my patch doesn't make sense, then try to understand why it works.


 >  >  But throttling getblk isn't good enough, you only want to throttle
 >  >  getting buffers that end in the particular slow device queue. And
 >  >  getblk doesn't have that information.
 >  
 >  the device speed doesn't matter much here.
 >  once you fill buffer cache with dirty data, you lose.

 No, if the blocks are allocated for another filesystem or another device,
 it is perfectly possible for these to be processed. So you only want
 to avoid blocks to be dirtied that end on the device that cannot
 process them or at least cannot process them fast enough.


 >  "the writer" here includes attempts of cleaning buffers.
 >  it just introduces another deadlock.

 The writer does never clean blocks. The writer can only create
 more dirty blocks. The blocks are "clean" again when they
 are processed by the device, i.e. in some iodone routine
 which is called in an interrupt. How should a sleeping process
 prevent that? In particular, at that time the writer is stopped
 in vndstrategy, we definitely know that the vndthread _is_ busy
 cleaning buffers, because that is the stop condition. We therefore
 make _sure_ that it can do so.


 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

From: yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)
To: mlelstv@serpens.de
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org,
	gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: PR/34293 CVS commit: src/sys/dev
Date: Sun, 10 Sep 2006 15:40:06 +0900 (JST)

 --NextPart-20060910152350-0213200
 Content-Type: Text/Plain; charset=us-ascii

 > >  >  If I remember previous discussions there was some opposition in
 > >  >  restricting writes this way. Please check the discussions about
 > >  >  untarring source trees and waiting for X11 to respond again.
 > >  
 > >  can you provide a pointer?
 > 
 > There is http://mail-index.netbsd.org/current-users/2004/08/30/0033.html,
 > but there are earlier threads I have to look up.

 thanks.

 > If you limit dirty buffers to 1/2 of total buffer cache you will
 > still deadlock. It doesn't matter what limit you put on dirty
 > buffers. Whenever that limit it reached then vnd cannot process
 > them because for doing so it would have to exceed that limit. You
 > must make sure that the filesystem that vnd is calling through
 > VOP_STRATEGY _can_ allocate new buffers, even those that it
 > can use for writing.

 which filesystem are you talking about?

 > If my patch doesn't make sense, then try to understand why it works.

 now i've tried your patch.  it didn't work for me.
 vndthread just got deadlock on "getnewbuf" without your
 throttling code triggered at all.

 	(/bigfile is 512M file w/o holes on ufs.)
 	vnconfig vnd0 /bigfile
 	newfs -F /dev/rvnd0d
 	mount /dev/vnd0d /mnt
 	dd if=/dev/zero of=/mnt/a bs=1m seek=1m  (ENOSPC)
 	vnconfig vnd1 /mnt/a
 	dd if=/dev/zero of=/dev/vnd1d bs=1m seek=1m  (deadlock)

 (i can't follow your recipe in the PR because i don't have 10G disk space.)

 > No, if the blocks are allocated for another filesystem or another device,
 > it is perfectly possible for these to be processed. So you only want
 > to avoid blocks to be dirtied that end on the device that cannot
 > process them or at least cannot process them fast enough.

 yes, it's what i meant.  sorry if it wasn't clear.

 > >  "the writer" here includes attempts of cleaning buffers.
 > >  it just introduces another deadlock.
 > 
 > The writer does never clean blocks. The writer can only create
 > more dirty blocks. The blocks are "clean" again when they
 > are processed by the device, i.e. in some iodone routine
 > which is called in an interrupt. How should a sleeping process
 > prevent that? In particular, at that time the writer is stopped
 > in vndstrategy, we definitely know that the vndthread _is_ busy
 > cleaning buffers, because that is the stop condition. We therefore
 > make _sure_ that it can do so.

 i think that our definitions of "writer" are different.
 mine is "callers of vndstrategy."
 yours is "callers of write(2)".  right?

 i've got tired of this discussion and made a proof-of-concept
 patch. (attached)  i hope it's a clearer explanation. :)

 YAMAMOTO Takashi

 --NextPart-20060910152350-0213200
 Content-Type: Text/Plain; charset=us-ascii
 Content-Disposition: attachment; filename="a.diff"

 Index: sys/buf.h
 ===================================================================
 --- sys/buf.h	(revision 1786)
 +++ sys/buf.h	(working copy)
 @@ -213,11 +213,13 @@ do {									\
  #define	B_WRITE		0x00000000	/* Write buffer (pseudo flag). */
  #define	B_XXX		0x02000000	/* Debugging flag. */
  #define	B_VFLUSH	0x04000000	/* Buffer is being synced. */
 +#define	B_DIRTYACC	0x08000000	/* */

  #define BUF_FLAGBITS \
      "\20\1AGE\3ASYNC\4BAD\5BUSY\6SCANNED\7CALL\10DELWRI" \
      "\11DIRTY\12DONE\14ERROR\15GATHERED\16INVAL\17LOCKED\20NOCACHE" \
 -    "\22CACHE\23PHYS\24RAW\25READ\26TAPE\30WANTED\32XXX\33VFLUSH"
 +    "\22CACHE\23PHYS\24RAW\25READ\26TAPE\30WANTED\32XXX\33VFLUSH" \
 +    "\34DIRTYACC"


  /*
 @@ -283,6 +285,11 @@ struct buf *getblk(struct vnode *, daddr
  struct buf *geteblk(int);
  struct buf *incore(struct vnode *, daddr_t);

 +struct buf *getblkx(struct vnode *, daddr_t, int, int, int, int);
 +
 +#define	BUFF_READ	1
 +#define	BUFF_NOALLOC	2
 +
  void	minphys(struct buf *);
  int	physio(void (*)(struct buf *), struct buf *, dev_t, int,
  	       void (*)(struct buf *), struct uio *);
 Index: kern/vfs_bio.c
 ===================================================================
 --- kern/vfs_bio.c	(revision 1786)
 +++ kern/vfs_bio.c	(working copy)
 @@ -116,6 +116,7 @@ __KERNEL_RCSID(0, "$NetBSD: vfs_bio.c,v 
  u_int	nbuf;			/* XXX - for softdep_lockedbufs */
  u_int	bufpages = BUFPAGES;	/* optional hardwired count */
  u_int	bufcache = BUFCACHE;	/* max % of RAM to use for buffer cache */
 +static size_t buf_dirtybytes;

  /* Function prototypes */
  struct bqueue;
 @@ -127,7 +128,7 @@ static void bufpool_page_free(struct poo
  static inline struct buf *bio_doread(struct vnode *, daddr_t, int,
      kauth_cred_t, int);
  static struct buf *getnewbuf(int, int, int);
 -static int buf_lotsfree(void);
 +static int buf_lotsfree(int);
  static int buf_canrelease(void);
  static inline u_long buf_mempoolidx(u_long);
  static inline u_long buf_roundsize(u_long);
 @@ -139,6 +140,7 @@ int count_lock_queue(void); /* XXX */
  #ifdef DEBUG
  static int checkfreelist(struct buf *, struct bqueue *);
  #endif
 +static void allocbufx(struct buf *, int, int, int);

  /*
   * Definitions for the buffer hash lists.
 @@ -361,6 +363,35 @@ buf_memcalc(void)
  	return (n);
  }

 +static void
 +bufcache_dirtyacc_relse(struct buf *bp)
 +{
 +
 +	if ((bp->b_flags & (B_DIRTYACC | B_DELWRI)) == B_DIRTYACC) {
 +		KASSERT(buf_dirtybytes >= bp->b_bcount);
 +		bp->b_flags &= ~B_DIRTYACC;
 +		buf_dirtybytes -= bp->b_bcount;
 +	}
 +}
 +
 +static void
 +bufcache_dirtyacc_write(struct buf *bp)
 +{
 +
 +	if ((bp->b_flags & B_DIRTYACC) == 0) {
 +		bp->b_flags |= B_DIRTYACC;
 +		buf_dirtybytes += bp->b_bcount; /* XXX */
 +		KASSERT(buf_dirtybytes <= bufmem); /* XXX */
 +	}
 +}
 +
 +static boolean_t
 +bufcache_dirtyacc_needthrottle(int flags)
 +{
 +
 +	return (flags & BUFF_READ) == 0 && buf_dirtybytes > bufmem / 4 * 3;
 +}
 +
  /*
   * Initialize buffers and hash links for buffers.
   */
 @@ -430,7 +461,7 @@ bufinit(void)
  }

  static int
 -buf_lotsfree(void)
 +buf_lotsfree(int flags)
  {
  	int try, thresh;
  	struct lwp *l = curlwp;
 @@ -439,6 +470,9 @@ buf_lotsfree(void)
  	if (l->l_flag & L_COWINPROGRESS)
  		return 1;

 +	if (bufcache_dirtyacc_needthrottle(flags))
 +		return 0;
 +
  	/* Always allocate if less than the low water mark. */
  	if (bufmem < bufmem_lowater)
  		return 1;
 @@ -729,6 +763,7 @@ bwrite(struct buf *bp)

  	s = splbio();
  	simple_lock(&bp->b_interlock);
 +	bufcache_dirtyacc_write(bp);

  	wasdelayed = ISSET(bp->b_flags, B_DELWRI);

 @@ -821,6 +856,7 @@ bdwrite(struct buf *bp)
  		reassignbuf(bp, bp->b_vp);
  	}

 +	bufcache_dirtyacc_write(bp);
  	/* Otherwise, the "write" is done, so mark and release the buffer. */
  	CLR(bp->b_flags, B_DONE);
  	simple_unlock(&bp->b_interlock);
 @@ -842,6 +878,7 @@ bawrite(struct buf *bp)

  	KASSERT(ISSET(bp->b_flags, B_BUSY));

 +	bufcache_dirtyacc_write(bp);
  	SET(bp->b_flags, B_ASYNC);
  	simple_unlock(&bp->b_interlock);
  	splx(s);
 @@ -987,6 +1024,7 @@ already_queued:
  	CLR(bp->b_flags, B_AGE|B_ASYNC|B_BUSY|B_NOCACHE);
  	SET(bp->b_flags, B_CACHE);

 +	bufcache_dirtyacc_relse(bp);
  	/* Allow disk interrupts. */
  	simple_unlock(&bp->b_interlock);
  	simple_unlock(&bqueue_slock);
 @@ -1032,6 +1070,14 @@ incore(struct vnode *vp, daddr_t blkno)
  struct buf *
  getblk(struct vnode *vp, daddr_t blkno, int size, int slpflag, int slptimeo)
  {
 +
 +	return getblkx(vp, blkno, size, slpflag, slptimeo, 0);
 +}
 +
 +struct buf *
 +getblkx(struct vnode *vp, daddr_t blkno, int size, int slpflag, int slptimeo,
 +    int flags)
 +{
  	struct buf *bp;
  	int s, err;
  	int preserve;
 @@ -1067,7 +1113,7 @@ start:
  		bremfree(bp);
  		preserve = 1;
  	} else {
 -		if ((bp = getnewbuf(slpflag, slptimeo, 0)) == NULL) {
 +		if ((bp = getnewbuf(slpflag, slptimeo, flags)) == NULL) {
  			simple_unlock(&bqueue_slock);
  			splx(s);
  			goto start;
 @@ -1126,12 +1172,23 @@ geteblk(int size)
   * start a write.  If the buffer grows, it's the callers
   * responsibility to fill out the buffer's additional contents.
   */
 +
  void
  allocbuf(struct buf *bp, int size, int preserve)
  {
 +
 +	allocbufx(bp, size, preserve, 0);
 +}
 +
 +static void
 +allocbufx(struct buf *bp, int size, int preserve, int flags)
 +{
  	vsize_t oldsize, desired_size;
  	caddr_t addr;
  	int s, delta;
 +	int dirtydelta = 0;
 +
 +	KASSERT((bp->b_flags & B_BUSY) != 0);

  	desired_size = buf_roundsize(size);
  	if (desired_size > MAXBSIZE)
 @@ -1159,10 +1216,19 @@ allocbuf(struct buf *bp, int size, int p
  	 * Update overall buffer memory counter (protected by bqueue_slock)
  	 */
  	delta = (long)desired_size - (long)oldsize;
 -
 +	if ((bp->b_flags & B_DIRTYACC) != 0) {
 +		dirtydelta -= oldsize;
 +		bp->b_flags &= ~B_DIRTYACC;
 +	}
 +	if ((flags & BUFF_READ) == 0) {
 +		dirtydelta += desired_size;
 +		bp->b_flags |= B_DIRTYACC;
 +	}
  	s = splbio();
  	simple_lock(&bqueue_slock);
 -	if ((bufmem += delta) > bufmem_hiwater) {
 +	bufmem += delta;
 +	buf_dirtybytes += dirtydelta;
 +	if (bufmem > bufmem_hiwater) {
  		/*
  		 * Need to trim overall memory usage.
  		 */
 @@ -1185,6 +1251,34 @@ allocbuf(struct buf *bp, int size, int p
  	splx(s);
  }

 +static struct buf *
 +bq_findvictim(struct bqueue *bq, int flags)
 +{
 +	struct buf *bp;
 +	const boolean_t throttledirty = bufcache_dirtyacc_needthrottle(flags);
 +
 +	TAILQ_FOREACH(bp, &bq->bq_queue, b_freelist) {
 +		simple_lock(&bp->b_interlock);
 +		if (!throttledirty || (bp->b_flags & B_DELWRI) != 0) {
 +			break;
 +		}
 +		simple_unlock(&bp->b_interlock);
 +	}
 +	return bp;
 +}
 +
 +static struct buf *
 +bufcache_findvictim(int flags)
 +{
 +	struct buf *bp;
 +
 +	bp = bq_findvictim(&bufqueues[BQ_AGE], flags);
 +	if (bp == NULL) {
 +		bp = bq_findvictim(&bufqueues[BQ_LRU], flags);
 +	}
 +	return bp;
 +}
 +
  /*
   * Find a buffer which is available for use.
   * Select something from a free list.
 @@ -1194,7 +1288,7 @@ allocbuf(struct buf *bp, int size, int p
   * Return buffer locked.
   */
  struct buf *
 -getnewbuf(int slpflag, int slptimeo, int from_bufq)
 +getnewbuf(int slpflag, int slptimeo, int flags)
  {
  	struct buf *bp;

 @@ -1205,7 +1299,7 @@ start:
  	 * Get a new buffer from the pool; but use NOWAIT because
  	 * we have the buffer queues locked.
  	 */
 -	if (!from_bufq && buf_lotsfree() &&
 +	if ((flags & BUFF_NOALLOC) == 0 && buf_lotsfree(flags) &&
  	    (bp = pool_get(&bufpool, PR_NOWAIT)) != NULL) {
  		memset((char *)bp, 0, sizeof(*bp));
  		BUF_INIT(bp);
 @@ -1219,15 +1313,11 @@ start:
  		return (bp);
  	}

 -	if ((bp = TAILQ_FIRST(&bufqueues[BQ_AGE].bq_queue)) != NULL ||
 -	    (bp = TAILQ_FIRST(&bufqueues[BQ_LRU].bq_queue)) != NULL) {
 -		simple_lock(&bp->b_interlock);
 +	bp = bufcache_findvictim(flags);
 +	if (bp != NULL) {
  		bremfree(bp);
  	} else {
 -		/*
 -		 * XXX: !from_bufq should be removed.
 -		 */
 -		if (!from_bufq || curproc != uvm.pagedaemon_proc) {
 +		if (curproc != uvm.pagedaemon_proc) {
  			/* wait for a free buffer of any kind */
  			needbuffer = 1;
  			ltsleep(&needbuffer, slpflag|(PRIBIO + 1),
 @@ -1305,10 +1395,11 @@ buf_trim(void)
  	long size = 0;

  	/* Instruct getnewbuf() to get buffers off the queues */
 -	if ((bp = getnewbuf(PCATCH, 1, 1)) == NULL)
 +	if ((bp = getnewbuf(PCATCH, 1, BUFF_NOALLOC | BUFF_READ)) == NULL)
  		return 0;

  	KASSERT(!ISSET(bp->b_flags, B_WANTED));
 +	KASSERT(!ISSET(bp->b_flags, B_DIRTYACC));
  	simple_unlock(&bp->b_interlock);
  	size = bp->b_bufsize;
  	bufmem -= size;
 Index: ufs/ufs/ufs_bmap.c
 ===================================================================
 --- ufs/ufs/ufs_bmap.c	(revision 1595)
 +++ ufs/ufs/ufs_bmap.c	(working copy)
 @@ -223,7 +223,8 @@ ufs_bmaparray(struct vnode *vp, daddr_t 
  			brelse(bp);

  		xap->in_exists = 1;
 -		bp = getblk(vp, metalbn, mp->mnt_stat.f_iosize, 0, 0);
 +		bp = getblkx(vp, metalbn, mp->mnt_stat.f_iosize, 0, 0,
 +		    BUFF_READ);
  		if (bp == NULL) {

  			/*

 --NextPart-20060910152350-0213200--

From: Michael van Elst <mlelstv@serpens.de>
To: gnats-bugs@NetBSD.org
Cc: kern-bug-people@NetBSD.org, gnats-admin@NetBSD.org,
	netbsd-bugs@NetBSD.org
Subject: Re: PR/34293 CVS commit: src/sys/dev
Date: Sun, 10 Sep 2006 09:18:41 +0200

 On Sun, Sep 10, 2006 at 06:45:03AM +0000, YAMAMOTO Takashi wrote:

 >  
 >  > If you limit dirty buffers to 1/2 of total buffer cache you will
 >  > still deadlock. It doesn't matter what limit you put on dirty
 >  > buffers. Whenever that limit it reached then vnd cannot process
 >  > them because for doing so it would have to exceed that limit. You
 >  > must make sure that the filesystem that vnd is calling through
 >  > VOP_STRATEGY _can_ allocate new buffers, even those that it
 >  > can use for writing.
 >  
 >  which filesystem are you talking about?

 The filesystem where the virtual device resides on, in that
 my UFS.


 >  > If my patch doesn't make sense, then try to understand why it works.
 >  
 >  now i've tried your patch.  it didn't work for me.

 That's probably because you have a completely different test that
 shows a different problem.

 >  vndthread just got deadlock on "getnewbuf" without your
 >  throttling code triggered at all.
 >  
 >  	(/bigfile is 512M file w/o holes on ufs.)
 >  	vnconfig vnd0 /bigfile
 >  	newfs -F /dev/rvnd0d
 >  	mount /dev/vnd0d /mnt
 >  	dd if=/dev/zero of=/mnt/a bs=1m seek=1m  (ENOSPC)

 That creates a sparse file starting at offset 1TB.

 >  	vnconfig vnd1 /mnt/a
 >  	dd if=/dev/zero of=/dev/vnd1d bs=1m seek=1m  (deadlock)

 That would write the sparse file starting from offset 1TB.

 >  (i can't follow your recipe in the PR because i don't have 10G disk space.)

 You can probably use a smaller file as long as enough data gets
 written. mkfs.ext2 just writes the ext2-metadata, so a simple dd
 would require a much smaller file.


 >  > The writer does never clean blocks. The writer can only create
 >  > more dirty blocks. The blocks are "clean" again when they
 >  > are processed by the device, i.e. in some iodone routine
 >  > which is called in an interrupt. How should a sleeping process
 >  > prevent that? In particular, at that time the writer is stopped
 >  > in vndstrategy, we definitely know that the vndthread _is_ busy
 >  > cleaning buffers, because that is the stop condition. We therefore
 >  > make _sure_ that it can do so.
 >  
 >  i think that our definitions of "writer" are different.
 >  mine is "callers of vndstrategy."
 >  yours is "callers of write(2)".  right?

 Mine is "callers of vndstrategy", for writing the block device
 that's the same as "callers of write(2)".

 I have a look at your patch.

 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

From: yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi)
To: mlelstv@serpens.de
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org,
	gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: PR/34293 CVS commit: src/sys/dev
Date: Sun, 10 Sep 2006 17:07:19 +0900 (JST)

 > >  > If you limit dirty buffers to 1/2 of total buffer cache you will
 > >  > still deadlock. It doesn't matter what limit you put on dirty
 > >  > buffers. Whenever that limit it reached then vnd cannot process
 > >  > them because for doing so it would have to exceed that limit. You
 > >  > must make sure that the filesystem that vnd is calling through
 > >  > VOP_STRATEGY _can_ allocate new buffers, even those that it
 > >  > can use for writing.
 > >  
 > >  which filesystem are you talking about?
 > 
 > The filesystem where the virtual device resides on, in that
 > my UFS.

 how can vnd on UFS allocate new dirty buffer cache
 in order to make a progress?

 > >  > If my patch doesn't make sense, then try to understand why it works.
 > >  
 > >  now i've tried your patch.  it didn't work for me.
 > 
 > That's probably because you have a completely different test that
 > shows a different problem.

 IMO, it's fundamentally the same problem.

 > >  	dd if=/dev/zero of=/mnt/a bs=1m seek=1m  (ENOSPC)
 > 
 > That creates a sparse file starting at offset 1TB.
 > 
 > >  	vnconfig vnd1 /mnt/a
 > >  	dd if=/dev/zero of=/dev/vnd1d bs=1m seek=1m  (deadlock)
 > 
 > That would write the sparse file starting from offset 1TB.

 yes, they were intentional.
 (to make sure it uses indirect blocks, w/o requiring much disk space.)

 > >  (i can't follow your recipe in the PR because i don't have 10G disk space.)
 > 
 > You can probably use a smaller file as long as enough data gets
 > written. mkfs.ext2 just writes the ext2-metadata, so a simple dd
 > would require a much smaller file.

 i don't want to waste my time to find a case that the patch can work.
 there might be a case, but it doesn't make the patch correct.

 > Mine is "callers of vndstrategy", for writing the block device
 > that's the same as "callers of write(2)".

 no, they are fundamentally different.
 the former cleans (or, at least, start to clean) buffers and
 the latter dirties buffers.
 yes, there are exceptions.  eg. write(2) can start async writeback.
 but they don't alter the big picture.

 > I have a look at your patch.

 thanks.

 YAMAMOTO Takashi

From: Michael van Elst <mlelstv@serpens.de>
To: YAMAMOTO Takashi <yamt@mwd.biglobe.ne.jp>
Cc: gnats-bugs@NetBSD.org, kern-bug-people@NetBSD.org,
	gnats-admin@NetBSD.org, netbsd-bugs@NetBSD.org
Subject: Re: PR/34293 CVS commit: src/sys/dev
Date: Sun, 10 Sep 2006 16:50:21 +0200

 On Sun, Sep 10, 2006 at 05:07:19PM +0900, YAMAMOTO Takashi wrote:

 > > >  > If you limit dirty buffers to 1/2 of total buffer cache you will
 > > >  > still deadlock. It doesn't matter what limit you put on dirty
 > > >  > buffers. Whenever that limit it reached then vnd cannot process
 > > >  > them because for doing so it would have to exceed that limit. You
 > > >  > must make sure that the filesystem that vnd is calling through
 > > >  > VOP_STRATEGY _can_ allocate new buffers, even those that it
 > > >  > can use for writing.
 > > >  
 > > >  which filesystem are you talking about?
 > > 
 > > The filesystem where the virtual device resides on, in that
 > > my UFS.
 > 
 > how can vnd on UFS allocate new dirty buffer cache
 > in order to make a progress?

 VOP_STRATEGY could do this for updating metadata. I don't know
 wether UFS actually does this. From looking at iostat output
 while running your testcase it only does extra reads, so if
 it generates dirty buffers these don't end (quickly) on the device
 queue.

 > > >  	dd if=/dev/zero of=/mnt/a bs=1m seek=1m  (ENOSPC)
 > > >  	vnconfig vnd1 /mnt/a
 > > >  	dd if=/dev/zero of=/dev/vnd1d bs=1m seek=1m  (deadlock)

 > yes, they were intentional.
 > (to make sure it uses indirect blocks, w/o requiring much disk space.)

 I cannot reproduce that problem here. dd doesn't deadlock.


 -- 
                                 Michael van Elst
 Internet: mlelstv@serpens.de
                                 "A potential Snark may lurk in every tree."

>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.39 2013/11/01 18:47:49 spz Exp $
$NetBSD: gnats_config.sh,v 1.8 2006/05/07 09:23:38 tsutsui Exp $
Copyright © 1994-2007 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.