NetBSD Problem Report #55415

From oster@fween.ca  Wed Jun 24 20:34:09 2020
Return-Path: <oster@fween.ca>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
	by mollari.NetBSD.org (Postfix) with ESMTPS id 56C831A9217
	for <gnats-bugs@gnats.NetBSD.org>; Wed, 24 Jun 2020 20:34:09 +0000 (UTC)
Message-Id: <20200624203407.16FAB52D76F@thog.fween.ca>
Date: Wed, 24 Jun 2020 14:34:07 -0600 (CST)
From: oster@netbsd.org
Reply-To: oster@netbsd.org
To: gnats-bugs@NetBSD.org
Subject: vax no longer preempts in a timely fashion
X-Send-Pr-Version: 3.95

>Number:         55415
>Category:       port-vax
>Synopsis:       vax no longer preempts in a timely fashion
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    port-vax-maintainer
>State:          closed
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Jun 24 20:35:00 +0000 2020
>Closed-Date:    Tue Dec 19 14:35:42 +0000 2023
>Last-Modified:  Tue Dec 19 14:35:42 +0000 2023
>Originator:     Greg Oster
>Release:        NetBSD 9.99.68
>Organization:
>Environment:


System: NetBSD floyd 9.99.68 NetBSD 9.99.68 (GENERICFIX) #0: Wed Jun 24 10:35:26 CST 2020  oster@thog:/u1/builds/build300/src/obj/vax/u1/builds/build300/src/sys/arch/vax/compile/GENERICFIX vax
Architecture: vax
Machine: vax
>Description:

The vax port currently isn't doing kernel preemption correctly.  That is,
if a userland process decides not to yield the CPU, it can basically stop
the network, console, and any other processes from getting any CPU cycles.

The behaviour was found by attempting to build /usr/pkgsrc/devel/m4, 
and wondering why it kept hanging the machine hard at a certain spot. 
That spot was testing for strstr() functionality.  After much debugging,
it was determined that the machine hadn't actually froze -- the strstr()
call was just consuming all CPU, and would continue to do so until the
call finished.  (Even though the code also had an alarm set to stop the
process after 5 seconds, that alarm would never fire, due to nothing else
getting any CPU cycles.)

>How-To-Repeat:

Attempt to build /usr/pkgsrc/devel/m4 on a 4000/60.  
Notice that the machine appears to be totally frozen.
The errant behaviour is also be observed on boot, where ping times 
to the vax go from 2ms to 9*seconds* as various /etc/rc.d scripts are run.

>Fix:
Version 1.136 of src/sys/vax/vax/trap.c removed the lines:
		if (curcpu()->ci_want_resched)
			preempt();
 from the trap() function, and version 1.103 of src/sys/vax/include/cpu.c
removed the line:
               (ci)->ci_want_resched = 1; \
 from the cpu_need_resched #define.

Reverting both of these changes restores normal preemption behaviour.

>Release-Note:

>Audit-Trail:
From: Anders Magnusson <ragge@tethuvudet.se>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
Date: Thu, 25 Jun 2020 11:10:42 +0200

 Will it work if you only restore the removed line in cpu.h?



From: Greg Oster <oster@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
Date: Thu, 25 Jun 2020 08:55:50 -0600

 On 6/25/20 3:15 AM, Anders Magnusson wrote:
 > The following reply was made to PR port-vax/55415; it has been noted by GNATS.
 > 
 > From: Anders Magnusson <ragge@tethuvudet.se>
 > To: gnats-bugs@netbsd.org
 > Cc:
 > Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
 > Date: Thu, 25 Jun 2020 11:10:42 +0200
 > 
 >   Will it work if you only restore the removed line in cpu.h?

 Yes, yes it does!  So it's just one line that needs to be restored to 
 get things working properly.

 Thanks for the hint!

 Later...

 Greg Oster

From: Anders Magnusson <ragge@tethuvudet.se>
To: gnats-bugs@netbsd.org, port-vax-maintainer@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, oster@netbsd.org
Cc: 
Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
Date: Fri, 26 Jun 2020 08:54:15 +0200

 Den 2020-06-25 kl. 17:00, skrev Greg Oster:
 > The following reply was made to PR port-vax/55415; it has been noted by GNATS.
 >
 > From: Greg Oster <oster@netbsd.org>
 > To: gnats-bugs@netbsd.org
 > Cc:
 > Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
 > Date: Thu, 25 Jun 2020 08:55:50 -0600
 >
 >   On 6/25/20 3:15 AM, Anders Magnusson wrote:
 >   > The following reply was made to PR port-vax/55415; it has been noted by GNATS.
 >   >
 >   > From: Anders Magnusson <ragge@tethuvudet.se>
 >   > To: gnats-bugs@netbsd.org
 >   > Cc:
 >   > Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
 >   > Date: Thu, 25 Jun 2020 11:10:42 +0200
 >   >
 >   >   Will it work if you only restore the removed line in cpu.h?
 >   
 >   Yes, yes it does!  So it's just one line that needs to be restored to
 >   get things working properly.
 >
 Great!

 The other missing line should not be needed as I understand the code in 
 sched_resched_cpu().
 ci_want_resched should always be set already when cpu_need_resched() is 
 called.

 I'll try to fire up my 4000/90 this weekend and see if I can find this bug.

 -- R

From: Greg Oster <oster@netbsd.org>
To: gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
Date: Mon, 6 Jul 2020 08:50:11 -0600

 On 6/26/20 12:55 AM, Anders Magnusson wrote:
 > The following reply was made to PR port-vax/55415; it has been noted by GNATS.
 > 
 > From: Anders Magnusson <ragge@tethuvudet.se>
 > To: gnats-bugs@netbsd.org, port-vax-maintainer@netbsd.org,
 >   gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, oster@netbsd.org
 > Cc:
 > Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
 > Date: Fri, 26 Jun 2020 08:54:15 +0200
 > 
 >   Den 2020-06-25 kl. 17:00, skrev Greg Oster:
 >   > The following reply was made to PR port-vax/55415; it has been noted by GNATS.
 >   >
 >   > From: Greg Oster <oster@netbsd.org>
 >   > To: gnats-bugs@netbsd.org
 >   > Cc:
 >   > Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
 >   > Date: Thu, 25 Jun 2020 08:55:50 -0600
 >   >
 >   >   On 6/25/20 3:15 AM, Anders Magnusson wrote:
 >   >   > The following reply was made to PR port-vax/55415; it has been noted by GNATS.
 >   >   >
 >   >   > From: Anders Magnusson <ragge@tethuvudet.se>
 >   >   > To: gnats-bugs@netbsd.org
 >   >   > Cc:
 >   >   > Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
 >   >   > Date: Thu, 25 Jun 2020 11:10:42 +0200
 >   >   >
 >   >   >   Will it work if you only restore the removed line in cpu.h?
 >   >
 >   >   Yes, yes it does!  So it's just one line that needs to be restored to
 >   >   get things working properly.
 >   >
 >   Great!
 >   
 >   The other missing line should not be needed as I understand the code in
 >   sched_resched_cpu().
 >   ci_want_resched should always be set already when cpu_need_resched() is
 >   called.
 >   
 >   I'll try to fire up my 4000/90 this weekend and see if I can find this bug.
 >   
 >   -- R
 >   
 > 
 I had a few minutes to poke at this again... and can confirm that the 
 issue can be seen using simh as well.  I note that setting 
 ci_want_resched to 4 (RESCHED_UPREEMPT) or 8 (RESCHED_KPREEMPT) is 
 insufficient -- it is only with setting ci_want_resched to 1 (i.e. 
 likely blowing away the currently set value of 4) that scheduling 
 behaves properly.  Also: using:

    ci_want_resched |= 1;

 is also insufficient -- which tells me it's the lack of '4' or '8' being 
 set that is the thing, not the setting of '1'.  But I havn't been able 
 to figure out why yet...


 Later...

 Greg Oster

From: oster@netbsd.org
To: gnats-bugs@netbsd.org, port-vax-maintainer@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, oster@netbsd.org
Cc: 
Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
Date: Thu, 30 Jul 2020 11:37:20 -0600

 On 6/26/20 12:55 AM, Anders Magnusson wrote:
 > The following reply was made to PR port-vax/55415; it has been noted by GNATS.
 > 
 > From: Anders Magnusson <ragge@tethuvudet.se>
 > To: gnats-bugs@netbsd.org, port-vax-maintainer@netbsd.org,
 >   gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, oster@netbsd.org
 > Cc:
 > Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
 > Date: Fri, 26 Jun 2020 08:54:15 +0200
 > 
 >   Den 2020-06-25 kl. 17:00, skrev Greg Oster:
 >   > The following reply was made to PR port-vax/55415; it has been noted by GNATS.
 >   >
 >   > From: Greg Oster <oster@netbsd.org>
 >   > To: gnats-bugs@netbsd.org
 >   > Cc:
 >   > Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
 >   > Date: Thu, 25 Jun 2020 08:55:50 -0600
 >   >
 >   >   On 6/25/20 3:15 AM, Anders Magnusson wrote:
 >   >   > The following reply was made to PR port-vax/55415; it has been noted by GNATS.
 >   >   >
 >   >   > From: Anders Magnusson <ragge@tethuvudet.se>
 >   >   > To: gnats-bugs@netbsd.org
 >   >   > Cc:
 >   >   > Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
 >   >   > Date: Thu, 25 Jun 2020 11:10:42 +0200
 >   >   >
 >   >   >   Will it work if you only restore the removed line in cpu.h?
 >   >
 >   >   Yes, yes it does!  So it's just one line that needs to be restored to
 >   >   get things working properly.
 >   >
 >   Great!
 >   
 >   The other missing line should not be needed as I understand the code in
 >   sched_resched_cpu().
 >   ci_want_resched should always be set already when cpu_need_resched() is
 >   called.
 >   
 >   I'll try to fire up my 4000/90 this weekend and see if I can find this bug.
 >   

 I've done a bit more debugging...   What I'm seeing is that in 
 kern_runq.c:sched_resched_cpu() the call to cpu_need_resched(ci, l, f)
 happens, cpu_need_resched() sets up the AST.  Except it's only once in a 
 while that the trap with the AST fires, userret() gets called, and 
 preemption happens!  Sometimes the trap with AST fires once, and not 
 again... sometimes it fires 5 times in a row, and then misses.... but I 
 don't know why an AST that has been posted would subsequently get missed 
 sometimes....

 So it's able to hit a situation where cpu_need_resched() is called, but 
 the corresponding AST never fires. The loop in sched_resched_cpu() that 
 sets ci->ci_want_resched keeps thinking (correctly!) that the AST has 
 already been setup, and so doesn't try to call cpu_need_resched() again. 
   When it gets 'stuck' like this, we never see an AST until the process 
 completes.  (nor do we see preemption until the process completes.)
 That seems to be because if I check the AST status with:

   if (mfpr(PR_ASTLVL) != AST_OK)

 that condition is always true... (meaning the AST is not setup...)

 Any ideas on how an AST can just 'disappear'?  (I'm using the same 
 mfpr() check right after the mtpr() setting of PR_ASTLVL, and there it 
 thinks it's set just fine... so how does it go missing a few moments 
 after????)

 Later...

 Greg Oster

From: Anders Magnusson <ragge@tethuvudet.se>
To: gnats-bugs@netbsd.org, oster@netbsd.org
Cc: 
Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
Date: Thu, 30 Jul 2020 21:07:37 +0200

 >   I've done a bit more debugging...   What I'm seeing is that in
 >   kern_runq.c:sched_resched_cpu() the call to cpu_need_resched(ci, l, f)
 >   happens, cpu_need_resched() sets up the AST.  Except it's only once in a
 >   while that the trap with the AST fires, userret() gets called, and
 >   preemption happens!  Sometimes the trap with AST fires once, and not
 >   again... sometimes it fires 5 times in a row, and then misses.... but I
 >   don't know why an AST that has been posted would subsequently get missed
 >   sometimes....
 >   
 >   So it's able to hit a situation where cpu_need_resched() is called, but
 >   the corresponding AST never fires. The loop in sched_resched_cpu() that
 >   sets ci->ci_want_resched keeps thinking (correctly!) that the AST has
 >   already been setup, and so doesn't try to call cpu_need_resched() again.
 >     When it gets 'stuck' like this, we never see an AST until the process
 >   completes.  (nor do we see preemption until the process completes.)
 >   That seems to be because if I check the AST status with:
 >   
 >     if (mfpr(PR_ASTLVL) != AST_OK)
 >   
 >   that condition is always true... (meaning the AST is not setup...)
 >   
 >   Any ideas on how an AST can just 'disappear'?  (I'm using the same
 >   mfpr() check right after the mtpr() setting of PR_ASTLVL, and there it
 >   thinks it's set just fine... so how does it go missing a few moments
 >   after????)
 >
 The AST is only acked if it has been taken.  This is done in trap(), 
 just before userret() is called.
 Losing the AST should not be possible.

 Reading the VAX manual says that ASTLVL is not saved by svpctx, so if a 
 process switch occurs before the AST is delivered it will be lost.
 Can this ever happen? Since ASTs are intended to cause the process 
 switch, can a switch be called from a higher level of interrupt these days?

 You could add in your code something like:

 s = splhigh();
 mtpr(AST_OK, PR_ASTLVL);
 if (mfpr(PR_ASTLVL) != AST_OK)
      printf("ERROR\n");
 splx(s);

 and see if you still get a missing AST?

 -- Ragge

From: oster@netbsd.org
To: gnats-bugs@netbsd.org, port-vax-maintainer@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, oster@netbsd.org
Cc: 
Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
Date: Thu, 30 Jul 2020 13:17:51 -0600

 On 7/30/20 1:10 PM, Anders Magnusson wrote:
 > The following reply was made to PR port-vax/55415; it has been noted by GNATS.
 > 
 > From: Anders Magnusson <ragge@tethuvudet.se>
 > To: gnats-bugs@netbsd.org, oster@netbsd.org
 > Cc:
 > Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
 > Date: Thu, 30 Jul 2020 21:07:37 +0200
 > 
 >   >   I've done a bit more debugging...   What I'm seeing is that in
 >   >   kern_runq.c:sched_resched_cpu() the call to cpu_need_resched(ci, l, f)
 >   >   happens, cpu_need_resched() sets up the AST.  Except it's only once in a
 >   >   while that the trap with the AST fires, userret() gets called, and
 >   >   preemption happens!  Sometimes the trap with AST fires once, and not
 >   >   again... sometimes it fires 5 times in a row, and then misses.... but I
 >   >   don't know why an AST that has been posted would subsequently get missed
 >   >   sometimes....
 >   >
 >   >   So it's able to hit a situation where cpu_need_resched() is called, but
 >   >   the corresponding AST never fires. The loop in sched_resched_cpu() that
 >   >   sets ci->ci_want_resched keeps thinking (correctly!) that the AST has
 >   >   already been setup, and so doesn't try to call cpu_need_resched() again.
 >   >     When it gets 'stuck' like this, we never see an AST until the process
 >   >   completes.  (nor do we see preemption until the process completes.)
 >   >   That seems to be because if I check the AST status with:
 >   >
 >   >     if (mfpr(PR_ASTLVL) != AST_OK)
 >   >
 >   >   that condition is always true... (meaning the AST is not setup...)
 >   >
 >   >   Any ideas on how an AST can just 'disappear'?  (I'm using the same
 >   >   mfpr() check right after the mtpr() setting of PR_ASTLVL, and there it
 >   >   thinks it's set just fine... so how does it go missing a few moments
 >   >   after????)
 >   >
 >   The AST is only acked if it has been taken.  This is done in trap(),
 >   just before userret() is called.
 >   Losing the AST should not be possible.
 >   
 >   Reading the VAX manual says that ASTLVL is not saved by svpctx, so if a
 >   process switch occurs before the AST is delivered it will be lost.
 >   Can this ever happen? Since ASTs are intended to cause the process
 >   switch, can a switch be called from a higher level of interrupt these days?
 >   
 >   You could add in your code something like:
 >   
 >   s = splhigh();
 >   mtpr(AST_OK, PR_ASTLVL);
 >   if (mfpr(PR_ASTLVL) != AST_OK)
 >        printf("ERROR\n");
 >   splx(s);
 >   
 >   and see if you still get a missing AST?

 I'm using this:

 #define	cpu_need_resched(ci, l, flags)		\
 	do {					\
 		__USE(flags);			\
 		mtpr(AST_OK,PR_ASTLVL);		\
 		if (mfpr(PR_ASTLVL) != AST_OK)  \
 		printf("AST NOT SET!\n"); \
 	} while (/*CONSTCOND*/ 0)

 and the "AST NOT SET!" is never printed.  I can try with splhigh/splx, 
 and will also look to see if there's a svpctx happening somewhere....

 Later...

 Greg Oster

From: Anders Magnusson <ragge@tethuvudet.se>
To: oster@netbsd.org, gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
Date: Thu, 30 Jul 2020 21:24:47 +0200

 > I'm using this:
 >
 > #define    cpu_need_resched(ci, l, flags)        \
 >     do {                    \
 >         __USE(flags);            \
 >         mtpr(AST_OK,PR_ASTLVL);        \
 >         if (mfpr(PR_ASTLVL) != AST_OK)  \
 >         printf("AST NOT SET!\n"); \
 >     } while (/*CONSTCOND*/ 0)
 >
 > and the "AST NOT SET!" is never printed.  I can try with splhigh/splx, 
 > and will also look to see if there's a svpctx happening somewhere....
 >
 Ok, I misread you, sorry.  Ignore this.
 Hm, should add a check in some of the context switch routines to see if 
 it is called by something else than an AST (actually, to see if it is 
 called with AST_OK).  If it is, then AST save code should be added.

 But, I assume that someone should know whether context switches these 
 days can be called from something else than ASTs?

 -- Ragge

From: oster@netbsd.org
To: gnats-bugs@netbsd.org, port-vax-maintainer@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, oster@netbsd.org
Cc: 
Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
Date: Thu, 30 Jul 2020 13:25:48 -0600

 On 7/30/20 1:10 PM, Anders Magnusson wrote:
 > The following reply was made to PR port-vax/55415; it has been noted by GNATS.
 > 
 > From: Anders Magnusson <ragge@tethuvudet.se>
 > To: gnats-bugs@netbsd.org, oster@netbsd.org
 > Cc:
 > Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
 > Date: Thu, 30 Jul 2020 21:07:37 +0200
 > 
 >   >   I've done a bit more debugging...   What I'm seeing is that in
 >   >   kern_runq.c:sched_resched_cpu() the call to cpu_need_resched(ci, l, f)
 >   >   happens, cpu_need_resched() sets up the AST.  Except it's only once in a
 >   >   while that the trap with the AST fires, userret() gets called, and
 >   >   preemption happens!  Sometimes the trap with AST fires once, and not
 >   >   again... sometimes it fires 5 times in a row, and then misses.... but I
 >   >   don't know why an AST that has been posted would subsequently get missed
 >   >   sometimes....
 >   >
 >   >   So it's able to hit a situation where cpu_need_resched() is called, but
 >   >   the corresponding AST never fires. The loop in sched_resched_cpu() that
 >   >   sets ci->ci_want_resched keeps thinking (correctly!) that the AST has
 >   >   already been setup, and so doesn't try to call cpu_need_resched() again.
 >   >     When it gets 'stuck' like this, we never see an AST until the process
 >   >   completes.  (nor do we see preemption until the process completes.)
 >   >   That seems to be because if I check the AST status with:
 >   >
 >   >     if (mfpr(PR_ASTLVL) != AST_OK)
 >   >
 >   >   that condition is always true... (meaning the AST is not setup...)
 >   >
 >   >   Any ideas on how an AST can just 'disappear'?  (I'm using the same
 >   >   mfpr() check right after the mtpr() setting of PR_ASTLVL, and there it
 >   >   thinks it's set just fine... so how does it go missing a few moments
 >   >   after????)
 >   >
 >   The AST is only acked if it has been taken.  This is done in trap(),
 >   just before userret() is called.
 >   Losing the AST should not be possible.
 >   
 >   Reading the VAX manual says that ASTLVL is not saved by svpctx, so if a
 >   process switch occurs before the AST is delivered it will be lost.
 >   Can this ever happen? 

 Hmm... svpctx happens in softint_common(), which seems to be called from 
 lots of softFOO functions...   So if I'm reading this correctly, if we 
 happen to get into softint_common then the AST will get lost....

 > Since ASTs are intended to cause the process
 >   switch, can a switch be called from a higher level of interrupt these days?
 >   
 >   You could add in your code something like:
 >   
 >   s = splhigh();
 >   mtpr(AST_OK, PR_ASTLVL);
 >   if (mfpr(PR_ASTLVL) != AST_OK)
 >        printf("ERROR\n");
 >   splx(s);
 >   
 >   and see if you still get a missing AST?
 >   
 >   -- Ragge
 >   
 > 


 Later...

 Greg Oster

From: Anders Magnusson <ragge@tethuvudet.se>
To: gnats-bugs@netbsd.org, oster@netbsd.org
Cc: 
Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
Date: Thu, 30 Jul 2020 21:38:35 +0200

 Den 2020-07-30 kl. 21:30, skrev oster@netbsd.org:
 > The following reply was made to PR port-vax/55415; it has been noted by GNATS.
 >
 > From: oster@netbsd.org
 > To: gnats-bugs@netbsd.org, port-vax-maintainer@netbsd.org,
 >   gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, oster@netbsd.org
 > Cc:
 > Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
 > Date: Thu, 30 Jul 2020 13:25:48 -0600
 >
 >   On 7/30/20 1:10 PM, Anders Magnusson wrote:
 >   > The following reply was made to PR port-vax/55415; it has been noted by GNATS.
 >   >
 >   > From: Anders Magnusson <ragge@tethuvudet.se>
 >   > To: gnats-bugs@netbsd.org, oster@netbsd.org
 >   > Cc:
 >   > Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
 >   > Date: Thu, 30 Jul 2020 21:07:37 +0200
 >   >
 >   >   >   I've done a bit more debugging...   What I'm seeing is that in
 >   >   >   kern_runq.c:sched_resched_cpu() the call to cpu_need_resched(ci, l, f)
 >   >   >   happens, cpu_need_resched() sets up the AST.  Except it's only once in a
 >   >   >   while that the trap with the AST fires, userret() gets called, and
 >   >   >   preemption happens!  Sometimes the trap with AST fires once, and not
 >   >   >   again... sometimes it fires 5 times in a row, and then misses.... but I
 >   >   >   don't know why an AST that has been posted would subsequently get missed
 >   >   >   sometimes....
 >   >   >
 >   >   >   So it's able to hit a situation where cpu_need_resched() is called, but
 >   >   >   the corresponding AST never fires. The loop in sched_resched_cpu() that
 >   >   >   sets ci->ci_want_resched keeps thinking (correctly!) that the AST has
 >   >   >   already been setup, and so doesn't try to call cpu_need_resched() again.
 >   >   >     When it gets 'stuck' like this, we never see an AST until the process
 >   >   >   completes.  (nor do we see preemption until the process completes.)
 >   >   >   That seems to be because if I check the AST status with:
 >   >   >
 >   >   >     if (mfpr(PR_ASTLVL) != AST_OK)
 >   >   >
 >   >   >   that condition is always true... (meaning the AST is not setup...)
 >   >   >
 >   >   >   Any ideas on how an AST can just 'disappear'?  (I'm using the same
 >   >   >   mfpr() check right after the mtpr() setting of PR_ASTLVL, and there it
 >   >   >   thinks it's set just fine... so how does it go missing a few moments
 >   >   >   after????)
 >   >   >
 >   >   The AST is only acked if it has been taken.  This is done in trap(),
 >   >   just before userret() is called.
 >   >   Losing the AST should not be possible.
 >   >
 >   >   Reading the VAX manual says that ASTLVL is not saved by svpctx, so if a
 >   >   process switch occurs before the AST is delivered it will be lost.
 >   >   Can this ever happen?
 >   
 >   Hmm... svpctx happens in softint_common(), which seems to be called from
 >   lots of softFOO functions...   So if I'm reading this correctly, if we
 >   happen to get into softint_common then the AST will get lost....
 >   
 AST is itself a softint (called at Softint level 2).
 But we should probably add saving of AST levels in the PCB anyway.

 -- Ragge

From: oster@netbsd.org
To: Anders Magnusson <ragge@tethuvudet.se>, gnats-bugs@netbsd.org,
 oster@netbsd.org
Cc: 
Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
Date: Thu, 30 Jul 2020 14:49:55 -0600

 On 7/30/20 1:38 PM, Anders Magnusson wrote:
 > 
 > 
 > Den 2020-07-30 kl. 21:30, skrev oster@netbsd.org:
 >> The following reply was made to PR port-vax/55415; it has been noted 
 >> by GNATS.
 >>
 >> From: oster@netbsd.org
 >> To: gnats-bugs@netbsd.org, port-vax-maintainer@netbsd.org,
 >>   gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, oster@netbsd.org
 >> Cc:
 >> Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
 >> Date: Thu, 30 Jul 2020 13:25:48 -0600
 >>
 >>   On 7/30/20 1:10 PM, Anders Magnusson wrote:
 >>   > The following reply was made to PR port-vax/55415; it has been 
 >> noted by GNATS.
 >>   >
 >>   > From: Anders Magnusson <ragge@tethuvudet.se>
 >>   > To: gnats-bugs@netbsd.org, oster@netbsd.org
 >>   > Cc:
 >>   > Subject: Re: port-vax/55415: vax no longer preempts in a timely 
 >> fashion
 >>   > Date: Thu, 30 Jul 2020 21:07:37 +0200
 >>   >
 >>   >   >   I've done a bit more debugging...   What I'm seeing is that in
 >>   >   >   kern_runq.c:sched_resched_cpu() the call to 
 >> cpu_need_resched(ci, l, f)
 >>   >   >   happens, cpu_need_resched() sets up the AST.  Except it's 
 >> only once in a
 >>   >   >   while that the trap with the AST fires, userret() gets 
 >> called, and
 >>   >   >   preemption happens!  Sometimes the trap with AST fires once, 
 >> and not
 >>   >   >   again... sometimes it fires 5 times in a row, and then 
 >> misses.... but I
 >>   >   >   don't know why an AST that has been posted would 
 >> subsequently get missed
 >>   >   >   sometimes....
 >>   >   >
 >>   >   >   So it's able to hit a situation where cpu_need_resched() is 
 >> called, but
 >>   >   >   the corresponding AST never fires. The loop in 
 >> sched_resched_cpu() that
 >>   >   >   sets ci->ci_want_resched keeps thinking (correctly!) that 
 >> the AST has
 >>   >   >   already been setup, and so doesn't try to call 
 >> cpu_need_resched() again.
 >>   >   >     When it gets 'stuck' like this, we never see an AST until 
 >> the process
 >>   >   >   completes.  (nor do we see preemption until the process 
 >> completes.)
 >>   >   >   That seems to be because if I check the AST status with:
 >>   >   >
 >>   >   >     if (mfpr(PR_ASTLVL) != AST_OK)
 >>   >   >
 >>   >   >   that condition is always true... (meaning the AST is not 
 >> setup...)
 >>   >   >
 >>   >   >   Any ideas on how an AST can just 'disappear'?  (I'm using 
 >> the same
 >>   >   >   mfpr() check right after the mtpr() setting of PR_ASTLVL, 
 >> and there it
 >>   >   >   thinks it's set just fine... so how does it go missing a few 
 >> moments
 >>   >   >   after????)
 >>   >   >
 >>   >   The AST is only acked if it has been taken.  This is done in 
 >> trap(),
 >>   >   just before userret() is called.
 >>   >   Losing the AST should not be possible.
 >>   >
 >>   >   Reading the VAX manual says that ASTLVL is not saved by svpctx, 
 >> so if a
 >>   >   process switch occurs before the AST is delivered it will be lost.
 >>   >   Can this ever happen?
 >>   Hmm... svpctx happens in softint_common(), which seems to be called 
 >> from
 >>   lots of softFOO functions...   So if I'm reading this correctly, if we
 >>   happen to get into softint_common then the AST will get lost....
 > AST is itself a softint (called at Softint level 2).
 > But we should probably add saving of AST levels in the PCB anyway.

 I'm happy to test :)

 Later...

 Greg Oster

From: Greg Oster <oster@netbsd.org>
To: Anders Magnusson <ragge@tethuvudet.se>, gnats-bugs@netbsd.org
Cc: 
Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
Date: Fri, 1 Jul 2022 17:25:06 -0600

 An update on this....

 If I add a separate check for cpu_intr_p() into 
 src/sys/kern/kern_runq.c:sched_resched_cpu() like this:

 ...
 	for (o = 0;; o = n) {
 		n = atomic_cas_uint(&ci->ci_want_resched, o, o | f);
 		if (__predict_true(o == n)) {
 			/*
 			 * We're the first to set a resched on the CPU.  Try
 			 * to avoid causing a needless trip through trap()
 			 * to handle an AST fault, if it's known the LWP
 			 * will either block or go through userret() soon.
 			 */

 			if (l != curlwp || cpu_intr_p()) {
 				cpu_need_resched(ci, l, f);
 			}
 			break;
 		}
 /* NEW CODE */
 		if (cpu_intr_p()) {
 			cpu_need_resched(ci, l, f);
 			break;
 		}
 /* END OF NEW CODE */
 		if (__predict_true(
 		    (n & (RESCHED_KPREEMPT|RESCHED_UPREEMPT)) >=
 		    (f & (RESCHED_KPREEMPT|RESCHED_UPREEMPT)))) {
 			/* Already in progress, nothing to do. */
 ...

 then the ping times drop from 9 seconds to 146ms, which is more in line 
 with what a 9.99.10 kernel does on the same hardware.  The new code 
 doesn't fire very often, but when it does, it would have been precisely 
 when the 'ping stalls' would occur.

 What I observed in testing that lead me here is that during the high 
 ping times the kernel is stuck looping in the "Already in progress, 
 nothing to do." section when in fact, there are interrupts(?) coming in 
 that need servicing....  Is this maybe something to do with VAX having 
 hardware ASTs?

 These tests are also with a patch from Ragge to preserve ASTs in a PCB 
 across context switching.

 I suspect such a 'fix' wouldn't be appropriate for all the other 
 architectures, but I don't know that for sure.  Perhaps there's 
 something machine-dependent with VAX that doesn't fit into the current 
 machine-independent way of doing things?  Or is there still some other 
 bit missing in the VAX code that would accomplish the above?  What does 
 seem to be true is that on a VAX, the "nothing to do" isn't sufficient 
 for a performant (if we can call VAX that :) ) system.

 Later...

 Greg Oster


 On 2020-07-30 14:49, oster@netbsd.org wrote:
 > On 7/30/20 1:38 PM, Anders Magnusson wrote:
 >>
 >>
 >> Den 2020-07-30 kl. 21:30, skrev oster@netbsd.org:
 >>> The following reply was made to PR port-vax/55415; it has been noted 
 >>> by GNATS.
 >>>
 >>> From: oster@netbsd.org
 >>> To: gnats-bugs@netbsd.org, port-vax-maintainer@netbsd.org,
 >>>   gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, oster@netbsd.org
 >>> Cc:
 >>> Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
 >>> Date: Thu, 30 Jul 2020 13:25:48 -0600
 >>>
 >>>   On 7/30/20 1:10 PM, Anders Magnusson wrote:
 >>>   > The following reply was made to PR port-vax/55415; it has been 
 >>> noted by GNATS.
 >>>   >
 >>>   > From: Anders Magnusson <ragge@tethuvudet.se>
 >>>   > To: gnats-bugs@netbsd.org, oster@netbsd.org
 >>>   > Cc:
 >>>   > Subject: Re: port-vax/55415: vax no longer preempts in a timely 
 >>> fashion
 >>>   > Date: Thu, 30 Jul 2020 21:07:37 +0200
 >>>   >
 >>>   >   >   I've done a bit more debugging...   What I'm seeing is that in
 >>>   >   >   kern_runq.c:sched_resched_cpu() the call to 
 >>> cpu_need_resched(ci, l, f)
 >>>   >   >   happens, cpu_need_resched() sets up the AST.  Except it's 
 >>> only once in a
 >>>   >   >   while that the trap with the AST fires, userret() gets 
 >>> called, and
 >>>   >   >   preemption happens!  Sometimes the trap with AST fires 
 >>> once, and not
 >>>   >   >   again... sometimes it fires 5 times in a row, and then 
 >>> misses.... but I
 >>>   >   >   don't know why an AST that has been posted would 
 >>> subsequently get missed
 >>>   >   >   sometimes....
 >>>   >   >
 >>>   >   >   So it's able to hit a situation where cpu_need_resched() is 
 >>> called, but
 >>>   >   >   the corresponding AST never fires. The loop in 
 >>> sched_resched_cpu() that
 >>>   >   >   sets ci->ci_want_resched keeps thinking (correctly!) that 
 >>> the AST has
 >>>   >   >   already been setup, and so doesn't try to call 
 >>> cpu_need_resched() again.
 >>>   >   >     When it gets 'stuck' like this, we never see an AST until 
 >>> the process
 >>>   >   >   completes.  (nor do we see preemption until the process 
 >>> completes.)
 >>>   >   >   That seems to be because if I check the AST status with:
 >>>   >   >
 >>>   >   >     if (mfpr(PR_ASTLVL) != AST_OK)
 >>>   >   >
 >>>   >   >   that condition is always true... (meaning the AST is not 
 >>> setup...)
 >>>   >   >
 >>>   >   >   Any ideas on how an AST can just 'disappear'?  (I'm using 
 >>> the same
 >>>   >   >   mfpr() check right after the mtpr() setting of PR_ASTLVL, 
 >>> and there it
 >>>   >   >   thinks it's set just fine... so how does it go missing a 
 >>> few moments
 >>>   >   >   after????)
 >>>   >   >
 >>>   >   The AST is only acked if it has been taken.  This is done in 
 >>> trap(),
 >>>   >   just before userret() is called.
 >>>   >   Losing the AST should not be possible.
 >>>   >
 >>>   >   Reading the VAX manual says that ASTLVL is not saved by svpctx, 
 >>> so if a
 >>>   >   process switch occurs before the AST is delivered it will be lost.
 >>>   >   Can this ever happen?
 >>>   Hmm... svpctx happens in softint_common(), which seems to be called 
 >>> from
 >>>   lots of softFOO functions...   So if I'm reading this correctly, if we
 >>>   happen to get into softint_common then the AST will get lost....
 >> AST is itself a softint (called at Softint level 2).
 >> But we should probably add saving of AST levels in the PCB anyway.
 > 
 > I'm happy to test :)
 > 
 > Later...
 > 
 > Greg Oster

From: "Greg Oster" <oster@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55415 CVS commit: src/sys/arch/vax/include
Date: Sun, 10 Sep 2023 00:15:52 +0000

 Module Name:	src
 Committed By:	oster
 Date:		Sun Sep 10 00:15:52 UTC 2023

 Modified Files:
 	src/sys/arch/vax/include: cpu.h

 Log Message:
 With the overhaul of the scheduler code the semantics of
 ci_want_resched have changed, and for some reason vax
 still requires ci_want_resched set to 1 in order to do
 preemption.  This commit contains a workaround for the
 preemption issued discussed in PR#55415.

 XXX pullup-10


 To generate a diff of this commit:
 cvs rdiff -u -r1.106 -r1.107 src/sys/arch/vax/include/cpu.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55415 CVS commit: [netbsd-10] src/sys/arch/vax/include
Date: Mon, 11 Sep 2023 13:42:08 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Mon Sep 11 13:42:08 UTC 2023

 Modified Files:
 	src/sys/arch/vax/include [netbsd-10]: cpu.h

 Log Message:
 Pull up following revision(s) (requested by oster in ticket #365):

 	sys/arch/vax/include/cpu.h: revision 1.107

 With the overhaul of the scheduler code the semantics of
 ci_want_resched have changed, and for some reason vax
 still requires ci_want_resched set to 1 in order to do
 preemption.  This commit contains a workaround for the
 preemption issued discussed in PR#55415.


 To generate a diff of this commit:
 cvs rdiff -u -r1.106 -r1.106.2.1 src/sys/arch/vax/include/cpu.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Kalvis Duckmanton" <kalvisd@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55415 CVS commit: src/sys/arch/vax/vax
Date: Mon, 18 Dec 2023 22:40:01 +0000

 Module Name:	src
 Committed By:	kalvisd
 Date:		Mon Dec 18 22:40:01 UTC 2023

 Modified Files:
 	src/sys/arch/vax/vax: subr.S

 Log Message:
 vax: preserve AST requests raised when handling software interrupts

     PR port-vax/55415

     On return from a software interrupt, if the software interrupt LWP
     raised an AST request, copy the AST level from its PCB to the PCB
     of the interrupted LWP.

     Reviewed by <ragge>


 To generate a diff of this commit:
 cvs rdiff -u -r1.42 -r1.43 src/sys/arch/vax/vax/subr.S

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: Greg Oster <oster@netbsd.org>
To: gnats-bugs@netbsd.org, port-vax-maintainer@netbsd.org,
 gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc: 
Subject: Re: PR/55415 CVS commit: src/sys/arch/vax/vax
Date: Mon, 18 Dec 2023 18:00:36 -0600

 On 2023-12-18 16.45, Kalvis Duckmanton wrote:
 > The following reply was made to PR port-vax/55415; it has been noted by GNATS.
 > 
 > From: "Kalvis Duckmanton" <kalvisd@netbsd.org>
 > To: gnats-bugs@gnats.NetBSD.org
 > Cc:
 > Subject: PR/55415 CVS commit: src/sys/arch/vax/vax
 > Date: Mon, 18 Dec 2023 22:40:01 +0000
 > 
 >   Module Name:	src
 >   Committed By:	kalvisd
 >   Date:		Mon Dec 18 22:40:01 UTC 2023
 >   
 >   Modified Files:
 >   	src/sys/arch/vax/vax: subr.S
 >   
 >   Log Message:
 >   vax: preserve AST requests raised when handling software interrupts
 >   
 >       PR port-vax/55415
 >   
 >       On return from a software interrupt, if the software interrupt LWP
 >       raised an AST request, copy the AST level from its PCB to the PCB
 >       of the interrupted LWP.
 >   
 >       Reviewed by <ragge>

 I'm very pleased to update this ticket and report that Kalvis's change 
 means that this workaround:

   cvs rdiff -u -r1.106 -r1.107 src/sys/arch/vax/include/cpu.h

 is no longer needed!!

 Later...

 Greg Oster

From: "Kalvis Duckmanton" <kalvisd@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55415 CVS commit: src/sys/arch/vax/include
Date: Tue, 19 Dec 2023 00:29:48 +0000

 Module Name:	src
 Committed By:	kalvisd
 Date:		Tue Dec 19 00:29:48 UTC 2023

 Modified Files:
 	src/sys/arch/vax/include: cpu.h

 Log Message:
 vax: PR port-vax/55415

     Remove VAX-specific workaround to force pre-emption, as it is now
     no longer needed.

     tested by oster@


 To generate a diff of this commit:
 cvs rdiff -u -r1.107 -r1.108 src/sys/arch/vax/include/cpu.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55415 CVS commit: [netbsd-10] src/sys/arch/vax/vax
Date: Tue, 19 Dec 2023 12:26:01 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Tue Dec 19 12:26:01 UTC 2023

 Modified Files:
 	src/sys/arch/vax/vax [netbsd-10]: subr.S

 Log Message:
 Pull up following revision(s) (requested by kalvisd in ticket #508):

 	sys/arch/vax/vax/subr.S: revision 1.43

 vax: preserve AST requests raised when handling software interrupts

     PR port-vax/55415

     On return from a software interrupt, if the software interrupt LWP
     raised an AST request, copy the AST level from its PCB to the PCB
     of the interrupted LWP.

     Reviewed by <ragge>


 To generate a diff of this commit:
 cvs rdiff -u -r1.41.2.1 -r1.41.2.2 src/sys/arch/vax/vax/subr.S

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc: 
Subject: PR/55415 CVS commit: [netbsd-10] src/sys/arch/vax/include
Date: Tue, 19 Dec 2023 12:28:01 +0000

 Module Name:	src
 Committed By:	martin
 Date:		Tue Dec 19 12:28:01 UTC 2023

 Modified Files:
 	src/sys/arch/vax/include [netbsd-10]: cpu.h

 Log Message:
 Pull up following revision(s) (requested by kalvisd in ticket #509):

 	sys/arch/vax/include/cpu.h: revision 1.108

 vax: PR port-vax/55415

     Remove VAX-specific workaround to force pre-emption, as it is now
     no longer needed.

     tested by oster@


 To generate a diff of this commit:
 cvs rdiff -u -r1.106.2.1 -r1.106.2.2 src/sys/arch/vax/include/cpu.h

 Please note that diffs are not public domain; they are subject to the
 copyright notices on the relevant files.

State-Changed-From-To: open->closed
State-Changed-By: oster@NetBSD.org
State-Changed-When: Tue, 19 Dec 2023 14:35:42 +0000
State-Changed-Why:
kalvisd fixed the underlying issue!  Thanks!
(Pullups to -10 are done now too.)


>Unformatted:

NetBSD Home
NetBSD PR Database Search

(Contact us) $NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023 The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.