NetBSD Problem Report #55415
From oster@fween.ca Wed Jun 24 20:34:09 2020
Return-Path: <oster@fween.ca>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.NetBSD.org", Issuer "mail.NetBSD.org CA" (not verified))
by mollari.NetBSD.org (Postfix) with ESMTPS id 56C831A9217
for <gnats-bugs@gnats.NetBSD.org>; Wed, 24 Jun 2020 20:34:09 +0000 (UTC)
Message-Id: <20200624203407.16FAB52D76F@thog.fween.ca>
Date: Wed, 24 Jun 2020 14:34:07 -0600 (CST)
From: oster@netbsd.org
Reply-To: oster@netbsd.org
To: gnats-bugs@NetBSD.org
Subject: vax no longer preempts in a timely fashion
X-Send-Pr-Version: 3.95
>Number: 55415
>Category: port-vax
>Synopsis: vax no longer preempts in a timely fashion
>Confidential: no
>Severity: critical
>Priority: high
>Responsible: port-vax-maintainer
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Wed Jun 24 20:35:00 +0000 2020
>Closed-Date: Tue Dec 19 14:35:42 +0000 2023
>Last-Modified: Tue Dec 19 14:35:42 +0000 2023
>Originator: Greg Oster
>Release: NetBSD 9.99.68
>Organization:
>Environment:
System: NetBSD floyd 9.99.68 NetBSD 9.99.68 (GENERICFIX) #0: Wed Jun 24 10:35:26 CST 2020 oster@thog:/u1/builds/build300/src/obj/vax/u1/builds/build300/src/sys/arch/vax/compile/GENERICFIX vax
Architecture: vax
Machine: vax
>Description:
The vax port currently isn't doing kernel preemption correctly. That is,
if a userland process decides not to yield the CPU, it can basically stop
the network, console, and any other processes from getting any CPU cycles.
The behaviour was found by attempting to build /usr/pkgsrc/devel/m4,
and wondering why it kept hanging the machine hard at a certain spot.
That spot was testing for strstr() functionality. After much debugging,
it was determined that the machine hadn't actually froze -- the strstr()
call was just consuming all CPU, and would continue to do so until the
call finished. (Even though the code also had an alarm set to stop the
process after 5 seconds, that alarm would never fire, due to nothing else
getting any CPU cycles.)
>How-To-Repeat:
Attempt to build /usr/pkgsrc/devel/m4 on a 4000/60.
Notice that the machine appears to be totally frozen.
The errant behaviour is also be observed on boot, where ping times
to the vax go from 2ms to 9*seconds* as various /etc/rc.d scripts are run.
>Fix:
Version 1.136 of src/sys/vax/vax/trap.c removed the lines:
if (curcpu()->ci_want_resched)
preempt();
from the trap() function, and version 1.103 of src/sys/vax/include/cpu.c
removed the line:
(ci)->ci_want_resched = 1; \
from the cpu_need_resched #define.
Reverting both of these changes restores normal preemption behaviour.
>Release-Note:
>Audit-Trail:
From: Anders Magnusson <ragge@tethuvudet.se>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
Date: Thu, 25 Jun 2020 11:10:42 +0200
Will it work if you only restore the removed line in cpu.h?
From: Greg Oster <oster@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
Date: Thu, 25 Jun 2020 08:55:50 -0600
On 6/25/20 3:15 AM, Anders Magnusson wrote:
> The following reply was made to PR port-vax/55415; it has been noted by GNATS.
>
> From: Anders Magnusson <ragge@tethuvudet.se>
> To: gnats-bugs@netbsd.org
> Cc:
> Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
> Date: Thu, 25 Jun 2020 11:10:42 +0200
>
> Will it work if you only restore the removed line in cpu.h?
Yes, yes it does! So it's just one line that needs to be restored to
get things working properly.
Thanks for the hint!
Later...
Greg Oster
From: Anders Magnusson <ragge@tethuvudet.se>
To: gnats-bugs@netbsd.org, port-vax-maintainer@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, oster@netbsd.org
Cc:
Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
Date: Fri, 26 Jun 2020 08:54:15 +0200
Den 2020-06-25 kl. 17:00, skrev Greg Oster:
> The following reply was made to PR port-vax/55415; it has been noted by GNATS.
>
> From: Greg Oster <oster@netbsd.org>
> To: gnats-bugs@netbsd.org
> Cc:
> Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
> Date: Thu, 25 Jun 2020 08:55:50 -0600
>
> On 6/25/20 3:15 AM, Anders Magnusson wrote:
> > The following reply was made to PR port-vax/55415; it has been noted by GNATS.
> >
> > From: Anders Magnusson <ragge@tethuvudet.se>
> > To: gnats-bugs@netbsd.org
> > Cc:
> > Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
> > Date: Thu, 25 Jun 2020 11:10:42 +0200
> >
> > Will it work if you only restore the removed line in cpu.h?
>
> Yes, yes it does! So it's just one line that needs to be restored to
> get things working properly.
>
Great!
The other missing line should not be needed as I understand the code in
sched_resched_cpu().
ci_want_resched should always be set already when cpu_need_resched() is
called.
I'll try to fire up my 4000/90 this weekend and see if I can find this bug.
-- R
From: Greg Oster <oster@netbsd.org>
To: gnats-bugs@netbsd.org
Cc:
Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
Date: Mon, 6 Jul 2020 08:50:11 -0600
On 6/26/20 12:55 AM, Anders Magnusson wrote:
> The following reply was made to PR port-vax/55415; it has been noted by GNATS.
>
> From: Anders Magnusson <ragge@tethuvudet.se>
> To: gnats-bugs@netbsd.org, port-vax-maintainer@netbsd.org,
> gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, oster@netbsd.org
> Cc:
> Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
> Date: Fri, 26 Jun 2020 08:54:15 +0200
>
> Den 2020-06-25 kl. 17:00, skrev Greg Oster:
> > The following reply was made to PR port-vax/55415; it has been noted by GNATS.
> >
> > From: Greg Oster <oster@netbsd.org>
> > To: gnats-bugs@netbsd.org
> > Cc:
> > Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
> > Date: Thu, 25 Jun 2020 08:55:50 -0600
> >
> > On 6/25/20 3:15 AM, Anders Magnusson wrote:
> > > The following reply was made to PR port-vax/55415; it has been noted by GNATS.
> > >
> > > From: Anders Magnusson <ragge@tethuvudet.se>
> > > To: gnats-bugs@netbsd.org
> > > Cc:
> > > Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
> > > Date: Thu, 25 Jun 2020 11:10:42 +0200
> > >
> > > Will it work if you only restore the removed line in cpu.h?
> >
> > Yes, yes it does! So it's just one line that needs to be restored to
> > get things working properly.
> >
> Great!
>
> The other missing line should not be needed as I understand the code in
> sched_resched_cpu().
> ci_want_resched should always be set already when cpu_need_resched() is
> called.
>
> I'll try to fire up my 4000/90 this weekend and see if I can find this bug.
>
> -- R
>
>
I had a few minutes to poke at this again... and can confirm that the
issue can be seen using simh as well. I note that setting
ci_want_resched to 4 (RESCHED_UPREEMPT) or 8 (RESCHED_KPREEMPT) is
insufficient -- it is only with setting ci_want_resched to 1 (i.e.
likely blowing away the currently set value of 4) that scheduling
behaves properly. Also: using:
ci_want_resched |= 1;
is also insufficient -- which tells me it's the lack of '4' or '8' being
set that is the thing, not the setting of '1'. But I havn't been able
to figure out why yet...
Later...
Greg Oster
From: oster@netbsd.org
To: gnats-bugs@netbsd.org, port-vax-maintainer@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, oster@netbsd.org
Cc:
Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
Date: Thu, 30 Jul 2020 11:37:20 -0600
On 6/26/20 12:55 AM, Anders Magnusson wrote:
> The following reply was made to PR port-vax/55415; it has been noted by GNATS.
>
> From: Anders Magnusson <ragge@tethuvudet.se>
> To: gnats-bugs@netbsd.org, port-vax-maintainer@netbsd.org,
> gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, oster@netbsd.org
> Cc:
> Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
> Date: Fri, 26 Jun 2020 08:54:15 +0200
>
> Den 2020-06-25 kl. 17:00, skrev Greg Oster:
> > The following reply was made to PR port-vax/55415; it has been noted by GNATS.
> >
> > From: Greg Oster <oster@netbsd.org>
> > To: gnats-bugs@netbsd.org
> > Cc:
> > Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
> > Date: Thu, 25 Jun 2020 08:55:50 -0600
> >
> > On 6/25/20 3:15 AM, Anders Magnusson wrote:
> > > The following reply was made to PR port-vax/55415; it has been noted by GNATS.
> > >
> > > From: Anders Magnusson <ragge@tethuvudet.se>
> > > To: gnats-bugs@netbsd.org
> > > Cc:
> > > Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
> > > Date: Thu, 25 Jun 2020 11:10:42 +0200
> > >
> > > Will it work if you only restore the removed line in cpu.h?
> >
> > Yes, yes it does! So it's just one line that needs to be restored to
> > get things working properly.
> >
> Great!
>
> The other missing line should not be needed as I understand the code in
> sched_resched_cpu().
> ci_want_resched should always be set already when cpu_need_resched() is
> called.
>
> I'll try to fire up my 4000/90 this weekend and see if I can find this bug.
>
I've done a bit more debugging... What I'm seeing is that in
kern_runq.c:sched_resched_cpu() the call to cpu_need_resched(ci, l, f)
happens, cpu_need_resched() sets up the AST. Except it's only once in a
while that the trap with the AST fires, userret() gets called, and
preemption happens! Sometimes the trap with AST fires once, and not
again... sometimes it fires 5 times in a row, and then misses.... but I
don't know why an AST that has been posted would subsequently get missed
sometimes....
So it's able to hit a situation where cpu_need_resched() is called, but
the corresponding AST never fires. The loop in sched_resched_cpu() that
sets ci->ci_want_resched keeps thinking (correctly!) that the AST has
already been setup, and so doesn't try to call cpu_need_resched() again.
When it gets 'stuck' like this, we never see an AST until the process
completes. (nor do we see preemption until the process completes.)
That seems to be because if I check the AST status with:
if (mfpr(PR_ASTLVL) != AST_OK)
that condition is always true... (meaning the AST is not setup...)
Any ideas on how an AST can just 'disappear'? (I'm using the same
mfpr() check right after the mtpr() setting of PR_ASTLVL, and there it
thinks it's set just fine... so how does it go missing a few moments
after????)
Later...
Greg Oster
From: Anders Magnusson <ragge@tethuvudet.se>
To: gnats-bugs@netbsd.org, oster@netbsd.org
Cc:
Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
Date: Thu, 30 Jul 2020 21:07:37 +0200
> I've done a bit more debugging... What I'm seeing is that in
> kern_runq.c:sched_resched_cpu() the call to cpu_need_resched(ci, l, f)
> happens, cpu_need_resched() sets up the AST. Except it's only once in a
> while that the trap with the AST fires, userret() gets called, and
> preemption happens! Sometimes the trap with AST fires once, and not
> again... sometimes it fires 5 times in a row, and then misses.... but I
> don't know why an AST that has been posted would subsequently get missed
> sometimes....
>
> So it's able to hit a situation where cpu_need_resched() is called, but
> the corresponding AST never fires. The loop in sched_resched_cpu() that
> sets ci->ci_want_resched keeps thinking (correctly!) that the AST has
> already been setup, and so doesn't try to call cpu_need_resched() again.
> When it gets 'stuck' like this, we never see an AST until the process
> completes. (nor do we see preemption until the process completes.)
> That seems to be because if I check the AST status with:
>
> if (mfpr(PR_ASTLVL) != AST_OK)
>
> that condition is always true... (meaning the AST is not setup...)
>
> Any ideas on how an AST can just 'disappear'? (I'm using the same
> mfpr() check right after the mtpr() setting of PR_ASTLVL, and there it
> thinks it's set just fine... so how does it go missing a few moments
> after????)
>
The AST is only acked if it has been taken. This is done in trap(),
just before userret() is called.
Losing the AST should not be possible.
Reading the VAX manual says that ASTLVL is not saved by svpctx, so if a
process switch occurs before the AST is delivered it will be lost.
Can this ever happen? Since ASTs are intended to cause the process
switch, can a switch be called from a higher level of interrupt these days?
You could add in your code something like:
s = splhigh();
mtpr(AST_OK, PR_ASTLVL);
if (mfpr(PR_ASTLVL) != AST_OK)
printf("ERROR\n");
splx(s);
and see if you still get a missing AST?
-- Ragge
From: oster@netbsd.org
To: gnats-bugs@netbsd.org, port-vax-maintainer@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, oster@netbsd.org
Cc:
Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
Date: Thu, 30 Jul 2020 13:17:51 -0600
On 7/30/20 1:10 PM, Anders Magnusson wrote:
> The following reply was made to PR port-vax/55415; it has been noted by GNATS.
>
> From: Anders Magnusson <ragge@tethuvudet.se>
> To: gnats-bugs@netbsd.org, oster@netbsd.org
> Cc:
> Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
> Date: Thu, 30 Jul 2020 21:07:37 +0200
>
> > I've done a bit more debugging... What I'm seeing is that in
> > kern_runq.c:sched_resched_cpu() the call to cpu_need_resched(ci, l, f)
> > happens, cpu_need_resched() sets up the AST. Except it's only once in a
> > while that the trap with the AST fires, userret() gets called, and
> > preemption happens! Sometimes the trap with AST fires once, and not
> > again... sometimes it fires 5 times in a row, and then misses.... but I
> > don't know why an AST that has been posted would subsequently get missed
> > sometimes....
> >
> > So it's able to hit a situation where cpu_need_resched() is called, but
> > the corresponding AST never fires. The loop in sched_resched_cpu() that
> > sets ci->ci_want_resched keeps thinking (correctly!) that the AST has
> > already been setup, and so doesn't try to call cpu_need_resched() again.
> > When it gets 'stuck' like this, we never see an AST until the process
> > completes. (nor do we see preemption until the process completes.)
> > That seems to be because if I check the AST status with:
> >
> > if (mfpr(PR_ASTLVL) != AST_OK)
> >
> > that condition is always true... (meaning the AST is not setup...)
> >
> > Any ideas on how an AST can just 'disappear'? (I'm using the same
> > mfpr() check right after the mtpr() setting of PR_ASTLVL, and there it
> > thinks it's set just fine... so how does it go missing a few moments
> > after????)
> >
> The AST is only acked if it has been taken. This is done in trap(),
> just before userret() is called.
> Losing the AST should not be possible.
>
> Reading the VAX manual says that ASTLVL is not saved by svpctx, so if a
> process switch occurs before the AST is delivered it will be lost.
> Can this ever happen? Since ASTs are intended to cause the process
> switch, can a switch be called from a higher level of interrupt these days?
>
> You could add in your code something like:
>
> s = splhigh();
> mtpr(AST_OK, PR_ASTLVL);
> if (mfpr(PR_ASTLVL) != AST_OK)
> Â Â Â printf("ERROR\n");
> splx(s);
>
> and see if you still get a missing AST?
I'm using this:
#define cpu_need_resched(ci, l, flags) \
do { \
__USE(flags); \
mtpr(AST_OK,PR_ASTLVL); \
if (mfpr(PR_ASTLVL) != AST_OK) \
printf("AST NOT SET!\n"); \
} while (/*CONSTCOND*/ 0)
and the "AST NOT SET!" is never printed. I can try with splhigh/splx,
and will also look to see if there's a svpctx happening somewhere....
Later...
Greg Oster
From: Anders Magnusson <ragge@tethuvudet.se>
To: oster@netbsd.org, gnats-bugs@netbsd.org
Cc:
Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
Date: Thu, 30 Jul 2020 21:24:47 +0200
> I'm using this:
>
> #define   cpu_need_resched(ci, l, flags)       \
> Â Â Â Â do {Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â \
> Â Â Â Â Â Â Â __USE(flags);Â Â Â Â Â Â Â Â Â Â Â \
> Â Â Â Â Â Â Â mtpr(AST_OK,PR_ASTLVL);Â Â Â Â Â Â Â \
> Â Â Â Â Â Â Â if (mfpr(PR_ASTLVL) != AST_OK)Â \
> Â Â Â Â Â Â Â printf("AST NOT SET!\n"); \
> Â Â Â Â } while (/*CONSTCOND*/ 0)
>
> and the "AST NOT SET!" is never printed. I can try with splhigh/splx,
> and will also look to see if there's a svpctx happening somewhere....
>
Ok, I misread you, sorry. Ignore this.
Hm, should add a check in some of the context switch routines to see if
it is called by something else than an AST (actually, to see if it is
called with AST_OK). If it is, then AST save code should be added.
But, I assume that someone should know whether context switches these
days can be called from something else than ASTs?
-- Ragge
From: oster@netbsd.org
To: gnats-bugs@netbsd.org, port-vax-maintainer@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, oster@netbsd.org
Cc:
Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
Date: Thu, 30 Jul 2020 13:25:48 -0600
On 7/30/20 1:10 PM, Anders Magnusson wrote:
> The following reply was made to PR port-vax/55415; it has been noted by GNATS.
>
> From: Anders Magnusson <ragge@tethuvudet.se>
> To: gnats-bugs@netbsd.org, oster@netbsd.org
> Cc:
> Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
> Date: Thu, 30 Jul 2020 21:07:37 +0200
>
> > I've done a bit more debugging... What I'm seeing is that in
> > kern_runq.c:sched_resched_cpu() the call to cpu_need_resched(ci, l, f)
> > happens, cpu_need_resched() sets up the AST. Except it's only once in a
> > while that the trap with the AST fires, userret() gets called, and
> > preemption happens! Sometimes the trap with AST fires once, and not
> > again... sometimes it fires 5 times in a row, and then misses.... but I
> > don't know why an AST that has been posted would subsequently get missed
> > sometimes....
> >
> > So it's able to hit a situation where cpu_need_resched() is called, but
> > the corresponding AST never fires. The loop in sched_resched_cpu() that
> > sets ci->ci_want_resched keeps thinking (correctly!) that the AST has
> > already been setup, and so doesn't try to call cpu_need_resched() again.
> > When it gets 'stuck' like this, we never see an AST until the process
> > completes. (nor do we see preemption until the process completes.)
> > That seems to be because if I check the AST status with:
> >
> > if (mfpr(PR_ASTLVL) != AST_OK)
> >
> > that condition is always true... (meaning the AST is not setup...)
> >
> > Any ideas on how an AST can just 'disappear'? (I'm using the same
> > mfpr() check right after the mtpr() setting of PR_ASTLVL, and there it
> > thinks it's set just fine... so how does it go missing a few moments
> > after????)
> >
> The AST is only acked if it has been taken. This is done in trap(),
> just before userret() is called.
> Losing the AST should not be possible.
>
> Reading the VAX manual says that ASTLVL is not saved by svpctx, so if a
> process switch occurs before the AST is delivered it will be lost.
> Can this ever happen?
Hmm... svpctx happens in softint_common(), which seems to be called from
lots of softFOO functions... So if I'm reading this correctly, if we
happen to get into softint_common then the AST will get lost....
> Since ASTs are intended to cause the process
> switch, can a switch be called from a higher level of interrupt these days?
>
> You could add in your code something like:
>
> s = splhigh();
> mtpr(AST_OK, PR_ASTLVL);
> if (mfpr(PR_ASTLVL) != AST_OK)
> Â Â Â printf("ERROR\n");
> splx(s);
>
> and see if you still get a missing AST?
>
> -- Ragge
>
>
Later...
Greg Oster
From: Anders Magnusson <ragge@tethuvudet.se>
To: gnats-bugs@netbsd.org, oster@netbsd.org
Cc:
Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
Date: Thu, 30 Jul 2020 21:38:35 +0200
Den 2020-07-30 kl. 21:30, skrev oster@netbsd.org:
> The following reply was made to PR port-vax/55415; it has been noted by GNATS.
>
> From: oster@netbsd.org
> To: gnats-bugs@netbsd.org, port-vax-maintainer@netbsd.org,
> gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, oster@netbsd.org
> Cc:
> Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
> Date: Thu, 30 Jul 2020 13:25:48 -0600
>
> On 7/30/20 1:10 PM, Anders Magnusson wrote:
> > The following reply was made to PR port-vax/55415; it has been noted by GNATS.
> >
> > From: Anders Magnusson <ragge@tethuvudet.se>
> > To: gnats-bugs@netbsd.org, oster@netbsd.org
> > Cc:
> > Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
> > Date: Thu, 30 Jul 2020 21:07:37 +0200
> >
> > > I've done a bit more debugging... What I'm seeing is that in
> > > kern_runq.c:sched_resched_cpu() the call to cpu_need_resched(ci, l, f)
> > > happens, cpu_need_resched() sets up the AST. Except it's only once in a
> > > while that the trap with the AST fires, userret() gets called, and
> > > preemption happens! Sometimes the trap with AST fires once, and not
> > > again... sometimes it fires 5 times in a row, and then misses.... but I
> > > don't know why an AST that has been posted would subsequently get missed
> > > sometimes....
> > >
> > > So it's able to hit a situation where cpu_need_resched() is called, but
> > > the corresponding AST never fires. The loop in sched_resched_cpu() that
> > > sets ci->ci_want_resched keeps thinking (correctly!) that the AST has
> > > already been setup, and so doesn't try to call cpu_need_resched() again.
> > > When it gets 'stuck' like this, we never see an AST until the process
> > > completes. (nor do we see preemption until the process completes.)
> > > That seems to be because if I check the AST status with:
> > >
> > > if (mfpr(PR_ASTLVL) != AST_OK)
> > >
> > > that condition is always true... (meaning the AST is not setup...)
> > >
> > > Any ideas on how an AST can just 'disappear'? (I'm using the same
> > > mfpr() check right after the mtpr() setting of PR_ASTLVL, and there it
> > > thinks it's set just fine... so how does it go missing a few moments
> > > after????)
> > >
> > The AST is only acked if it has been taken. This is done in trap(),
> > just before userret() is called.
> > Losing the AST should not be possible.
> >
> > Reading the VAX manual says that ASTLVL is not saved by svpctx, so if a
> > process switch occurs before the AST is delivered it will be lost.
> > Can this ever happen?
>
> Hmm... svpctx happens in softint_common(), which seems to be called from
> lots of softFOO functions... So if I'm reading this correctly, if we
> happen to get into softint_common then the AST will get lost....
>
AST is itself a softint (called at Softint level 2).
But we should probably add saving of AST levels in the PCB anyway.
-- Ragge
From: oster@netbsd.org
To: Anders Magnusson <ragge@tethuvudet.se>, gnats-bugs@netbsd.org,
oster@netbsd.org
Cc:
Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
Date: Thu, 30 Jul 2020 14:49:55 -0600
On 7/30/20 1:38 PM, Anders Magnusson wrote:
>
>
> Den 2020-07-30 kl. 21:30, skrev oster@netbsd.org:
>> The following reply was made to PR port-vax/55415; it has been noted
>> by GNATS.
>>
>> From: oster@netbsd.org
>> To: gnats-bugs@netbsd.org, port-vax-maintainer@netbsd.org,
>> gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, oster@netbsd.org
>> Cc:
>> Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
>> Date: Thu, 30 Jul 2020 13:25:48 -0600
>>
>> On 7/30/20 1:10 PM, Anders Magnusson wrote:
>> > The following reply was made to PR port-vax/55415; it has been
>> noted by GNATS.
>> >
>> > From: Anders Magnusson <ragge@tethuvudet.se>
>> > To: gnats-bugs@netbsd.org, oster@netbsd.org
>> > Cc:
>> > Subject: Re: port-vax/55415: vax no longer preempts in a timely
>> fashion
>> > Date: Thu, 30 Jul 2020 21:07:37 +0200
>> >
>> > > I've done a bit more debugging... What I'm seeing is that in
>> > > kern_runq.c:sched_resched_cpu() the call to
>> cpu_need_resched(ci, l, f)
>> > > happens, cpu_need_resched() sets up the AST. Except it's
>> only once in a
>> > > while that the trap with the AST fires, userret() gets
>> called, and
>> > > preemption happens! Sometimes the trap with AST fires once,
>> and not
>> > > again... sometimes it fires 5 times in a row, and then
>> misses.... but I
>> > > don't know why an AST that has been posted would
>> subsequently get missed
>> > > sometimes....
>> > >
>> > > So it's able to hit a situation where cpu_need_resched() is
>> called, but
>> > > the corresponding AST never fires. The loop in
>> sched_resched_cpu() that
>> > > sets ci->ci_want_resched keeps thinking (correctly!) that
>> the AST has
>> > > already been setup, and so doesn't try to call
>> cpu_need_resched() again.
>> > > When it gets 'stuck' like this, we never see an AST until
>> the process
>> > > completes. (nor do we see preemption until the process
>> completes.)
>> > > That seems to be because if I check the AST status with:
>> > >
>> > > if (mfpr(PR_ASTLVL) != AST_OK)
>> > >
>> > > that condition is always true... (meaning the AST is not
>> setup...)
>> > >
>> > > Any ideas on how an AST can just 'disappear'? (I'm using
>> the same
>> > > mfpr() check right after the mtpr() setting of PR_ASTLVL,
>> and there it
>> > > thinks it's set just fine... so how does it go missing a few
>> moments
>> > > after????)
>> > >
>> > The AST is only acked if it has been taken. This is done in
>> trap(),
>> > just before userret() is called.
>> > Losing the AST should not be possible.
>> >
>> > Reading the VAX manual says that ASTLVL is not saved by svpctx,
>> so if a
>> > process switch occurs before the AST is delivered it will be lost.
>> > Can this ever happen?
>> Hmm... svpctx happens in softint_common(), which seems to be called
>> from
>> lots of softFOO functions... So if I'm reading this correctly, if we
>> happen to get into softint_common then the AST will get lost....
> AST is itself a softint (called at Softint level 2).
> But we should probably add saving of AST levels in the PCB anyway.
I'm happy to test :)
Later...
Greg Oster
From: Greg Oster <oster@netbsd.org>
To: Anders Magnusson <ragge@tethuvudet.se>, gnats-bugs@netbsd.org
Cc:
Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
Date: Fri, 1 Jul 2022 17:25:06 -0600
An update on this....
If I add a separate check for cpu_intr_p() into
src/sys/kern/kern_runq.c:sched_resched_cpu() like this:
...
for (o = 0;; o = n) {
n = atomic_cas_uint(&ci->ci_want_resched, o, o | f);
if (__predict_true(o == n)) {
/*
* We're the first to set a resched on the CPU. Try
* to avoid causing a needless trip through trap()
* to handle an AST fault, if it's known the LWP
* will either block or go through userret() soon.
*/
if (l != curlwp || cpu_intr_p()) {
cpu_need_resched(ci, l, f);
}
break;
}
/* NEW CODE */
if (cpu_intr_p()) {
cpu_need_resched(ci, l, f);
break;
}
/* END OF NEW CODE */
if (__predict_true(
(n & (RESCHED_KPREEMPT|RESCHED_UPREEMPT)) >=
(f & (RESCHED_KPREEMPT|RESCHED_UPREEMPT)))) {
/* Already in progress, nothing to do. */
...
then the ping times drop from 9 seconds to 146ms, which is more in line
with what a 9.99.10 kernel does on the same hardware. The new code
doesn't fire very often, but when it does, it would have been precisely
when the 'ping stalls' would occur.
What I observed in testing that lead me here is that during the high
ping times the kernel is stuck looping in the "Already in progress,
nothing to do." section when in fact, there are interrupts(?) coming in
that need servicing.... Is this maybe something to do with VAX having
hardware ASTs?
These tests are also with a patch from Ragge to preserve ASTs in a PCB
across context switching.
I suspect such a 'fix' wouldn't be appropriate for all the other
architectures, but I don't know that for sure. Perhaps there's
something machine-dependent with VAX that doesn't fit into the current
machine-independent way of doing things? Or is there still some other
bit missing in the VAX code that would accomplish the above? What does
seem to be true is that on a VAX, the "nothing to do" isn't sufficient
for a performant (if we can call VAX that :) ) system.
Later...
Greg Oster
On 2020-07-30 14:49, oster@netbsd.org wrote:
> On 7/30/20 1:38 PM, Anders Magnusson wrote:
>>
>>
>> Den 2020-07-30 kl. 21:30, skrev oster@netbsd.org:
>>> The following reply was made to PR port-vax/55415; it has been noted
>>> by GNATS.
>>>
>>> From: oster@netbsd.org
>>> To: gnats-bugs@netbsd.org, port-vax-maintainer@netbsd.org,
>>> Â gnats-admin@netbsd.org, netbsd-bugs@netbsd.org, oster@netbsd.org
>>> Cc:
>>> Subject: Re: port-vax/55415: vax no longer preempts in a timely fashion
>>> Date: Thu, 30 Jul 2020 13:25:48 -0600
>>>
>>> Â On 7/30/20 1:10 PM, Anders Magnusson wrote:
>>> Â > The following reply was made to PR port-vax/55415; it has been
>>> noted by GNATS.
>>> Â >
>>> Â > From: Anders Magnusson <ragge@tethuvudet.se>
>>> Â > To: gnats-bugs@netbsd.org, oster@netbsd.org
>>> Â > Cc:
>>> Â > Subject: Re: port-vax/55415: vax no longer preempts in a timely
>>> fashion
>>> Â > Date: Thu, 30 Jul 2020 21:07:37 +0200
>>> Â >
>>>  >  >  I've done a bit more debugging...  What I'm seeing is that in
>>> Â >Â Â >Â Â kern_runq.c:sched_resched_cpu() the call to
>>> cpu_need_resched(ci, l, f)
>>>  >  >  happens, cpu_need_resched() sets up the AST. Except it's
>>> only once in a
>>> Â >Â Â >Â Â while that the trap with the AST fires, userret() gets
>>> called, and
>>>  >  >  preemption happens! Sometimes the trap with AST fires
>>> once, and not
>>> Â >Â Â >Â Â again... sometimes it fires 5 times in a row, and then
>>> misses.... but I
>>> Â >Â Â >Â Â don't know why an AST that has been posted would
>>> subsequently get missed
>>> Â >Â Â >Â Â sometimes....
>>> Â >Â Â >
>>> Â >Â Â >Â Â So it's able to hit a situation where cpu_need_resched() is
>>> called, but
>>> Â >Â Â >Â Â the corresponding AST never fires. The loop in
>>> sched_resched_cpu() that
>>> Â >Â Â >Â Â sets ci->ci_want_resched keeps thinking (correctly!) that
>>> the AST has
>>> Â >Â Â >Â Â already been setup, and so doesn't try to call
>>> cpu_need_resched() again.
>>> Â >Â Â >Â Â Â Â When it gets 'stuck' like this, we never see an AST until
>>> the process
>>>  >  >  completes. (nor do we see preemption until the process
>>> completes.)
>>> Â >Â Â >Â Â That seems to be because if I check the AST status with:
>>> Â >Â Â >
>>> Â >Â Â >Â Â Â Â if (mfpr(PR_ASTLVL) != AST_OK)
>>> Â >Â Â >
>>> Â >Â Â >Â Â that condition is always true... (meaning the AST is not
>>> setup...)
>>> Â >Â Â >
>>>  >  >  Any ideas on how an AST can just 'disappear'? (I'm using
>>> the same
>>> Â >Â Â >Â Â mfpr() check right after the mtpr() setting of PR_ASTLVL,
>>> and there it
>>> Â >Â Â >Â Â thinks it's set just fine... so how does it go missing a
>>> few moments
>>> Â >Â Â >Â Â after????)
>>> Â >Â Â >
>>>  >  The AST is only acked if it has been taken. This is done in
>>> trap(),
>>> Â >Â Â just before userret() is called.
>>> Â >Â Â Losing the AST should not be possible.
>>> Â >
>>> Â >Â Â Reading the VAX manual says that ASTLVL is not saved by svpctx,
>>> so if a
>>> Â >Â Â process switch occurs before the AST is delivered it will be lost.
>>> Â >Â Â Can this ever happen?
>>> Â Hmm... svpctx happens in softint_common(), which seems to be called
>>> from
>>>  lots of softFOO functions...  So if I'm reading this correctly, if we
>>> Â happen to get into softint_common then the AST will get lost....
>> AST is itself a softint (called at Softint level 2).
>> But we should probably add saving of AST levels in the PCB anyway.
>
> I'm happy to test :)
>
> Later...
>
> Greg Oster
From: "Greg Oster" <oster@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/55415 CVS commit: src/sys/arch/vax/include
Date: Sun, 10 Sep 2023 00:15:52 +0000
Module Name: src
Committed By: oster
Date: Sun Sep 10 00:15:52 UTC 2023
Modified Files:
src/sys/arch/vax/include: cpu.h
Log Message:
With the overhaul of the scheduler code the semantics of
ci_want_resched have changed, and for some reason vax
still requires ci_want_resched set to 1 in order to do
preemption. This commit contains a workaround for the
preemption issued discussed in PR#55415.
XXX pullup-10
To generate a diff of this commit:
cvs rdiff -u -r1.106 -r1.107 src/sys/arch/vax/include/cpu.h
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/55415 CVS commit: [netbsd-10] src/sys/arch/vax/include
Date: Mon, 11 Sep 2023 13:42:08 +0000
Module Name: src
Committed By: martin
Date: Mon Sep 11 13:42:08 UTC 2023
Modified Files:
src/sys/arch/vax/include [netbsd-10]: cpu.h
Log Message:
Pull up following revision(s) (requested by oster in ticket #365):
sys/arch/vax/include/cpu.h: revision 1.107
With the overhaul of the scheduler code the semantics of
ci_want_resched have changed, and for some reason vax
still requires ci_want_resched set to 1 in order to do
preemption. This commit contains a workaround for the
preemption issued discussed in PR#55415.
To generate a diff of this commit:
cvs rdiff -u -r1.106 -r1.106.2.1 src/sys/arch/vax/include/cpu.h
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Kalvis Duckmanton" <kalvisd@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/55415 CVS commit: src/sys/arch/vax/vax
Date: Mon, 18 Dec 2023 22:40:01 +0000
Module Name: src
Committed By: kalvisd
Date: Mon Dec 18 22:40:01 UTC 2023
Modified Files:
src/sys/arch/vax/vax: subr.S
Log Message:
vax: preserve AST requests raised when handling software interrupts
PR port-vax/55415
On return from a software interrupt, if the software interrupt LWP
raised an AST request, copy the AST level from its PCB to the PCB
of the interrupted LWP.
Reviewed by <ragge>
To generate a diff of this commit:
cvs rdiff -u -r1.42 -r1.43 src/sys/arch/vax/vax/subr.S
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Greg Oster <oster@netbsd.org>
To: gnats-bugs@netbsd.org, port-vax-maintainer@netbsd.org,
gnats-admin@netbsd.org, netbsd-bugs@netbsd.org
Cc:
Subject: Re: PR/55415 CVS commit: src/sys/arch/vax/vax
Date: Mon, 18 Dec 2023 18:00:36 -0600
On 2023-12-18 16.45, Kalvis Duckmanton wrote:
> The following reply was made to PR port-vax/55415; it has been noted by GNATS.
>
> From: "Kalvis Duckmanton" <kalvisd@netbsd.org>
> To: gnats-bugs@gnats.NetBSD.org
> Cc:
> Subject: PR/55415 CVS commit: src/sys/arch/vax/vax
> Date: Mon, 18 Dec 2023 22:40:01 +0000
>
> Module Name: src
> Committed By: kalvisd
> Date: Mon Dec 18 22:40:01 UTC 2023
>
> Modified Files:
> src/sys/arch/vax/vax: subr.S
>
> Log Message:
> vax: preserve AST requests raised when handling software interrupts
>
> PR port-vax/55415
>
> On return from a software interrupt, if the software interrupt LWP
> raised an AST request, copy the AST level from its PCB to the PCB
> of the interrupted LWP.
>
> Reviewed by <ragge>
I'm very pleased to update this ticket and report that Kalvis's change
means that this workaround:
cvs rdiff -u -r1.106 -r1.107 src/sys/arch/vax/include/cpu.h
is no longer needed!!
Later...
Greg Oster
From: "Kalvis Duckmanton" <kalvisd@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/55415 CVS commit: src/sys/arch/vax/include
Date: Tue, 19 Dec 2023 00:29:48 +0000
Module Name: src
Committed By: kalvisd
Date: Tue Dec 19 00:29:48 UTC 2023
Modified Files:
src/sys/arch/vax/include: cpu.h
Log Message:
vax: PR port-vax/55415
Remove VAX-specific workaround to force pre-emption, as it is now
no longer needed.
tested by oster@
To generate a diff of this commit:
cvs rdiff -u -r1.107 -r1.108 src/sys/arch/vax/include/cpu.h
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/55415 CVS commit: [netbsd-10] src/sys/arch/vax/vax
Date: Tue, 19 Dec 2023 12:26:01 +0000
Module Name: src
Committed By: martin
Date: Tue Dec 19 12:26:01 UTC 2023
Modified Files:
src/sys/arch/vax/vax [netbsd-10]: subr.S
Log Message:
Pull up following revision(s) (requested by kalvisd in ticket #508):
sys/arch/vax/vax/subr.S: revision 1.43
vax: preserve AST requests raised when handling software interrupts
PR port-vax/55415
On return from a software interrupt, if the software interrupt LWP
raised an AST request, copy the AST level from its PCB to the PCB
of the interrupted LWP.
Reviewed by <ragge>
To generate a diff of this commit:
cvs rdiff -u -r1.41.2.1 -r1.41.2.2 src/sys/arch/vax/vax/subr.S
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: "Martin Husemann" <martin@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/55415 CVS commit: [netbsd-10] src/sys/arch/vax/include
Date: Tue, 19 Dec 2023 12:28:01 +0000
Module Name: src
Committed By: martin
Date: Tue Dec 19 12:28:01 UTC 2023
Modified Files:
src/sys/arch/vax/include [netbsd-10]: cpu.h
Log Message:
Pull up following revision(s) (requested by kalvisd in ticket #509):
sys/arch/vax/include/cpu.h: revision 1.108
vax: PR port-vax/55415
Remove VAX-specific workaround to force pre-emption, as it is now
no longer needed.
tested by oster@
To generate a diff of this commit:
cvs rdiff -u -r1.106.2.1 -r1.106.2.2 src/sys/arch/vax/include/cpu.h
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
State-Changed-From-To: open->closed
State-Changed-By: oster@NetBSD.org
State-Changed-When: Tue, 19 Dec 2023 14:35:42 +0000
State-Changed-Why:
kalvisd fixed the underlying issue! Thanks!
(Pullups to -10 are done now too.)
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.47 2022/09/11 19:34:41 kim Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2023
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.