NetBSD Problem Report #50939
From kivinen@fireball.acr.fi Fri Mar 11 14:17:49 2016
Return-Path: <kivinen@fireball.acr.fi>
Received: from mail.netbsd.org (mail.netbsd.org [199.233.217.200])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(Client CN "mail.netbsd.org", Issuer "Postmaster NetBSD.org" (verified OK))
by mollari.NetBSD.org (Postfix) with ESMTPS id CEE9D7ABE5
for <gnats-bugs@gnats.NetBSD.org>; Fri, 11 Mar 2016 14:17:49 +0000 (UTC)
Message-Id: <201603111254.u2BCsClJ009422@fireball.acr.fi>
Date: Fri, 11 Mar 2016 14:54:12 +0200 (EET)
From: kivinen@iki.fi
Reply-To: kivinen@iki.fi
To: gnats-bugs@NetBSD.org
Subject: Bug in GCC optionization causing i386 net-snmpd to crash
X-Send-Pr-Version: 3.95
>Number: 50939
>Category: pkg
>Synopsis: snmpd crashes when compiled with gcc -O2
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: adam
>State: closed
>Class: sw-bug
>Submitter-Id: net
>Arrival-Date: Fri Mar 11 14:20:01 +0000 2016
>Closed-Date: Fri Mar 05 04:53:32 +0000 2021
>Last-Modified: Fri Mar 05 04:53:32 +0000 2021
>Originator: Tero Kivinen
>Release: NetBSD 7.0_STABLE
>Organization:
IKI ry
>Environment:
System: NetBSD seuraava.iki.fi 7.0_STABLE NetBSD 7.0_STABLE (GENERIC) #0: Sat Mar 5 20:05:29 EET 2016 kivinen@seuraava.iki.fi:/usr/obj/sys/arch/i386/compile/GENERIC i386
Architecture: i386
Machine: i386
>Description:
I have maching running net-snmpd and immediately when the
monitoring script connects to the snmpd and tries to read cpu
statistics the
net-snmp-5.7.3/agent/mibgroup/hardware/cpu/cpu_sysctl.c
crashes. If the net-snmpd is compiled without optimizations it
does not crash. This only happens on the i386 architecture, it
does not appear on amd64 architecture.
Before the crash the system will print error message to the
syslog saying:
sysctl vm.vm_meter failed (errno 0)
Using gdb to debug the code it seems it starts executing
netsnmp_cpu_arch_load, and does the first few calls nomally,
i.e. the cpu_stats call (line 200) etc, and then does the
mem_mib call (line 218), but before actually storing the
mem_stats output to the cpu->* structure (at line 220) it goes
on and runs the NetBSD specific code reading kern.cp_time
(line 233 forward) and after that is done it jumps back to
check the error status of the mem_mib call (at line 219), thus
printing out error message about the sysctl vm.vm_meter
failing (even when it actually did succeed), and then it tries
to store the data to cpu->* structure (at line 220), but as
cpu variable has been trashed at this point, it has value of
0x77 and this will cause crash.
>How-To-Repeat:
Install NetBSD 7.0 from CVS on i386 machine. Install
/usr/pkgsrc/net/net-snmp and the net-snmp will crash
immediately when it calls the netsnmp_cpu_arch_load.
I.e. start snmpd
/etc/rc.d/snmpd start
In our system it crashed in less than minute.
>Fix:
cd /usr/pkgsrc/net/net-snmp
make configure
<edit all Makefiles, and remove -O2 and -O2 from the CFLAGS>
make install
/etc/rc.d/snmpd start
>Release-Note:
>Audit-Trail:
Responsible-Changed-From-To: toolchain-manager->adam
Responsible-Changed-By: maya@NetBSD.org
Responsible-Changed-When: Thu, 29 Sep 2016 21:19:02 +0000
Responsible-Changed-Why:
Over to maintainer
From: David Holland <dholland-pbugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: toolchain/50939: Bug in GCC optionization causing i386 net-snmpd
to crash
Date: Fri, 30 Sep 2016 06:41:16 +0000
On Fri, Mar 11, 2016 at 02:20:01PM +0000, kivinen@iki.fi wrote:
> Using gdb to debug the code it seems it starts executing
> netsnmp_cpu_arch_load, and does the first few calls nomally,
> i.e. the cpu_stats call (line 200) etc, and then does the
> mem_mib call (line 218), but before actually storing the
> mem_stats output to the cpu->* structure (at line 220) it goes
> on and runs the NetBSD specific code reading kern.cp_time
> (line 233 forward) and after that is done it jumps back to
> check the error status of the mem_mib call (at line 219), thus
> printing out error message about the sysctl vm.vm_meter
> failing (even when it actually did succeed), and then it tries
> to store the data to cpu->* structure (at line 220), but as
> cpu variable has been trashed at this point, it has value of
> 0x77 and this will cause crash.
This sounds like it is overwriting its stack, probably in the mem_mib
call. Then when it returns form the mem_mib call it manages to go to
the wrong place. Can you check in the debugger if this is the case?
What gets trashed if you overwrite the stack can depend heavily on
compiler optimizations, so it's not necessarily a gcc bug.
I don't see anything obviously wrong with the code, but that isn't
conclusive.
Also, is this happening on real i386, or in a 32-bit chroot on an
amd64? Might also be a problem with the compat32 sysctl().
--
David A. Holland
dholland@netbsd.org
From: "Gavan Fantom" <gavan@netbsd.org>
To: gnats-bugs@gnats.NetBSD.org
Cc:
Subject: PR/50939 CVS commit: pkgsrc/net/net-snmp
Date: Fri, 6 Oct 2017 02:39:38 +0000
Module Name: pkgsrc
Committed By: gavan
Date: Fri Oct 6 02:39:38 UTC 2017
Modified Files:
pkgsrc/net/net-snmp: Makefile distinfo
pkgsrc/net/net-snmp/patches:
patch-agent_mibgroup_hardware_cpu_cpu__sysctl.c
Log Message:
net-snmp: Prevent crash on NetBSD/i386
A compiler bug causes incorrect compilation of the NetBSD-specific
code in cpu_sysctl.c. This results in a crash shortly after startup if
the machine has 2 or more CPUs.
Disable optimisation in netsnmp_cpu_arch_load() only.
This works around the problem reported in PR pkg/50939.
To generate a diff of this commit:
cvs rdiff -u -r1.120 -r1.121 pkgsrc/net/net-snmp/Makefile
cvs rdiff -u -r1.90 -r1.91 pkgsrc/net/net-snmp/distinfo
cvs rdiff -u -r1.6 -r1.7 \
pkgsrc/net/net-snmp/patches/patch-agent_mibgroup_hardware_cpu_cpu__sysctl.c
Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.
From: Gavan Fantom <gavan@coolfactor.org>
To: gnats-bugs@netbsd.org, tech-toolchain@netbsd.org
Cc: dholland-pbugs@netbsd.org, maya@NetBSD.org, kivinen@iki.fi,
adam@netbsd.org
Subject: Re: pkg/50939: Bug in GCC optionization causing i386 net-snmpd, to
crash
Date: Fri, 6 Oct 2017 02:14:39 +0100
Some time ago, David Holland wrote:
> This sounds like it is overwriting its stack, probably in the mem_mib
> call. Then when it returns form the mem_mib call it manages to go to
> the wrong place. Can you check in the debugger if this is the case?
>
> What gets trashed if you overwrite the stack can depend heavily on
> compiler optimizations, so it's not necessarily a gcc bug.
>
> I don't see anything obviously wrong with the code, but that isn't
> conclusive.
>
> Also, is this happening on real i386, or in a 32-bit chroot on an
> amd64? Might also be a problem with the compat32 sysctl().
I have reproduced this on NetBSD 7.1 on a real i386 machine.
The problem appears to be a compiler bug. Consider the following code,
from the middle of netsnmp_cpu_arch_load:
for (i = 0; i < cpu_num; i++) {
netsnmp_cpu_info *ncpu = netsnmp_cpu_get_byIdx( i, 1 );
size_t j = i * CPUSTATES;
ncpu->user_ticks = (unsigned long long)ncpu_stats[j + CP_USER];
ncpu->nice_ticks = (unsigned long long)ncpu_stats[j + CP_NICE];
ncpu->sys2_ticks = (unsigned long long)ncpu_stats[j +
CP_SYS]+cpu_stats[j + CP_INTR];
ncpu->kern_ticks = (unsigned long long)ncpu_stats[j + CP_SYS];
ncpu->idle_ticks = (unsigned long long)ncpu_stats[j + CP_IDLE];
ncpu->intrpt_ticks = (unsigned long long)ncpu_stats[j +
CP_INTR];
}
This is translated into the following block of code (disassembled by
gdb). The block is entered via a conditional branch from elsewhere, if
cpu_num > 0.
0xbba64c88 <+1039>: movl $0x1,0x4(%esp)
0xbba64c90 <+1047>: movl $0x0,(%esp)
0xbba64c97 <+1054>: call 0xbba09460 <netsnmp_cpu_get_byIdx@plt>
0xbba64c9c <+1059>: mov (%edi),%edx
0xbba64c9e <+1061>: mov 0x4(%edi),%ecx
0xbba64ca1 <+1064>: mov %edx,0x2008(%eax)
0xbba64ca7 <+1070>: mov %ecx,0x200c(%eax)
0xbba64cad <+1076>: mov 0x8(%edi),%edx
0xbba64cb0 <+1079>: mov 0xc(%edi),%ecx
0xbba64cb3 <+1082>: mov %edx,0x2010(%eax)
0xbba64cb9 <+1088>: mov %ecx,0x2014(%eax)
0xbba64cbf <+1094>: mov 0x10(%edi),%edx
0xbba64cc2 <+1097>: mov 0x14(%edi),%ecx
0xbba64cc5 <+1100>: add 0x54(%esp),%edx
0xbba64cc9 <+1104>: adc 0x58(%esp),%ecx
0xbba64ccd <+1108>: mov %edx,0x2068(%eax)
0xbba64cd3 <+1114>: mov %ecx,0x206c(%eax)
0xbba64cd9 <+1120>: mov 0x10(%edi),%edx
0xbba64cdc <+1123>: mov 0x14(%edi),%ecx
0xbba64cdf <+1126>: mov %edx,0x2030(%eax)
0xbba64ce5 <+1132>: mov %ecx,0x2034(%eax)
0xbba64ceb <+1138>: mov 0x20(%edi),%edx
0xbba64cee <+1141>: mov 0x24(%edi),%ecx
0xbba64cf1 <+1144>: mov %edx,0x2020(%eax)
0xbba64cf7 <+1150>: mov %ecx,0x2024(%eax)
0xbba64cfd <+1156>: mov 0x18(%edi),%edx
0xbba64d00 <+1159>: mov 0x1c(%edi),%ecx
0xbba64d03 <+1162>: mov %edx,0x2038(%eax)
0xbba64d09 <+1168>: mov %ecx,0x203c(%eax)
0xbba64d0f <+1174>: mov -0x258(%ebx),%eax
0xbba64d15 <+1180>: mov (%eax),%eax
0xbba64d17 <+1182>: cmp $0x1,%eax
0xbba64d1a <+1185>: jle 0xbba64ace <netsnmp_cpu_arch_load+597>
0xbba64d20 <+1191>: movl $0x1,0x4(%esp)
0xbba64d28 <+1199>: movl $0x1,(%esp)
0xbba64d2f <+1206>: call 0xbba09460 <netsnmp_cpu_get_byIdx@plt>
0xbba64d34 <+1211>: mov 0x28(%edi),%edx
0xbba64d37 <+1214>: mov 0x2c(%edi),%ecx
0xbba64d3a <+1217>: mov %edx,0x2008(%eax)
0xbba64d40 <+1223>: mov %ecx,0x200c(%eax)
0xbba64d46 <+1229>: mov 0x30(%edi),%esi
0xbba64d49 <+1232>: mov 0x34(%edi),%edi
0xbba64d4c <+1235>: mov %esi,0x2010(%eax)
0xbba64d52 <+1241>: mov %edi,0x2014(%eax)
The branch to 0xbba64ace is a branch back to continue the normal
execution of the code, where free(...) is called and life carries on.
Note that the compiler appears to have partially unrolled the loop. But
this is the end of that block of code. The next block of code happens to
be the cleanup code sysctl(mem_mib, ...) failing, which logs "sysctl
vm.vm_meter failed". This appears to be purely coincidental, and the
real failure here is that execution just falls off the end of this
half-finished loop unrolling.
0xbba64d58 <+1247>: call 0xbba0abf0 <__errno@plt>
0xbba64d5d <+1252>: mov (%eax),%eax
0xbba64d5f <+1254>: mov %eax,0x8(%esp)
0xbba64d63 <+1258>: lea -0x41e78(%ebx),%eax
0xbba64d69 <+1264>: mov %eax,0x4(%esp)
0xbba64d6d <+1268>: movl $0x3,(%esp)
0xbba64d74 <+1275>: call 0xbba0af70 <snmp_log@plt>
0xbba64d79 <+1280>: jmp 0xbba649cd <netsnmp_cpu_arch_load+340>
It does look like a machine with only one CPU would be spared this fate
as it would exit the loop after the first iteration and not try to
execute the second, incomplete, iteration. This problem should be
reproducible on any NetBSD/i386 machine with at least 2 CPUs.
Obviously in the short term, the package will need to work around this
by disabling optimisation, but this is clearly something the compiler is
getting wrong.
From: David Holland <dholland-pbugs@netbsd.org>
To: gnats-bugs@NetBSD.org
Cc:
Subject: Re: pkg/50939: Bug in GCC optionization causing i386 net-snmpd, to
crash
Date: Fri, 6 Oct 2017 11:02:49 +0000
On Fri, Oct 06, 2017 at 04:45:00AM +0000, Gavan Fantom wrote:
> Note that the compiler appears to have partially unrolled the loop. But
> this is the end of that block of code. The next block of code happens to
> be the cleanup code sysctl(mem_mib, ...) failing, which logs "sysctl
> vm.vm_meter failed". This appears to be purely coincidental, and the
> real failure here is that execution just falls off the end of this
> half-finished loop unrolling.
That's... creative of it. :-/
--
David A. Holland
dholland@netbsd.org
State-Changed-From-To: open->closed
State-Changed-By: dholland@NetBSD.org
State-Changed-When: Fri, 05 Mar 2021 04:53:32 +0000
State-Changed-Why:
compiler bug, we're not going to be fixing it and there's been a workaround
in pkgsrc since 2017.
>Unformatted:
(Contact us)
$NetBSD: query-full-pr,v 1.46 2020/01/03 16:35:01 leot Exp $
$NetBSD: gnats_config.sh,v 1.9 2014/08/02 14:16:04 spz Exp $
Copyright © 1994-2020
The NetBSD Foundation, Inc. ALL RIGHTS RESERVED.