Every time my code crashes, I rely on core dump files to find out where the crash happened. (See my previous post on how to produce and use a core dump file.) My debugging life has been happy thanks to this approach, until this time. When I loaded GDB with the core dump, I’m so disappointed that all the stack trace is about some system library, none about mine.

TLDR: Check this patch.

Let’s start the journey.

Background

To help my dear readers to better understand how my usual postmortem procedure works, take a look at this tiny buggy C++ code.

// compile with:
//   g++ -g -std=c++11 sigsegv.cc -o sigsegv -pthread
#include <thread>
#include <vector>
#include <iostream>

void foo() {
    std::vector<int> v;
    std::cout << v[100] << std::endl;
}

int main() {
    std::thread t(foo);
    t.join();
}

Not surprisingly, there should be a segmentation fault. To know where the crash happened, you can either load the core dump file with GDB if the bug is not so easy to catch, or just rerun in GDB otherwise. Let’s just run it in GDB:

$ gdb ./sigsegv
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Reading symbols from ./sigsegv...done.
(gdb) r
Starting program: /tmp/sigsegv
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff6f4e700 (LWP 68189)]

Thread 2 "sigsegv" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff6f4e700 (LWP 68189)]
0x0000000000400f5d in foo () at sigsegv.cc:8
8	    std::cout << v[100] << std::endl;

(gdb) bt
#0  0x0000000000400f5d in foo () at sigsegv.cc:8
#1  0x00000000004027dd in std::_Bind_simple<void (*())()>::_M_invoke<>(std::_Index_tuple<>) (this=0x617c48)
    at /usr/include/c++/5/functional:1531
#2  0x0000000000402736 in std::_Bind_simple<void (*())()>::operator()() (this=0x617c48)
    at /usr/include/c++/5/functional:1520
#3  0x00000000004026c6 in std::thread::_Impl<std::_Bind_simple<void (*())()> >::_M_run() (this=0x617c30)
    at /usr/include/c++/5/thread:115
#4  0x00007ffff7b0dc80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffff76296ba in start_thread (arg=0x7ffff6f4e700) at pthread_create.c:333
#6  0x00007ffff735f41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

As you can see, GDB is able to show the exact line number of the crash scene, as usual.

So far so good. But at this time, my code uses vector::at to access the elements with bound checking. It will throw std::out_of_range exception if things go bad.

// compile with:
//   g++ -g -std=c++11 exception.cc -o exception -pthread
#include <thread>
#include <vector>
#include <iostream>

void foo() {
    std::vector<int> v;
    std::cout << v.at(100) << std::endl;
}

int main() {
    std::thread t(foo);
    t.join();
}

It seems to be a safer practice to use at over operator[]. However, GDB just won’t show me where the crash happens:

$ gdb ./exception
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Reading symbols from ./exception...done.
(gdb) r
Starting program: /tmp/exception
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff6f4e700 (LWP 68143)]
terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 100) >= this->size() (which is 0)

Thread 2 "exception" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff6f4e700 (LWP 68143)]
0x00007ffff728d428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

(gdb) bt
#0  0x00007ffff728d428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ffff728f02a in __GI_abort () at abort.c:89
#2  0x00007ffff7ae484d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007ffff7ae26b6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff7ae2701 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffff7b0dd38 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff76296ba in start_thread (arg=0x7ffff6f4e700) at pthread_create.c:333
#7  0x00007ffff735f41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Neat! The program told us that vector throws a std::out_of_range exception as its last words. I’m so moved. But how do I know where that happened?

Look at the stack trace, there is no single line in my code base. Yes, you can see where the bug is simply by staring the code above. But this issue actually happened in my project that has 10k lines of C++ code. I really need GDB to tell me the line number.

Now you see the problem.

Bug in some system library?

Suddenly, Niel, whose seat is next to me, told me that there could be some bugs in the underlying libraries.

To be honest, I usually don’t believe that my problem is caused by some bugs in compilers, operating system, or standard library. It’s just unlikely since they are so widely used.

But Niel said that he ran into bugs in those libraries before, and he was willing to help me look into this problem, plus he is super chill. So we started to investigate this problem together.

Recover the ??

Staring at those ?? doesn’t help. So I decided to recover their real names. I thought I have enough knowledge with Ubuntu that I naturally typed sudo apt install libstdc++-gdb. Sadly, it doesn’t exist. It took me a while to figure out that the correct package for the debugging symbol is called libstdc++6-5-dbg, where 6 corresponds to libstdc++.so.6 and 5 is for GCC 5.4 as I’m using Ubuntu 16.04.

Now that we have the debugging symbol, GDB gives us more clue:

$ gdb ./exception
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Reading symbols from ./exception...done.
(gdb) r
Starting program: /tmp/exception
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff6f4e700 (LWP 68314)]
terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 100) >= this->size() (which is 0)

Thread 2 "exception" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff6f4e700 (LWP 68314)]
0x00007ffff728d428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

(gdb) bt
#0  0x00007ffff728d428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ffff728f02a in __GI_abort () at abort.c:89
#2  0x00007ffff7ae484d in __gnu_cxx::__verbose_terminate_handler ()
    at ../../../../src/libstdc++-v3/libsupc++/vterminate.cc:95
#3  0x00007ffff7ae26b6 in __cxxabiv1::__terminate (handler=<optimized out>)
    at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:47
#4  0x00007ffff7ae2701 in std::terminate () at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:57
#5  0x00007ffff7b0dd38 in std::execute_native_thread_routine (__p=<optimized out>)
    at ../../../../../src/libstdc++-v3/src/c++11/thread.cc:92
#6  0x00007ffff76296ba in start_thread (arg=0x7ffff6f4e700) at pthread_create.c:333
#7  0x00007ffff735f41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Glibc

We decided to dive into each stack frame from bottom up. The clone() doesn’t seem interesting, so we skipped it. Now we want to check pthread_create.c:333. After some search, I realized that it is in glibc. But which version of glibc am I using? My idea is to run ldd first, to know where the .so file is.

$ ldd ./exception
	linux-vdso.so.1 =>  (0x00007ffc77f54000)
	libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f23ae730000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f23ae51a000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f23ae150000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f23ade47000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f23aeab2000)

Now we know where the .so is but what is the version?

$ ls -la /lib/x86_64-linux-gnu/libc.so.6
lrwxrwxrwx 1 root root 12 Mar  4 18:36 /lib/x86_64-linux-gnu/libc.so.6 -> libc-2.23.so

Ok, now we can check the pthread_create.c:333 from glibc 2.23 source code:

THREAD_SETMEM (pd, result, CALL_THREAD_FCT (pd));  // pthread_create.c:333

But I want to know what CALL_THREAD_FCT does. It seems like a macro. I need to find its definition:

$ grep '#define CALL_THREAD_FCT' -r glibc-2.23
glibc-2.23/sysdeps/i386/nptl/tls.h:#define CALL_THREAD_FCT(descr) \

I got lucky that it is indeed defined using #define CALL_THREAD_FCT, but it’s not the architecture I’m using. But I got lucky again to predict its appearance at glibc-2.23/sysdeps/x86_64/nptl/tls.h:

# define CALL_THREAD_FCT(descr) \
  ({ void *__res;                                                             \
     asm volatile ("movq %%fs:%P2, %%rdi\n\t"                                 \
                   "callq *%%fs:%P1"                                          \
                   : "=a" (__res)                                             \
                   : "i" (offsetof (struct pthread, start_routine)),          \
                     "i" (offsetof (struct pthread, arg))                     \
                   : "di", "si", "cx", "dx", "r8", "r9", "r10", "r11",        \
                     "memory", "cc");                                         \
     __res; })

I don’t know assembly language. It seems to me that this is to call start_routine with arg as arguments. It doesn’t seem very interesting either.

We decided to move on to the next frame.

libstdc++

So now we need the source code of libstdc++. I realized that libstdc++ is part of GCC. So, we need the source code of GCC 5.4. Let’s look at ../../../../../src/libstdc++-v3/src/c++11/thread.cc:92.

extern "C"
{
  static void*
  execute_native_thread_routine(void* __p)
  {
    thread::_Impl_base* __t = static_cast<thread::_Impl_base*>(__p);
    thread::__shared_base_type __local;
    __local.swap(__t->_M_this_ptr);

    __try
      {
        __t->_M_run();
      }
    __catch(const __cxxabiv1::__forced_unwind&)
      {
        __throw_exception_again;
      }
    __catch(...)
      {
        std::terminate();  // line 92
      }

    return nullptr;
  }
} // extern "C"

I was so shocked when I opened this file. Why would libstdc++ wants to catch all exceptions?! Please, don’t! Just let it crash!

This code explains a lot. My code must be running inside the try block. When it throws an exception, it will be caught by Line 92. And at the time the control flow enters the catch block, all the stacks have already been unwound, with all useful information that could help me debug being erased.

Bug Report

This seems to me a real bug in libstdc++. I searched and realized that someone reported this Bug #55917 in 2013, but it didn’t get fixed until GCC 8. And the patch is simply to remove the try-catch block and let the user code crash.

Upgrade to GCC 8

Now that we know the bug has been fixed in GCC 8, we can recompile our program with GCC 8. Because Ubuntu 16.04 doesn’t include GCC 8 in the source, I have to use the ubuntu-toolchain-r/test PPA:

sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install g++-8

Now let’s recompile the buggy code and run GDB:

$ g++-8 -g -std=c++11 exception.cc -o exception -pthread
$ gdb ./exception
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Reading symbols from ./exception...done.
(gdb) r
Starting program: /tmp/exception
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff6f42700 (LWP 69463)]
terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 100) >= this->size() (which is 0)

Thread 2 "exception" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff6f42700 (LWP 69463)]
0x00007ffff7281428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

(gdb) bt
#0  0x00007ffff7281428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ffff728302a in __GI_abort () at abort.c:89
#2  0x00007ffff7ad78f7 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007ffff7adda46 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff7adda81 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffff7addcb4 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff7ad97f5 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x0000000000401274 in std::vector<int, std::allocator<int> >::_M_range_check (this=0x7ffff6f41e00, __n=100)
    at /usr/include/c++/8/bits/stl_vector.h:960
#8  0x0000000000401033 in std::vector<int, std::allocator<int> >::at (this=0x7ffff6f41e00, __n=100)
    at /usr/include/c++/8/bits/stl_vector.h:981
#9  0x0000000000400dd7 in foo () at exception.cc:8
#10 0x00000000004013a7 in std::__invoke_impl<void, void (*)()>(std::__invoke_other, void (*&&)()) (
    __f=<unknown type in /tmp/exception, CU 0x0, DIE 0x6a01>) at /usr/include/c++/8/bits/invoke.h:60
#11 0x0000000000401093 in std::__invoke<void (*)()>(void (*&&)()) (__fn=<unknown type in /tmp/exception, CU 0x0, DIE 0x6e68>)
    at /usr/include/c++/8/bits/invoke.h:95
#12 0x00000000004019da in std::thread::_Invoker<std::tuple<void (*)()> >::_M_invoke<0ul> (this=0x615c28)
    at /usr/include/c++/8/thread:234
#13 0x000000000040199b in std::thread::_Invoker<std::tuple<void (*)()> >::operator() (this=0x615c28)
    at /usr/include/c++/8/thread:243
#14 0x0000000000401970 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)()> > >::_M_run (this=0x615c20)
    at /usr/include/c++/8/thread:186
#15 0x00007ffff7b0857f in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#16 0x00007ffff761d6ba in start_thread (arg=0x7ffff6f42700) at pthread_create.c:333
#17 0x00007ffff735341d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Look at Frame 9. It’s our code! It works!

The Lesson

Although it’s unlikely that the underlying software and library are buggy enough to affect normal programmers’ life, it did happen, and probably it will still happen in the future. So, don’t be afraid to challenge them.

And thanks to Niel. I wouldn’t dive into the problem this deep without his help.