* * * * *
                                        
         “You can't fix the bug until you have filled out form 27B/6.”
                                        
I swear, I just cannot adjust to the new development process. Last month, as
I was starting work on “Project: Bradenburg [1],” I ended up using GCC (GNU's
Not Unix Compiler Collection) 11, and just on a whim, I decided to try GCC 11
on “Project: Lumbergh [2]” and that's when I found an issue—“Project:
Lumbergh” crashed immediately. It took about ten minutes to track down the
issue—undefined behavior! Effectively, “Project: Lumbergh” was taking a
pointer to a variable in an inner scope and using it outside said scope
(don't worry, there's an example of what I'm talking below). I duly recorded
a bug in Jira [3], but did not check in the patch as the other developer, CZ,
was doing a bunch of work on “Project: Lumbergh” and I wanted to run the bug
by him first just so he knew about it. In retrospect, I think I should have
just checked in the patch and asked forgiveness, because the ensuing
“process” nearly killed me.

First off, the bug basically languished due to the holliday season and
everybody taking vacation and what not. When I brought up the bug on Monday,
CZ demanded a test case to show it failing. Problem—there was no reason to
create a specific test case, because any test would have shown the crash.
Second problem—it's undefined behavior—it only shows (as far as I know) when
compiled with GCC 11. We don't use GCC 11 in production, and the GCC version
we use in production produces code that runs, thus not finding the bug any
ealier.

CZ tried using valgrind [4] to locate the issue and couldn't. Of course CZ
couldn't—the executable worked. As I tried to state, it only fails when
compiled with GCC 11 (which we as a company don't use yet).

I spent the week trying to convince him to just accept the patch because it
removed undefined behavior. CZ was adamant in trying to reproduce the issue
for himself. Because with our Corporate Overlords, all bugs must be
reproducable, must have a test case, and must pass said test case, before it
will be patched.

Even in the case of blatant undefined behavior!

**ARE YOU XXXXXXX KIDDING ME? IS OUR CODE SO XXXXXXX FRAGILE THAT MOVING
THREE VARIABLES DECLARATIONS WILL XXXXXXX BREAK THE PROGRAM? IS THAT WHAT YOU
ARE SO XXXXXXX AFRAID OF? XXXXXXXXXXXXXXXXXXXX!** 

[Calm down! Breath! In … out … in … out. —Editor] 

Sigh.  

Eventually, I was given the go-ahead to submit the patch. Here was my
revision notes:

> Bug fix XXXXXX—remove undefined behavior
> 
> This was found using GCC 11, and it's very hard to reproduce, since the
> code invoked undefined behavior by storing pointers to data in a scope that
> technically no longer exists when it's referenced.
> 
> The reason this is hard to find depends upon how the compiler lays out the
> stack frame for the given function. It can, for instance, collect all the
> variable created at all the scopes in the function and set aside space for
> all of them at once, which in that case the code will work without issue.
> Or it could collect all the variables from all the scopes, and create just
> enough space for all the variables, allowing the variables from one scope
> to reuse memory for another scope, in which case, it may work, or not,
> depending upon intervening scopes. Or, the compiler can create the scoped
> variables when the code enters the scope, in which case the code may lead
> to undeterministic behavior, depending upon the code following the scope.
> 
> An example:
> 
> -----[ C ]-----
> struct bar
> {
>   char b[MAX];
> };
> 
> struct foo
> {
>   char        f[MAX];
>   struct bar *sub;
> };
> 
> static int foo(A a, B b, C c)
> {
>   struct foo f;	  
>   bool       flag;
>   
>   flag = to_display(f.f,sizeof(f.f),a);
>   if (flag)
>   {
>     struct bar b;
>     
>     b.b = to_other_display(b.b,sizeof(b.b),b,c);
>     f.sub = &b;
>   }
>   else
>     f.sub = NULL;
>     
>   if (check_this())
>   {
>     int x = tointeger(b);
>     char buf[MAX];
>     
>     to_c_display(buf,sizeof(buf),c); 
>     syslog(LOG_DEBUG,"check=%d c=%s",x,buf);
>   }
>     
>   do_something(&f);
>   return 0;
> }
> -----[ END OF LINE ]-----
> 
> In this sample function, compiler A may collect all the local variables
> into a single block at the start of the function:
> 
> -----[ C ]-----
> struct foo f;
> bool       flag;
> struct bar b;
> int        x;
> char       buf[MAX];
> -----[ END OF LINE ]-----
> 
> and even though this provokes undefined behavior, it will still work, and
> valgrind won't find an issue. Compiler B might create space for all the
> variables, but could reuse the space for the two sub blocks:
> 
> -----[ C ]-----
> struct foo f;
> bool       flag;
> union
> {
>   struct bar b;
>   struct { int x ; char buf[MAX]; }
> };
> -----[ END OF LINE ]-----
> 
> (not valid C syntax, but it should give the intent I'm going for). In this
> case, the space pointed to by b could be overwritten if check_this()
> returns true. Compiler C may only create the space as needed, so the
> resulting assembly code may look something like:
> 
> -----[ Assembly ]-----
> foo	
> 	push	rbp
> 	mov	rbp,rsp
> 	sub	rsp,sizeof(struct foo) + sizeof(flag) + padding
> 	…
> 	cmp	al,[rbp+flag]
> 	jz	foo_skip
> 
> 	sub	rsp,sizeof(struct bar) + padding
> 	…
> 	lea	rbx,[rbp - f];
> 	lea	rdx,[rbp - b];
> 	mov	[rbx + sub],rdx
> 	…
> 	add	rsp,sizeof(struct bar) + padding
> 
> foo_skip:
> 	…
> 	call	check_this
> 	tst	al
> 	jz	foo_skip2
> 
> 	sub	rsp,sizeof(int) + sizeof(buf) + padding
> 	…
> 	add	rsp,sizeof(int) + sizeof(buf) + padding
> 
> foo_skip2:
> 
> 	call	do_something
> -----[ END OF LINE ]-----
> 
> And in this case, if check_this() even returns false, data in struct bar
> could be overwritten just with the call to check_this() itself.
> 
> This is the very definition of “undefined behavior” and in this case, it's
> might have been hard to find, even with tools like valgrind, since it
> really depends upon the code generated by the compiler. I think we're lucky
> in that GCC 11 layed out the code such that it crashed and thus, I was able
> to isolate the issue long before it becomes important.
> 

[1] gopher://gopher.conman.org/0Phlog:2021/12/10.1
[2] gopher://gopher.conman.org/0Phlog:2018/09/11.2
[3] https://www.atlassian.com/software/jira
[4] https://valgrind.org/

Email author at sean@conman.org