bug-guile
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#19883: Smob's mark_smob has become unreliable in Guile 2.x


From: David Kastrup
Subject: bug#19883: Smob's mark_smob has become unreliable in Guile 2.x
Date: Sun, 01 Mar 2015 11:09:46 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/25.0.50 (gnu/linux)

Mark H Weaver <address@hidden> writes:

> Hi David,
>
> Thank you for the minimal test case.  This is enormously helpful.
>
> David Kastrup <address@hidden> writes:
>
>> The symptom likely occurs when an object is collected and consequently
>> its free_smob function gets called.
>
> After some investigation, I believe I've confirmed that this is exactly
> what's happening.
>
> I added a field to the Smob class that is set to 12345678 when the
> object is created (and 21345678 at a later point), and then 87654321
> when free_smob is called.  I then observed that the segfault happens
> when the smob_mark function is called on a smob whose tag field is
> valid, but whose corresponding C++ object has 87654321 in the special
> field, indicating that its smob_free method was already called.  This
> happened with GC_MARKERS=1.

There are a few remedies for the test case that do not necessarily apply
to the full-blown code.

I was able to avoid the crashes here by declaring the mark_smob function
non-virtual.  Declaring it virtual was actually more an oversight than
anything else.  Not routing the scm_mark call through a pointer fetched
from the no longer existent virtual function table will stop the crashes
in the example code.

It's not an option for LilyPond proper, however, since the type lattice
is complex enough that routing some of the marking process through a
virtual member function is not really avoidable.

Another more likely effective patch (which already made it into
LilyPond's code base) _clears_ the SCM data (where the pointer to the
C++ structure is stored) in unregister_ptr.  Since all calls to the
member functions mark_smob (and print_smob accordingly) are routed
through the same mark_trampoline, letting this trampoline check for a
non-null pointer avoids the mark-after-free situation in the general
case.

However, this is still a probabilistic fix.  For one thing, there might
be race conditions inside of GUILE.  For another, the basic premise is
that calling the mark function with garbage must not result in crashes.

Since LilyPond relies on C++ data structures and containers for its
operation and those are also involved in the mark walks, "can be called
on arbitrary garbage" (of which already freed smobs are just one special
case, so catering for them is not sufficient) is not an option that can
be implemented with reasonable amount of effort in the mark_smob
routines of LilyPond.

However, from the GUILE side the mark_smob calls originate from objects
detected to be SCM cells, and GUILE should be able to manage areas
filled with SCM cells such that it can _reliably_ distinguish allocated
from unallocated cells and there is no such thing as random garbage.

When that is the case, it may be sufficient to clear out the tag field
in the course of calling the free_smob hook, avoiding any subsequent
mark calls to occur on a cell that has lost its type consistency anyway.
Again, this might involve race conditions: no idea.  Because of
potential race conditions, a reliable fix needs to be in GUILE or BDWGC
and cannot be done at the application end.  The best we will be able to
achieve there is "crashes almost never randomly".

Which is what we should get in LilyPond with
<URL:https://codereview.appspot.com/197690043> (already committed to the
code base as
<URL:http://git.savannah.gnu.org/cgit/lilypond.git/commit/?id=60eeb4a4ff217e3abbfcb6b4de94f0a6ef6d437d>.

> I believe this is a bug in BDWGC.  gc_mark.h warns:
>
> /* WARNING: Such a mark procedure may be invoked on an unused object    */
> /* residing on a free list.  Such objects are cleared, except for a     */
> /* free list link field in the first word.  Thus mark procedures may    */
> /* not count on the presence of a type descriptor, and must handle this */
> /* case correctly somehow.                                              */
>
> and protect Guile users from this issue by checking that the tag on the
> smob cell (first word) is correct before calling the user-defined
> smob_mark function.  However, in this case, the mark procedure was
> called on a smob cell whose first word was the correct tag and whose
> second word was non-zero and clearly a pointer to an already deleted
> instance of the C++ Smob class.
>
> I think the next step is to remove libguile from this test program, and
> turn it into a minimal demonstration of this apparent BDWGC bug, for
> submission to the BDWGC developers.

I don't know about the interplay of GUILE and BDWGC.  It may be that the
general C freestore operation that BDWGC provides cannot prevent the
"can be called on garbage" effect and functions need to be robust
against that.

However, the specific "garbage" here resulted directly from marking an
already freed smob, and with a frequency that is far above random.  When
that happens, it is extremely unlikely that memory is being either
successfully retained or freed.

And that sounds like a bad idea in BDWGC itself.

> FYI, here are the changes I made to the test program and the results I
> observed.
>
> First, some changes that I think might be needed for GC safety:
>
> --- smobs.hh.ORIG     2015-02-15 13:35:03.000000000 -0500
> +++ smobs.hh  2015-02-28 23:42:06.346803668 -0500
> @@ -270,8 +271,9 @@
>    }
>    SCM unprotected_smobify_self ()
>    {
> -    self_scm_ = Smob_base<Super>::register_ptr (static_cast<Super *> (this));
> -    return self_scm_;
> +    SCM s = Smob_base<Super>::register_ptr (static_cast<Super *> (this));
> +    self_scm_ = s;
> +    return s;

Don't see the intended difference here.  It's conceivable that with
several "volatile" declarations the compiler may generate different code
that would protect against self_scm_ being modified before the function
returns.  Is that supposed to help against non-atomic changes of
self_scm_?

> @@ -279,12 +281,12 @@
>    }
>    SCM unprotect ()
>    {
> -    unprotect_smob (self_scm_, &protection_cons_);
> -    return self_scm_;
> +    SCM s = self_scm_;
> +    unprotect_smob (s, &protection_cons_);
> +    return s;
>    }

I see how this one would be advisable in multi-threaded applications.

>    void smobify_self () {
> -    self_scm_ = unprotected_smobify_self ();
> -    protect ();
> +    protect_smob (unprotected_smobify_self (), &protection_cons_);
>    }
>    SCM self_scm () const { return self_scm_; }
>  };
>
>
> The changes to 'unprotected_smobify_self' and 'smobify_self" are
> probably only needed for a multithreaded program.  The idea is to try to
> ensure that the smob is protected from GC by being on the stack (and not
> merely in the malloc/free heap) until it is protected.
>
> The change to 'unprotect' is along the same lines, and might be needed
> even for single-threaded programs.  I don't know that
> 'scm_gc_unprotect_object' (called from 'unprotect_smob') will guarantee
> not to free the object passed to it if there are no other visible
> references to it.
>
> However, none of these changes fixed the problem here.

I'll put them in as there is no sane reason not to.  However, they
appear to mostly concern low-probability races whereas we still have an
elephant in the room.

> BTW, one other thing to keep in mind: it will never be enough if the
> only GC-visible reference to an object is the C++ Smob object.  There
> must always be a GC-visible reference to the SCM value.

That's why we need the mark functions in LilyPond.  It would be a really
large PITA to have to route everything through SCM since there is a
really large corpus of code and non-SCM data structures involved.

> (gdb) bt
> #0  0x0a2abde3 in ?? ()
> #1  0xb766dc4a in smob_mark (addr=0xa15f6e0, mark_stack_ptr=0x9e5a320, 
> mark_stack_limit=0x9e62000, env=0) at smob.c:335
> #2  0xb758664a in GC_mark_from (mark_stack_top=0x9e5a320, 
> mark_stack=0x9e5a000, mark_stack_limit=0x9e62000) at mark.c:737
> #3  0xb7587c07 in GC_mark_some (cold_gc_frame=0xbf874ed8 
> "pVZ\267M\332W\267\330N\207\277") at mark.c:386
> #4  0xb757da4d in GC_stopped_mark (address@hidden <GC_never_stop_func>) at 
> alloc.c:637
> #5  0xb757e4a9 in GC_try_to_collect_inner (stop_func=0xb757d5b0 
> <GC_never_stop_func>) at alloc.c:456
> #6  0xb757ee00 in GC_collect_or_expand (address@hidden, address@hidden, 
>     address@hidden) at alloc.c:1266
> #7  0xb757effc in GC_allocobj (address@hidden, address@hidden) at alloc.c:1355
> #8  0xb758440b in GC_generic_malloc_inner (address@hidden, address@hidden) at 
> malloc.c:133
> #9  0xb7581e8a in GC_register_finalizer_inner (address@hidden, address@hidden 
> <finalize_smob>, 
>     address@hidden, address@hidden, address@hidden, 
>     address@hidden <GC_null_finalize_mark_proc>) at finalize.c:524
> #10 0xb75821e5 in GC_register_finalizer_no_order (address@hidden, 
> address@hidden <finalize_smob>, 
>     address@hidden, address@hidden, address@hidden) at finalize.c:575
> #11 0xb761bfbb in scm_i_set_finalizer (address@hidden, address@hidden 
> <finalize_smob>, 
>     address@hidden) at finalizers.c:44
> #12 0xb766e1f8 in scm_i_new_smob (tc=7039, data=171214120) at smob.c:416
> #13 0x08049b12 in scm_new_smob (tc=7039, data=171214120)
>     at 
> /gnu/store/3lhr8q28q6f59774di9av7ncy09jd55d-guile-2.0.11/include/guile/2.0/libguile/smob.h:91
> #14 0x0804a8dc in Smob_base<Family>::register_ptr (p=0xa348528) at 
> smobs.tcc:54
> #15 0x0804a281 in Smob<Family>::unprotected_smobify_self (this=0xa34852c) at 
> smobs.hh:274
> #16 0x08049d0a in Smob<Family>::smobify_self (this=0xa34852c) at smobs.hh:289

That's when smobify_self tries to create a smob pointing to the
C++-allocated data structure.  This calls scm_new_smob, and in the
course of that, garbage is apparently collected.  And smob_mark is
called on an already freed C++ structure with a bad virtual table, so
when mark_trampoline tries routing the virtual table call to mark_smob,
it crashes.

With the already checked-in changes and your suggested changes, we'll
likely get the mark-after-free crashes under control.  They still seem
to warrant fixing at the BDWGC level as they, crashes aside, appear to
cause memory to be in some half-released state good for neither
retaining nor reallocation.

But the mark-on-general-garbage problem seems sort of system-immanent to
conservative garbage collection while making the management of C++ data
structures (often relying on some internal consistency outside of user
control) extremely awkward.  Fixing that at the SCM level seems both
desirable and doable and will likely require GUILE to not let garbage
(rather than recognizably unallocated cells) into its managed cell
store.

-- 
David Kastrup





reply via email to

[Prev in Thread] Current Thread [Next in Thread]