out-of-control GC

Discussion:

out-of-control GC

Linas Vepstas

2017-09-09 19:47:24 UTC

I've been experiencing problems with guile GC for a while now, that I've
mostly been able to ignore, but that are becoming debilitating, and I need
help, or need to find some solution.

Background: I have a C++ library that has guile bindings; it is wrapped up
as a guile module. It manages a lot of data, anywhere from a few gigabytes,
to a over a hundred gigabytes. Almost all this RAM usage should be
invisible, opaque to guile: so, for example, while my library is managing
5 or 10 GBytes, guile itself is managing maybe 20 or 50 or 100MBytes,
according to (gc-stats). This is normal and expected. Some of my programs
have guile use considerably more than this, in the multi-GByte range

However, the larger the RAM usage, the more frequently guile GC seems to
run, and the longer it takes, so that by the time my dataset is 100GBytes,
guile is using maybe 1GBytes, and is spending 99% of its time in GC. My app
slows to a crawl, and nothing gets done. I need to get GC under control,
because it is running pointlessly, because its clear that (almost) nothing
can be reclaimed, (usually) nothing is being reclaimed.

Here's my tool:

; Report the average amount of time spent in GC
(define avg-gc-cpu-time
(let ((last-gc (gc-stats))
(start-time (get-internal-real-time))
(run-time (get-internal-run-time)))
(lambda ()
(define now (get-internal-real-time))
(define run (get-internal-run-time))
(define cur (gc-stats))
(define gc-time-taken (* 1.0e-9 (- (cdar cur) (cdar last-gc))))
(define elapsed-time (* 1.0e-9 (- now start-time)))
(define cpu-time (* 1.0e-9 (- run run-time)))
(format #t "Elapsed time: ~5f secs. GC: ~5f% cpu-usage: ~5f%\n"
elapsed-time
(* 100 (/ gc-time-taken elapsed-time))
(* 100 (/ cpu-time elapsed-time))
)
(set! last-gc cur)
(set! start-time now)
(set! run-time run))))

Here's what it prints:

guile-en-r> (avg-gc-cpu-time)
Elapsed time: 46.03 secs. GC: 342.8% cpu-usage: 346.4%

guile-en-r> (avg-gc-cpu-time)
Elapsed time: 7288. secs. GC: 329.5% cpu-usage: 330.9%

guile-en-r> (avg-gc-cpu-time)
Elapsed time: 3205. secs. GC: 318.0% cpu-usage: 319.4%

Here's a typical gc from the start/middle of this.

(gc-stats)

((gc-time-taken . 634825386689) (heap-size . 16199680) (heap-free-size .
1925120) (heap-total-allocated . 15424779760) (heap-allocated-since-gc .
7692512) (protected-objects . 7) (gc-times . 3786))

In this case, the heap is 16MBytes, little or no work is being done in
guile, almost all is done in the C++ code. GC has taken 634 seconds ... a
lot but acceptable. After this point, the number of GC's skyrockets -- a
gc ever few seconds, sometimes tens of them per second. It sucks up cpu
time, and makes the guile prompt be very laggy. I'd give more examples,
except that now, my app has moved to the stage where it really is creating

(gc-stats)

((gc-time-taken . 281465535649189) (heap-size . 5175664640) (heap-free-size
. 1538387968) (heap-total-allocated . 67649385200) (heap-allocated-since-gc
. 6112) (protected-objects . 7) (gc-times . 11319))

That's right: 5GBytes of scheme stuff, and 281465 seconds = 78 cpu-hours
spent in GC. top shows 116GBytes total in the app, which is normal and
expected.

When my app is about 50GBytes in size, it will run for about 12 hours, with
an outrageous amount of time spent in GC, which I just ignore out of habit,
because I can wait 12 hours. This one, a bigger dataset, appears to be
stalled -- after 24 hours, less than 1% of the computations have been
completed. I'm screwed until I can somehow disable guile GC long enough
for the thing to make forward progress.

I'm sort-of willing to hack on guile internals, except that my project is
already late and I am panicking ...

--linas

--
*"The problem is not that artificial intelligence will get too smart and
take over the world," computer scientist Pedro Domingos writes, "the
problem is that it's too stupid and already has." *

Linas Vepstas

2017-09-10 04:14:46 UTC