Discussion:
out-of-control GC
Linas Vepstas
2017-09-09 19:47:24 UTC
Permalink
I've been experiencing problems with guile GC for a while now, that I've
mostly been able to ignore, but that are becoming debilitating, and I need
help, or need to find some solution.

Background: I have a C++ library that has guile bindings; it is wrapped up
as a guile module. It manages a lot of data, anywhere from a few gigabytes,
to a over a hundred gigabytes. Almost all this RAM usage should be
invisible, opaque to guile: so, for example, while my library is managing
5 or 10 GBytes, guile itself is managing maybe 20 or 50 or 100MBytes,
according to (gc-stats). This is normal and expected. Some of my programs
have guile use considerably more than this, in the multi-GByte range

However, the larger the RAM usage, the more frequently guile GC seems to
run, and the longer it takes, so that by the time my dataset is 100GBytes,
guile is using maybe 1GBytes, and is spending 99% of its time in GC. My app
slows to a crawl, and nothing gets done. I need to get GC under control,
because it is running pointlessly, because its clear that (almost) nothing
can be reclaimed, (usually) nothing is being reclaimed.

Here's my tool:

; Report the average amount of time spent in GC
(define avg-gc-cpu-time
(let ((last-gc (gc-stats))
(start-time (get-internal-real-time))
(run-time (get-internal-run-time)))
(lambda ()
(define now (get-internal-real-time))
(define run (get-internal-run-time))
(define cur (gc-stats))
(define gc-time-taken (* 1.0e-9 (- (cdar cur) (cdar last-gc))))
(define elapsed-time (* 1.0e-9 (- now start-time)))
(define cpu-time (* 1.0e-9 (- run run-time)))
(format #t "Elapsed time: ~5f secs. GC: ~5f% cpu-usage: ~5f%\n"
elapsed-time
(* 100 (/ gc-time-taken elapsed-time))
(* 100 (/ cpu-time elapsed-time))
)
(set! last-gc cur)
(set! start-time now)
(set! run-time run))))

Here's what it prints:

guile-en-r> (avg-gc-cpu-time)
Elapsed time: 46.03 secs. GC: 342.8% cpu-usage: 346.4%

guile-en-r> (avg-gc-cpu-time)
Elapsed time: 7288. secs. GC: 329.5% cpu-usage: 330.9%

guile-en-r> (avg-gc-cpu-time)
Elapsed time: 3205. secs. GC: 318.0% cpu-usage: 319.4%

Here's a typical gc from the start/middle of this.
(gc-stats)
((gc-time-taken . 634825386689) (heap-size . 16199680) (heap-free-size .
1925120) (heap-total-allocated . 15424779760) (heap-allocated-since-gc .
7692512) (protected-objects . 7) (gc-times . 3786))

In this case, the heap is 16MBytes, little or no work is being done in
guile, almost all is done in the C++ code. GC has taken 634 seconds ... a
lot but acceptable. After this point, the number of GC's skyrockets -- a
gc ever few seconds, sometimes tens of them per second. It sucks up cpu
time, and makes the guile prompt be very laggy. I'd give more examples,
except that now, my app has moved to the stage where it really is creating
(gc-stats)
((gc-time-taken . 281465535649189) (heap-size . 5175664640) (heap-free-size
. 1538387968) (heap-total-allocated . 67649385200) (heap-allocated-since-gc
. 6112) (protected-objects . 7) (gc-times . 11319))

That's right: 5GBytes of scheme stuff, and 281465 seconds = 78 cpu-hours
spent in GC. top shows 116GBytes total in the app, which is normal and
expected.

When my app is about 50GBytes in size, it will run for about 12 hours, with
an outrageous amount of time spent in GC, which I just ignore out of habit,
because I can wait 12 hours. This one, a bigger dataset, appears to be
stalled -- after 24 hours, less than 1% of the computations have been
completed. I'm screwed until I can somehow disable guile GC long enough
for the thing to make forward progress.

I'm sort-of willing to hack on guile internals, except that my project is
already late and I am panicking ...

--linas
--
*"The problem is not that artificial intelligence will get too smart and
take over the world," computer scientist Pedro Domingos writes, "the
problem is that it's too stupid and already has." *
Linas Vepstas
2017-09-10 04:14:46 UTC
Permalink
To make this conversation less crazy, and more down-to-earth, here is a
demo that seems to spend most of its time in GC. Hard for me to to say that
this is "wrong", right now, but it illustrates the scandalous nature of the
problem.

Its a simple loop, about 6 lines of code total, it mostly just wastes cpu
time. There's also a longer progress-report printer, similar to the
previous one I posted.

It runs at about 1 million iterations/minute, for me, of which 15 seconds
are spent "doing real work", and 45-75 seconds in gc. Note that GC runs in
parallel, usually using about 3-4 cores, so the fraction-cpu-time is the
fraction relative to wall-clock, so that %350 cpu time is typical (i.e. 3.5
cpu cores are running)

This seems to be a reasonable model of the behavior that I am seeing.

--linas

; Report the average amount of time spent in GC
(define avg-gc-cpu-time
(let ((last-gc (gc-stats))
(start-time (get-internal-real-time))
(run-time (get-internal-run-time)))
(lambda ()
(define now (get-internal-real-time))
(define run (get-internal-run-time))
(define cur (gc-stats))
(define gc-time-taken (* 1.0e-9 (- (cdar cur) (cdar last-gc))))
(define elapsed-time (* 1.0e-9 (- now start-time)))
(define cpu-time (* 1.0e-9 (- run run-time)))
(define non-gc-cpu (- cpu-time gc-time-taken))
(define ngc (- (assoc-ref cur 'gc-times)
(assoc-ref last-gc 'gc-times)))
(format #t "Elapsed: ~5f secs. non-gc: ~5f secs Rate: ~5f gc/min %GC: ~5f%
%cpu: ~5f%\n"
elapsed-time
non-gc-cpu
(/ (* ngc 60) elapsed-time)
(* 100 (/ gc-time-taken elapsed-time))
(* 100 (/ cpu-time elapsed-time))
)
(set! last-gc cur)
(set! start-time now)
(set! run-time run))))

; a tail-recursive loop, seems to waste a lot of time in gc.
(define (make-bgl lis cnt)
; a bogus longer list
(define longer (cons (format #f "~A" cnt) lis))
; periodically report statistics
(if (eq? 0 (modulo cnt (* 1000 1000)))
(begin (format #t "~A " cnt) (avg-gc-cpu-time)))
; periodically trim the list
(if (eq? 0 (modulo cnt (* 4123 1123)))
(set! longer '()))
; loop for a while.
(if (eq? 0 cnt) lis
(make-bgl longer (- cnt 1))))

; (define foo (make-bgl '() (* 1235 1000 1000)))
Post by Linas Vepstas
I've been experiencing problems with guile GC for a while now, that I've
mostly been able to ignore, but that are becoming debilitating, and I need
help, or need to find some solution.
Background: I have a C++ library that has guile bindings; it is wrapped up
as a guile module. It manages a lot of data, anywhere from a few gigabytes,
to a over a hundred gigabytes. Almost all this RAM usage should be
invisible, opaque to guile: so, for example, while my library is managing
5 or 10 GBytes, guile itself is managing maybe 20 or 50 or 100MBytes,
according to (gc-stats). This is normal and expected. Some of my programs
have guile use considerably more than this, in the multi-GByte range
However, the larger the RAM usage, the more frequently guile GC seems to
run, and the longer it takes, so that by the time my dataset is 100GBytes,
guile is using maybe 1GBytes, and is spending 99% of its time in GC. My app
slows to a crawl, and nothing gets done. I need to get GC under control,
because it is running pointlessly, because its clear that (almost) nothing
can be reclaimed, (usually) nothing is being reclaimed.
; Report the average amount of time spent in GC
(define avg-gc-cpu-time
(let ((last-gc (gc-stats))
(start-time (get-internal-real-time))
(run-time (get-internal-run-time)))
(lambda ()
(define now (get-internal-real-time))
(define run (get-internal-run-time))
(define cur (gc-stats))
(define gc-time-taken (* 1.0e-9 (- (cdar cur) (cdar last-gc))))
(define elapsed-time (* 1.0e-9 (- now start-time)))
(define cpu-time (* 1.0e-9 (- run run-time)))
(format #t "Elapsed time: ~5f secs. GC: ~5f% cpu-usage: ~5f%\n"
elapsed-time
(* 100 (/ gc-time-taken elapsed-time))
(* 100 (/ cpu-time elapsed-time))
)
(set! last-gc cur)
(set! start-time now)
(set! run-time run))))
guile-en-r> (avg-gc-cpu-time)
Elapsed time: 46.03 secs. GC: 342.8% cpu-usage: 346.4%
guile-en-r> (avg-gc-cpu-time)
Elapsed time: 7288. secs. GC: 329.5% cpu-usage: 330.9%
guile-en-r> (avg-gc-cpu-time)
Elapsed time: 3205. secs. GC: 318.0% cpu-usage: 319.4%
Here's a typical gc from the start/middle of this.
Post by Linas Vepstas
(gc-stats)
((gc-time-taken . 634825386689) (heap-size . 16199680) (heap-free-size .
1925120) (heap-total-allocated . 15424779760) (heap-allocated-since-gc .
7692512) (protected-objects . 7) (gc-times . 3786))
In this case, the heap is 16MBytes, little or no work is being done in
guile, almost all is done in the C++ code. GC has taken 634 seconds ... a
lot but acceptable. After this point, the number of GC's skyrockets -- a
gc ever few seconds, sometimes tens of them per second. It sucks up cpu
time, and makes the guile prompt be very laggy. I'd give more examples,
except that now, my app has moved to the stage where it really is creating
Post by Linas Vepstas
(gc-stats)
((gc-time-taken . 281465535649189) (heap-size . 5175664640
<(517)%20566-4640>) (heap-free-size
. 1538387968) (heap-total-allocated . 67649385200) (heap-allocated-since-gc
. 6112) (protected-objects . 7) (gc-times . 11319))
That's right: 5GBytes of scheme stuff, and 281465 seconds = 78 cpu-hours
spent in GC. top shows 116GBytes total in the app, which is normal and
expected.
When my app is about 50GBytes in size, it will run for about 12 hours,
with an outrageous amount of time spent in GC, which I just ignore out of
habit, because I can wait 12 hours. This one, a bigger dataset, appears to
be stalled -- after 24 hours, less than 1% of the computations have been
completed. I'm screwed until I can somehow disable guile GC long enough
for the thing to make forward progress.
I'm sort-of willing to hack on guile internals, except that my project is
already late and I am panicking ...
--linas
--
*"The problem is not that artificial intelligence will get too smart and
take over the world," computer scientist Pedro Domingos writes, "the
problem is that it's too stupid and already has." *
--
*"The problem is not that artificial intelligence will get too smart and
take over the world," computer scientist Pedro Domingos writes, "the
problem is that it's too stupid and already has." *
Arne Babenhauserheide
2017-09-10 16:50:15 UTC
Permalink
Hi Linas,
Post by Linas Vepstas
To make this conversation less crazy, and more down-to-earth, here is a
demo that seems to spend most of its time in GC. Hard for me to to say that
this is "wrong", right now, but it illustrates the scandalous nature of the
problem.
Do you have many string operations in your code?

If I replace

(cons (format #f "~A" cnt) lis)

with

(cons cnt lis)

GC is down to ~20% (from around 130%).

Best wishes,
Arne
--
Unpolitisch sein
heißt politisch sein
ohne es zu merken
David Kastrup
2017-09-10 18:03:57 UTC
Permalink
Post by Arne Babenhauserheide
Hi Linas,
Post by Linas Vepstas
To make this conversation less crazy, and more down-to-earth, here is a
demo that seems to spend most of its time in GC. Hard for me to to say that
this is "wrong", right now, but it illustrates the scandalous nature of the
problem.
Do you have many string operations in your code?
If I replace
(cons (format #f "~A" cnt) lis)
with
(cons cnt lis)
GC is down to ~20% (from around 130%).
But the main point of using format (rather than string appends) is to
avoid creating lots of garbage-collected temporary expressions. A
factor of 6 seems to suggest that this expectation is not actually being
accommodated.
--
David Kastrup
Linas Vepstas
2017-09-10 20:23:19 UTC
Permalink
Hi Arne,
Post by Arne Babenhauserheide
Hi Linas,
Post by Linas Vepstas
To make this conversation less crazy, and more down-to-earth, here is a
demo that seems to spend most of its time in GC. Hard for me to to say
that
Post by Linas Vepstas
this is "wrong", right now, but it illustrates the scandalous nature of
the
Post by Linas Vepstas
problem.
Do you have many string operations in your code?
If I replace
(cons (format #f "~A" cnt) lis)
with
(cons cnt lis)
GC is down to ~20% (from around 130%).
That demo tried to capture the spirit of a more complex system. The cons
that I have is actually (cons (wrapper-for-some-c++-code foobar) lis) or
maybe (cons (scheme-that-calls-other-scheme-that-calls-c++ foo) lis) ...
but its not just cons -- it can also be for-each, map, or fold.

The actual printouts, in my instrumented code look like this:

guile-en-r> (avg-gc-cpu-time)
Elapsed: 4461. secs. Rate: 4.236 gc/min %cpu-GC: 280.8% %cpu-use: 284.0%

guile-en-r> (avg-gc-cpu-time)
Elapsed: 1501. secs. Rate: 4.077 gc/min %cpu-GC: 280.4% %cpu-use: 283.8%

guile-en-r> (gc-stats)

((gc-time-taken . 353452428102078) (heap-size . 6485340160) (heap-free-size
. 3497791488) (heap-total-allocated . 100778445248)
(heap-allocated-since-gc . 9248) (protected-objects . 7) (gc-times . 12894))

which suggests that the vast majority of time is spent in GC, and not in
either scheme/guile, or my wrapped c++ code.

4 gc's/minute means one every 15 seconds, which sounds reasonable, except
that RAM usage is so huge, that each GC takes many seconds to complete. By
examining heap-free-size before and after, there is almost no change --
i.e. there was very little to collect. Hmm .. perhaps I should monitor
that more closely, to see how true that really is ...

Anyway, this demo gives a scenario where gc seems to run too often. I have
other scenarios where it does not run often enough -- I have one where the
working set is about 200MBytes, but guile happily grows to 10GB or 20GB RAM
over a few hours. My solution for that other case is to manually GC
whenever gc-stats gets above 1GB, because I know a priori my working set is
smaller than that.

The overall message seems to be that gc either runs too often, or not often
enough. Some sort of better planning is needed.

I'm motivated to explore this, as it actually impacts my work.

--linas
--
*"The problem is not that artificial intelligence will get too smart and
take over the world," computer scientist Pedro Domingos writes, "the
problem is that it's too stupid and already has." *
Marko Rauhamaa
2017-09-10 20:36:20 UTC
Permalink
Post by Linas Vepstas
which suggests that the vast majority of time is spent in GC, and not
in either scheme/guile, or my wrapped c++ code.
That may well be normal.
Post by Linas Vepstas
4 gc's/minute means one every 15 seconds, which sounds reasonable,
except that RAM usage is so huge, that each GC takes many seconds to
complete.
That may or may not be normal. However, it should not be a function of
how much RAM has been allocated ourside Guile's heap.
Post by Linas Vepstas
Anyway, this demo gives a scenario where gc seems to run too often.
How often is too often? I would imagine the objective is to minimize the
stop-the-world periods. IOW, you usually want the program to run very
steadily.
Post by Linas Vepstas
I have other scenarios where it does not run often enough -- I have
one where the working set is about 200MBytes, but guile happily grows
to 10GB or 20GB RAM over a few hours.
You mean the size of Guile's heap? That seems pathological. Maybe there
are lots of values in your data that accidentally look like heap
addresses.
Post by Linas Vepstas
My solution for that other case is to manually GC whenever gc-stats
gets above 1GB, because I know a priori my working set is smaller than
that.
The overall message seems to be that gc either runs too often, or not
often enough. Some sort of better planning is needed.
I'm motivated to explore this, as it actually impacts my work.
Absolutely, although as an application programmer you should be able to
trust that GC just works.


Marko
Linas Vepstas
2017-09-10 21:47:09 UTC
Permalink
Post by Marko Rauhamaa
Post by Linas Vepstas
which suggests that the vast majority of time is spent in GC, and not
in either scheme/guile, or my wrapped c++ code.
That may well be normal.
It might be "normal", but its objectionable. If 90% of the cpu time is
spent in GC, and almost nothing at all was actually collected, then its too
much. On the Amazon cloud, you get to pay by the minute. On laptops with
batteries, people complain. Especially when they are apple users, not linux
users.

If the compute job takes a week, and your user has discovered this GC
thing, then the user will conclude that it should take less than 24 hours,
and will tell you that you are stupid, and that guile sucks. I have been
told I'm stupid, and that guile sucks too many times.

CPU cycles aren't free.
Post by Marko Rauhamaa
Post by Linas Vepstas
4 gc's/minute means one every 15 seconds, which sounds reasonable,
except that RAM usage is so huge, that each GC takes many seconds to
complete.
That may or may not be normal. However, it should not be a function of
how much RAM has been allocated ourside Guile's heap.
The bigger the RAM usage, the slower it seems to be. It might be due to
fragmentation. It might be due to memory-bandwidth effects: if GC touches
every page or every other page, out of 100GB, it takes time to move those
pages from RAM through 3rd, 2nd and 1st-level cache. DDR3 has a bandwidth
of maybe 25GB/sec, if you are lucky; less depending on the northbridge and
cpu design

Its possible/likely that mixed c++ and guile code results in heavily
fragmented memory, where every 4K page has 100 bytes of guile, and 3900
bytes of non-collectible memory. The TLB will get severely thrashed in
this scenario. TLB's are infamously under-sized, and have been for decades.
Post by Marko Rauhamaa
Post by Linas Vepstas
Anyway, this demo gives a scenario where gc seems to run too often.
How often is too often? I would imagine the objective is to minimize the
stop-the-world periods. IOW, you usually want the program to run very
steadily.
Its too often if GC collected less than 10% of the current in-use heap.

In the current case I am looking at, stop-the-world lasts about 5 seconds,
and the guile in-use heap changes by less than 1% before and after.
Post by Marko Rauhamaa
Post by Linas Vepstas
I have other scenarios where it does not run often enough -- I have
one where the working set is about 200MBytes, but guile happily grows
to 10GB or 20GB RAM over a few hours.
You mean the size of Guile's heap?
Yes.
Post by Marko Rauhamaa
That seems pathological.
Yes.
Post by Marko Rauhamaa
Maybe there
are lots of values in your data that accidentally look like heap
addresses.
Maybe. But probably not. For this particular scenario, there are vast
quantities of 1MB to 10MB sized C strings that are sent to
scm_eval_string(), and are then freed. These C strings are approx 10%
scheme s-exps, and about 90% utf8 strings in various natural languages
(french, chinese, etc)

However, guile uses gc_malloc_atomic for these strings, and so GC should
never-ever peer into the contents of them. So I don't think its "accidental
heap addresses". Its probably fragmentation, and the very heavy churn of
large malloc chunks.
Post by Marko Rauhamaa
Post by Linas Vepstas
My solution for that other case is to manually GC whenever gc-stats
gets above 1GB, because I know a priori my working set is smaller than
that.
The overall message seems to be that gc either runs too often, or not
often enough. Some sort of better planning is needed.
I'm motivated to explore this, as it actually impacts my work.
Absolutely, although as an application programmer you should be able to
trust that GC just works.
Well, yes, but I also have to handle user feedback about where I should
stick it.

The c++ code that I manage uses smart pointers for memory management. I
have no idea what fraction of cpu times is spent in that code, as there is
no simple, practical way of instrumenting that code and measuring it. It
might be a sloppy disaster for all I know -- but I don't know.

By contrast, the guile gc can be directly observed, and some of my users
observe it. Myself, I'm willing to partly ignore it, but it keeps coming up
as a possible bottleneck.

So let me redirect: is there some central spot within guile, where the
decision to run GC is made? If so, what file would that be in? If not, is
there some string I should grep for?

--linas
--
*"The problem is not that artificial intelligence will get too smart and
take over the world," computer scientist Pedro Domingos writes, "the
problem is that it's too stupid and already has." *
Marko Rauhamaa
2017-09-10 22:38:30 UTC
Permalink
Post by Marko Rauhamaa
Post by Linas Vepstas
which suggests that the vast majority of time is spent in GC, and not
in either scheme/guile, or my wrapped c++ code.
That may well be normal.
It might be "normal", but its objectionable. If 90% of the cpu time is
spent in GC, and almost nothing at all was actually collected, then
its too much.
Hard to say. I don't know how much garbage Guile generates as part of
its normal execution model, but for example Python generates garbage
with each method call. It is very natural for each application of a Lisp
function to create a throw-away data structure through substitution.
On the Amazon cloud, you get to pay by the minute. On laptops with
batteries, people complain. Especially when they are apple users, not
linux users.
GC overhead is like RAM cache overhead or C function call overhead. Its
the price of doing business.
If the compute job takes a week, and your user has discovered this GC
thing, then the user will conclude that it should take less than 24
hours, and will tell you that you are stupid, and that guile sucks. I
have been told I'm stupid, and that guile sucks too many times.
CPU cycles aren't free.
I have no idea how good Guile's performance is. Python's performance is
probably about 100 times worse than C, yet it's the fast becoming the
most popular programming language in the world.

Not everything in the world is done for performance. The #1 purpose for
a programming language is managing complexity. Relatively few
computation tasks require excellent performance, and there are other
programming languages for that: C, C++, Java, C#, go

Note that Java, C# and go are performant despite GC. GC generally
doesn't have to be all that costly throughput-wise. The biggest
performance drain for high-level programming languages such as Scheme or
Python is the absence of compile-time typing. In fact, Python is adding
optional static type annotation (which I'm afraid might be a big
mistake).
However, guile uses gc_malloc_atomic for these strings, and so GC
should never-ever peer into the contents of them. So I don't think its
"accidental heap addresses". Its probably fragmentation, and the very
heavy churn of large malloc chunks.
You aren't thrashing by any chance...?
So let me redirect: is there some central spot within guile, where the
decision to run GC is made? If so, what file would that be in? If not,
is there some string I should grep for?
I believe that decision is left for GC_malloc(1L):

GC_malloc will attempt to reclaim inaccessible space automatically by
invoking a conservative garbage collector at appropriate points.


Marko
Arne Babenhauserheide
2017-09-11 16:17:21 UTC
Permalink
Post by Marko Rauhamaa
I have no idea how good Guile's performance is. Python's performance is
probably about 100 times worse than C, yet it's the fast becoming the
most popular programming language in the world.
One of the reasons for that is that you can get it to native C for most
non-trivial tasks by writing an extension in C.

If I understand it right, Linas is doing just that for Guile, and that
shows problems, which means that something is blocking the path to
much wider usecases. So getting that fixed would be great!

Best wishes,
Arne
--
Unpolitisch sein
heißt politisch sein
ohne es zu merken
Linas Vepstas
2017-09-14 18:31:59 UTC
Permalink
Hi Marko,

A casual reply, then ...
Post by Marko Rauhamaa
Not everything in the world is done for performance. The #1 purpose for
a programming language is managing complexity. Relatively few
computation tasks require excellent performance, and there are other
programming languages for that: C, C++, Java, C#, go
I agree in principle; the guts of my system are in C++, the casual
computations that string together the pipeline are in guile.

There are some 50+ dialects of scheme, and somewhere on the net is a giant
page of benchmarks comparing them. Guile ranks about 10th or 11th in that
list, which isn't bad, except that many of the #1 and #2 spots are
something like 10x faster. I'd like to see guile improve in those
rankings.
Post by Marko Rauhamaa
Note that Java, C# and go are performant despite GC.
Well, some decades ago, there were a couple of PhD's down the hall from me,
and they were full-time assigned to improving java GC performance. That's
all they did, for years. And this was a long time ago. If java wasn't good
at this by now, there'd be ... it would be incomprehensible.
Post by Marko Rauhamaa
The biggest
performance drain for high-level programming languages such as Scheme or
Python is the absence of compile-time typing. In fact, Python is adding
optional static type annotation (which I'm afraid might be a big
mistake).
Well that's a can of worms. Aside from static typing to benefit the
compiler, there's static typing to help the programmer understand what the
heck is being passed as an argument to some poorly documented, overly-long
function. This has always been a kind-of bugaboo of reading other people's
scheme code -- its too often too hard to understand.

This is also why python is popular: any shmo feels they can code in it; its
the new visual-basic of the 21st century, with all of the culture and
code-quality that implies.

One "obvious" solution to adding types in scheme is to go to caml/haskell,
but this does not obviously improve readability. (although types are
effing fantastic from the math/theoretical side: my system makes very very
heavy use of types and type constructors, But its unrelated to scheme)

For a certain class of problems, coding in python simply doesn't work:
graph data structures are naturally recursive and functional; they're a
natural fit for scheme, and are hard/painful/nearly impossible to
manipulate in python.
Post by Marko Rauhamaa
You aren't thrashing by any chance...?
No.
OK thanks

--linas
--
*"The problem is not that artificial intelligence will get too smart and
take over the world," computer scientist Pedro Domingos writes, "the
problem is that it's too stupid and already has." *
Marko Rauhamaa
2017-09-14 18:58:40 UTC
Permalink
Post by Linas Vepstas
Well that's a can of worms. Aside from static typing to benefit the
compiler, there's static typing to help the programmer understand what
the heck is being passed as an argument to some poorly documented,
overly-long function. This has always been a kind-of bugaboo of
reading other people's scheme code -- its too often too hard to
understand.
This is also why python is popular: any shmo feels they can code in
it; its the new visual-basic of the 21st century, with all of the
culture and code-quality that implies.
I strongly disagree. Python is popular because its level of abstraction
is optimal for a whole slew of use cases. I see no reason why Scheme
couldn't surpass Python some day in the future for the same reason.

Not everybody can write software in good style. In fact, in my
experience very few programmers can do it. That is not a property of the
programming language.

Static type declarations have undeniable quality benefits, but in my
experience the noise they introduce is also a quality problem. One of
the prime quality boosters for high-level programming languages is the
clear but concise expressivity. You can see the forest for the trees.
Post by Linas Vepstas
graph data structures are naturally recursive and functional; they're
a natural fit for scheme, and are hard/painful/nearly impossible to
manipulate in python.
I don't see that.

The greatest benefit of Scheme (and other Lisps) is the data/code
unification. All code is data and all data is code. Apart from that,
Python has adopted almost everything from Lisp. Now, Scheme and Lisp
famously allow you to expand the language syntax dynamically, but I'd
say you most often shouldn't use that power. Good Scheme code embraces
the lambda and leaves ad-hoc syntax aside.


Marko
Linas Vepstas
2017-09-14 20:15:17 UTC
Permalink
Hi Marko, long off-topic reply about the nature of programming languages.
Post by Marko Rauhamaa
Post by Linas Vepstas
Well that's a can of worms. Aside from static typing to benefit the
compiler, there's static typing to help the programmer understand what
the heck is being passed as an argument to some poorly documented,
overly-long function. This has always been a kind-of bugaboo of
reading other people's scheme code -- its too often too hard to
understand.
This is also why python is popular: any shmo feels they can code in
it; its the new visual-basic of the 21st century, with all of the
culture and code-quality that implies.
I strongly disagree. Python is popular because its level of abstraction
is optimal for a whole slew of use cases. I see no reason why Scheme
couldn't surpass Python some day in the future for the same reason.
What would that leve of abstraction be? Racket, ex PLT/scheme seems to
have taken steps in that direction; I can't tell if they've been successful
at this. Should guile move in that direction?
Post by Marko Rauhamaa
Not everybody can write software in good style. In fact, in my
experience very few programmers can do it. That is not a property of the
programming language.
Perhaps. Back in the day, when young programmers were learning C++ for the
first time, they felt obligated to use each of the nifty new tricks they
just learned: multiple inheritance, overloaded operators, virtual methods,
virtual base classes, templates: not just use, but abuse these, even when
something simpler was available. It was like working with quick-set
concrete: after the code was written, it seemed like it was impossible to
make any significant changes to it that would even compile, much less do
the right thing. This kind of code was hard to read cause it was badly
written.

Sometimes, with scheme, I get the sense that if you haven't read SICP you
can't understand what the code does: that you can't understand the code
cause it's well-written.
Post by Marko Rauhamaa
Static type declarations have undeniable quality benefits, but in my
experience the noise they introduce is also a quality problem. One of
the prime quality boosters for high-level programming languages is the
clear but concise expressivity. You can see the forest for the trees.
Speaking of SICP -- one of the early examples they have there is for
working with 2D numeric data, and using either (x,y) under the covers, or
using (r,theta). Which is all very spiffy, except that, for the user who
just glances at the function signature, its not clear exactly what that
function was expecting to get as arguments.

Somehow, that doesn't happen in Python, even in it's current untyped state.
But I cannot put my finger on it, its a gut-sense thing.
Post by Marko Rauhamaa
Post by Linas Vepstas
graph data structures are naturally recursive and functional; they're
a natural fit for scheme, and are hard/painful/nearly impossible to
manipulate in python.
I don't see that.
Hmm. I have C++ code with guile, cython and haskell bindings. It is used
to work with data structures that resemble json-like nested, recursive
tree-structures. By json-like I mean they are trees, with various
different types at the vertexes. A graph (directed or non-directed, with
loops in general) is specified by specifying the various trees that partly
span the graph. I kind of wish I had javascript bindings for this thing,
but can't figure out how.

The guile bindings for this are very elegant, and are about as readable as
json is. The python bindings are real clunkers -- they're much more
verbose, by factors of 3x to 5x, To add injury to insult, the insanely
stupid python whitespace rules means that it is impossible to indent them
into the natural tree structure that they naturally "want" to have.

Perhaps our python bindings are mis-designed. They were created by a python
fanatic to be "pythonic" which I think means "I don't know how to code"
:-( Our guile bindings are also mis-designed (by me), I regret many of
the decisions I made... but they're still nicer than the python...
Post by Marko Rauhamaa
The greatest benefit of Scheme (and other Lisps) is the data/code
unification. All code is data and all data is code. Apart from that,
Python has adopted almost everything from Lisp. Now, Scheme and Lisp
famously allow you to expand the language syntax dynamically, but I'd
say you most often shouldn't use that power. Good Scheme code embraces
the lambda and leaves ad-hoc syntax aside.
Since we are off-topic, I want to mention two things that scheme doesn't
have that I'm supremely interested in. Of course, it would not be scheme,
any more, once these were added, but these are things important for some
unspecified future.

One is a knowledge-representation language subset. Something like the
datalog subset of prolog, or something like SQL: some subset of scheme that
can be manipulated and queried and persisted to a file. Think pattern
matching .. but not the current anemic, super-simplistic model used in
scheme, but something full-featured, like SQL. Applied not to tables, but
to s-exps.

Another neato trick would be some kind of term-rewriting system. Something
kind-of-ish like the scheme macro subsystem, but instead, the macros could
be invoked repeatedly, ad infinitum, on arbitrary data, instead of just
once on fixed data. Again, kind-of-like SQL query statements: you can
re-use them over and over, even as the table of data underneath it all
changes. Pattern-matching, all grown up.

If you had these two, you've have a graph DB that far exceeded what the
tinkerpop folks can do. and with it being scheme ... or at least s-exps ...

--linas
Post by Marko Rauhamaa
Marko
--
*"The problem is not that artificial intelligence will get too smart and
take over the world," computer scientist Pedro Domingos writes, "the
problem is that it's too stupid and already has." *
David Kastrup
2017-09-14 20:36:54 UTC
Permalink
Post by Linas Vepstas
Perhaps. Back in the day, when young programmers were learning C++
for the first time, they felt obligated to use each of the nifty new
tricks they just learned: multiple inheritance, overloaded operators,
virtual methods, virtual base classes, templates: not just use, but
abuse these, even when something simpler was available. It was like
working with quick-set concrete: after the code was written, it seemed
like it was impossible to make any significant changes to it that
would even compile, much less do the right thing. This kind of code
was hard to read cause it was badly written.
The language C++ has had a long-running inferiority complex towards
FORTRAN, basically with the motto "we can do the same, but user-defined,
and still get native performance of a statically typed system". This
gave it a monstrous, and actually unpredictably contorted type
conversion lattice in order to simulate Fortran's builtin complex types.
It also gave birth to a legion of attempts to do efficient
runtime-dimensioned multidimensional arrays, with the result that there
are no numeric libraries equivalent in usability and efficiency to
half-century old FORTRAN libraries like BLAS and LINPACK since there
still is no agreed-upon efficient multidimensional array type, let alone
one where the compiler can do good strength reduction.

This "we can do the same, but user-defined, and still get native
performance of a statically typed system" mantra pervades all language
standardization processes and basically leads to a Turing-complete
template, type and ambiguity resolution system that is a wholly
independent layer akin to the C preprocessor but syntactically
structure-preserving.

Scheme (and its overall LISP family) has no language syntax to speak of
and its input _is_ already a data structure, so "structure-preserving"
has always been a given for its preprocessing. It also never really was
into static typing, but "JIT compilation" allows to reduce the impact of
that decision considerably.
Post by Linas Vepstas
Sometimes, with scheme, I get the sense that if you haven't read SICP
you can't understand what the code does: that you can't understand the
code cause it's well-written.
Well, that does not sound like a good language design goal either.
--
David Kastrup
Arne Babenhauserheide
2017-09-14 20:45:28 UTC
Permalink
Post by Marko Rauhamaa
Post by Linas Vepstas
This is also why python is popular: any shmo feels they can code in
it; its the new visual-basic of the 21st century, with all of the
culture and code-quality that implies.
I strongly disagree. Python is popular because its level of abstraction
is optimal for a whole slew of use cases. I see no reason why Scheme
couldn't surpass Python some day in the future for the same reason.
I somewhat disagree: Python became popular, because code written in
Python is typically easy to understand; even for non-programmers. This
is aided by the zen of python, which is a seriously awesome basis to
create a programming language: https://www.python.org/dev/peps/pep-0020/

Now so many people are using Python, that it is also popular because
there are so many people using it and building things for it.

Best wishes,
Arne
--
Unpolitisch sein
heißt politisch sein
ohne es zu merken
t***@tuxteam.de
2017-09-11 08:22:00 UTC
Permalink
On Sun, Sep 10, 2017 at 04:47:09PM -0500, Linas Vepstas wrote:

[...]
Post by Linas Vepstas
The bigger the RAM usage, the slower it seems to be. It might be due to
fragmentation [...]
Your guesses all make much sense (gc and non-gc data mixed at page level,
forcing the gc to touch many pages, etc.).

To just add one more: total memory usage (gc and non-gc) seems to enter
the gc triggering heuristics [1] (I couldn't follow in detail how, and I
don't know whether this is still valid in the 2.2 branch).

It just might be that Guile is seeing the outrageous process's memory
footprint and tries desperately to free some memory (which of course
can't suceed in your case because most memory usage isn't "guile's").

That might explain the gc being called far too often (and without
success).

Take this with two pounds of salt. My knowledge of Guile internals is
pathetic at best.

Cheers

[1] http://git.savannah.gnu.org/cgit/guile.git/tree/libguile/gc.c?h=v2.0.13#n770

- -- tomás
Linas Vepstas
2017-09-14 20:19:57 UTC
Permalink
Tomas, Hans, Mark,

Thanks looking at it now.

--linas
Post by t***@tuxteam.de
To just add one more: total memory usage (gc and non-gc) seems to enter
the gc triggering heuristics [1] (I couldn't follow in detail how, and I
don't know whether this is still valid in the 2.2 branch).
[1] http://git.savannah.gnu.org/cgit/guile.git/tree/libguile/
gc.c?h=v2.0.13#n770
Does it have to do with this issue [1], that the Boehm GC initially uses
a small amount of heap. One can use say
Post by t***@tuxteam.de
export GC_INITIAL_HEAP_SIZE=50G
on the shell command line before running the program.
1. https://gith <https://github.com/Macaulay2/M2/issues/500>
ub.com/Macaulay2/M2/issues/500 <https://github.com/Macaulay2/M2/issues/500>
Post by t***@tuxteam.de
If you'd like to learn how Boehm GC decides when to run, I would
encourage you to read <http://www.hboehm.info/gc/gcdescr.html>. I've
selected some relevant excerpts below.
--
*"The problem is not that artificial intelligence will get too smart and
take over the world," computer scientist Pedro Domingos writes, "the
problem is that it's too stupid and already has." *
David Kastrup
2017-09-13 17:13:04 UTC
Permalink
Post by Marko Rauhamaa
Post by Linas Vepstas
The overall message seems to be that gc either runs too often, or not
often enough. Some sort of better planning is needed.
I'm motivated to explore this, as it actually impacts my work.
Absolutely, although as an application programmer you should be able
to trust that GC just works.
s/although/because/
--
David Kastrup
Arne Babenhauserheide
2017-09-10 22:17:08 UTC
Permalink
Hi Linas,
Post by Linas Vepstas
Post by Arne Babenhauserheide
If I replace
(cons (format #f "~A" cnt) lis)
with
(cons cnt lis)
GC is down to ~20% (from around 130%).
That demo tried to capture the spirit of a more complex system.
That’s great, by the way!

I did that experiment, because the string formatting seems to create
lots of values the GC has to collect. If this triggers an investigation
of all memory, any reduction of disposable value creation should help.

Just try putting a (gc-disable) just before you run make-bgl and see the
memory usage shoot through the roof almost instantly.

For the real-code: throw in a
Post by Linas Vepstas
The cons
that I have is actually (cons (wrapper-for-some-c++-code foobar) lis) or
maybe (cons (scheme-that-calls-other-scheme-that-calls-c++ foo) lis) ...
but its not just cons -- it can also be for-each, map, or fold.
guile-en-r> (avg-gc-cpu-time)
Elapsed: 4461. secs. Rate: 4.236 gc/min %cpu-GC: 280.8% %cpu-use: 284.0%
This indeed looks pretty painful.
Post by Linas Vepstas
Anyway, this demo gives a scenario where gc seems to run too often. I have
other scenarios where it does not run often enough -- I have one where the
working set is about 200MBytes, but guile happily grows to 10GB or 20GB RAM
over a few hours. My solution for that other case is to manually GC
whenever gc-stats gets above 1GB, because I know a priori my working set is
smaller than that.
The overall message seems to be that gc either runs too often, or not often
enough. Some sort of better planning is needed.
A short term solution could be to disable automatic triggering of GC in
cases where it fires too often (compared to the total memory
consumption).
Post by Linas Vepstas
I'm motivated to explore this, as it actually impacts my work.
That’s great! And it’s awesome that you’re pushing Guile that far. This
is what it needs to become a reliable real-world battered workhorse for
many more people.

A start might be to look at the problem description: GC is triggered too
often when your C++ datastructures are so huge that the growth in
Guile-managed memory is irrelevant, and it is triggered too seldomly
when you only manage a small amount of data.

So maybe GC could benefit from a kind of hint from C++ about the size of
the datastructures which allows tuning the GC frequency to the total
amount of memory.
--
Unpolitisch sein
heißt politisch sein
ohne es zu merken
Arne Babenhauserheide
2017-09-10 22:36:51 UTC
Permalink
Hi Linas,
Post by Linas Vepstas
Post by Arne Babenhauserheide
Post by Linas Vepstas
To make this conversation less crazy, and more down-to-earth, here
is a demo that seems to spend most of its time in GC. Hard for me
to to say that this is "wrong", right now, but it illustrates the
scandalous nature of the problem.
Do you have many string operations in your code?
If I replace
(cons (format #f "~A" cnt) lis)
with
(cons cnt lis)
GC is down to ~20% (from around 130%).
That demo tried to capture the spirit of a more complex system. The cons
that I have is actually (cons (wrapper-for-some-c++-code foobar) lis) or
maybe (cons (scheme-that-calls-other-scheme-that-calls-c++ foo) lis) ...
but its not just cons -- it can also be for-each, map, or fold.
I looked at format, because it seems to create a lot of values which
need to be collected.

Just try to run your function like this:

(gc-disable)
(define foo (make-bgl '() (* 1235 1000 1000)))


 and see memory usage go through the roof almost instantly.

Then try the same without the format to get a mostly tame memory
behavior.
Post by Linas Vepstas
guile-en-r> (avg-gc-cpu-time)
Elapsed: 4461. secs. Rate: 4.236 gc/min %cpu-GC: 280.8% %cpu-use: 284.0%
That looks painful indeed.
Post by Linas Vepstas
Anyway, this demo gives a scenario where gc seems to run too often. I have
other scenarios where it does not run often enough -- I have one where the
working set is about 200MBytes, but guile happily grows to 10GB or 20GB RAM
over a few hours. My solution for that other case is to manually GC
whenever gc-stats gets above 1GB, because I know a priori my working set is
smaller than that.
The overall message seems to be that gc either runs too often, or not often
enough. Some sort of better planning is needed.
Maybe Guile needs some additional information from the C++
datastructures about the total size of the managed memory (and the
actual size of datastructures in foreign code) so it can judge whether
the added memory usage it sees is actually already relevant.
Post by Linas Vepstas
I'm motivated to explore this, as it actually impacts my work.
That’s great! And it is also awesome that you’re pushing Guile this far!
This helps Guile become a reliable tool for many more people.

Best wishes,
Arne
--
Unpolitisch sein
heißt politisch sein
ohne es zu merken
Mark H Weaver
2017-09-14 07:44:59 UTC
Permalink
Post by Linas Vepstas
The overall message seems to be that gc either runs too often, or not often
enough. Some sort of better planning is needed.
There several variables available to adjust libgc's heuristics, e.g.
GC_use_entire_heap, GC_free_space_divisor, GC_full_frequency, with
associated environment variables, described in:
<https://github.com/ivmai/bdwgc/blob/master/doc/README.environment>
Post by Linas Vepstas
I'm motivated to explore this, as it actually impacts my work.
If you'd like to learn how Boehm GC decides when to run, I would
encourage you to read <http://www.hboehm.info/gc/gcdescr.html>. I've
selected some relevant excerpts below.

Mark


http://www.hboehm.info/gc/gcdescr.html

[...]
In non-incremental mode, we make a decision about whether to garbage
collect whenever an allocation would otherwise have failed with the
current heap size. If the total amount of allocation since the last
collection is less than the heap size divided by GC_free_space_divisor,
we try to expand the heap. Otherwise, we initiate a garbage
collection. This ensures that the amount of garbage collection work per
allocated byte remains constant.

The above is in fact an oversimplification of the real heap expansion
and GC triggering heuristic, which adjusts slightly for root size and
certain kinds of fragmentation. In particular:

* Programs with a large root set size and little live heap memory will
expand the heap to amortize the cost of scanning the roots.

* Versions 5.x of the collector actually collect more frequently in
nonincremental mode. The large block allocator usually refuses to
split large heap blocks once the garbage collection threshold is
reached. This often has the effect of collecting well before the heap
fills up, thus reducing fragmentation and working set size at the
expense of GC time. Versions 6.x choose an intermediate strategy
depending on how much large object allocation has taken place in the
past. (If the collector is configured to unmap unused pages, versions
6.x use the 5.x strategy.)

* In calculating the amount of allocation since the last collection we
give partial credit for objects we expect to be explicitly
deallocated. Even if all objects are explicitly managed, it is often
desirable to collect on rare occasion, since that is our only
mechanism for coalescing completely empty chunks.

It has been suggested that this should be adjusted so that we favor
expansion if the resulting heap still fits into physical memory. In many
cases, that would no doubt help. But it is tricky to do this in a way
that remains robust if multiple application are contending for a single
pool of physical memory.

[...]

In incremental mode, the heap is always expanded when we encounter
insufficient space for an allocation. Garbage collection is triggered
whenever we notice that more than GC_heap_size/2 *
GC_free_space_divisor bytes of allocation have taken place. After
GC_full_freq minor collections a major collection is started.
Arne Babenhauserheide
2017-09-10 16:53:52 UTC
Permalink
Post by Linas Vepstas
To make this conversation less crazy, and more down-to-earth, here is a
demo that seems to spend most of its time in GC. Hard for me to to say that
this is "wrong", right now, but it illustrates the scandalous nature of the
problem.
My approach was to optimize the function (in REPL: ,optimize <func>) and
then to look for boxed values.

Best wishes,
Arne
--
Unpolitisch sein
heißt politisch sein
ohne es zu merken
Mark H Weaver
2017-09-13 02:24:21 UTC
Permalink
Hi Linas,
Post by Linas Vepstas
To make this conversation less crazy, and more down-to-earth, here is a
demo that seems to spend most of its time in GC. Hard for me to to say that
this is "wrong", right now, but it illustrates the scandalous nature of the
problem.
Its a simple loop, about 6 lines of code total, it mostly just wastes cpu
time. There's also a longer progress-report printer, similar to the
previous one I posted.
It runs at about 1 million iterations/minute, for me, of which 15 seconds
are spent "doing real work", and 45-75 seconds in gc. Note that GC runs in
parallel, usually using about 3-4 cores, so the fraction-cpu-time is the
fraction relative to wall-clock, so that %350 cpu time is typical (i.e. 3.5
cpu cores are running)
This seems to be a reasonable model of the behavior that I am seeing.
The indentation was stripped from your mail, so I re-indented the
Post by Linas Vepstas
; a tail-recursive loop, seems to waste a lot of time in gc.
(define (make-bgl lis cnt)
; a bogus longer list
(define longer (cons (format #f "~A" cnt) lis))
; periodically report statistics
(if (eq? 0 (modulo cnt (* 1000 1000)))
(begin (format #t "~A " cnt) (avg-gc-cpu-time)))
; periodically trim the list
(if (eq? 0 (modulo cnt (* 4123 1123)))
(set! longer '()))
; loop for a while.
(if (eq? 0 cnt) lis
(make-bgl longer (- cnt 1))))
; (define foo (make-bgl '() (* 1235 1000 1000)))
A few observations:

* The (format #f ...) allocates a temporary string output port, and that
turns out to be quite a lot of allocation, especially in Guile 2.2.
Glancing at the relevant code, I would estimate that it's at least 2.5
kilobytes spread over at least 10 heap blocks of various sizes. It
also involves registering a finalizer for the port, and adding to a
weak set, both of which are relatively expensive for the GC to deal
with. Using 'number->string' there would reduce memory allocation of
the loop above by an order of magnitude or more.

In theory, string ports could (and IMO should) be made a *lot*
lighter, but we're not there yet.

* The fact that you're using 'set!' to mutate 'longer' forces a
heap-allocated variable object to be allocated for 'longer', whereas
otherwise it could be allocated on the stack. See section 9.3.4
(Variables and the VM) in the Guile 2.2 manual for more on this, and
how 'set!' often makes things much less efficient.

* Since 'longer' is allocated within the loop, a new copy of that
(heap-allocated) variable object is created for each iteration.

Since you're running this loop about 1.2 billion times, and each
iteration is allocating at least 2.5k bytes, that comes out to around 3
terabytes of allocation during this loop. All of that must be reclaimed
by the garbage collector.

Also, since the port objects have finalizers, they are not deallocated
during the first collection that finds them to be unreachable. They are
added to a queue of objects to be finalized, and only after they have
been finalized and the next GC runs can they actually be deallocated.

Considering all of this, it's not surprising to me that the cost of this
loop in dominated by time spent in GC. The remaining things to do are
very lightweight operations by comparison.

Mark
Arne Babenhauserheide
2017-09-13 19:38:57 UTC
Permalink
Hi Mark,

Thank you for your great writeup!
Post by Mark H Weaver
* The (format #f ...) allocates a temporary string output port, and that


Post by Mark H Weaver
with. Using 'number->string' there would reduce memory allocation of
the loop above by an order of magnitude or more.


Post by Mark H Weaver
* The fact that you're using 'set!' to mutate 'longer' forces a
heap-allocated variable object to be allocated for 'longer', whereas
otherwise it could be allocated on the stack. See section 9.3.4
(Variables and the VM) in the Guile 2.2 manual for more on this, and
how 'set!' often makes things much less efficient.
* Since 'longer' is allocated within the loop, a new copy of that
(heap-allocated) variable object is created for each iteration.
I implemented a version which avoids both points via let-recursion and
number->string:

(define (make-bgl lis cnt)
"a bogus longer list"
(let loop ((longer (cons (number->string cnt) lis))
(cnt cnt))
;; periodically report statistics
(when (eq? 0 (modulo cnt (* 1000 1000)))
(begin (format #t "~A " cnt) (avg-gc-cpu-time)))
(cond
;; periodically trim the list
((and (eq? 0 (modulo cnt (* 4123 1123)) (> (length longer) 0)))
(loop (list)
cnt))
;; end recursion
((eq? 0 cnt)
lis)
(else
(loop (cons (number->string cnt) longer)
(- cnt 1))))))

This does indeed spend much less time in GC, but its memory usage is
still rising rapidly. Did I unknowingly create a memory leak?

Best wishes,
Arne
--
Unpolitisch sein
heißt politisch sein
ohne es zu merken
David Kastrup
2017-09-13 19:57:07 UTC
Permalink
Post by Arne Babenhauserheide
I implemented a version which avoids both points via let-recursion and
(define (make-bgl lis cnt)
"a bogus longer list"
(let loop ((longer (cons (number->string cnt) lis))
(cnt cnt))
;; periodically report statistics
(when (eq? 0 (modulo cnt (* 1000 1000)))
(begin (format #t "~A " cnt) (avg-gc-cpu-time)))
(cond
;; periodically trim the list
((and (eq? 0 (modulo cnt (* 4123 1123)) (> (length longer) 0)))
You probably mean

((and (eq? 0 (modulo cnt (* 4123 1123))) (> (length longer) 0))

instead, and (eq? 0 ...) is undefined behavior anyway even disregarding
that it let an error go unnoticed that would have bombed out had you
been using (= 0 ...) or equivalently (zero? ...) instead.

And (> (length longer) 0) is just horribly inefficient anyway (O(n) with
a turtle-hare hiding in the constant factor) compared to (pair? longer).
Post by Arne Babenhauserheide
(loop (list)
cnt))
;; end recursion
((eq? 0 cnt)
lis)
(else
(loop (cons (number->string cnt) longer)
(- cnt 1))))))
This does indeed spend much less time in GC, but its memory usage is
still rising rapidly. Did I unknowingly create a memory leak?
Best wishes,
Arne
--
David Kastrup
Arne Babenhauserheide
2017-09-13 20:37:21 UTC
Permalink
Post by David Kastrup
Post by Arne Babenhauserheide
I implemented a version which avoids both points via let-recursion and
(define (make-bgl lis cnt)
"a bogus longer list"
(let loop ((longer (cons (number->string cnt) lis))
(cnt cnt))
;; periodically report statistics
(when (eq? 0 (modulo cnt (* 1000 1000)))
(begin (format #t "~A " cnt) (avg-gc-cpu-time)))
(cond
;; periodically trim the list
((and (eq? 0 (modulo cnt (* 4123 1123)) (> (length longer) 0)))
You probably mean
((and (eq? 0 (modulo cnt (* 4123 1123))) (> (length longer) 0))
instead, and (eq? 0 ...) is undefined behavior anyway even disregarding
that it let an error go unnoticed that would have bombed out had you
been using (= 0 ...) or equivalently (zero? ...) instead.
Ah, yes, there’s the memoryleak. Thank you!
Post by David Kastrup
And (> (length longer) 0) is just horribly inefficient anyway (O(n) with
a turtle-hare hiding in the constant factor) compared to (pair? longer).
I didn’t know that this works — thank you!

(pair? (list)) → #f
(pair? (list 1)) → #t
Post by David Kastrup
Post by Arne Babenhauserheide
This does indeed spend much less time in GC, but its memory usage is
still rising rapidly. Did I unknowingly create a memory leak?
Thanks to David this is answered now.

Best wishes,
Arne
--
Unpolitisch sein
heißt politisch sein
ohne es zu merken
Linas Vepstas
2017-09-14 19:14:09 UTC
Permalink
Hi Mark,

Thanks... I'm not sure how to reply. So below I write "Yes, but...". My
alternative was to ignore your email, but that seems rude.
Post by Arne Babenhauserheide
Hi Linas,
* The (format #f ...) allocates a temporary string output port, and that
turns out to be quite a lot of allocation, especially in Guile 2.2.
Glancing at the relevant code, I would estimate that it's at least 2.5
kilobytes spread over at least 10 heap blocks of various sizes. It
also involves registering a finalizer for the port, and adding to a
weak set, both of which are relatively expensive for the GC to deal
with. Using 'number->string' there would reduce memory allocation of
the loop above by an order of magnitude or more.
Yes but ... that code was meant to be an approximation of the "real" code.
In the "real" code, there's maybe dozen-ish scheme calls, so the estimate
of 2.5KB spread over 10 block might be in the right range. In the "real"
code, there is also maybe 30KB of C++ mallocs, per iteration. Again, those
mallocs should be "invisible" to bdwgc, but they might serve to fragment
RAM badly.
Post by Arne Babenhauserheide
* The fact that you're using 'set!' to mutate 'longer' forces a
heap-allocated variable object to be allocated for 'longer', whereas
otherwise it could be allocated on the stack. See section 9.3.4
(Variables and the VM) in the Guile 2.2 manual for more on this, and
how 'set!' often makes things much less efficient.
Heh. I was using set! for this demo, only to clear out the long list.
otherwise, you'd get a multi-billion-entry list, which clearly blows out
the stack.
Post by Arne Babenhauserheide
* Since 'longer' is allocated within the loop, a new copy of that
(heap-allocated) variable object is created for each iteration.
Hmm. That's interesting, Is that really right? The thing was meant to be a
tail recursive loop, designed to run for some arbitarily long time.
Clearly a heap allocation in a loop would be a disaster.

In my "real" code, I don't think I am using set! in this way (but I will
now have to pay closer attention). Its not clear that this isn't some kind
of guile design flaw -- I can't imagine a technical reason for why a set!
to a local variable would force it to go on the heap.
Post by Arne Babenhauserheide
Since you're running this loop about 1.2 billion times, and each
iteration is allocating at least 2.5k bytes, that comes out to around 3
terabytes of allocation during this loop. All of that must be reclaimed
by the garbage collector.
Yeah, the 1.2 billion was to get it to run some 12+ hours, so that I could
verify if it slowly degrades over time (it didn't - it was stable after the
first few minutes). My loop was running 25 million times, and it took some
12 or 20 hours to finish. And it did seem to slowly degrade, but I'm not
quite sure.
Post by Arne Babenhauserheide
Also, since the port objects have finalizers,
I don't think that any of my code is using ports in my loop, or that
there's anything else in the loop that needs finalization.

I need to create some better instrumentation. But at 1-3 days per run,
debugging gets tedious very quickly very fast. One is not motivated to
revisit the problem, once you finally come out the other end. Until, of
course, next time.

--linas
--
*"The problem is not that artificial intelligence will get too smart and
take over the world," computer scientist Pedro Domingos writes, "the
problem is that it's too stupid and already has." *
Kjetil Matheussen
2017-09-11 07:42:25 UTC
Permalink
Post by Linas Vepstas
I've been experiencing problems with guile GC for a while now, that I've
mostly been able to ignore, but that are becoming debilitating, and I need
help, or need to find some solution.
Background: I have a C++ library that has guile bindings; it is wrapped up
as a guile module. It manages a lot of data, anywhere from a few gigabytes,
to a over a hundred gigabytes. Almost all this RAM usage should be
invisible, opaque to guile: so, for example, while my library is managing
5 or 10 GBytes, guile itself is managing maybe 20 or 50 or 100MBytes,
according to (gc-stats). This is normal and expected. Some of my programs
have guile use considerably more than this, in the multi-GByte range
However, the larger the RAM usage, the more frequently guile GC seems to
run, and the longer it takes, so that by the time my dataset is 100GBytes,
guile is using maybe 1GBytes, and is spending 99% of its time in GC. My app
slows to a crawl, and nothing gets done. I need to get GC under control,
because it is running pointlessly, because its clear that (almost) nothing
can be reclaimed, (usually) nothing is being reclaimed.
Hi Lineas,

This is probably not the reason, but could it be that your program has a
lot of global/static data,
or that you are dynamically loading more and more libraries that has
global/static data?
(global/static data are scanned by bdw-gc, unless explicitly prevented from
doing so)

To test this, you can add a "static_roots_callback" before calling GC_INIT
when your
program starts up. Here's an example of a "static_roots_callback":

https://github.com/kmatheussen/radium/blob/master/Qt/Qt_Main.cpp#L3028
(change "#if 0" at line 3094 to "#if 1" to print out info every time you
collect garbage)

It's initialized like this:

int main(){
GC_register_has_static_roots_callback(gc_has_static_roots_func);
GC_INIT();
...
}


Also, maybe you should ask on the bdw-gc mailing list too. The question is
probably more relevant there if it's a gc tuning issue (which it might not
be though).
Linas Vepstas
2017-09-14 18:41:35 UTC
Permalink
Post by Kjetil Matheussen
This is probably not the reason, but could it be that your program has a
lot of global/static data,
or that you are dynamically loading more and more libraries that has
global/static data?
(global/static data are scanned by bdw-gc, unless explicitly prevented
from doing so)
Thanks for the suggestion, but I don't think so. There are a few dozen
shared libs, I doubt they have more than a few KB of .data and .bss
segments. Its pretty "normal" C++ code.

Now, there are lots of threads, each which has thread-local sections but
its hard to imagine how this would add up to more than a few megabytes.

--linas
--
*"The problem is not that artificial intelligence will get too smart and
take over the world," computer scientist Pedro Domingos writes, "the
problem is that it's too stupid and already has." *
Hans Åberg
2017-09-11 18:13:25 UTC
Permalink
Post by Linas Vepstas
I've been experiencing problems with guile GC for a while now, that I've
mostly been able to ignore, but that are becoming debilitating, and I need
help, or need to find some solution.
Background: I have a C++ library that has guile bindings; it is wrapped up
as a guile module. It manages a lot of data, anywhere from a few gigabytes,
to a over a hundred gigabytes. Almost all this RAM usage should be
invisible, opaque to guile: so, for example, while my library is managing
5 or 10 GBytes, guile itself is managing maybe 20 or 50 or 100MBytes,
according to (gc-stats). This is normal and expected. Some of my programs
have guile use considerably more than this, in the multi-GByte range
However, the larger the RAM usage, the more frequently guile GC seems to
run, and the longer it takes, so that by the time my dataset is 100GBytes,
guile is using maybe 1GBytes, and is spending 99% of its time in GC. My app
slows to a crawl, and nothing gets done. I need to get GC under control,
because it is running pointlessly, because its clear that (almost) nothing
can be reclaimed, (usually) nothing is being reclaimed.
Does it have to do with this issue [1], that the Boehm GC initially uses a small amount of heap. One can use say
export GC_INITIAL_HEAP_SIZE=50G
on the shell command line before running the program.

1. https://github.com/Macaulay2/M2/issues/500
Loading...