Ticket #54 (closed defect: fixed)

Opened 4 years ago

Last modified 3 years ago

Memory usage much higher than 1.0.x, can lead to swapping a machine to death

Reported by: Jarod Wilson <jwilson@…> Owned by: oetiker
Priority: critical Milestone:
Component: rrd_graph Version: 1.2.15
Keywords: Cc: it@…, tester@…, jbreault@…, harlan@…

Description

Reported by Andrej Todosic (it@…) in Red Hat's bugzilla:  https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=201241

Description of problem: When making many graphs, rrdtool eventually stops writing graphs and starts sucking up memory, until all RAM and swap is consumed and the entire machine gets wedged

Version-Release number of selected component (if applicable): rrdtool-1.2.15-3.fc4

How reproducible: Either using the 'larrd' add-on for BigBrother?, or the munin-graph part of munin.

Steps to Reproduce: 1. 'munin-graph --cron' Actual results: Machine runs out of RAM and swap. Kernel fault, requires reboot. Furthermore, the graphs that are drawn are visually very different (anti-aliased?) from previous versions of rrdtool.

Expected results: larrd and munin graph .jpg files created, as they were earlier today.

Additional info: BigBrother? was made from a tarball. It calls rrdtool from bash scripts. munin is munin-1.2.4-9.fc4, which may call librrd.so from Perl modules.

From the strace:

open("/var/lib/munin/touchtunes.com/kampfgruppe.touchtunes.com-tt_toparse-gen3-g.rrd",
O_RDONLY) = 5
fstat64(5, {st_mode=S_IFREG|0644, st_size=50604, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0xb7f76000
read(5, "RRD\0000001\0\0\0\0/%\300\307C+\37[\1\0\0\0\f\0\0\0,\1"..., 4096) = 4096
_llseek(5, 0, [4096], SEEK_CUR)         = 0
_llseek(5, 36864, [36864], SEEK_SET)    = 0
read(5, "\0\0\370\377\0\0\0\0\0\0\370\377\0\0\0\0\0\0\370\377\0"..., 4096) = 4096
_llseek(5, 32768, [32768], SEEK_SET)    = 0
read(5, "\0\0\370\377\0\0\0\0\0\0\370\377\0\0\0\0\0\0\370\377\0"..., 4096) = 4096
close(5)                                = 0
munmap(0xb7f76000, 4096)                = 0
time(NULL)                              = 1154630928
open("/usr/share/rrdtool/fonts/DejaVuSansMono-Roman.ttf", O_RDONLY) = 5
fcntl64(5, F_SETFD, FD_CLOEXEC)         = 0
fstat64(5, {st_mode=S_IFREG|0644, st_size=60444, ...}) = 0
mmap2(NULL, 60444, PROT_READ, MAP_PRIVATE, 5, 0) = 0xb7d36000
close(5)                                = 0
munmap(0xb7d36000, 60444)               = 0
[above six lines repeated 57 times.]
stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=1252, ...}) = 0
[above line repeated 60 times]
brk(0xa17c000)                          = 0xa17c000
brk(0xa19d000)                          = 0xa19d000
brk(0xa1be000)                          = 0xa1be000
brk(0xa1df000)                          = 0xa1df000
brk(0xa200000)                          = 0xa200000
brk(0xa221000)                          = 0xa221000
brk(0xa242000)                          = 0xa242000
brk(0xa263000)                          = 0xa263000
brk(0xa284000)                          = 0xa284000
brk(0xa2a5000)                          = 0xa2a5000
[... etc, repeat until entire machine dies]

Attachments

rrd.tar.bz2 Download (9.1 KB) - added by tester@… 3 years ago.
rrd files that make it crash

Change History

  Changed 4 years ago by oetiker

Hi,

can you please try reproducing this problem with rrdtool alone ... using a short perl script ...

as you describe it, it would indicate a memory leak, but why rrdtool should suddenly stop working and start eating up all memory is not explicable to me ...

tobi

  Changed 3 years ago by oetiker

Hi,

bugs where the reporter does not cooperate are difficult to tackle ...

tobi

  Changed 3 years ago by Jarod Wilson <jwilson@…>

While I haven't heard anything from the original reporter, someone else has provided a reproducer in the corresponding RH bugzilla ticket.

From Jeremias Reith (jr@…): Created an attachment,  https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=134039&action=view RRD that can be used to reproduce the bug

I was able to reproduce the bug on FC6 test2 (x86_64).

Steps to reproduce:

1. put the extracted rrd into the current path

2. Execute the following command

rrdtool graph - \
--logarithmic \
DEF:b="leak.rrd":ReadExecLogPosDiff:AVERAGE \
LINE1:b#837C04:"" > /dev/null

It looks like the --lograrithmic option has something to do with it

Here is a backtrace (got with a SIGKILL):

#0  0x00002aaaaafbc10d in _int_malloc () from /lib64/libc.so.6
#1  0x00002aaaaafbd73d in malloc () from /lib64/libc.so.6
#2  0x0000003366625202 in gfx_new_dashed_line () from /usr/lib64/librrd.so.2
#3  0x0000003366612f22 in horizontal_log_grid () from /usr/lib64/librrd.so.2
#4  0x000000336661561d in grid_paint () from /usr/lib64/librrd.so.2
#5  0x000000336661788e in graph_paint () from /usr/lib64/librrd.so.2
#6  0x0000003366618b9f in rrd_graph () from /usr/lib64/librrd.so.2
#7  0x000000336662c5ef in HandleInputLine () from /usr/lib64/librrd.so.2
#8  0x000000336662c776 in main () from /usr/lib64/librrd.so.2
#9  0x00002aaaaaf6aaa4 in __libc_start_main () from /lib64/libc.so.6
#10 0x0000000000400529 in ?? ()
#11 0x00007fff455f7018 in ?? ()
#12 0x0000000000000000 in ?? ()

  Changed 3 years ago by oetiker

  • status changed from new to assigned

does this help?

--- rrd_graph.c (revision 874) +++ rrd_graph.c (working copy) @@ -1721,7 +1721,8 @@

{1.0, 5.0, 10., 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0}, {1.0, 2.0, 5.0, 7.0, 10., 0.0, 0.0, 0.0, 0.0, 0.0}, {1.0, 2.0, 4.0, 6.0, 8.0, 10., 0.0, 0.0, 0.0, 0.0},

- {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.}}; + {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.}, + {0,0,0,0,0, 0,0,0,0,0} /* last line */ };

int i, j, val_exp, min_exp; double nex; /* number of decades in data */

@@ -1730,7 +1731,7 @@

int mid = -1; /* row in yloglab for major grid */ double mspac; /* smallest major grid spacing (pixels) */ int flab; /* first value in yloglab to use */

- double value, tmp; + double value, tmp, pre_value;

double X0,X1,Y0; char graph_label[100];

@@ -1749,11 +1750,11 @@

mid++; for(i = 0; yloglab[mid][i + 1] < 10.0; i++); mspac = logscale * log10(10.0 / yloglab[mid][i]);

- } while(mspac > 2 * im->text_prop[TEXT_PROP_LEGEND].size && mid < 5); + } while(mspac > 2 * im->text_prop[TEXT_PROP_LEGEND].size && yloglab[mid][0] > 0);

if(mid) mid--;

/* find first value in yloglab */

- for(flab = 0; frexp10(im->minval, &tmp) > yloglab[mid][flab]; flab++); + for(flab = 0; yloglab[mid][flab] < 10 && frexp10(im->minval, &tmp) > yloglab[mid][flab] ; flab++);

if(yloglab[mid][flab] == 10.0) {

tmp += 1.0; flab = 0;

@@ -1765,9 +1766,14 @@

X1=im->xorigin+im->xsize;

/* draw grid */

- while(1) { + pre_value = DNAN; + while(1) { +

value = yloglab[mid][flab] * pow(10.0, val_exp);

+ if ( AlmostEqual?2sComplement(value,pre_value,4) ) break; /* it seems we are not converging */ + pre_value = value; +

Y0 = ytr(im, value); if(Y0 <= im->yorigin - im->ysize) break;

  Changed 3 years ago by oetiker

does this help?

--- rrd_graph.c (revision 874)
+++ rrd_graph.c (working copy)
@@ -1721,7 +1721,8 @@
        {1.0, 5.0, 10., 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0},
        {1.0, 2.0, 5.0, 7.0, 10., 0.0, 0.0, 0.0, 0.0, 0.0},
        {1.0, 2.0, 4.0, 6.0, 8.0, 10., 0.0, 0.0, 0.0, 0.0},
-       {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.}};
+       {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.},
+       {0,0,0,0,0, 0,0,0,0,0} /* last line */ };
 
     int i, j, val_exp, min_exp;
     double nex;                /* number of decades in data */
@@ -1730,7 +1731,7 @@
     int mid = -1;      /* row in yloglab for major grid */
     double mspac;      /* smallest major grid spacing (pixels) */
     int flab;          /* first value in yloglab to use */
-    double value, tmp;
+    double value, tmp, pre_value;
     double X0,X1,Y0;   
     char graph_label[100];
 
@@ -1749,11 +1750,11 @@
        mid++;
        for(i = 0; yloglab[mid][i + 1] < 10.0; i++);
        mspac = logscale * log10(10.0 / yloglab[mid][i]);
-    } while(mspac > 2 * im->text_prop[TEXT_PROP_LEGEND].size && mid < 5);
+    } while(mspac > 2 * im->text_prop[TEXT_PROP_LEGEND].size && yloglab[mid][0] > 0);
     if(mid) mid--;
 
     /* find first value in yloglab */
-    for(flab = 0; frexp10(im->minval, &tmp) > yloglab[mid][flab]; flab++);
+    for(flab = 0; yloglab[mid][flab] < 10 && frexp10(im->minval, &tmp) > yloglab[mid][flab] ; flab++);
     if(yloglab[mid][flab] == 10.0) {
        tmp += 1.0;
        flab = 0;
@@ -1765,9 +1766,14 @@
     X1=im->xorigin+im->xsize;
 
     /* draw grid */
-    while(1) {
+    pre_value = DNAN;
+    while(1) {       
+
        value = yloglab[mid][flab] * pow(10.0, val_exp);
+        if (  AlmostEqual2sComplement(value,pre_value,4) ) break; /* it seems we are not converging */
 
+        pre_value = value;
+
        Y0 = ytr(im, value);
        if(Y0 <= im->yorigin - im->ysize) break;
 

  Changed 3 years ago by Jarod Wilson <jwilson@…>


I cannot reproduce the bug anymore.

This might be related to the latest freetype update on FC6 (2.2.1-4.fc6) as this is the only change that has something to do with rrdtool.

The provided patch breaks the build of rrdtool-1.2.15-3.fc6:

rrd_graph.c: In function 'horizontal_log_grid':
rrd_graph.c:1771: warning: implicit declaration of function
'AlmostEqual2sComplement'

I will try the current trunk of the svn repo if I get some time for it.

  Changed 3 years ago by oetiker

ok fixed the fix by declaring almost... early enough ...

can anyone still reproduce the problem ?

  Changed 3 years ago by Jeremias Reith <jr@…>

My last comment from  https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=201241:

I was able to reproduce the bug again. This time on a current RHEL 4 (x86).

The given fix does not solve the problem (tried the svn snapshot of september 4th).

Is there anything else I can do in order to get this bug fixed?

  Changed 3 years ago by Jarod Wilson <jwilson@…>

Latest entry in the bugzilla ticket:

I suspect the actual problem may be with freetype. Jeremias, is there any way you can try rebuilding the FC6 freetype package on RHEL4, installing the resulting binraies, and retest?

  Changed 3 years ago by oetiker

Is there a way to reproduce this ? since your previous script does not fail anymore for me ...

cheers tobi

  Changed 3 years ago by tester@…

  • cc tester@… added

Using the freetype from fc6test3 on rhel4 didnt work for me. rrdtool graph is using lots of RAM with 1.2, 1.0 worked well with the same commands. After running massif on it, I get 98% of allocations coming from rrd_fetch.c:312 (rrd_fetch_fn). So I dont think it has anything to do with freetype.

  Changed 3 years ago by oetiker

tester,

you would not happen to have a sample script which creates the rrd, fills it with data and then dies graphing it ?

cheers tobi

Changed 3 years ago by tester@…

rrd files that make it crash

  Changed 3 years ago by tester@…

I can make it crash with the attached rrd files and the following command:

rrdtool graph rps.png --start end-2m --end 1160692529 -w 1024 -h 300 -v MB/sec --x-grid MINUTE:2:MINUTE:2:MINUTE:2:0:%X --alt-y-grid -X 0 -E DEF:7_16_def_rpl=nte-7-16.rrd:rplreceivedDefKb:AVERAGE DEF:7_16_gr_rpl=nte-7-16.rrd:rplreceivedGrKb:AVERAGE DEF:7_16_def_rq=nte-7-16.rrd:reqsentDefKb:AVERAGE DEF:7_16_gr_rq=nte-7-16.rrd:reqsentGrKb:AVERAGE CDEF:7_16_def=7_16_def_rpl,UN,0,7_16_def_rpl,IF,7_16_def_rq,UN,0,7_16_def_rq,IF,+,1024,/ CDEF:7_16_gr=7_16_gr_rpl,UN,0,7_16_gr_rpl,IF,7_16_gr_rq,UN,0,7_16_gr_rq,IF,+,1024,/ AREA:7_16_def#FF9000:ia-007_Write STACK:7_16_gr#FF0000:ia-007_Write_Guaranteed DEF:7_15_def_rpl=nte-7-15.rrd:rplreceivedDefKb:AVERAGE DEF:7_15_gr_rpl=nte-7-15.rrd:rplreceivedGrKb:AVERAGE DEF:7_15_def_rq=nte-7-15.rrd:reqsentDefKb:AVERAGE DEF:7_15_gr_rq=nte-7-15.rrd:reqsentGrKb:AVERAGE CDEF:7_15_def=7_15_def_rpl,UN,0,7_15_def_rpl,IF,7_15_def_rq,UN,0,7_15_def_rq,IF,+,1024,/ CDEF:7_15_gr=7_15_gr_rpl,UN,0,7_15_gr_rpl,IF,7_15_gr_rq,UN,0,7_15_gr_rq,IF,+,1024,/ STACK:7_15_def#009cFF:ia-007_Read STACK:7_15_gr#000cFF:ia-007_Read_Guaranteed DEF:8_16_def_rpl=nte-8-16.rrd:rplreceivedDefKb:AVERAGE DEF:8_16_gr_rpl=nte-8-16.rrd:rplreceivedGrKb:AVERAGE DEF:8_16_def_rq=nte-8-16.rrd:reqsentDefKb:AVERAGE DEF:8_16_gr_rq=nte-8-16.rrd:reqsentGrKb:AVERAGE CDEF:8_16_def=8_16_def_rpl,UN,0,8_16_def_rpl,IF,8_16_def_rq,UN,0,8_16_def_rq,IF,+,1024,/ CDEF:8_16_gr=8_16_gr_rpl,UN,0,8_16_gr_rpl,IF,8_16_gr_rq,UN,0,8_16_gr_rq,IF,+,1024,/ STACK:8_16_def#FFa800:sh28_Write STACK:8_16_gr#FF1800:sh28_Write_Guaranteed DEF:8_15_def_rpl=nte-8-15.rrd:rplreceivedDefKb:AVERAGE DEF:8_15_gr_rpl=nte-8-15.rrd:rplreceivedGrKb:AVERAGE DEF:8_15_def_rq=nte-8-15.rrd:reqsentDefKb:AVERAGE DEF:8_15_gr_rq=nte-8-15.rrd:reqsentGrKb:AVERAGE CDEF:8_15_def=8_15_def_rpl,UN,0,8_15_def_rpl,IF,8_15_def_rq,UN,0,8_15_def_rq,IF,+,1024,/ CDEF:8_15_gr=8_15_gr_rpl,UN,0,8_15_gr_rpl,IF,8_15_gr_rq,UN,0,8_15_gr_rq,IF,+,1024,/ STACK:8_15_def#00b4FF:sh28_Read STACK:8_15_gr#0024FF:sh28_Read_Guarantee

  Changed 3 years ago by human

  • cc jbreault@… added
  • priority changed from major to critical

here is an strace of one of the process killing the computer ... when this process run 1 min later 5 gig of ram consumed and iowaits go crazy and swap almost full ...

any idea ??

follow-up: ↓ 16   Changed 3 years ago by human

here is the link of the strace file  http://www.nexweb.ca/test_log.tgz

in reply to: ↑ 15   Changed 3 years ago by oetiker

Replying to human:

here is the link of the strace file  http://www.nexweb.ca/test_log.tgz

Hi,

I assume you are all working with the latest snapshot ... Use the limit command to set coredumpsize to unlimited. Now start the process and as soon as it starts growing (brk phase) use kill -SEGV $pid to produce a core file. Now load the core with gdb and run a bt command to see where that memory allocation is going wild ...

cheers tobi

follow-up: ↓ 19   Changed 3 years ago by human

This bit us (from munin-graph) when we updated from librrds-perl 1.2.11-0.6 today to 1.2.15-0.2 in Debian Linux "etch/testing". It went bananas (ram-wise) in both IA-32 and amd64, swallowing up 8GB before being killed. This was for a graph that has been running fine for a couple of months.

Kernel 2.6.17 and 2.6.16, and glibc 2.3.6.

The graph in question used "graph_args -l 1 --logarithmic --base 1000 --units-exponent 0". Removing these has apparently cured the problem (at the expense of the usefuless of the graph).

  Changed 3 years ago by human

  • cc harlan@… added

in reply to: ↑ 17   Changed 3 years ago by oetiker

Replying to human:

Kernel 2.6.17 and 2.6.16, and glibc 2.3.6. The graph in question used "graph_args -l 1 --logarithmic --base 1000 --units-exponent 0". Removing these has apparently cured the problem (at the expense of the usefuless of the graph).

can you please try the latest snapshot ?  http://oss.oetiker.ch/rrdtool/pub/beta/rrdtool-svn-snap.tar.gz

cheers tobi

follow-up: ↓ 21   Changed 3 years ago by human

Kernel 2.6.17 and 2.6.16, and glibc 2.3.6. The graph in question used "graph_args -l 1 --logarithmic --base 1000 --units-exponent 0". Removing these has apparently cured the problem (at the expense of the usefuless of the graph).

can you please try the latest snapshot ?  http://oss.oetiker.ch/rrdtool/pub/beta/rrdtool-svn-snap.tar.gz cheers tobi

Looks good so far! I'll let you know if it blows up at some point. Thank you very much....I'll ask the Debian maintainer to package the later version. If he wants a specific revision, is it r881 that fixed this? r887? (The reason I'm asking about the latter is because our values on this graph are usually zero. I understand that 10x=0 is unsolvable (via the math I studied, anyway), but it's still handy to have a logarithmic graph for this data.)

The logarithmic graphs used to have five minor Y-axis lines for every major division, and now have ten; this is fine with me, but thought I'd mention it in case it is unexpected by you.

You rock,
Pete

in reply to: ↑ 20   Changed 3 years ago by oetiker

Replying to human:

The logarithmic graphs used to have five minor Y-axis lines for every major division, and now have ten; this is fine with me, but thought I'd mention it in case it is unexpected by you.

A physicist who uses rrdtool to plot pressure time series contributed a patch recently (which also caused the breakage) that makes log plots behave more consistently across a wider range of values. The increassed number of lines is one of the symptoms of this patch.

I will be releasing rrdtool 1.2.16 RSN ... the rXXXX is just the svn revision and not something which should be packed I guess ... unless the debian person cares to isolate the change which fixes the log behavior.

You rock,
Pete

follow-ups: ↓ 23 ↓ 24   Changed 3 years ago by human

Unfortunatly, currently Debian is in a deep code freeze, so it would be helpful to isolate the fixes. Could you give some hints which changesets should be taken?

Cheers, Andi

in reply to: ↑ 22   Changed 3 years ago by oetiker

  • status changed from assigned to closed
  • resolution set to fixed

Replying to human:

Unfortunatly, currently Debian is in a deep code freeze, so it would be helpful to isolate the fixes. Could you give some hints which changesets should be taken?

I guess the important bits for this are [881] and [887]

cheers tobi

in reply to: ↑ 22 ; follow-up: ↓ 25   Changed 3 years ago by human

Replying to human:

Unfortunatly, currently Debian is in a deep code freeze, so it would be helpful to isolate the fixes. Could you give some hints which changesets should be taken?

Like Tobi just said, looks like both r881 and r887 should be picked up. Both apply to 1.2.15 with only a 2-line offset. I'm respinning fedora packages with a patch including those two changesets this afternoon.

in reply to: ↑ 24 ; follow-up: ↓ 26   Changed 3 years ago by oetiker

Replying to human:

So while you are at it, you might want to put this in too Patch

in reply to: ↑ 25   Changed 3 years ago by human

Replying to oetiker:

So while you are at it, you might want to put this in too Patch

Okay, that'll also be in the build I'm pushing today...

--jarod

Note: See TracTickets for help on using tickets.

NOTE: The content of this website is accessible with any browser. The graphical design though relies completely on CSS2 styles. If you see this text, this means that your browser does not support CSS2. Consider upgrading to a standard conformant browser like Mozilla Firefox or Opera but also Apple's Safari or KDE's Konqueror for example. It may also be that you are looking at a mirror page which did not copy the CSS for this page. Or if some pictu res are missing, then the mirror may not have picked up the contents of the inc directory.