31 December, 2007

Last Performance Blog

According to a reply on my previous blog, I could just as well test if having my hamster run in his wheel would increase drawingspeed as long as I don't use valgrind and cachegrind.

Well, I've never used those tools before, but hey. Let's give it a try.

Valgrind and Dolphin (resizing all the time).
Well, valgrind talked about memleaks. I get that. The details of that, however, don't tell me much (of course, that's an understatement).

==21180== LEAK SUMMARY:
==21180== definitely lost: 8,383 bytes in 342 blocks.

The output for individual parts where something is lost is like this:
==21180== 216 bytes in 1 blocks are definitely lost in loss record 175 of 257
==21180== at 0x4021765: malloc (in /usr/lib/valgrind/x86-linux/vgpreload_memcheck.so)
==21180== by 0x56BB885: _XimOpenIM (in /usr/lib/libX11.so.6.2.0)
==21180== by 0x56B88CF: _XimRegisterIMInstantiateCallback (in /usr/lib/libX11.so.6.2.0)
==21180== by 0x5699517: XRegisterIMInstantiateCallback (in /usr/lib/libX11.so.6.2.0)
==21180== by 0x503F73D: QXIMInputContext::QXIMInputContext() (qximinputcontext_x11.cpp:361)
==21180== by 0x503E4A5: QInputContextFactory::create(QString const&, QObject*) (qinputcontextfactory.cpp:120)
==21180== by 0x4B928F1: QApplication::inputContext() const (qapplication.cpp:4541)
==21180== by 0x4BD5046: QWidget::inputContext() (qwidget.cpp:245)
==21180== by 0x4C051CC: QWidget::destroy(bool, bool) (qwidget_x11.cpp:853)
==21180== by 0x4BD7D8F: QWidget::~QWidget() (qwidget.cpp:1264)
==21180== by 0x4EADB04: QLineEdit::~QLineEdit() (qlineedit.cpp:357)
==21180== by 0x4E76C8A: QComboBox::setLineEdit(QLineEdit*) (qcombobox.cpp:1599)

Above is with QtCurve as style, btw.

Well, I think that would explain it all if I just had more brains. Unfortunately, I don't, so let's go on to kcachegrind. Again with QtCurve, which, as I mentioned before, draws a bit faster than Oxygen.

Well, what can I say. Lovely graphs, that's for sure. As far as I can tell, this tool shows where the app is spending it's time. I guess the 100% ld-2.7.so doesn't mean that's what the app does all the time.
The < cycle 13 > I see as LibQtCore.so.4.4.0 doesn't explain much either. Digging down (this kcallgrind tool is pretty cool, even though I don't really understand it) I seem to see a lot of calls to QWidgetBackingStore?!? And there are over 2.4 million calls to isPaintOrScrollDoneEvent, but the % spend there (if that's what the % mean) is just 1.5%... Hmmm, again - big ?????

Or is that second line, self, the one which shows how much a function is called? Sounds like it could be true. In that case, at least 12% of the time of Dolphin (when resizing all the time, of course) is spend in varous libc-2.7.so calls.
The funny thing is that the orders for the libc-2.7.so functions seem to come all from the same Qt function: QListView::paintEvent < cycle 13 >. And then some KFileitemDelegate::paint things.

This could mean QListView is the problem (did I say something about QScrollBar? Does that have something to do with QListView?)? Or does this mean my hamster simply didn't run fast enough?

I'm sorry. This does not make anything clear to me. As I said before, I'm going to wait for someone actually knowledgeable in this area to say something.

A commenter on my previous blog did use callgrind, btw, but his findings were not very clear (to him, I mean, they were obviously not clear to me anyway).

Luckily I received also a few private mails, ranging from "Xephyr does indeed weird stuff, Qt4 apps draw much faster in it than in plain X" to "go on blogging about this". Which I won't, btw, as I never intended to go this far. Sure, many ppl seem to find it interesting, but the complaints about "don't whine, do something", while not making me happy, have a point. I can't do anything, so someone who actually CAN should take over. Or not, in that case - well. Shit happens. Maybe just use GTK apps, then. I had to install a bunch of GTK libs for kcachegrind anyway, for some weird reason:
libgnomecanvas, libgnome, libbonoboui, gnome-keyring, libgnomeui, graphviz, gail, gconf, gnome-vfs and libbonobo (including a bunch of errors about missing gtk things and other stuff, even though kcachegrind works fine)
Why KCachegrind needed that stuf: please let me know. I'd love to get rid of it again.

Edit: aah, probably optional (but apparently compiled in) dependencies of Graphviz... Weird, the graphviz page doesn't mention them, so I still think it's faulty.... So far for Arch linux is mean & lean.

Oh, and Happy new year for those who already are in 2008 when they read this ;-)

Edit2: And the Gwenview author, Aurélien Gâteau, just emailed me to ask me to test some improvements to the panning in Gwenview 4 he made in response to my blogs. So I did test it, and I am happy to tell you all the slower panning in Gwenview in KDE 4 is totally gone - the KDE 4 version is now as fast as the KDE 3 version! Thank you, Aurélien!

Edit3: make the blog sound less like I'll be crying all night because someone said bad things - after all, that's not the point, and I don't want to sound like I won't blog anymore if ppl are mean...


  1. The message I'm getting from this blog post is "don't be sarcastic or I stop writing". You shouldn't stop measuring performance and pointing out flaws just because somebody tells you you're not great at finding the exact cause of them.

    I'm sure there are many people who enjoy the "performance" blog posts, such as me.

    Sarcasm, irony and wit are essential parts of the Internet. I get comments like this all the time on my blog.

  2. Well, I was indeed rather unhappy with the comment - but that wasn't the only point. I think the person was right in that I got as far as I could go - someone else should go on and finish this.

    I'll sure blog about performance issues, but not this one - I've done more than I can (5 or 6 blogs about that? More than enough...).

    And I'm very happy Aurelién found and fixed the panning issue in Gwenview, at least I can say the blogs did something for users of KDE 4.0 ;-)

    BTW I am thinking about disabling anonymous comments, as criticism is much easier if you know who it comes from - I know, for example, Thomas loves me since I named my hamster after him, and any criticism he has for me isn't meant in any bad way ;-)
    Anonymous ppl, on the other hand, especially when they have a point, are less nice. But on the other hand, many anonymous posters (including ppl I actually know, as I found out later) had very interesting things to say. So I won't change anything yet.

  3. nice to read, that gwenview profits from this blog :)

    another comment to gwenview-kde4:

    scrolling in a thumbnail-view within a directory with many many images, while the thumbnails are generated was good in gwenview-kde3 because thumbnails at the actual-scroll-position were rendered first. I hope this will be done again for the kde4-version, this seems not to be the case yet.

  4. Hello Jos !

    First of all, while you are upset by one critical comment, please don't forget about all other positive comments and all other people who just silently enjoyed your blogs ! Even if you can't solve the issue yourself, at least you can raise attention to it and motivate other people to take a look at it. You have done even more: you actually spent you time trying to solve it.

    Profiler is not a magic: it can't tell you everything. Knowing that 80% of CPU time is spent in some mysterious function in nvidia_drv.so is helpful, but knowing that the problem disappears with intell drivers can be equally important and profiler can't give you this knowledge ! Your blog contains lots of very valuable information for a developer to start fixing the issue. This information just can't be gathered by a developer himself unless he owns 10 different PCs with different video hardware and operation systems.

    So please, keep up you great work ! And when you get negative comment next time, please take into account not that comment alone but also all other positive comments !

  5. I have to say your posts don't come off as whiny at all (you even made an effort to understand the profiler output, which isn't an easy task), and they're helping to improve the quality of KDE4 by keeping performance in the mind of the developers.

    Let's hope the Gwenview fix can give Dolphin devs a hint about what's slowing things down.

  6. I've tried to profile using oprofile. It can be better than valgring because it does system-wide profiling taking into account not only your application thread but anything that is done by your CPU including Xorg, drivers and kernel. The downside is that it is statistical profiler and sometimes can give wrong results.

    I've started konqueror4 in KDE3 session and profiled it only while resizing. The results are:

    95755 22.0982 vmlinux-dbg-2.6.20-16-generic
    65766 15.1774 nvidia_drv.so
    65392 15.0911 Xorg
    50486 11.6511 libqt-mt.so.3.3.7
    38042 8.7793 libnvidia-wfb.so.100.14.19
    25348 5.8498 libc-2.6.1.so
    16337 3.7702 libQtGui.so.4.3.2
    14582 3.3652 oprofiled
    7799 1.7998 oprofile
    6550 1.5116 libz.so.
    5994 1.3833 libX11.so.6.2.0
    4602 1.0620 libasound.so.2.0.0
    3755 0.8666 libkdecore.so.4.2.0
    3717 0.8578 libmad.so.0.2.1
    2522 0.5820 nvidia
    2062 0.4759 libkhtml.so.5.0.0
    2023 0.4669 kwin3_crystal.so
    1913 0.4415 libflashplayer.so
    1763 0.4069 libstdc++.so.6.0.9
    1697 0.3916 libkdeinit_kwin.so
    1501 0.3464 libQtCore.so.4.3.2

    (please keep in mind that this is not accumulated statistics: percentages includes only time spent in the library itself not taking into account libraries called by given library)

    Now I should download debug-enabled version of kernel, Xorg, qt and KDE and try to interpret symbol-level and callgraph profiles, but I have pending exams in university really soon so I have to postpone it.

    If someone will be able to find a time sooner that me, here is quick receipt to run oprofile:
    $ sudo opcontrol --vmlinux=/boot/vmlinux-DEBUG-VERSION --callgraph=32
    [ start you application and make sure nothing else uses your CPU ]
    $ sudo opcontrol --start
    [ at least for one minute do what you want to profile - resize window for example ]
    $ sudo opcontrol --stop

    After it you can parse the results. Most simple per-process results:
    $ opreport
    Per-symbol results:
    $ opreport -a -l
    Additional files with symbols (for example -dbg version of libs or kernel modules) can be added by -p [path_to_file1,path_to_file2,...] option.
    It is possible to convert results into a form parsable by kcachegrind:
    $ opreport -gdf | op2calltree
    - this will create lots of files named oprof.out.PROCNAME which can be opened by kcachegrind

  7. Jos - you made it to Digg! :)


  8. Well jos, I don't agree with you at all on this one. The poster you mention was a jerk, just that. And he didn't have a point at all.

    Jerk (let's call him that way) doesn't know what the scientific method is. Yes we all were guessing. But what you were doing was to change one factor at a time in order to isolate the myriad of possibilities. For example NVIDIA XRENDER performance, which I am afraid, has a lot to do with graphics on a X11 server. How someone can compare that with putting his bloody cat on top of his bloody monitor is beyond me.

    So you may think that your many blog posts about performance are enough and that you went as far as you could, and decide to drop the subject. That's fine. But that doesn't mean that Jerk was right.

    So please next time just ignore someone as stupid as that, will you?. It's not worth it.

    Take care and continue having fun with KDE and its great community.

  9. I really have enjoyed reading your QT4 performance posts - it's like reading a murder mystery! Whodunit? Was it the QScrollbar, in X, with an Nvidia card? Hehe.

    Anyway, I know nothing about profiling either, so it was interesting to see how much could be figured out by just testing stuff and asking the community. Plus I'm really glad to hear that Gwenview resizes faster now!

  10. Hi,

    personally I'm happy you decided this. I mean; the blogs went as far as it could go. We have a lot of reports from people that think redrawing is too slow; this is valuable data. Remember (like Zack says) there are no absolute numbers in measuring graphics; just "fast enough".

    There are a dozen ways to do profiling and there is never just one piece that is wrong. Which is why I agree with the poster you pointed out. The correct way of fixing this issue is to figure out exactly what is the slowest part and fix that. Then start profiling again and figure out what is the slowest part now, and fix that too. Until drawing is fast enough.

    Because of this problem that there may be multiple pieces causing slowdown I don't think your method of 'elimination' is useful. Profiling to figure out which is the slowest is useful.

    And it got too many commenters concluding based on guesswork that Qt was slow. Which is something I thinks is unfair and not in our best interrest to put on dig ;)

    After all; its open source software. Just empty out the QScrollBar painting method and see if things get faster :)

    Anyway; thanks and please do write that bugreport to qt-bugs with your observed issue.

  11. please don't disable anonymous comments - blogger often refuses to log me in for no apparent reason.

  12. Hello,

    In this phoronix post (they looks like GPU experts), they talks about a recent NVidia drivers release, and amongst the bug fix we can find:

    by fixing some other XRender issues. Another new fix in 169.07 is fix for scrolling ARGB X drawables in Qt

    Maybe it's related... it seems pretty old (December 20), but maybe it's related to your problem.

    Good luck hunting,

  13. Discovered this blog only now, and I'd like to share some experiences.

    The application I'm working on, which uses Qt 4 has similar performance problems, but they do not always show up.

    I've profiled Qt 4, and it's painting routines many times, and certainly in Qt 4.3.3 they are very fast!

    Still the high xorg load at times, drives me mad! My thoughts are, like some people posted here as well, that Qt 4 relies on XRENDER extensions a LOT, and drivers that do not properly implement it, or use software fallbacks suffer a lot, and xorg and sometimes the application itself start sucking cpu...

    One user reported 100% xorg load and ~ 2-4 frames/per second with his GeForce FX 5700 Ultra!!
    It appeared that Mandriva had configured xorg to NOT use RenderAccel !

    Enabling it the same operations required ~ 1% xorg cpu load, and painting was very smooth!

    So let's hope that xorg and drivers get updated in time before KDE4 and Qt 4 based applications get widely used.

  14. By default, cmake doesn't strip binary and library so you can try this:
    cd /path/to/kde4/bin
    for i in $(ls); do; strip --strip-all $i; done
    cd /path/to/kde4/lib
    for i in $(ls); do; strip --strip-all $i; done

  15. Better command:
    cd /path/to/kde4/bin
    find . | xargs file | grep "ELF 32-bit LSB executable" | cut -f 1 -d : | xargs strip --strip-all
    cd /path/to/kde4/lib
    find . | xargs file | grep "shared object" | cut -f 1 -d : | xargs strip --strip-all


Say something smart and be polite please!