Discussion:
[KPhotoAlbum] Hopefully final (at least for now) performance improvements
Robert Krawitz
2018-05-22 01:16:36 UTC
Permalink
I'm done with this round of performance improvements to the loader.

On my system (2.8/3.7 GHz 4-core mobile Skylake, 2TB 2.5" Seagate
spinners, Crucial MX300 SATA SSD), I'm pretty much able to max out the
I/O system loading images from the HDD. It takes just shy of 15
minutes to load the 10839 images totaling 92 GB in, so that's a little
over 100 MB/sec. Cat'ing the files through dd into /dev/null takes a
little less time, maybe 14:30-14:50, but that's within a few percent
and as far as I'm concerned constitutes "full speed". The CPU is
nowhere near fully loaded. Note that I use noatime in my mount
options; without that, I probably wouldn't get as good performance.

Loading the images from the SSD takes about 4:35, or about 335 MB/sec.
The CPU as a whole runs about 60% user CPU and maybe 75% total CPU.
That's not maxing out the SSD by any means, but it's very respectable.
If I use two scout threads, it drops to about 3:55; three scout
threads is about 3:48. This is in the range of 390 MB/sec. That
again isn't maxing it out -- max is a little over 500 MB/sec -- but
it's clearly getting toward CPU-bound at that point when thumbnails
are being built, and I've found that it takes quite a few parallel
requests to max out the SSD. However, more scout threads hurts
performance on the HDD by about 10-11% (16:43). For now I don't see a
lot of point trying to goose this more; most people are still going to
be using HDD's to store their images, and if they aren't, they
probably won't object to taking 4 minutes to load what for most people
is an enormous number of photos.

It would be possible to get more performance if the images are split
over multiple disks. That's actually my use case; there are no >2TB
laptop HDD's, and one camera goes to one disk while the other(s) go to
the other. Usually most of the images go to one of the disks. If I
interleaved loading images on the two filesystems, and possibly used
one scout per disk or something, I probably could do better, but it's
a somewhat specialized use case and my problem at that point is
getting the images off the media in the first place.

So a more useful optimization would be to subclass the file searcher,
so my download script could feed the files in one at a time and allow
kpa to work in parallel with that. Since my cards are all
considerably slower than my storage, I would be completely limited by
the card I/O speed (at least until I get faster cards or in one case a
faster reader). That's something I may take a look at again next
fall.

So Johannes, I think we can take the Load-Performance branch through
review and merge it when you're comfortable. I'm quite confident that
this will solve some of the segfaults when exiting while thumbnails
are building, and perhaps some of the other problems that have been
seen that I think are due to misuse of QThreads.
--
Robert Krawitz <***@alum.mit.edu>

*** MIT Engineers A Proud Tradition http://mitathletics.com ***
Member of the League for Programming Freedom -- http://ProgFree.org
Project lead for Gutenprint -- http://gimp-print.sourceforge.net

"Linux doesn't dictate how I work, I dictate how Linux works."
--Eric Crampton
Robert Krawitz
2018-05-22 02:07:18 UTC
Permalink
BTW, I'd like the NFS users to give this a try. I suspect that NFS,
like SATA SSD, will actually benefit from more scout threads, so you
may want to try increasing that (in DB/NewImageFinder.cpp on the
Load-performance branch, try setting imageScoutCount to 2 or 3 and see
if you do any better).

Perhaps counterintuitively, I suspect that NFS behaves a lot like a
very slow SATA SSD -- the actual transfer rate off the media is very
fast compared to the protocol latency, so a higher degree of
parallelism will allow for better I/O overlap and therefore better
throughput. With hard disks, on the other hand, transfers are
dominated by rotational speed and rotation and head seek latency --
things that are not amenable to parallelization. The only thing the
scout thread really gives us -- and it's not unimportant, to the tune
of 10% -- is the ability to keep the disk busy, because we don't have
to have the disk wait for us to finish processing an image and load
another one. So we can pipeline operations on HDD's, but not really
parallelize them, where we can with SATA SSD, and I suspect with NFS
too.

NVMe is something else; the interface transfer rate is quite a lot
faster, but the latency is also a lot lower. But the image loading
pipeline simply isn't fast enough on current processors that I have
available to me. It's possible that an i7-8700K or i9-7940X or the
like might just be fast enough to take some advantage of an NVMe
device, particularly if overclocked (the former because of its single
thread performance, the latter because of the combination of very good
single thread performance and high thread count to process
thumbnails). I don't have such a system available, but it's possible
that either of those with fast memory might be able to load images at
1-1.2 GB/sec with the kpa image pipe, which is a pretty good match for
NVMe throughput. Either one would likely need some extra scout
threads. But in reality, someone needs to have an awfully big photo
shoot, a ridiculous way to transfer data to the system (maybe raw
4K or 8K video frames over Infiniband?), and an absurd budget to make
this meaningful in any pratical sense.

NVMe is simply too fast right now for most workloads to take full
advantage of it. Historically, it's not common for CPU to be the
limiting factor with data-intensive workloads, but NVMe with current
CPUs is an exception.
--
Robert Krawitz <***@alum.mit.edu>

*** MIT Engineers A Proud Tradition http://mitathletics.com ***
Member of the League for Programming Freedom -- http://ProgFree.org
Project lead for Gutenprint -- http://gimp-print.sourceforge.net

"Linux doesn't dictate how I work, I dictate how Linux works."
--Eric Crampton
Loading...