Discussion:
[KPhotoAlbum] Speed up new image load time
Robert Krawitz
2017-05-29 21:27:49 UTC
Permalink
I've improved loading new images into kpa by batching up the EXIF
information into a single transaction, as I expected.

Some timings, for loading 1133 images:

Old New
20 MP 5:41 0:32
0.2 MP 6:00+ 0:07

Not sure why it took longer to read in the 0.2 MP images than the
full-size ones; probably due to whatever else was going on on my
system. But for the realistic case, I have about 10x speedup
overall.

It appears that there were actually multiple places in the code
reading the EXIF data and storing it in the database, one file at a
time. I found what looked like one, and the time for the 0.2 MP files
dropped to 1:45 or so, so that may have actually been more than one place.

It looks like storing the EXIF data in the database takes about 3
seconds. The next big time consumer is file version detection; if I
turn that off, the total time drops off to about 7 seconds. At that
point, in a realistic scenario, I'd likely be I/O-bound; if I were
loading 3000 images (30 GB, typically), I'd need on the order of
250-300 seconds just to read the data from disk. But if someone were
storing their images on nVME, it might matter.

In any event, this will make loading new images a lot more bearable.
I'm used to having to wait 30 minutes to load images into kpa; this
should get that down to 5'ish (with I/O being the limiting factor).
--
Robert Krawitz <***@alum.mit.edu>

*** MIT Engineers A Proud Tradition http://mitathletics.com ***
Member of the League for Programming Freedom -- http://ProgFree.org
Project lead for Gutenprint -- http://gimp-print.sourceforge.net

"Linux doesn't dictate how I work, I dictate how Linux works."
--Eric Crampton
Robert Krawitz
2017-05-29 22:30:07 UTC
Permalink
if I were loading 3000 images (30 GB, typically), I'd need on the
order of 250-300 seconds just to read the data from disk.
But do you need to read the whole data? If the image is for example a
jpeg and contains a thumbnail in the header (many do), that could be
used instead of reading the whole picture and creating one. Then there
is no reason to read the whole picture and I/O will drop significantly.
Well, you need to read it at at bare minimum to compute the MD5 checksum.
As I often read my images over a network file system (AFS over
100Mbit/s is not uncommon for me), cutting I/O is a win.
--
Robert Krawitz <***@alum.mit.edu>

*** MIT Engineers A Proud Tradition http://mitathletics.com ***
Member of the League for Programming Freedom -- http://ProgFree.org
Project lead for Gutenprint -- http://gimp-print.sourceforge.net

"Linux doesn't dictate how I work, I dictate how Linux works."
--Eric Crampton
Robert Krawitz
2017-05-29 22:47:05 UTC
Permalink
Post by Robert Krawitz
Old New
20 MP 5:41 0:32
...
Post by Robert Krawitz
It looks like storing the EXIF data in the database takes about 3
seconds. The next big time consumer is file version detection; if I
turn that off, the total time drops off to about 7 seconds. At that
point, in a realistic scenario, I'd likely be I/O-bound; if I were
loading 3000 images (30 GB, typically), I'd need on the order of
250-300 seconds just to read the data from disk. But if someone were
storing their images on nVME, it might matter.
Well, there's some very low hanging fruit here: the modified file
detection computes the MD5 checksum of each file twice! It's a very
simple matter to get rid of one of those; the time drops to about 20
seconds (which is consistent with what I saw running md5sum on all of
the files: it took about 10 seconds).
--
Robert Krawitz <***@alum.mit.edu>

*** MIT Engineers A Proud Tradition http://mitathletics.com ***
Member of the League for Programming Freedom -- http://ProgFree.org
Project lead for Gutenprint -- http://gimp-print.sourceforge.net

"Linux doesn't dictate how I work, I dictate how Linux works."
--Eric Crampton
Robert Krawitz
2017-05-29 23:05:47 UTC
Permalink
Post by Robert Krawitz
Post by Robert Krawitz
Old New
20 MP 5:41 0:32
...
Post by Robert Krawitz
It looks like storing the EXIF data in the database takes about 3
seconds. The next big time consumer is file version detection; if I
turn that off, the total time drops off to about 7 seconds. At that
point, in a realistic scenario, I'd likely be I/O-bound; if I were
loading 3000 images (30 GB, typically), I'd need on the order of
250-300 seconds just to read the data from disk. But if someone were
storing their images on nVME, it might matter.
Well, there's some very low hanging fruit here: the modified file
detection computes the MD5 checksum of each file twice! It's a very
simple matter to get rid of one of those; the time drops to about 20
seconds (which is consistent with what I saw running md5sum on all of
the files: it took about 10 seconds).
If I take out MD5 checksumming altogether it drops to about 8 seconds,
as would be expected.

Of that time, about 3-4 seconds is spent in what looks like saving the
EXIF data, 2-3 seconds scanning the filesystem, and 2-3 seconds
reading the files in (when I interrupted gdb several times during
that, it looked like most of it was library routines scanning the EXIF
headers).

So, 20'ish seconds to read in 1100 files, which would normally be
around 10 GB. And that's with a fairly slow processor; with a
contemporary fast processor it would be more like 10. With a large
amount of data, thatt would be completely I/O-bound unless you had an
nVME.

I think this problem is solved.
--
Robert Krawitz <***@alum.mit.edu>

*** MIT Engineers A Proud Tradition http://mitathletics.com ***
Member of the League for Programming Freedom -- http://ProgFree.org
Project lead for Gutenprint -- http://gimp-print.sourceforge.net

"Linux doesn't dictate how I work, I dictate how Linux works."
--Eric Crampton
Robert Krawitz
2017-05-29 23:35:50 UTC
Permalink
Post by Robert Krawitz
Post by Robert Krawitz
Post by Robert Krawitz
Old New
20 MP 5:41 0:32
...
Post by Robert Krawitz
It looks like storing the EXIF data in the database takes about 3
seconds. The next big time consumer is file version detection; if I
turn that off, the total time drops off to about 7 seconds. At that
point, in a realistic scenario, I'd likely be I/O-bound; if I were
loading 3000 images (30 GB, typically), I'd need on the order of
250-300 seconds just to read the data from disk. But if someone were
storing their images on nVME, it might matter.
Well, there's some very low hanging fruit here: the modified file
detection computes the MD5 checksum of each file twice! It's a very
simple matter to get rid of one of those; the time drops to about 20
seconds (which is consistent with what I saw running md5sum on all of
the files: it took about 10 seconds).
If I take out MD5 checksumming altogether it drops to about 8 seconds,
as would be expected.
Of that time, about 3-4 seconds is spent in what looks like saving the
EXIF data, 2-3 seconds scanning the filesystem, and 2-3 seconds
reading the files in (when I interrupted gdb several times during
that, it looked like most of it was library routines scanning the EXIF
headers).
So, 20'ish seconds to read in 1100 files, which would normally be
around 10 GB. And that's with a fairly slow processor; with a
contemporary fast processor it would be more like 10. With a large
amount of data, thatt would be completely I/O-bound unless you had an
nVME.
I think this problem is solved.
I tried the same experiment on my server (i7-5820K, with single
threads a bit more than twice as fast as my laptop). The first time I
loaded the new files, it was on a pace to take something like a
minute. When I repeated it, it took 15 seconds. That's I/O-bound,
and short of not computing the MD5, there's not much we can do.

One option, if detect duplicate files isn't turned on, would be to
compute the MD5 checksum only when the thumbnails are created or the
image viewed. Since the working set of the images is frequently
larger than RAM, this would save on I/O. But it would be rather
complicated, I suspect.

This may not be entirely accurate, because I ran it to a remote
display (my laptop). But I suspect it's not off by much.
--
Robert Krawitz <***@alum.mit.edu>

*** MIT Engineers A Proud Tradition http://mitathletics.com ***
Member of the League for Programming Freedom -- http://ProgFree.org
Project lead for Gutenprint -- http://gimp-print.sourceforge.net

"Linux doesn't dictate how I work, I dictate how Linux works."
--Eric Crampton
Johannes Zarl-Zierl
2017-05-30 19:20:13 UTC
Permalink
Hi Robert,

Thanks for providing these patches! They are appreciated ;-)

I'm a little sleep deprived right now, so please bear with me if I don't merge
them right away.

@Tobias: If you have time to review and merge Robert's patches, I won't mind
:)

Cheers,
Johannes
Post by Robert Krawitz
Post by Robert Krawitz
Post by Robert Krawitz
Post by Robert Krawitz
Old New
20 MP 5:41 0:32
...
Post by Robert Krawitz
It looks like storing the EXIF data in the database takes about 3
seconds. The next big time consumer is file version detection; if I
turn that off, the total time drops off to about 7 seconds. At that
point, in a realistic scenario, I'd likely be I/O-bound; if I were
loading 3000 images (30 GB, typically), I'd need on the order of
250-300 seconds just to read the data from disk. But if someone were
storing their images on nVME, it might matter.
Well, there's some very low hanging fruit here: the modified file
detection computes the MD5 checksum of each file twice! It's a very
simple matter to get rid of one of those; the time drops to about 20
seconds (which is consistent with what I saw running md5sum on all of
the files: it took about 10 seconds).
If I take out MD5 checksumming altogether it drops to about 8 seconds,
as would be expected.
Of that time, about 3-4 seconds is spent in what looks like saving the
EXIF data, 2-3 seconds scanning the filesystem, and 2-3 seconds
reading the files in (when I interrupted gdb several times during
that, it looked like most of it was library routines scanning the EXIF
headers).
So, 20'ish seconds to read in 1100 files, which would normally be
around 10 GB. And that's with a fairly slow processor; with a
contemporary fast processor it would be more like 10. With a large
amount of data, thatt would be completely I/O-bound unless you had an
nVME.
I think this problem is solved.
I tried the same experiment on my server (i7-5820K, with single
threads a bit more than twice as fast as my laptop). The first time I
loaded the new files, it was on a pace to take something like a
minute. When I repeated it, it took 15 seconds. That's I/O-bound,
and short of not computing the MD5, there's not much we can do.
One option, if detect duplicate files isn't turned on, would be to
compute the MD5 checksum only when the thumbnails are created or the
image viewed. Since the working set of the images is frequently
larger than RAM, this would save on I/O. But it would be rather
complicated, I suspect.
This may not be entirely accurate, because I ran it to a remote
display (my laptop). But I suspect it's not off by much.
Tobias Leupold
2017-05-30 19:24:31 UTC
Permalink
Post by Johannes Zarl-Zierl
@Tobias: If you have time to review and merge Robert's patches, I won't mind
:)
Sure, I'll have a look at them asap (when the excavator no longer shapes the
area around the house ;-), but I'm not really sure if I can review them
properly, because they touch parts of KPA I haven't worked with yet.

Let's see ...
Robert Krawitz
2017-05-31 00:07:02 UTC
Permalink
Post by Johannes Zarl-Zierl
@Tobias: If you have time to review and merge Robert's patches, I won't mind
:)
Sure, I'll have a look at them asap (when the excavator no longer shapes the area around the house ;-), but I'm not really sure if I can review them properly, because they touch parts of KPA I haven't worked with yet.
Let's see ...
I've merged the two patches (which are doing basically the same thing
to basicallly the same code) and cleaned up some warts. This one is
in much better shape.
--
Robert Krawitz <***@alum.mit.edu>

*** MIT Engineers A Proud Tradition http://mitathletics.com ***
Member of the League for Programming Freedom -- http://ProgFree.org
Project lead for Gutenprint -- http://gimp-print.sourceforge.net

"Linux doesn't dictate how I work, I dictate how Linux works."
--Eric Crampton
Johannes Zarl-Zierl
2017-07-17 18:57:38 UTC
Permalink
Hi Robert,
Post by Robert Krawitz
I've merged the two patches (which are doing basically the same thing
to basicallly the same code) and cleaned up some warts. This one is
in much better shape.
So this is basically 0003-fast-new-images.patch + 0004-dont-recompute-
checksum.patch ans should be applicable on top of 0002-patch-fast-
remove.patch?

Would it be a hassle for you to rebase it on current git master? The patch
does not apply cleanly...

Cheers,
Johannes
Robert Krawitz
2017-07-17 21:35:25 UTC
Permalink
Post by Johannes Zarl-Zierl
Hi Robert,
Post by Robert Krawitz
I've merged the two patches (which are doing basically the same thing
to basicallly the same code) and cleaned up some warts. This one is
in much better shape.
So this is basically 0003-fast-new-images.patch + 0004-dont-recompute-
checksum.patch ans should be applicable on top of 0002-patch-fast-
remove.patch?
Would it be a hassle for you to rebase it on current git master? The patch does not apply cleanly...
I'm out of town, and network connectivity sucks right now.
--
Robert Krawitz <***@alum.mit.edu>

*** MIT Engineers A Proud Tradition http://mitathletics.com ***
Member of the League for Programming Freedom -- http://ProgFree.org
Project lead for Gutenprint -- http://gimp-print.sourceforge.net

"Linux doesn't dictate how I work, I dictate how Linux works."
--Eric Crampton
Robert Krawitz
2017-07-17 21:48:55 UTC
Permalink
Post by Johannes Zarl-Zierl
Hi Robert,
Post by Robert Krawitz
I've merged the two patches (which are doing basically the same thing
to basicallly the same code) and cleaned up some warts. This one is
in much better shape.
So this is basically 0003-fast-new-images.patch + 0004-dont-recompute-
checksum.patch ans should be applicable on top of 0002-patch-fast-
remove.patch?
Would it be a hassle for you to rebase it on current git master? The patch does not apply cleanly...
I've attached the three relevant patches; I don't remember exactly
what my patch sets were then.
--
Robert Krawitz <***@alum.mit.edu>

*** MIT Engineers A Proud Tradition http://mitathletics.com ***
Member of the League for Programming Freedom -- http://ProgFree.org
Project lead for Gutenprint -- http://gimp-print.sourceforge.net

"Linux doesn't dictate how I work, I dictate how Linux works."
--Eric Crampton
Loading...