Discussion:
Recalc checksum upon image modification?
Tim McCormack
2008-04-06 19:36:23 UTC
Permalink
I was under the impression that KPA would recalculate the checksum of
any image in the collection that had been modified since last DB save.
I've discovered that this is not true. Is there any way of forcing KPA
to do this?

At the very least, the checksum should be updated in index.xml whenever
one of the plugins is applied to the image.

Without this automatic recalc, KPA is unable to properly identify moved
images. Example:

1. Drop in some images, run KPA, save.
2. Edit image, run KPA, save.
3. Move image, run KPA: Moved/renamed image is not recognized.

I am under the impression that KimDaBa handled this properly. Perhaps it
is an actual bug? I'm running KPA 3.0.2 on the current Ubuntu (Gutsy
Gibbon).

- Tim McCormack
Jan Kundrát
2008-04-06 21:42:26 UTC
Permalink
Post by Tim McCormack
I was under the impression that KPA would recalculate the checksum of
any image in the collection that had been modified since last DB save.
Hi Tim,
we don't check file checksums at startup as it's really very expensive
operation. My pretty small DB has about 16k images and takes 32GB on
disk. My laptop can process up to 300MB/s of data per one core when
computing the MD5 and being fed by 8k blocks (`openssl speed md5`). I
guess that Qt's implementation is not as optimized as that in OpenSSL,
but even if it was, it would take more than two minutes of CPU time on
each startup of KPA. When using both cores and clever tricks, I could
get at one minute if I were lucky and filesystem/disk was fast. But as
my laptop is one year old, so it would take about twenty minutes (!)
just to read the data from disk. We have plenty of users on older
machines and/or with images stored on a network disk over 100MBps or
slower ethernet (and for some users, even just reading the directory
list and calling stat() on each file was a killer), so enabling this
globally is not really an option.
Post by Tim McCormack
I've discovered that this is not true. Is there any way of forcing KPA
to do this?
Sure, Maintenance -> Recalculate Checksum does that.
Post by Tim McCormack
At the very least, the checksum should be updated in index.xml whenever
one of the plugins is applied to the image.
Good point, please add this as a wish to our bugzilla (bug.kde.org).

Cheers,
-jkt
--
cd /local/pub && more beer > /dev/mouth
Tim McCormack
2008-04-07 01:14:19 UTC
Permalink
Post by Jan Kundrát
we don't check file checksums at startup as it's really very
expensive operation.
Oh, I'm not saying to recalculate the MD5 sums of everything on startup
-- my proposal is to look for modified files on the basis of stat(), and
then recompute the MD5.

Good point regarding LAN-stored photo collections. Perhaps this could be
an option for startup (off by default), or even just a plugin.
Post by Jan Kundrát
Is there any way of forcing KPA to do this?
Sure, Maintenance -> Recalculate Checksum does that.
But that recalculates *all* checksums. :-/

KPA already calls stat() for all files, correct? Or does it just check
directory listings for new files?

- Tim McCormack
Jan Kundrát
2008-04-07 01:22:42 UTC
Permalink
Post by Tim McCormack
Oh, I'm not saying to recalculate the MD5 sums of everything on startup
-- my proposal is to look for modified files on the basis of stat(), and
then recompute the MD5.
We don't store file's last modification in the DB.
Post by Tim McCormack
KPA already calls stat() for all files, correct? Or does it just check
directory listings for new files?
It used to stat() each file (via QDir), but as it was slow in some
setups, it just uses readdir() now.

Cheers,
-jkt
--
cd /local/pub && more beer > /dev/mouth
Robert Krawitz
2008-04-07 01:36:19 UTC
Permalink
Date: Sun, 06 Apr 2008 21:14:19 -0400
Post by Jan Kundrát
we don't check file checksums at startup as it's really very
expensive operation.
*Very*, *very* expensive. It would take many hours on a reasonably
fast machine to check a large collection.
Post by Jan Kundrát
Is there any way of forcing KPA to do this?
Sure, Maintenance -> Recalculate Checksum does that.
But that recalculates *all* checksums. :-/

KPA already calls stat() for all files, correct? Or does it just
check directory listings for new files?

Not any more, it doesn't. It's a bit more clever to avoid stat'ing
files it knows about or that it otherwise it knows it's not interested
in. For me, that speeds up the scan for new images on a cold system
from about 60 seconds to about 3.

stat() is fast if the inode is already in memory. If it's on disk, it
requires a disk I/O (at least one) to bring it in. My laptop disk is
5400 RPM with average 12 ms seek time. While seek times are funny
things, that means that on average stat'ing a file will (everything
else being equal) require (12 + ((60 / 5400) / 2)) milliseconds, or
about 17 ms. I have about 30,000 files in my images directory, so
that would in principle mean it would take 500 seconds if it has to
stat every file.

(In actual fact, it's much less than that -- typically about a minute,
as I said -- because multiple inodes are stored on one block, and when
the kernel reads in a block it caches it. But having to wait a minute
to start up on a cold system isn't very pleasant.)
--
Robert Krawitz <***@alum.mit.edu>

Tall Clubs International -- http://www.tall.org/ or 1-888-IM-TALL-2
Member of the League for Programming Freedom -- mail ***@uunet.uu.net
Project lead for Gutenprint -- http://gimp-print.sourceforge.net

"Linux doesn't dictate how I work, I dictate how Linux works."
--Eric Crampton
Loading...