Discussion:
[KPhotoAlbum] Search performance
Robert Krawitz
2018-10-17 01:05:35 UTC
Permalink
After looking at the code and profiles, it's going to be hard to do a
lot better unless we change the internal representation of category
items. We're spending quite a bit of time doing string comparisons;
we'd have to adopt some kind of internal indexing to do much better.

Whether that's worthwhile is another matter, but it might improve
startup as well.
--
Robert Krawitz <***@alum.mit.edu>

*** MIT Engineers A Proud Tradition http://mitathletics.com ***
Member of the League for Programming Freedom -- http://ProgFree.org
Project lead for Gutenprint -- http://gimp-print.sourceforge.net

"Linux doesn't dictate how I work, I dictate how Linux works."
--Eric Crampton
Johannes Zarl-Zierl
2018-10-17 19:45:07 UTC
Permalink
Hi Robert,
Post by Robert Krawitz
After looking at the code and profiles, it's going to be hard to do a
lot better unless we change the internal representation of category
items.
One idea that I was mildly interested in lately was to use an in-memory sqlite
database or something like that (while keeping the index.xml as canonical on-
disk format).

I have to admit though, that I have no idea how that would affect performance
at all. My reasoning was more based on flexibility of search…

Cheers,
Johannes
Robert Krawitz
2018-10-18 00:37:37 UTC
Permalink
Post by Johannes Zarl-Zierl
Hi Robert,
Post by Robert Krawitz
After looking at the code and profiles, it's going to be hard to do a
lot better unless we change the internal representation of category
items.
One idea that I was mildly interested in lately was to use an
in-memory sqlite database or something like that (while keeping the
index.xml as canonical on- disk format).
This would carry a penalty every time we start kpa. I'm not sure how
severe the penalty would be, but I'd be very surprised if it were less
than 20% (and possibly considerably more).

I don't remember why we abandoned the SQLDB backend, but if we want an
SQL database, I think it would make more sense to store the data in
one in the first place. It would certainly make saving a lot faster,
since that would be a database update rather than a full save from
scratch. The index.xml format is certainly convenient for testing (it
can be edited by hand), but it's not likely the highest performance
way of doing the job.

My thought is that rather than storing collections of category members
as strings that we store them as integer indices into tables of
category members. The compact index.xml representation already does
that, so we do have some code in kpa that knows how to do this.

My search case was rather extreme, and taking 4 seconds for that isn't
too bad. But it sure would be nice if even really big searches were
instantaneous.
Post by Johannes Zarl-Zierl
I have to admit though, that I have no idea how that would affect
performance at all. My reasoning was more based on flexibility of
search…
--
Robert Krawitz <***@alum.mit.edu>

*** MIT Engineers A Proud Tradition http://mitathletics.com ***
Member of the League for Programming Freedom -- http://ProgFree.org
Project lead for Gutenprint -- http://gimp-print.sourceforge.net

"Linux doesn't dictate how I work, I dictate how Linux works."
--Eric Crampton
Joe
2018-10-18 05:11:43 UTC
Permalink
Post by Robert Krawitz
Post by Johannes Zarl-Zierl
Hi Robert,
Post by Robert Krawitz
After looking at the code and profiles, it's going to be hard to do a
lot better unless we change the internal representation of category
items.
One idea that I was mildly interested in lately was to use an
in-memory sqlite database or something like that (while keeping the
index.xml as canonical on- disk format).
This would carry a penalty every time we start kpa. I'm not sure how
severe the penalty would be, but I'd be very surprised if it were less
than 20% (and possibly considerably more).
I don't remember why we abandoned the SQLDB backend, but if we want an
SQL database, I think it would make more sense to store the data in
one in the first place. It would certainly make saving a lot faster,
since that would be a database update rather than a full save from
scratch. The index.xml format is certainly convenient for testing (it
can be edited by hand), but it's not likely the highest performance
way of doing the job
My thought is that rather than storing collections of category members
as strings that we store them as integer indices into tables of
category members. The compact index.xml representation already does
that, so we do have some code in kpa that knows how to do this.
My search case was rather extreme, and taking 4 seconds for that isn't
too bad. But it sure would be nice if even really big searches were
instantaneous.
Post by Johannes Zarl-Zierl
I have to admit though, that I have no idea how that would affect
performance at all. My reasoning was more based on flexibility of
search…
I believe I saw comments on this list that the SQL code was very broken
and that was the reason for abandoning it at the time.

All this performance work/internals is stuff an end user would take for
granted (no glory), but it really helps. Thanks for all the hard work.

Joe
Martin Hoeller
2018-10-18 05:31:22 UTC
Permalink
Hi!
Post by Robert Krawitz
Post by Johannes Zarl-Zierl
Hi Robert,
Post by Robert Krawitz
After looking at the code and profiles, it's going to be hard to do a
lot better unless we change the internal representation of category
items.
One idea that I was mildly interested in lately was to use an
in-memory sqlite database or something like that (while keeping the
index.xml as canonical on- disk format).
[...]
Post by Robert Krawitz
I don't remember why we abandoned the SQLDB backend,
The reason seemed to be complexity. Read this email thread from 6 years
ago: https://mail.kdab.com/pipermail/kphotoalbum/2012-April/005024.html
Post by Robert Krawitz
but if we want an
SQL database, I think it would make more sense to store the data in
one in the first place.
Well, I like the idea of an SQL backend... the question is: can we manage
this task. As far as I know we have 3 active committers (all with limited
time). So as already 3 attempts were made an failed, I doubt that another
attempty would succeed.

I really think the time would be investigated better in other parts of
KPA.

Anyways, if somebody feels like beeing brave enough for another try, go
for it.

Just my 2¢,
- martin
Robert Krawitz
2018-10-18 12:48:23 UTC
Permalink
Post by Martin Hoeller
Post by Robert Krawitz
Post by Johannes Zarl-Zierl
Hi Robert,
Post by Robert Krawitz
After looking at the code and profiles, it's going to be hard to do a
lot better unless we change the internal representation of category
items.
One idea that I was mildly interested in lately was to use an
in-memory sqlite database or something like that (while keeping the
index.xml as canonical on- disk format).
[...]
Post by Robert Krawitz
I don't remember why we abandoned the SQLDB backend,
The reason seemed to be complexity. Read this email thread from 6 years
ago: https://mail.kdab.com/pipermail/kphotoalbum/2012-April/005024.html
Yes, I can well understand why.
Post by Martin Hoeller
Post by Robert Krawitz
but if we want an
SQL database, I think it would make more sense to store the data in
one in the first place.
Well, I like the idea of an SQL backend... the question is: can we manage
this task. As far as I know we have 3 active committers (all with limited
time). So as already 3 attempts were made an failed, I doubt that another
attempty would succeed.
It certainly would not be an easy task; I just think that without it
we're going to have a lot of trouble improving performance very much.

We're spending so much time in string functions, though, that I wonder
whether shortening the XML tag and attribute names (to one
character/byte, perhaps?) would significantly improve load/save
times. It shrinks the size of my index.xml considerably, from 51 MB
to 40 MB.

We could also consider putting immutable data (in particular, MD5
checksum, image dimensions) in the EXIF database.

All of this has the feel of nibbling at the edges, to be sure. But I
do note that ODF is quite ruthless about using very short
tag/attribute names.
--
Robert Krawitz <***@alum.mit.edu>

*** MIT Engineers A Proud Tradition http://mitathletics.com ***
Member of the League for Programming Freedom -- http://ProgFree.org
Project lead for Gutenprint -- http://gimp-print.sourceforge.net

"Linux doesn't dictate how I work, I dictate how Linux works."
--Eric Crampton
Johannes Zarl-Zierl
2018-10-18 19:26:16 UTC
Permalink
Hi,
Post by Robert Krawitz
Post by Martin Hoeller
The reason seemed to be complexity. Read this email thread from 6 years
ago: https://mail.kdab.com/pipermail/kphotoalbum/2012-April/005024.html
Yes, I can well understand why.
IMO there were several problems with the approach that we took for the SQL
backend:

1) It was an *additional* backend+frontend (a new file format, as well as a
new database core). The "traditional" database still needed to be maintained
in addition to the new one. (I say "traditional" instead of "XMLDB" to avoid
confusion, because the backend has nothing to do with XML.)

2) Having a new file format prevented switching to the new format because many
people (at least if they were thinking like me) wouldn't trust the new format
right away and there was no easy way to switch back. Lack of user adoption
also means less testers, ...

3) Lack of unit testing for the database means that we can not make radical
changes with confidence.

This is, of course my personal opinion and not a criticism of Jesper, Tuomas,
Jan and Henner. It's always easier to look back and point out problems then to
build something and anticipate them ;-)


Learning from the old mistakes means that we could try to do things
differently:

1) Switch the backend fully to in-memory SQL.

2) Don't introduce a new on-disk format. Of course, we can still cache the
database on disk for performance reasons and only load from index.xml if it
actually changed.

3) Unit testing is still not there. Before taking on such an endeavor, we need
to get testing in place. I have done a little write-up of my plans to increase
maintainability here:
https://phabricator.kde.org/T7752
Post by Robert Krawitz
Post by Martin Hoeller
Post by Robert Krawitz
but if we want an
SQL database, I think it would make more sense to store the data in
one in the first place.
I should have mentioned in the first mail what I mean by "canonical file
format". I've no problems with storing the data into a persistent database for
caching.
But I still think that the index.xml format has good properties (resistant
against file corruption, easy/robust versioning, readable and writable "by
hand"). Also, many people use kphotoalbum on different machines in different
versions - with the XML format, you can easily pull that off as long as you
take some care.

If we take the caching approach, we should be able to eat our cake (index.xml
format, fast queries) and still have half of it (usually fast loading with
"slow" saving to index.xml).
Post by Robert Krawitz
Post by Martin Hoeller
Well, I like the idea of an SQL backend... the question is: can we manage
this task. As far as I know we have 3 active committers (all with limited
time). So as already 3 attempts were made an failed, I doubt that another
attempty would succeed.
It certainly would not be an easy task; I just think that without it
we're going to have a lot of trouble improving performance very much.
I think switching to an SQL backend is one goal that is worth investigating,
but plays very nicely with my "big picture" (if you can call it that): invest
time now in the "boring" task of increasing maintainability and leverage the
new freedom to move kphotoalbum forward.
Post by Robert Krawitz
We're spending so much time in string functions, though, that I wonder
whether shortening the XML tag and attribute names (to one
character/byte, perhaps?) would significantly improve load/save
times. It shrinks the size of my index.xml considerably, from 51 MB
to 40 MB.
I'm not opposed to doing optimizations like that. I have to agree with you
though, that we won't get *huge* performance gains doing that.

Cheers,
Johannes
Andreas Schleth
2018-10-18 21:02:41 UTC
Permalink
Hi everybody,

Am 18.10.18 um 21:26 schrieb Johannes Zarl-Zierl:
...
Post by Johannes Zarl-Zierl
I should have mentioned in the first mail what I mean by "canonical file
format". I've no problems with storing the data into a persistent database for
caching.
But I still think that the index.xml format has good properties (resistant
against file corruption, easy/robust versioning, readable and writable "by
hand"). Also, many people use kphotoalbum on different machines in different
versions - with the XML format, you can easily pull that off as long as you
take some care.
Yes, yes, yes!

Eg: I still use an old 4.2 KPA with all the glorious KIPI plugins to
turn time (when someone gives me pictures with the date/time off to sync
them with my own images). This works nicely with index files otherwise
used with the latest git master.

And I occasionally tweak the database manually. Eg. setting the time to
somewhere between Jan 1st and Dec 31st makes the image show up in at
least 2 consecutive years. Changing this to Jan 2nd and Dec. 30th are
just two commands in vim.

Even if I am usually a bit critical about XML because it is a bit chatty
(lots of text in names and attributes), it has the great benefit of
being very robust. Robustness must come first, then the code has to be
understood by future maintainers, then performance. We are talking about
data that we want to keep for (many) years to come. My own databases
date back to 2004/2005, when Blackie himself twiddled with the code.
This is at least 2 generations of maintainers back.

Thus, everybody involved did a really terrific job in keeping the file
format stable and backwards compatible over so long a time frame.
Post by Johannes Zarl-Zierl
If we take the caching approach, we should be able to eat our cake (index.xml
format, fast queries) and still have half of it (usually fast loading with
"slow" saving to index.xml).
I somewhat doubt that a large number of images really makes loading much
slower. There are other factors too, such as (maybe) total tree size or
type and size of media.

My image databases all load fairly quick - all around 30 to 40k images:

***@wshome5:~/eigene_Bilder> time kphotoalbum -c index.xml
real    0m8,219s
user    0m5,219s
sys     0m0,448s
(open & close without save / tree size: 141,449,556 kB / 35457 images /
index 31 MB)

My movie database with only around 1k clips and movies takes "forever"
to load:

***@wshome5:~/Filme> time kphotoalbum -c index.xml
real    0m40,944s
user    0m8,874s
sys     0m4,718s
(open & close without saving / tree size: 1,568,340,336 kB / 1100 films
/ index 1,7 MB)

This big difference tells me (I did not look into the code) that looking
at a few large files takes KPA much longer than looking at many smaller
ones...

All my files sit on a NFS share (spinning rust) via GB Ethernet.


Just my thoughts.

Best regards & thanks for keeping the project alive!

Andreas
Robert Krawitz
2018-10-18 22:39:47 UTC
Permalink
Post by Andreas Schleth
...
Post by Johannes Zarl-Zierl
I should have mentioned in the first mail what I mean by "canonical file
format". I've no problems with storing the data into a persistent database for
caching.
But I still think that the index.xml format has good properties (resistant
against file corruption, easy/robust versioning, readable and writable "by
hand"). Also, many people use kphotoalbum on different machines in different
versions - with the XML format, you can easily pull that off as long as you
take some care.
Yes, yes, yes!
Eg: I still use an old 4.2 KPA with all the glorious KIPI plugins to turn time (when someone gives me pictures with the date/time off to sync them with my own images). This works nicely with index files otherwise used with the latest git master.
I don't really disagree, just note that this is going to be the
limiting factor in startup and save performance.
Post by Andreas Schleth
Post by Johannes Zarl-Zierl
If we take the caching approach, we should be able to eat our cake (index.xml
format, fast queries) and still have half of it (usually fast loading with
"slow" saving to index.xml).
I somewhat doubt that a large number of images really makes loading
much slower. There are other factors too, such as (maybe) total tree
size or type and size of media.
I've measured it :-)
I have 275K images; it currently takes about 12 seconds to start up.
That's long enough to be annoying if I want to quickly check some
images.
Post by Andreas Schleth
real    0m8,219s
user    0m5,219s
sys     0m0,448s
(open & close without save / tree size: 141,449,556 kB / 35457 images / index 31 MB)
real    0m40,944s
user    0m8,874s
sys     0m4,718s
(open & close without saving / tree size: 1,568,340,336 kB / 1100 films / index 1,7 MB)
This big difference tells me (I did not look into the code) that
looking at a few large files takes KPA much longer than looking at
many smaller ones...
What version of kpa are you using, and on what CPU?

There *shouldn't* be any difference in startup time depending upon
storage or file type *unless* you have search for new files on startup
turned on, in which case it's going to search the directory for new
files. I can't judge that without knowing more about the details of
your filesystem structure. I'm very surprised by your results, unless
you're running on an old version of kpa.
Post by Andreas Schleth
All my files sit on a NFS share (spinning rust) via GB Ethernet.
NFS is not a good storage back end for KPA or anything else that works
with a lot of files. The scout thread I implemented in kpa 5.4 should
help when actually loading new files.
--
Robert Krawitz <***@alum.mit.edu>

*** MIT Engineers A Proud Tradition http://mitathletics.com ***
Member of the League for Programming Freedom -- http://ProgFree.org
Project lead for Gutenprint -- http://gimp-print.sourceforge.net

"Linux doesn't dictate how I work, I dictate how Linux works."
--Eric Crampton
Tobias Leupold
2018-10-18 21:34:30 UTC
Permalink
Post by Johannes Zarl-Zierl
But I still think that the index.xml format has good properties (resistant
against file corruption, easy/robust versioning, readable and writable "by
hand"). Also, many people use kphotoalbum on different machines in different
versions - with the XML format, you can easily pull that off as long as you
take some care.
I must say that the XML database format was one reason for me to use KPA at
all back then. I thought I'll spend a lot of time with tagging and a lot of
energy will flow into this database. And if the KPA thing ends up sucking one
day, I will still have that database that I can even read "by hand" with a
normal text editor. And that's why I thought "no matter what happens, my data
is safe and mine after all".

I don't know if other users have/had the same thought. But I think this
"special" approach (I don't think it's really common for that task) to save
the data is a good thing.

Surely, one can optimize. A in-memory SQLite datase for fast search queries is
a nice idea for sure. But I would really stick to the XML file as the on-disk
storage format.

Just my thoughts ...
Robert Krawitz
2018-10-18 22:43:36 UTC
Permalink
Post by Tobias Leupold
Post by Johannes Zarl-Zierl
But I still think that the index.xml format has good properties (resistant
against file corruption, easy/robust versioning, readable and writable "by
hand"). Also, many people use kphotoalbum on different machines in different
versions - with the XML format, you can easily pull that off as long as you
take some care.
I must say that the XML database format was one reason for me to use
KPA at all back then. I thought I'll spend a lot of time with
tagging and a lot of energy will flow into this database. And if the
KPA thing ends up sucking one day, I will still have that database
that I can even read "by hand" with a normal text editor. And that's
why I thought "no matter what happens, my data is safe and mine
after all".
To be honest, I like it too. I've been known to edit my index.xml by
hand. But it does have performance implications for startup and save.
I do want to see what happens with abbreviating tag and attribute
names for those in the image path. It may not make much difference,
but even 10% difference is something not to be ignored.
--
Robert Krawitz <***@alum.mit.edu>

*** MIT Engineers A Proud Tradition http://mitathletics.com ***
Member of the League for Programming Freedom -- http://ProgFree.org
Project lead for Gutenprint -- http://gimp-print.sourceforge.net

"Linux doesn't dictate how I work, I dictate how Linux works."
--Eric Crampton
Loading...