Discussion:
[KPhotoAlbum] Search performance (at least) linear in number of search terms?
Robert Krawitz
2017-07-10 00:00:21 UTC
Permalink
I'm reorganizing my database as follows:

I have traditionally applied a keyword to each set of photos. When I
select the photos I'm using, I apply a second keyword named "$keyword
selection" (for appropriate value of $keyword).

This led to a lot of clutter, so I've created a new keyword named "All
Selections", which I apply to the selected image. So I can select the
image I want by selecting the and of the keyword and All Selections.

I went about this via the search dialog, by selecting all of the
keywords named "$keyword selection" (something like 100 keywords,
IIRC), and then applying the new keyword to that. This procedure was
very time consuming; it took maybe a minute on my core i7-920XM. This
seems too long for 222000 images and 100 keywords (each image has no
more than two keywords, usually). I'm going through one kcachegrind
trace (which took me about 30 minutes to collect, using only 40
keywords) and haven't been able thus far to work out quite what's
going on in there. For reference, it made about 80 million calls each
to DB::ImageInfo::hasCategoryInfo(QString const&, QString const&) and
DB::ImageInfo::hasCategoryInfo(QString const&, QSet<QString> const&).


It appears that most of the work is done by intersecting the set of
tags applied to the image and the set of tags being searched for, but
the way this intersection is done results in a lot of string hashes
being calculated (as opposed to the hashes being cached).

Furthermore, since each keyword (or other category member) has a
unique integer assigned, the matching could be done by integer
comparisons. That wouldn't work if someone wanted to do a search on a
wildcard/regexp, but it appears we don't allow that anyway.

So I'm still trying to puzzle my way through this.
--
Robert Krawitz <***@alum.mit.edu>

*** MIT Engineers A Proud Tradition http://mitathletics.com ***
Member of the League for Programming Freedom -- http://ProgFree.org
Project lead for Gutenprint -- http://gimp-print.sourceforge.net

"Linux doesn't dictate how I work, I dictate how Linux works."
--Eric Crampton
Robert Krawitz
2017-07-16 03:34:31 UTC
Permalink
Post by Robert Krawitz
I have traditionally applied a keyword to each set of photos. When I
select the photos I'm using, I apply a second keyword named "$keyword
selection" (for appropriate value of $keyword).
This led to a lot of clutter, so I've created a new keyword named "All
Selections", which I apply to the selected image. So I can select the
image I want by selecting the and of the keyword and All Selections.
I went about this via the search dialog, by selecting all of the
keywords named "$keyword selection" (something like 100 keywords,
IIRC), and then applying the new keyword to that. This procedure was
very time consuming; it took maybe a minute on my core i7-920XM. This
seems too long for 222000 images and 100 keywords (each image has no
more than two keywords, usually). I'm going through one kcachegrind
trace (which took me about 30 minutes to collect, using only 40
keywords) and haven't been able thus far to work out quite what's
going on in there. For reference, it made about 80 million calls each
to DB::ImageInfo::hasCategoryInfo(QString const&, QString const&) and DB::ImageInfo::hasCategoryInfo(QString const&, QSet<QString> const&).
It appears that most of the work is done by intersecting the set of
tags applied to the image and the set of tags being searched for, but
the way this intersection is done results in a lot of string hashes
being calculated (as opposed to the hashes being cached).
Furthermore, since each keyword (or other category member) has a
unique integer assigned, the matching could be done by integer
comparisons. That wouldn't work if someone wanted to do a search on a
wildcard/regexp, but it appears we don't allow that anyway.
So I'm still trying to puzzle my way through this.
Well, I did find one piece of really low hanging fruit: the overlap
function (which, mind you, is used in only one place, so Set.cpp and
Set.h might just as well be removed altogether) was computing the
intersection of the two sets of strings rather than just determining
whether there is an intersection.

For my test case, it cut the time from about 18 seconds to maybe 13.
While with only 222,000 images and searching for about 40 tags it
should be able to do a lot better indeed than that, this is at least a
start in the right direction.

If you want, I'll do a patch that removes Set.h and Set.cpp altogether.

diff --git a/DB/ImageInfo.cpp b/DB/ImageInfo.cpp
index d5b8cb25..c5897fe7 100644
--- a/DB/ImageInfo.cpp
+++ b/DB/ImageInfo.cpp
@@ -132,7 +132,7 @@ bool ImageInfo::hasCategoryInfo( const QString& key, const QString& value ) cons

bool DB::ImageInfo::hasCategoryInfo( const QString& key, const StringSet& values ) const
{
- return Utilities::overlap( m_categoryInfomation[key], values );
+ return values.intersects( m_categoryInfomation[key] );
}
--
Robert Krawitz <***@alum.mit.edu>

*** MIT Engineers A Proud Tradition http://mitathletics.com ***
Member of the League for Programming Freedom -- http://ProgFree.org
Project lead for Gutenprint -- http://gimp-print.sourceforge.net

"Linux doesn't dictate how I work, I dictate how Linux works."
--Eric Crampton
Tobias Leupold
2017-07-16 07:53:37 UTC
Permalink
Hi :-)

As far as I can grasp it, Set.h and Set.cpp is a remnant of KPA using older Qt
versions which did not provide the needed functionality.

In case of doubt, Johannes is the one to decide whether or not, but I think
this can be removed in favor of using Qt directly. Which does not only – as
you stated – speed up the code, but also simplify it.

Cheers, Tobias
Post by Robert Krawitz
Well, I did find one piece of really low hanging fruit: the overlap
function (which, mind you, is used in only one place, so Set.cpp and
Set.h might just as well be removed altogether) was computing the
intersection of the two sets of strings rather than just determining
whether there is an intersection.
For my test case, it cut the time from about 18 seconds to maybe 13.
While with only 222,000 images and searching for about 40 tags it
should be able to do a lot better indeed than that, this is at least a
start in the right direction.
If you want, I'll do a patch that removes Set.h and Set.cpp altogether.
diff --git a/DB/ImageInfo.cpp b/DB/ImageInfo.cpp
index d5b8cb25..c5897fe7 100644
--- a/DB/ImageInfo.cpp
+++ b/DB/ImageInfo.cpp
@@ -132,7 +132,7 @@ bool ImageInfo::hasCategoryInfo( const QString& key,
const QString& value ) cons
bool DB::ImageInfo::hasCategoryInfo( const QString& key, const StringSet&
values ) const {
- return Utilities::overlap( m_categoryInfomation[key], values );
+ return values.intersects( m_categoryInfomation[key] );
}
Robert Krawitz
2017-07-16 15:55:22 UTC
Permalink
Post by Tobias Leupold
Hi :-)
As far as I can grasp it, Set.h and Set.cpp is a remnant of KPA using older Qt versions which did not provide the needed functionality.
In case of doubt, Johannes is the one to decide whether or not, but I think this can be removed in favor of using Qt directly. Which does not only – as you stated – speed up the code, but also simplify it.
So, here's the big problem: it's just plain doing waaaaay too much
work, by at least an order of magnitude. Not to mention that the
number of calls to hasCategoryInfo() is linear in the number of
individual search terms (so each keyword being searched for is checked
twice), which since the intersection test is likely superlinear
results in the overall work being superquadratic. *Then* this is all
done almost 10 times for each search.

I set a breakpoint on DB::ImageSearchInfo::match on the demo database,
and caught a stack trace for each time it was hit on one particular
image. I found that that was being called 9x (and in some cases that
I haven't characterized 18x) for each image. And what's more, _both_
flavors of hasCategoryInfo() (in ImageInfo.cpp) are being called if
there's _not_ a match on a keyword (if there is a match, the call to
the QSet<QString> flavor is short circuited).

If I do a search for "desktop | scenic" in keywords, I get this trace
using debugging information below:

match "bar55.jpg"
hCI1 "bar55.jpg" "Keywords" "desktop" ""
hCI2 "bar55.jpg" "Keywords" "" ""
hCI1 "bar55.jpg" "Keywords" "fun" ""
hCI2 "bar55.jpg" "Keywords" "" ""
match "snow.jpg"
hCI1 "snow.jpg" "Keywords" "desktop" "|desktop"

whereas if I just search for "desktop" I get this:

match "bar55.jpg"
hCI1 "bar55.jpg" "Keywords" "desktop" ""
hCI2 "bar55.jpg" "Keywords" "" ""
match "snow.jpg"
hCI1 "snow.jpg" "Keywords" "desktop" "|desktop"

and if I search for all 5 keywords, I get -- no surprise -- this:

match "bar55.jpg"
hCI1 "bar55.jpg" "Keywords" "desktop" ""
hCI2 "bar55.jpg" "Keywords" "" ""
hCI1 "bar55.jpg" "Keywords" "fun" ""
hCI2 "bar55.jpg" "Keywords" "" ""
hCI1 "bar55.jpg" "Keywords" "new wave" ""
hCI2 "bar55.jpg" "Keywords" "" ""
hCI1 "bar55.jpg" "Keywords" "scanned in" ""
hCI2 "bar55.jpg" "Keywords" "" ""
hCI1 "bar55.jpg" "Keywords" "scenic" ""
hCI2 "bar55.jpg" "Keywords" "" ""
match "snow.jpg"
hCI1 "snow.jpg" "Keywords" "desktop" "|desktop"

I don't have time right now to rearchitect this, and it needs
discussion. For example, perhaps we could use a generation number
that's incremented each time the search conditions are changed, and
short circuit all this work otherwise. Or find some other way to
ensure that it's only called once.

Oh, and finally, ImageSearchInfo::match() doesn't completely short
circuit match failures. For examplee, it still iterates over the
categories in the options search. Maybe the compiler is clever and
optimizes that out, but I wouldn't want to bet on that.

diff --git a/DB/ImageSearchInfo.cpp b/DB/ImageSearchInfo.cpp
index 8d11a5c6..974edbd2 100644
--- a/DB/ImageSearchInfo.cpp
+++ b/DB/ImageSearchInfo.cpp
@@ -86,8 +86,8 @@ bool ImageSearchInfo::match( ImageInfoPtr info ) const
if ( !m_compiled )
compile();

- bool ok = true;
- ok = m_exifSearchInfo.matches( info->fileName() );
+ if (! m_exifSearchInfo.matches( info->fileName() ))
+ return false;

QDateTime actualStart = info->date().start();
QDateTime actualEnd = info->date().end();
@@ -108,27 +108,28 @@ bool ImageSearchInfo::match( ImageInfoPtr info ) const
bool b2 =( actualStart <= m_date.end() && m_date.end() <= actualEnd );
bool b3 = ( m_date.start() <= actualStart && ( actualEnd <= m_date.end() || m_date.end().isNull() ) );

- ok = ok && ( ( b1 || b2 || b3 ) );
+ if (! ( ( b1 || b2 || b3 ) ) ) return false;
} else if ( !m_date.end().isNull() ) {
bool b1 = ( actualStart <= m_date.end() && m_date.end() <= actualEnd );
bool b2 = ( actualEnd <= m_date.end() );
- ok = ok && ( ( b1 || b2 ) );
+ if (! ( ( b1 || b2 ) ) ) return false;
}

// -------------------------------------------------- Options
// alreadyMatched map is used to make it possible to search for
// Jesper & None
+ qDebug() << "match" << info->fileName().relative();
QMap<QString, StringSet> alreadyMatched;
for (CategoryMatcher* optionMatcher : m_categoryMatchers) {
- ok = ok && optionMatcher->eval(info, alreadyMatched);
+ if (! optionMatcher->eval(info, alreadyMatched) ) return false;
}


// -------------------------------------------------- Label
- ok = ok && ( m_label.isEmpty() || info->label().indexOf(m_label) != -1 );
+ if (! ( m_label.isEmpty() || info->label().indexOf(m_label) != -1 ) ) return false;

// -------------------------------------------------- RAW
- ok = ok && ( m_searchRAW == false || ImageManager::RAWImageDecoder::isRAW( info->fileName()) );
+ if (! ( m_searchRAW == false || ImageManager::RAWImageDecoder::isRAW( info->fileName()) ) ) return false;

// -------------------------------------------------- Rating

@@ -137,61 +138,55 @@ bool ImageSearchInfo::match( ImageInfoPtr info ) const
switch( m_ratingSearchMode ) {
case 1:
// Image rating at least selected
- ok = ok && ( m_rating <= info->rating() );
+ if (! ( m_rating <= info->rating() ) ) return false;
break;
case 2:
// Image rating less than selected
- ok = ok && ( m_rating >= info->rating() );
+ if (! ( m_rating >= info->rating() ) ) return false;
break;
case 3:
// Image rating not equal
- ok = ok && ( m_rating != info->rating() );
+ if (! ( m_rating != info->rating() ) ) return false;
break;
default:
- ok = ok && ((m_rating == -1 ) || ( m_rating == info->rating() ));
+ if (! ((m_rating == -1 ) || ( m_rating == info->rating() ))) return false;
break;
}
- }


// -------------------------------------------------- Resolution
- if ( m_megapixel )
- ok = ok && ( m_megapixel * 1000000 <= info->size().width() * info->size().height() );
+ if ( m_megapixel && (! ( m_megapixel * 1000000 <= info->size().width() * info->size().height() ) ) ) return false;

- if ( m_max_megapixel && m_max_megapixel > m_megapixel )
- ok = ok && ( m_max_megapixel * 1000000 > info->size().width() * info->size().height() );
+ if ( ( m_max_megapixel && m_max_megapixel > m_megapixel ) && (! ( m_max_megapixel * 1000000 > info->size().width() * info->size().height() ) ) ) return false;

// -------------------------------------------------- Text
QString txt = info->description();
if ( !m_description.isEmpty() ) {
QStringList list = m_description.split(QChar::fromLatin1(' '), QString::SkipEmptyParts);
Q_FOREACH( const QString &word, list ) {
- ok = ok && ( txt.indexOf( word, 0, Qt::CaseInsensitive ) != -1 );
+ if (! ( txt.indexOf( word, 0, Qt::CaseInsensitive ) != -1 ) ) return false;
}
}

// -------------------------------------------------- File name pattern
- ok = ok && ( m_fnPattern.isEmpty() ||
- m_fnPattern.indexIn( info->fileName().relative() ) != -1 );
+ if (! ( m_fnPattern.isEmpty() ||
+ m_fnPattern.indexIn( info->fileName().relative() ) != -1 ) ) return false;


#ifdef HAVE_KGEOMAP
// Search for GPS Position
- if (ok && m_usingRegionSelection) {
- ok = ok && info->coordinates().hasCoordinates();
- if (ok) {
- float infoLat = info->coordinates().lat();
- float infoLon = info->coordinates().lon();
- ok = ok
- && m_regionSelectionMinLat <= infoLat
- && infoLat <= m_regionSelectionMaxLat
- && m_regionSelectionMinLon <= infoLon
- && infoLon <= m_regionSelectionMaxLon;
+ if (m_usingRegionSelection) {
+ if (! info->coordinates().hasCoordinates() ) return false;
+ float infoLat = info->coordinates().lat();
+ float infoLon = info->coordinates().lon();
+ if (! (m_regionSelectionMinLat <= infoLat
+ && infoLat <= m_regionSelectionMaxLat
+ && m_regionSelectionMinLon <= infoLon
+ && infoLon <= m_regionSelectionMaxLon ) ) return false;
}
}
#endif
-
- return ok;
+ return true;
}
Post by Tobias Leupold
Post by Robert Krawitz
Well, I did find one piece of really low hanging fruit: the overlap
function (which, mind you, is used in only one place, so Set.cpp and
Set.h might just as well be removed altogether) was computing the
intersection of the two sets of strings rather than just determining
whether there is an intersection.
For my test case, it cut the time from about 18 seconds to maybe 13.
While with only 222,000 images and searching for about 40 tags it
should be able to do a lot better indeed than that, this is at least a
start in the right direction.
If you want, I'll do a patch that removes Set.h and Set.cpp altogether.
diff --git a/DB/ImageInfo.cpp b/DB/ImageInfo.cpp
index d5b8cb25..c5897fe7 100644
--- a/DB/ImageInfo.cpp
+++ b/DB/ImageInfo.cpp
@@ -132,7 +132,7 @@ bool ImageInfo::hasCategoryInfo( const QString& key,
const QString& value ) cons
bool DB::ImageInfo::hasCategoryInfo( const QString& key, const StringSet&
values ) const {
- return Utilities::overlap( m_categoryInfomation[key], values );
+ return values.intersects( m_categoryInfomation[key] );
}
The hasCategoryInfo() with debugging:

bool ImageInfo::hasCategoryInfo( const QString& key, const QString& value ) const
{
QString answer;
Q_FOREACH(QString str, m_categoryInfomation[key]) {
answer = answer + QString::fromLatin1("|");
answer = answer + str;
}
qDebug() << "hCI1" << m_fileName.relative() << key << value << answer;
return m_categoryInfomation[key].contains(value);
}

bool DB::ImageInfo::hasCategoryInfo( const QString& key, const StringSet& values ) const
{
QString answer, answer1;
Q_FOREACH(QString str, m_categoryInfomation[key]) {
answer = answer + QString::fromLatin1("|");
answer = answer + str;
}
Q_FOREACH(QString str, values) {
answer1 = answer1 + QString::fromLatin1("|");
answer1 = answer1 + str;
}
qDebug() << "hCI2" << m_fileName.relative() << key << answer1 << answer;
return values.intersects( m_categoryInfomation[key] );
}


All of the stack traces:

#7 0x00007f2a8137f493 in QMetaObject::activate(QObject*, int, int, void**) ()
at /usr/lib64/libQt5Core.so.5
#8 0x00007f2a84bc0c25 in QAbstractItemView::activated(QModelIndex const&) ()
at /usr/lib64/libQt5Widgets.so.5
#9 0x00007f2a84bc6ef6 in QAbstractItemView::mouseReleaseEvent(QMouseEvent*) ()
at /usr/lib64/libQt5Widgets.so.5
#10 0x00007f2a84beb38e in QListView::mouseReleaseEvent(QMouseEvent*) ()
at /usr/lib64/libQt5Widgets.so.5
#11 0x00007f2a849d1487 in QWidget::event(QEvent*) ()
at /usr/lib64/libQt5Widgets.so.5
#12 0x00007f2a84aaddde in QFrame::event(QEvent*) ()
at /usr/lib64/libQt5Widgets.so.5
#13 0x00007f2a84bcdcfb in QAbstractItemView::viewportEvent(QEvent*) ()
at /usr/lib64/libQt5Widgets.so.5
#14 0x00007f2a81355801 in QCoreApplicationPrivate::sendThroughObjectEventFilters(QObject*, QEvent*) () at /usr/lib64/libQt5Core.so.5
#15 0x00007f2a84993ad5 in QApplicationPrivate::notify_helper(QObject*, QEvent*) () at /usr/lib64/libQt5Widgets.so.5
#16 0x00007f2a8499aedc in QApplication::notify(QObject*, QEvent*) ()
at /usr/lib64/libQt5Widgets.so.5
#17 0x00007f2a81355935 in QCoreApplication::notifyInternal2(QObject*, QEvent*) () at /usr/lib64/libQt5Core.so.5
#18 0x00007f2a84999d59 in QApplicationPrivate::sendMouseEvent(QWidget*, QMouseEvent*, QWidget*, QWidget*, QWidget**, QPointer<QWidget>&, bool) ()
at /usr/lib64/libQt5Widgets.so.5
#19 0x00007f2a849e9b91 in () at /usr/lib64/libQt5Widgets.so.5
#20 0x00007f2a849ec0f3 in () at /usr/lib64/libQt5Widgets.so.5
#21 0x00007f2a84993afc in QApplicationPrivate::notify_helper(QObject*, QEvent*) () at /usr/lib64/libQt5Widgets.so.5
#22 0x00007f2a8499a840 in QApplication::notify(QObject*, QEvent*) ()
at /usr/lib64/libQt5Widgets.so.5
#23 0x00007f2a81355935 in QCoreApplication::notifyInternal2(QObject*, QEvent*) () at /usr/lib64/libQt5Core.so.5
#24 0x00007f2a819184ed in QGuiApplicationPrivate::processMouseEvent(QWindowSystemInterfacePrivate::MouseEvent*) () at /usr/lib64/libQt5Gui.so.5
#25 0x00007f2a8191a0a5 in QGuiApplicationPrivate::processWindowSystemEvent(QWindowSystemInterfacePrivate::WindowSystemEvent*) () at /usr/lib64/libQt5Gui.so.5
#26 0x00007f2a818f88ab in QWindowSystemInterface::sendWindowSystemEvents(QFlags<QEventLoop::ProcessEventsFlag>) () at /usr/lib64/libQt5Gui.so.5
#27 0x00007f2a70016e50 in () at /usr/lib64/libQt5XcbQpa.so.5
#28 0x00007f2a7b501134 in g_main_context_dispatch ()
at /usr/lib64/libglib-2.0.so.0
#29 0x00007f2a7b501388 in () at /usr/lib64/libglib-2.0.so.0
#30 0x00007f2a7b50142c in g_main_context_iteration ()
at /usr/lib64/libglib-2.0.so.0
#31 0x00007f2a813a611c in QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () at /usr/lib64/libQt5Core.so.5
#32 0x00007f2a81353c2b in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () at /usr/lib64/libQt5Core.so.5
#33 0x00007f2a8135c1f4 in QCoreApplication::exec() ()
at /usr/lib64/libQt5Core.so.5
#34 0x0000000000460307 in main ()
--
Robert Krawitz <***@alum.mit.edu>

*** MIT Engineers A Proud Tradition http://mitathletics.com ***
Member of the League for Programming Freedom -- http://ProgFree.org
Project lead for Gutenprint -- http://gimp-print.sourceforge.net

"Linux doesn't dictate how I work, I dictate how Linux works."
--Eric Crampton
Johannes Zarl-Zierl
2017-07-16 21:50:43 UTC
Permalink
Hi,
Post by Tobias Leupold
In case of doubt, Johannes is the one to decide whether or not, but I think
this can be removed in favor of using Qt directly. Which does not only – as
you stated – speed up the code, but also simplify it.
I don't think there is any doubt here: you both mentioned two concrete
advantages and no obvious disadvantages. Go for it...

Cheers,
Johannes
Robert Krawitz
2017-07-16 21:59:28 UTC
Permalink
Post by Johannes Zarl-Zierl
Post by Tobias Leupold
In case of doubt, Johannes is the one to decide whether or not, but I think
this can be removed in favor of using Qt directly. Which does not only – as
you stated – speed up the code, but also simplify it.
I don't think there is any doubt here: you both mentioned two
concrete advantages and no obvious disadvantages. Go for it...
OK. I'm going to be out of town, so I may have less access off and
on.
--
Robert Krawitz <***@alum.mit.edu>

*** MIT Engineers A Proud Tradition http://mitathletics.com ***
Member of the League for Programming Freedom -- http://ProgFree.org
Project lead for Gutenprint -- http://gimp-print.sourceforge.net

"Linux doesn't dictate how I work, I dictate how Linux works."
--Eric Crampton
Robert Krawitz
2017-07-17 03:35:02 UTC
Permalink
Post by Tobias Leupold
In case of doubt, Johannes is the one to decide whether or not, but I think
this can be removed in favor of using Qt directly. Which does not only – as
you stated – speed up the code, but also simplify it.
I don't think there is any doubt here: you both mentioned two concrete advantages and no obvious disadvantages. Go for it...
It was messier than I expected because there were a lot of uses of
StringSet, which was defined in Set.h. I renamed Set.h to StringSet.h
and cleaned up the uses of Set.h (some of which were gratuitous).

I guess I could have done it the easy way and just removed Set.cpp and
everything but StringSet from Set.h, but I was a glutton for punishment.
--
Robert Krawitz <***@alum.mit.edu>

*** MIT Engineers A Proud Tradition http://mitathletics.com ***
Member of the League for Programming Freedom -- http://ProgFree.org
Project lead for Gutenprint -- http://gimp-print.sourceforge.net

"Linux doesn't dictate how I work, I dictate how Linux works."
--Eric Crampton
Loading...