Item11664: optimize indexer

Priority: Enhancement
Current State: Closed
Released In: n/a
Target Release:
Applies To: Extension
Component: SolrPlugin
Branches: trunk
Reported By: MichaelDaum
Waiting For:
Last Change By: MichaelDaum
see also: Support.Question1060,,Fri&sel=559#l555

Solr-Indexer needs about 2-3 Minutes to index such a Topic

This sounds incredibly high to index a single topic

So how exactly is this topic protected by ACLs:

  • how many users have got access?
  • how many don't?
  • are access rights assigned on topic level or web level?
  • how many attachments does the topic have?
  • how do you specify ACLS: using lists of users or lists of groups or a mixture?
  • how do the user groups look like with regards to nestedness?

I'd like to profile this myself, but need to configure a similar setup over here...

-- MichaelDaum - 16 Mar 2012

With 300+ Webs, 40k+ Topics and 2000+ Authenticated Users we observe all kind of settings.

  • When talking about indexing one topic alone it does not matter where ACLs are set, only when indexing multiple topics, ACLs set in WebPreferences have only to be checked once
  • the percentage of users granted access seems to have an impact. In our case its probably 2000 out of 2400*
    • * Foswiki seems to remember users that have left the organisation
    • There exist authenticated users that are not member of the organization. They can only be identified by not being in a certain group. There is no group with 'externals' on the one hand, and users feel very uncomfortable with deny-rules anyway
  • it does only marginal matter, whether ACLs are defined in deeply nested Ldap-Groups or with a flat Groups of 2000 Wikinames listed in a Wiki Group - we tested
  • we nearly exclusive operate with ALLOW rules. In rare cases we have DENY for WikiGuest.

Profiling indicates to a non ideal application logic:
  1. whenever a topic contains ALLOW|DENY a flag $topicHasPerms is set
  2. Foswiki iterates through all users to check if the users has view access
    • we observed, how ACLs are defined has nearly no performance impact

We guess it would be faster to determine the list of allowed users by logic derived from ALLOWTOPICVIEW and DENYTOPICVIEW and then iterate through the list of allowed users.

Or an even more radical change: Add the allow/deny rules into the index and include group membership to the solr query. This would make full index less important, as new user would get immediate valuable results, not only after the next full index.

-- AndreLichtsteiner - 16 Mar 2012

How often do you register new users? How often are they backed out?

So from what you outline the test case we should optimize for is a topic that has got a single ALLOWTOPICVIEW = myorganization rule which allows 2000 of 2400 users to access the topic. Will check.

For now, the only api that we've got to interact with the ACL system of Foswiki is Foswiki::Func::checkAccessPermission("view", $user, ... , $topic, $web) Any other approach to read the allow/deny settings will have to bypass this api to be more efficient.

The problem at hand can be illustrated as reading a big matrix (topics, users), and while indexing a topic we need to get the complete row of entries in that matrix for a specific topic. The current Foswiki api, however, only allows you to sample one cell of that matrix with no other way to operate on that data. This data actually isn't existing formally in Foswiki. As such it has to be build up partially during the index process. That's what is taking so long.

For normal operations checking a single point in the ACL matrix is enuf. What we are needing here is actually a way to get all users that have access rights to a topic.

With regards to your more "radical change" proposal: Apache ManifoldCF comes with an search component to be integrated into the solr server to bring ACLs directly to the search engine itself. This however would need quite a change in the way Foswiki operates on identities by means of using "access tokens" granted to outside services...

-- MichaelDaum - 16 Mar 2012

Andre, what exactly would you propose to improve the application logic collecting ACLs. That's still unclear from what you write. Could you elaborate, please.

-- MichaelDaum - 19 Nov 2012

I've rewritten the access_granted generator; it's much faster now. Topics that used to take 15 minutes to index now take something like 10 seconds (including stuff that needs to be done only once per index run; indexing an individual topic takes about half a second if several thousand entries need to be added to access_granted).

The cool solution would be Andre's "radical change", but I don't quite see how it could realistically be done, given that we can't do a fast "give me all groups of this user" lookup at search time (or can we?).

I'm attaching my changes as a diff. I hope they apply cleanly to 1.10.

-- JanKrueger - 18 Feb 2013

This has been merged a year ago as far as I can see.

-- MichaelDaum - 23 Feb 2015
Topic revision: r14 - 23 Feb 2015, MichaelDaum - This page was cached on 23 Nov 2020 - 12:42.

The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy