Duplicate Search Results Coming from internal search

This is an open forum for any mojoPortal topics that don't fall into the other categories.

This thread is closed to new posts. You must sign in to post in the forums.
1/11/2011 1:29:27 PM
Gravatar
Total Posts 18439

Re: Duplicate Search Results Coming from internal search

The internal search engine is a complex system because it does have to account for permissions and whether the user can see the page so that we don't leak any secure data in search results. So basically when view roles change on pages or content it has to be re-indexed. Crawler based search engines have it much easier and it makes it easy to index based on the specific url so that there are no duplicates. Basically we store the view roles also in the search index and we pass in the user roles during search so we can filter results.

In mojoPortal content is not indexed based on the url and not by a crawler and the "Page" is not what is indexed but the feature instances on the page if they are searchable the feature is responsible for indexing its own content. So in the search index the structure of the index is not the same as the structure of the site, there is a document for each indexable item in the search index so each matching document in the index is a hit in search results even if more than one document points to the same url. Pages don't know how to index anything and may contain any number of searchable/indexable features including custom features that developers may implement themselves that may also implement search.

The only filtering approach that I can think of based on url would be done during databinding of the search results where we "could" keep track of the url for each result and if the previous result had the same url then filter the item out. The down side of this approach is that it is problematic for paged search results, if the page size is 10 and the first 10 items have the same url it would render 1 row on the first page of results doing that kind of filtering. So, I'm not real keen on that approach. 

Suggestions I have to help with this are:

  • If you are creating pages with only Html content features on the page and using the built in column layout, you can instead use 1 instance of Html and do layout inside that instance using content templates. This would improve page performance by reducing hits to the database and would also mean there is only one indexable html item on the page therefore only 1 search result. For example the home page on this site has apparently 3 columns but that is really just 1 instance of Html and the column layout is internal to the instance.
  • The new setting that allows you to exclude an html instance from search may help with this and is implemented already in the source code repository
  • You could make the search results look more distinct even though they point to the same page by making it show the instance title. Add this setting to user.config <add key="ShowModuleTitleInSearchResultLink" value="true"/> I can't remember if this requires rebuilding the index but if it does not show them then try rebuilding the index

Hope it helps,

Joe

You must sign in to post in the forums. This thread is closed to new posts.