Site Search Overview
The mojoPortal content management system Search Engine is built using Lucene.Net. As content is created and updated it gets indexed into the search index which is a set of files that live in the
/Data/Sites/[SiteID]/index folder. Also the view roles are stored in the index so that users cannot find things in search that they should not be able to see based on their role membership. Search results are filtered based on the user's roles vs the required roles to view the content.
There is not a one to one correspondence of search results and URLs in the internal search engine.
There are some differences in the way content is indexed compared to the way content is indexed by public search engines such as Google or Bing. Public search engines index your content by crawling your site, so each crawlable URL in your site corresponds to one document in their search index so one search result corresponds to one url in your site. The internal Lucene search index is not indexed by URL and it is not indexed by crawling the site, so there is not a one to one correspondence between urls in your site and documents in the search index. In the internal Lucene search index each indexable item is separate a document in the search index. So if a page has 2 Html content articles on it each article is represented by a separate document in the search index even though they may have the same url associated with them. Indexing of content is implemented at the feature level not at the page level and indexing happens when content is edited or role permissions are changed, pages are just containers for features and pages do not know anything about the features they contain or the content within the features, there is a separation of concerns in developer terminology. A feature instance may also be published on more than one page and in this case there will be a search result for each page that has the content.
There is one down side to this though in that if there are 2 articles on a page and they both match the search terms then you will see 2 results in the search results, one for each article,even though they are on the same page/URL. In the Image Gallery feature (we have more than one Gallery feature, I'm talking about the one that supports captions and descriptions as it is the captions and descriptions that get indexed), each image corresponds to a document in the search index, so if more than one image has a description that matches the search term, then you will get a search result for each image that does match even though they are all on the same page that contains the Image Gallery. mojoPortal CMS does allow you to use Google or Bing for Site search instead of or in addition to the internal Lucene search engine, so if you don't like this difference you could just use Google or Bing. However the advantage that the internal search index has is that it can index content that is not on public pages, such as content on pages protected by roles. Google and Bing cannot crawl these protected pages and therefore cannot index their content. The internal search engine stores the allowed view roles with the content in the search index and when searching, the user's roles are passed in as a parameter so that search results can be shown for protected items if the user is in an allowed role, but will be filtered out if the user is not in an allowed role.
Changes as of mojoPortal version 220.127.116.11
In mojoPortal CMS version 18.104.22.168 we added some new fields to the search index and changed the way we store some fields that we have been storing. However, in order to keep compatibility with existing sites for upgrade we put in some configuration options so that it can still work with indexes created using older versions of mojoPortal. So by default, we have disabled some of the new search engine enhancements. It is possible for existing sites to use the new features if you change the configuration and then rebuild the search index. For large sites with lots of traffic it can be very dodgy rebuilding the search index so its something to not undertake lightly, however if your site is not very large and his not very much traffic, its fairly trivial. You should read the article about rebuilding the search index to decide whether you want to do it.
The new features available if you rebuild the search index include the ability to filter search by feature, and the optional ability to highlight the search results with matching terms. However, it is important to understand that in order to be able to highlight the results the content has to be stored in he index in addition to the data needed to search, so it will make the size on disk of the index much larger. Here on mojoportal.com as of 2009-05-27, the search index is about 28MB on disk where I think previously it was about 8MB.
The relevant Web.config/user.config settings to use these new features are as follows:
<add key="SearchUseBackwardCompatibilityMode" value="false" />
This is set to true by default in order to not break existing search indexes, to use the new features, you should set this to false before you rebuild the index.
<add key="DisableSearchFeatureFilters" value="false" />
This is set to true by default because it can only work if you have rebuilt the index. So if you are going to rebuild the index then set this to false.
<add key="EnableSearchResultsHighlighting" value="true" />
This is false by default. If you are going to rebuild the search index and want to use results highlighting then you should set this to true before you rebuild the index. If you rebuild the index with this set to false it will not store the needed data for results highlighting, so if you later decide you want to use results highlighting then you will have to set this to true and rebuild the index again.
All the searchable features now have a search name in addition to their feature name because the feature name is not always appropriate for the search list. So the "Blog" feature is listed in search as "Blogs", and "Event Calendar Pro" is listed as "Events". You can find the searchname for a feature under Administration Menu > Advanced Tools > Feature Installation. Typically the name is really a resource key so the text is retrieved from a
.resx file so it can be localized. However, it can also be customized so if you want to change how a feature is listed you can add a setting in user.config to override it. For example, the "Html Content" feature is listed as "Articles" based on the resource key "HtmlContentSearchName" as seen in the feature definition. So if you want to customize it so its listed as "Documents" you add a setting in user.config like this:
<add key="HtmlContentSearchName" value="Documents" />
Also, you may not want all the searchable features listed. For example users who have purchased Event Calendar Pro and are not using the other Event Calendar may wish to exclude the other one from the search list. In that case you can put a comma separated list of feature guids to exclude using this user.config setting. You get the feature GUID from the same place where you find the search list name. An example which remove the free Event Calendar is as follows:
<add key="SearchableFeatureGuidsToExclude" value="c5e6a5df-ac2a-43d3-bb7f-9739bc47194e" />
To exclude more features just separate the GUIDs with a comma and no spaces.
One of the things that changed as of version 22.214.171.124 if using a rebuilt index is the way the role filtering is done. Previously we just retrieved all matches and then filtered them out from the display if the user is not in the correct roles. The problem with this is the search results page may say there are 100 matching results but the user doesn't actually get to see that many so the numbers are wrong. In the new way we pass the user's roles into the search query and filter the result by role right in the search query so that the numbers are always correct.
OpenSearch with Autodiscovery
Firefox and Internet Explorer both support OpenSearch provider plugins, it enables users to add your site search to their search toolbar.
If the user clicks the drop down list in their search bar it will show an entry for "Add [your site name] Search"