Search results - character entities

This is the place to report bugs and get support. When posting in this forum, please always provide as much detail as possible.

Please do not report problems with a custom build or custom code in this forum. If you are producing your own build from the source code and have problems or questions, ask in the developer forum, do not report it as a bug.

This is the place to report bugs and get support

When posting in this forum, please try to provide as many relevant details as possible. Particularly the following:

  • What operating system were you running when the bug appeared?
  • What database platform is your site using?
  • What version of mojoPortal are you running?
  • What version of .NET do you use?
  • What steps are necessary to reproduce the issue? Compare expected results vs actual results.
Please do not report problems with a custom build or custom code in this forum. If you are producing your own build from the source code and have problems or questions, ask in the developer forum.
This thread is closed to new posts. You must sign in to post in the forums.
12/12/2007 4:32:38 AM
Gravatar
Total Posts 9

Search results - character entities

Hi Joe,

in search results, there is for every found record its name and begin of its text. The begin means some number of first characters from the text, but if the begin ends in the middle of character entity, there are found records with the begin of text looking like "sample text&ia...".

Also if text has many character entities, the begin of text in found records is quite short.

In regard of these problems, I'd like to ask how (or where) are characters translated to character entites? Is it just on client side? Is it possible to turn off this translating? Because some Czech characters are translated and other not. I can search only phrases without characters which are translated. And if I translated the searched phrase on entities(in source code of search engine), I think, it'd translate all czech characters on entities and it wouldn't be possible to found phrases with Czech characters which weren't translated.

 

Thanks

12/12/2007 6:45:33 AM
Gravatar
Total Posts 18439

Re: Search results - character entities

Hi,

I think there are several issues going on.

1. Some content is not found upon searching - this problem could be a lack of support for Czech language in the Lucene.NET, might be worth reviewing which languages are supported to confirm this or not. I have seen this problem before with other not yet supported languages. For example text with certain swedish characters is also known to not be searchable.

2. The Intro content is truncated sometimes in the middle of a word or character entity

The intro text is only for display so it is not connected to problem 1 as it is not used in actual  search , only in display of results.

When we retieve the results from the search index, we are applying some security to prevent cross site scripting, if you look in the markup of SearchResults.aspx you'll see:

SecurityHelper.PreventCrossSiteScripting(DataBinder.Eval(Container.DataItem, "Intro").ToString())

Since the actual content is stored raw we use a white list approach to filter out or encode anything that doesn't match the white list of allowed markup.

So in trying to solve this I think first we need to find out if the languages is supported and if not perhaps you can look into what is needed to support it in Lucene.NET. The other thing is we may need a smarter way of truncating the string to produce an intro for display without breaking on a character entity or word. The intro text is created in mojoPortal.Business.WebHelpers.IndexHelper.RebuildIndex(...) at the time of content indexing, so that is where we would plug in a better solution.

Hope it helps,

Joe

12/12/2007 7:22:52 AM
Gravatar
Total Posts 18439

Re: Search results - character entities

One correction to my previous remarks. I said we are storing it raw, but in the case of the intro (which again is only used for display not search) we are applying a regex to remove markup before truncating the text to create the intro. I'm not sure whether this has an impact on the actual problem of what is displayed, I'm thinking it would not filter the Czech characters so the encoding into character entities is still happening from SecurityHelper.PreventCrossSiteScripting(...)  during display as far as I understand.

We are storing actual content in the db raw and raw content is also used for indexing content into the search index. The intro is a piece of data stored in the index for each item but is not used in search, its only used during display of items that matched the search.

Joe

You must sign in to post in the forums. This thread is closed to new posts.