Proper query syntax

Proper query syntax

Simple terms and phrases

When performing a search one can use simple terms, that is, single words or phrases, that is, a group of words surrounded by double quotes, for example “Pawiak prison”. The use of quotation marks will result in finding only those documents that contain the whole phrase.

Search terms can be combined by Boolean operators. One can use wildcard characters to replace letters, numbers and the series of letters and numbers, search terms similar in spelling, search terms that are a distance away from each other and determine the relevance of particular search terms.

Boolean operators

  • AND – the symbol && can be used in place of the word AND – means that terms combined by an operator must simultaneously exist in a document. For example, a search query Starzyński && Lorentz will result in finding only those documents that contain both these names. The operator AND is used by default when more than one word is entered. Thus the search result will be the same when we enter Starzyński Lorentz.
  • Or, the symbol || can be used in place of the word Orrequires at least one term to exist in a document. For example, a search query Starzyński || Lorentz will result in finding documents that contain the name Starzyński or Lorentz or both.
  • NOT – an alternate sign ! – excludes those documents that contain the negated term. For example, a search query ‘Adolf Hitler Platz’ NOT Platz will result in finding documents that contain the phrase ‘Adolf Hitler’ without, however, necessarily containing the word Platz. This operator cannot be used on a stand-alone basis. For example, a search query NOT ‘Adolf Hitler’ will not result in providing correct results.
  • + (the required term operator) is used to find documents that, while necessarily containing a term that directly follows +, may not contain the remaining terms. For example, a search query +Warsaw ghetto will result in finding documents that must contain the word ghetto and may not contain the word Warsaw.
  • – (the forbidden term operator) works similarly to the operator NOT. For example, a search query ‘Adolf Hitler’ – ‘Adolf Hitler Platz’ will result in finding documents that contain the name ‘Adolf Hitler’ but do not contain the phrase ‘Adolf Hitler Platz’.

Wildcard characters

  • ? – replaces any given character. For example, a search query Adamsk? fits both Adamski and Adamska
  • * – replaces a series of marks. For example, a record g*n will result in finding such words as gen, gun, gin etc. One cannot use a masking symbol as the first character of a search.

Fuzzy Search

Fuzzy search is used to find simple terms that are similar in spelling – for example, Holocaust, Holokaust. The documents that contain these terms can be found by adding tilde symbol to the term holocaust: holocaust~.

A similarity degree can be determined by a similarity coefficient ranging from 0 (no similarity) to 1 (identical terms). A similarity coefficient is by default set to 0.5. In order to change it, one must add the tilde symbol and a clearly specified coefficient to the search term – for example, holocaust~0.4.

Proximity search

It is also possible to specify search terms that are a distance away from each other (the so-called proximity search). If one remembers that a document contains two words a short distance away, for example Gestapo and torture, one can use the following search query: ‘Gestapo torture ~6’.

Term relevance specification

It is possible to specify a term relevance by adding a symbol ^ along with a number (greater than 1). For example, a search query Lange^4 Sajnóg will result in finding documents that contain both names, but the name labelled as more relevant will top the search list (Lange). Relevance is by default set to 1.

Query Grouping

Phrases used in complex queries can be grouped by means of parentheses. This allows extensive queries to acquire their unequivocal meaning, as is the case with arithmetic operations.

The query “obóz zagłady w Treblince” AND (Holokaust OR Holocaust) will result in finding documents that contain a phrase “obóz zagłady w Treblince” and one of the two words (or both): Holocaust, Holocaust.

Special characters

For obvious reasons in the search process characters used to create complex queries (+ - && || ! ( ) { } [ ] ^ " ~ * ? : \) have a different role to fulfill from all the other characters. These special characters are part of the query syntax and not part of the search term. In order to include them in the search, one needs to use the so-called escape character \. For example, in order to find a phrase “(2+2)*2” use the query "\( 2 \+ 2 \) \* 2".

Source

Full instruction on making queries: Apache Lucene Query Parser Syntax.

The text was originally published on the website of Kujawsko-Pomorska Digital Library.

Content is available under Creative Commons Attribution – Share Alike 2.5 Poland License.