Indexing Metadata with Indexsheets

Metadata is great way to enrich your content. However, just having metadata does not help you or your end users very much. You must leverage or use the metadata. Index sheets enable you to take advantage (leverage) both internal and external metadata. Index sheets unleash the power of metadata by indexing your metadata according to metadata tags or properties within your content. Index sheets take a source document and index its contents according to indexing rules you set up within the index sheet. Index sheets, unlike metadata, are specific to NXT.

What are Index Sheets?

Index sheets, in short, are XML documents that follow XSL (eXtensible Stylesheet Language) rules. More precisely, an index sheet is an XML document that follows a rule-based system based on XPath and a subset of XSLT. Index sheets are stylesheets for indexing. Because index sheets are based on XPath/XSLT rule system, the syntax you use may be both familiar and standard. Index sheet file names contain the .xil extension which stands for "eXtensible Indexing Language."

Index sheets are simple in their structure but powerful and effective in their purpose. Index sheets deal with "inside" a document, whereas the makefile deals with a document as a whole. The makefile is not concerned with what is in the document, but an index sheet is.

Default Index Sheets

When you install NXT Online Server or NXT Builder you receive premade, fully functional, out-of-the-box index sheets. You can modify and customize any or all of these index sheets to fit your indexing needs. Table 1 provides a list of these index sheets and the document types they index and a brief description of each.

Library Manager, the content bridges, ccBuild, and Manage Content all use these default index sheets to further index your content (beyond a simple full-text index) as you build your content collections.

Table 1. Default Index Sheets
Default Index Sheet	Content Type	Document Type	Description
HTML.XIL	text/html	HTML	Indexes most HTML elements and designates the "Title" to be the document title in the TOC.
HTML-title.XIL	text/html	HTML	Similar to the HTML.xil except it creates table of contents hierarchy from the header ("H1, H2, etc.") elements.
XML.XIL	text/xml	XML	Indexes all elements by element name for any given XML document.
PDF.xil	application/pdf	Adobe Acrobat PDF	Transforms the PDF document to XML then indexes the (metadata) properties of PDF documents. This indexsheet is used by default for PDF indexing.
PDF-transform.xil	application/pdf	Adobe Acrobat PDF	Transforms the PDF document to XML then indexes the (metadata) properties of PDF documents. This PDF indexsheet was introduced in NXT 4.10 for custom indexing of PDF files.
MSExcel.XIL	application/msexcel	Microsoft Excel	Indexes the (metadata) properties of MSExcel documents.
MSWord.XIL	application/msword	Microsoft Word	Transforms the MSWord document to HTML then indexes the (metadata) properties of MSWord documents.
ODT.XIL	application/vnd.oasis.opendocument.text	OpenDocument Text (ODT)	Indexes the (metadata) properties of ODT documents.
MSPowerPoint.XIL	application/mspowerpoint	Microsoft PowerPoint	Indexes the (metadata) properties of MSPowerPoint documents.
Metadata.XIL		Metadata	Indexes all external metadata files leveraging Dublin Core elements as well as non-Dublin Core standard elements within those metadata documents.

Note: To index ODT documents, you must install the corresponding version of OpenOffice IFilters that is included in the Apache OpenOffice 4.0 and higher, or similar software (for example, LibreOffice 4.3 and higher). You can download and install the required software manually from the official site.

Indexing PDF files

The NXT indexing engine uses the following indexsheets to index PDF files: PDF.xil, PDF-transform.xil.

To revert to the NXT version 4.9 indexing behavior, use the PDF-transform.xil file. Also, you can use the PDF-transform.xil file if you want to customize your indexing process. For example, to generate facets for PDF files, or to use an external utlity during the indexing process.

To apply rules for the improved performance during the PDF indexing, use PDF.xil. New rules for indexing are introduced in the 4.10 version of NXT. The improved performance during the PDF indexing means that with NXT 4.10 and higher you can index large PDF files faster. PDF.xil is used by default.

Also, since the 4.10 version of NXT, a special tool for PDF indexing is used. The PDFSupport tool is installed to ensure a correct processing of PDF files between several NXT products. This tool is useful when you uninstall one NXT product but keep another NXT product. In this case, the registration for PDF libraries is processed correctly.

Creating Index Sheets

You may use or modify the default index sheets as you desire, or you may want to create an index sheet completely from scratch. Either path you choose, you can use the information and know-how in this section to help you accomplish your index sheet and indexing needs.

Index sheets are fixed. If you edit them, you must start your collection over (in Library Manager, see Start Collection Over menu item) and will need to redistribute your collection.

Index sheets are very powerful in what they do and enable your end users to do, however, their makeup is relatively simple. There are basically two building blocks that make up Index sheets:

Because index sheets are XML documents which generally follow XSLT/XPath rules and are specific to NXT, there is mark-up within an index sheet that is XML/XSL standard, and there is mark-up that is proprietary (NXT specific). To differentiate which items are which, index sheets employ the use of "namespaces".

Namespaces are used in XML to avoid naming collision with other documents. Namespaces identifier become part of element names and preface an element name within the "<" and ">" characters, like this: <namespace:element_name>. The namespace identifier for XML/XSL standard items is "xsl" and that for the NXT items is "np."

You must declare the namespaces you will use in an XML document. That declaration happens at the top of that document (usually immediately following the XML declaration statement), and each namespace declaration is prefaced by


				xmlns:

(this stands for XML namespace). Figure 1 shows the namespace declaration statement of an index sheet where two namespaces are being declared; xsl and np.

Template Blocks

Template blocks are XPath/XSLT items. Template blocks define and delineate the indexing "action items" for an index sheet. These template blocks follow the XPath rules for matching on elements. Template blocks begin with the start tag


				<xsl:template ...>

and end with </xsl:template>.

Template blocks look for or "match" on element names (they can also match on attributes of elements) and index the contents of the element (what is between the tags). Remember that. The tag or element names of your metadata provide the basis for the matching of your template blocks.

Fields

Fields are NXT items. Fields are indexing aliases for the matched-on elements. Meaning, you could match on an element <feline> and index it as "cat." "Cat" would be the field name for the "feline" element. Regardless, you must index each element you instruct the index sheet to find (match on) as something. That something may be the same as the element name or different. You must define field names that are different than the element names.

Fields can be defined in two different places; within the index sheet or within the makefile (see makefile.dtd). Regardless of where you define your fields, field definitions exist outside of the template blocks.

After fields are defined (if needed) you can use those fields within the template blocks to index your content. Figure 2 shows an example of a "two-action" (two template blocks) Index sheet with field definition and usage.

Remember that everything an index sheet indexes is indexed as a field. In Figure 2 you see two actions (template blocks) but only one definition. The first template block matches on the "title" element, then indexes the element's contents as a field by the same name as the element name (this is the function of the "field-element-name" attribute). The second template block matches on the element "dc:creator" and indexes it as "author," thus using the field defined toward the top of the index sheet. Indexsheet.dtd governs the structure of all index sheets.

Note: You may match on more than one element to apply a rule. When doing so, use the pipe character "|" to separate the element names. For example:

Be aware that you must not add spaces between the element names and the pipe character in the template match. 'H2|H3' is valid; 'H2 | H3' is not.

Apply-Templates

In Figure 2 you notice that right in the middle of the template block there is an XML/XSL empty element; <xsl:apply-templates/>. The NXT XIL language requires the "apply-templates" element within each template block. The purpose of the "apply-templates" element is two-fold.

First, the apply-templates element enables NXT to accomplish the indexing indicated by the np:index element. Apply-templates is kind of like the "on" switch or the "go" button, telling NXT to do what np:index indicates.

Second, from XSLT, apply-templates tells NXT to process the children (content between the tags) of the matched-on element. When the index sheet matches on an element, the np:index element in the template block indicates what to do with the element content. The xsl:apply-templates element starts the process by "grabbing" the children of the matched-on element (the data between the start and end tags of the matched-on element). The child content is indexed according to the rules indicated in the np:index element. The same data that was just indexed is then "processed" or "parsed" for elements that may be applicable to any template blocks in your index sheet. If there are elements within the child data that apply to any template blocks in your index sheet those appropriate template blocks will be applied to the respective matching elements; thus the element name "apply-templates".

Without the apply-templates element nested within the np:index element of the template block, indexing will not happen. This differs from XSLT. In XSLT, if the apply-templates element is not included, the action of the template block will still occur but the children of the matched-on element will not be processed.

Other Index Sheet Features

Using Index sheets to index your HTML and XML content enables you to standardize or normalize your end users' ability to search for content. Explanation: Suppose you have some HTML and XML documents on your Content Network that use "author" tags to designate the person who created the document, whereas other documents, within the document metadata (internal or external), use the Dublin Core "dc:creator" tags for the same type of person. With index sheets and XIL you do not have to index these as separate entities. You can match on these elements separately but index them as the same field name. Figure 3 shows two examples of this.

One benefit of being able to index content in this fashion is that your end users are able to topically search for "author" and be able to get results for "author" and "dc:creator". Another benefit to this indexing capability is that you are able to use one index sheet on more that one document. The relationship of index sheets to documents is one-to-many. Meaning, you can use one index sheet to index many documents. However, each document can be indexed by only one index sheet.

Implementing Index Sheets

The use of index sheets permeates every collection builder application in the NXT 4 product family. So, once you have created your indexsheet, depending on what method you use for creating content collections (Library Manager, ccBuild, or Manage Content), you will need to choose the appropriate implementation process. The implementation process for each of these applications is different.

Within Library Manager

Content collections that you build with Library Manager use index sheets to extend the full-text indexing that NXT does by default. NXT applies indexsheets to your library collection content according to the content type of your content, similar to Manage Content. Generally, Library Manager and the content bridges leverage the out-of-the-box, default index sheets to index the content going into the content collections. However, you can implement additional index sheets for Library Manager and the content bridges to use to index your library content. Implementing an indexsheet in Library Manager is on a per collection basis.

Index sheets are fixed. If you edit them, you must start your collection over (in Library Manager, see Start Collection Over menu item) and will need to redistribute your collection.

Library Manager allows you to implement indexsheets only for collection builder nodes in your library. Since collection reference nodes reference content collections were built outside of the Library Manager interface, NXT assumes that the appropriate indexsheets were applied at the time the referenced collection was built. And, only one content bridge allows you to implement other indexsheets or manipulate which indexsheet indexes which document type: File System Content Bridge. All other content bridges inherently use the appropriate indexsheet based on their content type.

To completely implement an indexsheet with either of the allowable content bridges, you must accomplish the following two step process:

You must perform the steps in this order otherwise when you try to choose the indexsheet from the Index Sheet property drop down list, the indexsheet will not be in the list. Also, if you have a new document type that you want indexed with your new indexsheet you need to add that document type, indicate the publishing rules, and select the appropriate indexsheet.

Within ccBuild (Makefile)

To implement indexsheets within an ccBuild built content collection, you do it through your content collection makefile. Implementing indexsheets into the ccBuild process is a two-step process:

Remember that you can apply one indexsheet to multiple documents but that each document can only reference one indexsheet. Using the MakeStart utility, you can initially construct your makefile with a single indexsheet for your content collection by entering the path and name of the indexsheet in the appropriate text box.

Configuring the Indexsheet Element

MakeStart takes the indexsheet information, creates and configures an indexsheet element for you. All indexsheet elements must be a nested children of the content-collection element and come before the first document element in your makefile. The indexsheet element must contain two attributes: id and source.

The id attribute can be to your choosing, but must be unique among all other IDs in your collection. The source attribute is merely the path and file name to the indexsheet you are adding. Each indexsheet element is an empty element (there is nothing between the start and end tags) and can not have any elements nested within it.

The only instance where the ID must be a particular value is with indexing external metadata. Figure 4 also shows the method for associating external metadata to a document within a content collection. Assign an indexsheet element's id attribute value of "metadata" to designate an indexsheet for indexing metadata content in your collection. The indexsheet that you specify in your makefile as the "metadata" indexsheet will be used on all metadata files in your content collection (these are the files referenced by document-element-nested metadata elements).

To include other indexsheets within your content collection, you must add subsequent indexsheet elements directly below the indexsheet element created by MakeStart. Include all necessary attributes and values in each indexsheet element or ccBuild will encounter a fatal error when it tries to build your content collections. Figure 4 shows part of a makefile with multiple indexsheet elements.

Configuring the Indexsheet Attribute

Once you add the indexsheet elements to your makefile, you need to decide which indexsheets will index which content collection documents. You designate the indexsheet to index a given document with the indexsheet attribute within the document element for each indexable document. The value of the indexsheet attribute corresponds to the id attribute value of one of your indexsheet elements.

Once you have configured both the indexsheet elements and attributes for your collection content, you can execute ccBuild and build or update your content collection. Then you can go to your NXT site and run searches to see the effects of your indexsheets.

Within Manage Content

Manage Content uses seven of the eight default indexsheets. The indexsheet that is not used by Manage Content is "HTML-title.xil." Manage Content chooses one indexsheet from the seven indexsheets to apply to a given document by the document's content type, or MIME-type. When you add a document or file to a content collection with Manage Content, NXT automatically and instantly indexes that document on the fly. To do this NXT must know which indexsheet is appropriate for that document.

NXT knows a document's content type by its extension (this is how your operating system determines which application can open a file). NXT checks the extension against your operating system's registry to find the MIME-type that corresponds with a particular extension. Once NXT identifies a document's type, it knows, by the pairings in Table 2 (these are listed in the Index Sheets field of the Add a Collection dialog) which indexsheet to apply to the document. NXT then applies the indexsheet and indexes the document, and places a copy of the document and index information in your content collection.

Table 2. Default Manage Content Indexsheets
text/html=C:\Program Files\Rocket\NXT 4\Online Server/IndexSheets/HTML.xil;
text/xml=C:\Program Files\Rocket\NXT 4\Online Server/IndexSheets/XML.xil;
metadata=C:\Program Files\Rocket\NXT 4\Online Server/IndexSheets/Metadata.xil;
application/msword=C:\Program Files\Rocket\NXT 4\Online Server/IndexSheets/MSWord.xil;
application/pdf=C:\Program Files\Rocket\NXT 4\Online Server/IndexSheets/PDF.xil;
application/pdf=C:\Program Files\Rocket\NXT 4\Online Server/IndexSheets/PDF-transform.xil;
application/vnd.ms-powerpoint=C:\Program Files\Rocket\NXT 4\Online Server/IndexSheets/MSPowerPoint.xil;
application/x-mspowerpoint=C:\Program Files\Rocket\NXT 4\Online Server/IndexSheets/MSPowerPoint.xil;
application/vnd.ms-excel=C:\Program Files\Rocket\NXT 4\Online Server/IndexSheets/MSExcel.xil;
application/x-msexcel=C:\Program Files\Rocket\NXT 4\Online Server/IndexSheets/MSExcel.xil

During the build process, NXT obtains the MIME type from the server registry based on the extension of the document, and then applies the appropriate Indexsheet based on that MIME type. NXT does this for each document in each of your Content Services and Manage Content.

You can designate a different indexsheet for indexing a certain type of content by modifying the value (path and name) of the MIME type for the specific indexsheet. You must do this when you create your collection with Manage Content. Once the content collection is build, you are not able to modify the indexsheets that NXT applies to the content you add to the collection. If you do change an indexsheet in this way, that change only applies to that specific content collection.