Enriching Your Content with Metadata

Your end users basically have two means of finding content on your site; navigating the table of contents, and, running searches. These two may seem to be too few to be able to find content efficiently. You can enhance both of these options in ways that will increase the efficiency of your end users' ability to find content.

What is Metadata?

Metadata is descriptive data; data that describes data. This is similar to a dictionary; a dictionary is words that describe words. Metadata can be a very useful tool to aid in the process of describing and finding content.

Metadata can nearly be any data that describes another set of data or provides added information regarding that other data. Metadata can be set of document properties that describe a document, such as; "title", "author", "subject", "abstract", "date", "ID", "name", and "size", just to name a few. Metadata can also be a completely separate paragraph describing a document, like a document abstract or summary. Furthermore, metadata can be a set of specialty tags or elements surrounding specific data within an XML or HTML document.

You can create your own metadata structure to fit the specific needs of your content. The metadata "door" can swing wide open to fit your needs. However, there is a cross-industry standard set of accepted metadata elements and attributes for the electronic environment. This standard set of metadata elements is known as the Dublin Core ( see http://dublincore.org/documents/1999/07/02/dces/ for a complete list of the elements).

Although the Dublin Core has established a standard set of metadata elements that you can use, you are not restricted to those fifteen elements. You can use the Dublin Core metadata elements or create your own. The first step to expanding your end users' ability to search for and find content is to enrich your content by adding metadata to it. Metadata can exist in various places. There are two types of metadata; internal and external.

Internal Metadata

In mark-up language documents, XML and HTML, you can include metadata within the document it is describing. This is "internal metadata." Within XML and HTML documents internal metadata is contained within descriptive elements (start and end tags). Placing metadata within meaningful and descriptive elements will enhance your ability to appropriately index that metadata content. Figure 1 shows an example of an HTML document with internal metadata elements.

note icon Properties on MS Word, ODT, PDF, and other non-mark-up documents are also considered to be internal metadata. However, with these types of documents, the properties are usually set and you are not able to add additional properties to them. You are only able to modify the values of the existing properties. Therefore, we will focus our discussion on HTML and XML type documents for adding metadata.

Internal Metadata - HTML Document

Figure 1. Internal Metadata in an HTML Document

Not all element tags in an HTML document are metadata elements. Most, in fact, are for the purpose of displaying the document. The "title" tag is one that is metadata for that document. Metadata tags are tags or element names that describe or further define what the content between the tags is. Your browser processes HTML pages. Any tags that your browser does not understand it disregards and simply displays the content between the tags in the browser. The way your browser handles (or doesn't as the case is) metadata tagging enables you to add or mark up your content as much as you want without affecting the document's display.

XML documents are different from HTML documents in that all XML document tags are metadata tags because XML documents are pure content, and no display. You use XSL style sheets to display XML content. Figure 2 shows an example of internal metadata in an XML document.

Internal Metadata - XML Document

Figure 2. Internal Metadata in an XML Document

Notice that the tags in this XML document are very descriptive of what the tags contain, or what is between the tags. Having these types of descriptive tags within your HTML or XML content enables you to effectively index these documents beyond a full-text index by using indexsheets.

External Metadata and RDFs

Metadata can also exist, for all types of documents and files (including XML and HTML), in a file or document separate from the document it is describing. This type of metadata is "external metadata." External metadata is contained in a Resource Description Framework (RDF) file. An RDF file is an XML document that is specifically structured to house metadata. RDF files also enable you to associate metadata with non-text files; images, a/v files, etc. Figure 3 shows an example of a RDF file containing metadata for the graphic.jpg image.

External Metadata - RDF File

Figure 3. External Metadata in a RDF File

Notice that the RDF file in Figure 3 contains both Dublin Core elements and custom, or non-Dublin Core, elements. You may implement alternative forms of external metadata, as long as these files follow XML specifications. For more information regarding RDFs please see either of the following web sites:

http://www.ukoln.ac.uk/metadata/resources/dc/datamodel/WD-dc-rdf/
http://www.w3.org/TR/REC-rdf-syntax

Associating External Metadata

Once you create your external metadata files, you must make a "connection" between your metadata file and the file it describes. NXT has three different ways to associate external metadata files to their parent documents:

The last two methods for associating external metadata files to their parent files are specific to NXT, "Filename" is not.

Filename Association

Filename association means that the external metadata file has the same name, extension included, as the parent document usually with the .rdf extension. So, the name of an RDF file for the graphic.jpg image in Figure 3 would be graphic.jpg.rdf. This method of association is the method you would use with the File System Content Bridge.

Handling Metadata with the File System Content Bridge

When you use the File System Content Bridge to build a content collection in Library Manager, and, you have external metadata files to describe the content, you must "tell" the Library Manager to handle and index those external metadata files. Library Manager handles content for the File System Content Bridge with content rules according to document type. You can add a property value to the rules for a given document type to handle external metadata. Follow the following process to have the Library Manager index and handle external metadata files for the File System Content Bridge:

  1. Place each external metadata file in the same directory (folder) with its parent
  2. Select the content collection builder node that includes your parent content
  3. Open the Edit Content Rules dialog from the Content Rules property of your collection builder node
  4. Select a document type that you have external metadata files for (from the Members list of the Content Rules dialog)
  5. Open the Edit Properties dialog from the Property Values property field
  6. Add the metadata property to the Members list
  7. Modify as necessary the Property Source Value property if the extension for your external metadata files are different from the ".rdf" extension (.rdf is the default extension for external metadata files, and the default Property Source value is Constant)
  8. Close the Edit Properties dialog
  9. Repeat steps 4 - 8 for each document type you have metadata for
  10. Confirm that the Publish and Publish Content rules for the "Default" document type (top of the Members list) are set to False, otherwise perform these steps:
    1. Add a new document type of with the extension of your metadata files (.rdf or other extension)
    2. Set the Publish and Publish Content rules to False (this keeps your external metadata files from being published)
  11. Close the Edit Content Rules dialog and save the changes to your library and rebuild that content collection

Most of this process is for adding external metadata to an existing content collection in your library. You can, if you know you already have external metadata or will have external metadata, start and build a new content collection builder node using the Metadata in .RDF Files File System Content Bridge template. Figure 4 shows this option.

Figure 4. File System Content Bridge Metadata Template Option

This template pre-defines each File System Content Bridge default document type with the metadata property set for .rdf extension files. This would eliminate steps 2 - 9 (other than building a File System Content Bridge collection builder steps). If you have external metadata files, but use a different extension, you can modify the Property Source Value accordingly. You should always do step 9 to make sure you do not publish your metadata files.

When you add the metadata property to your content rules, you "tell" NXT to take any file with the extension that you indicate for a given document type, associate it with it's namesake parent document, and index it with the Metadata.xil indexsheet. The Metadata.xil is a premade, out-of-the-box indexsheet for indexing metadata files. Therefore, all you need to do is set the metadata property for NXT to perform the index.

If you set this property for a document type, the build system will try to find a metadata file for all documents of the given type. If the build system finds a file of that document type that does not have an associated metadata file, it will log a warning for each of those documents.

By setting this property the build system knows to index each metadata file with the Metadata.xil indexsheet (or the indexsheet that has the ID value of Metadata).

Makefile Association

Another way you can associate an external metadata file to a document is in your content collection makefile, if you built, or plan to build, the content collection with ccBuild. This association is accomplished is a two-step process.

  1. Nest a metadata element within the parent document element
  2. Add an indexsheet element specifically for indexing metadata files in your content collection

The makefile.dtd only allows for 0 or 1 metadata elements to be nested within any given document element. Figure 5 shows a makefile indexsheet element designating an indexsheet for metadata (id="metadata") and document element with a nested metadata file element.

Metadata Association - Makefile

Figure 5. Metadata Association in the Makefile

This type of an association is as close as you get to a "physical" association in the electronic world. Unlike with the File System Content Bridge, neither the name nor extension of the metadata file designate it as a metadata file, nor do either dictate the its parent document. The nesting determines the metadata file's parent, and, the metadata element name designates the file as a metadata file. The file that you reference in the location attribute of the metadata element can have any name and any extension.

Remember to include a metadata indexsheet element for NXT to index your metadata files, otherwise they go unindexed. You can only have one "metadata" indexsheet per content collection which is why the metadata element does not have an attribute to identify an indexsheet to use for indexing. So, you should make sure that your metadata indexsheet is sufficient to cover all your content collection metadata files.

Manage Content Properties

With Manage Content you can add metadata on-the-fly to documents in a content collection. Figure 6 shows the Properties interface for adding metadata to a content collection document. When you use Manage Content to assign metadata to a file, NXT places this metadata in a name-associated RDF file.

Metadata Association - Manage Content

Figure 6. Adding Metadata with Manage Content

The metadata fields you see in Figure 6 correspond (top to bottom) to the Dublin Core elements of title, subject, creator, and description, respectively.

Document Properties vs. Metadata

You may be wondering what the difference is between metadata and the properties of a document (like a MSWord document). The short answer is that there is no difference. The longer answer is that when NXT indexes certain files like Microsoft Office and PDF files, it converts the documents to HTML. This conversion creates documents with internal metadata. NXT then applies an indexsheet to the documents to leverage and index the internal metadata. Part of this internal metadata contains the document properties. Once NXT is finished indexing those documents, it deletes the HTML version of the document. Thus, in the end, NXT handles document properties the same as it handles other forms of metadata.