Data Model Structure
This chapter describes the hierarchical structure of the AIP data model that is based on the PREMIS reference model. It covers four levels of objects that are the basis for the Rosetta data model. Further information related to the PREMIS reference model can be found at: http://www.loc.gov/standards/premis/
The following two diagrams illustrate the entities of the AIP data model:
The entities in the data model are defined as follows.
Intellectual Entity
An intellectual entity is a distinct intellectual or artistic creation that is considered relevant to a designated community in the context of digital preservation. An intellectual entity can include other intellectual entities or alternatively, it may have one or more representations with files. There are two types of intellectual entities – structural and content:
- Structural IEs – Structural IEs represent a complex object (such as datasets with various content items or whole journals with multiple issues to preserve). They hold the metadata and structure of the complex object with relations to either other structural IEs, creating a nested hierarchy of objects for preservation, or content IEs (for example, the dataset metadata with the structure of its various content items or the journals and multiple issues metadata and structure). The actual digital content (for example, the various items under a dataset with their own metadata, the articles under the journal, and issues that are preserved) is stored and represented by content IEs. Therefore, structural IEs can contain relationships to other IEs, either structural or content, but not representations with files. The relationships are kept in a tree structure with each parent IE keeping the relationships to its immediate child IEs. A child IE does not hold a relationship to its parent IE and circular relationships are not allowed.
- Content IEs – Content IEs hold the actual digital content (for example, the various items under a dataset with their own metadata, the articles under the journal, and issues that are preserved, books, maps, photographs, databases, etc.). Content IEs can stand by themselves or be related to a structural IE. Content IEs hold the IE’s metadata and representations with files but cannot hold relationships to other IEs.
Representations
A representation is the set of files, including structural metadata, needed for a complete and reasonable rendition of an intellectual entity. There can be more than one representation for the same intellectual entity. For example, a journal article may be complete in one PDF file and this single file will then constitute the representation. However, another journal article may consist of one SGML file and two image files. In this case, these three files will constitute the representation. A third article may be represented by one TIFF image for each of 12 pages plus an XML file of structural metadata showing the order of the pages. In this case, 13 files will constitute the representation. (PREMIS data dictionary, p. 14)
Files
A file is a named and ordered sequence of bytes that is known by an operating system. A file can be zero or more bytes and has a file format, access permissions, and file system characteristics such as size and last modification date.
Bitstreams
A bitstream is contiguous or non-contiguous data within a file that has meaningful common properties for preservation purposes. A bitstream cannot be transformed into a standalone file without the addition of file structure (headers, and so forth) and/or reformatting to comply with a particular file format.
A bitstream is defined in the PREMIS data model as a set of bits embedded within a file. This differs from common usage, where a bitstream could, in theory, span more than one file.
A good example of a file with embedded bitstreams is a TIFF file containing two image
According to the TIFF file format specification, a TIFF file must contain a header that includes information about the file. It may then contain one or more images. In the data model, each of these images is a bitstream and can have properties such as identifiers, location, inhibitors, and detailed technical metadata (for example, color space).
Some bitstreams have the same properties as files and some do not. The image embedded within the TIFF file clearly has properties that are different from the file itself. However, three TIFF files can also be aggregated within a larger TAR file. In this case, the three TIFF files are filestreams, but they have all the properties of TIFF files: http://www.loc.gov/standards/premis/v2/premis-2-1.pdf
Rosetta bitstream functionality is limited to filestream only. Real bitstreams (embedded objects within a file) are functionally not supported. However, from a Data Model perspective, the Data Model serves both types of bitstreams.