WARC
The WARC template is based on the standard metadata elements that are parsed to the XML file created by the WARC file splitter. The following format is usually used as the path: metadata/<tag>.
Control
| Control field | Source/Content | Additional Normalization Rules |
|---|---|---|
|
Source ID
|
From configuration file.
|
Required field.
|
|
Record ID
|
Source ID + Source Record-ID
|
Required field.
|
|
Source system
|
From configuration file.
|
Display
| Display Field | Source/Content | Additional Normalization Rules |
|---|---|---|
|
Type
|
Constant – “website”
|
|
|
title
|
metadata/title
|
|
|
OR if not present the URI is taken:
|
||
|
warc-target-uri
|
||
|
creator
|
metadata/author
|
|
|
contributor
|
metadata/producer
|
|
|
creation date
|
metadata/created
|
|
|
format
|
metadata/content-type and metadata/resource-type
|
The two fields are merged
|
|
subject
|
metadata/keywords
|
|
|
description
|
metadata/description
|
|
|
language
|
metadata/language
|
|
|
rights
|
metadata/rights
|
Links
| Link | Source/Content | Additional Normalization Rules |
|---|---|---|
|
link to resource
|
warc_record/warc-target-uri
|
Search
| Search Field | Source/Content | Additional Normalization Rules |
|---|---|---|
|
creatorcontrib
|
metadata/author
|
|
|
metadata/producer
|
||
|
title
|
metadata/title
|
|
|
OR if not present the URI is taken:
|
||
|
warc-target-uri
|
||
|
description
|
metadata/description
|
|
|
subject
|
metadata/keywords
|
|
|
fulltext
|
content
|
This tag includes the content of the harvested web page.
|
|
recordid
|
from PNX control/recordid
|
|
|
resource type
|
from PNX display/type
|
|
|
creation date
|
metadata/created
|
|
|
format
|
metadata/content-type and metadata/resource-type
|
The two fields are merged
|
Sort
| Sort | Source/Content | Additional Normalization Rules |
|---|---|---|
|
title
|
Copied from PNX display/title
|
|
|
author
|
Copied from PNX display/creator
|
Facets
| Facet | Source/Content | Additional Normalization Rules |
|---|---|---|
|
language
|
metadata/language
|
|
|
topic
|
metadata/keywords
|
|
|
toplevel
|
Constant: online_resources
|
|
|
prefilter
|
from PNX display/type
|
|
|
resource type
|
from PNX display/type
|
Duplicate Record Detection
No dedup vectors are predefined.
FRBR
No FRBR vectors are predefined.
Delivery and Scoping
| Delivery Field | Source | Additional Normalization Notes |
|---|---|---|
|
Delivery category
|
Online Resource
|
Modify as relevant
|
Ranking
No Ranking fields are predefined.
Enrichment
No enrichment fields are predefined.
Additional Data
No additional data fields are predefined.

