WARC
The WARC template is based on the standard metadata elements that are parsed to the XML file created by the WARC file splitter. The following format is usually used as the path: metadata/<tag>.
Control
Control field | Source/Content | Additional Normalization Rules |
---|---|---|
Source ID
|
From configuration file.
|
Required field.
|
Record ID
|
Source ID + Source Record-ID
|
Required field.
|
Source system
|
From configuration file.
|
Display
Display Field | Source/Content | Additional Normalization Rules |
---|---|---|
Type
|
Constant – “website”
|
|
title
|
metadata/title
|
|
OR if not present the URI is taken:
|
||
warc-target-uri
|
||
creator
|
metadata/author
|
|
contributor
|
metadata/producer
|
|
creation date
|
metadata/created
|
|
format
|
metadata/content-type and metadata/resource-type
|
The two fields are merged
|
subject
|
metadata/keywords
|
|
description
|
metadata/description
|
|
language
|
metadata/language
|
|
rights
|
metadata/rights
|
Links
Link | Source/Content | Additional Normalization Rules |
---|---|---|
link to resource
|
warc_record/warc-target-uri
|
Search
Search Field | Source/Content | Additional Normalization Rules |
---|---|---|
creatorcontrib
|
metadata/author
|
|
metadata/producer
|
||
title
|
metadata/title
|
|
OR if not present the URI is taken:
|
||
warc-target-uri
|
||
description
|
metadata/description
|
|
subject
|
metadata/keywords
|
|
fulltext
|
content
|
This tag includes the content of the harvested web page.
|
recordid
|
from PNX control/recordid
|
|
resource type
|
from PNX display/type
|
|
creation date
|
metadata/created
|
|
format
|
metadata/content-type and metadata/resource-type
|
The two fields are merged
|
Sort
Sort | Source/Content | Additional Normalization Rules |
---|---|---|
title
|
Copied from PNX display/title
|
|
author
|
Copied from PNX display/creator
|
Facets
Facet | Source/Content | Additional Normalization Rules |
---|---|---|
language
|
metadata/language
|
|
topic
|
metadata/keywords
|
|
toplevel
|
Constant: online_resources
|
|
prefilter
|
from PNX display/type
|
|
resource type
|
from PNX display/type
|
Duplicate Record Detection
No dedup vectors are predefined.
FRBR
No FRBR vectors are predefined.
Delivery and Scoping
Delivery Field | Source | Additional Normalization Notes |
---|---|---|
Delivery category
|
Online Resource
|
Modify as relevant
|
Ranking
No Ranking fields are predefined.
Enrichment
No enrichment fields are predefined.
Additional Data
No additional data fields are predefined.