Skip to main content

Manage Your Research Data: Documentation & Metadata

Resources to help you prepare your data for open access and archiving

What is Data Documentation?

Data documentation can take different forms depending on the discipline and/or nature of the research project. For example, codebooks are a form of documentation for questionnaire/survey data; lab notebooks are a primary form of data documentation in the sciences and engineering. 

Essentially data documentation explains:

Who?

  • Who collected this data?
  • Who or what were the subjects under study?

What?

  • What data was collected, and for what purpose?
  • What is the content and structure of the data?

Where?

  • Where was this data collected?
  • What were the experimental conditions that produced it?

When?

  • When was the data collected?
  • Is the data part of a series, or ongoing experiment?

Why?

  • Why was this experiment performed?
  • How does it relate to your research question?

Data Documentation Best Practices

  • Describe the contents of data files
  • Define the parameters and the units on the parameter
  • Explain the formats for dates, time, geographic coordinates, and other parameters
  • Define any coded values
  • Describe quality flags or qualifying values
  • Define missing values

Data Documentation, Organization, and Metadata

5-minute video from the University of Minnesota's Data Management Course. 

What is Metadata?

Metadata is data about data. 

Data which has metadata associated with it is often called "structured data" and is machine-readable by computers, because it has been standardized.

Metadata Best Practices

  • Consistent data entry is important
  • Avoid extraneous punctuation
  • Avoid most abbreviations
  • Use templates and macros when possible
  • Extract pre-existing metadata
  • Keep a data dictionary
  • Always use an established metadata standard

Metadata Sample Elements

Sample Elements to Include in Metadata:

  • Title
  • Creator
  • Identifier
  • Subject
  • Funders
  • Rights
  • Access Information
  • Language
  • Dates
  • Location
  • Methodology
  • Data Processing
  • Sources
  • List of File Names
  • File Formats
  • File Structure
  • Variable List
  • Code Lists
  • Versions
  • Checksums

Common Metadata Standards

Here are some common metadata standards in scientific disciplines.

Engineering

CSMD-CCLRC Core Scientific Metadata Model

A study-data oriented model, primarily in support of the ICAT data management infrastructure software. The CSMD is designed to support data collected within a large-scale facility’s scientific workflow; however the model is also designed to be generic across scientific disciplines.

Sponsored by the Science and Technologies Facilities Council, there is reference to CSMD 3.0 development in 2010; however, the latest full specification available is v 2.0, from 2004.

ISA-Tab

The Investigation/Study/Assay (ISA) tab-delimited (TAB) format is a general purpose framework with which to collect and communicate complex metadata (i.e. sample characteristics, technologies used, type of measurements made) from 'omics-based' experiments employing a combination of technologies.

Created by core developers from the University of Oxford, ISA-TAB v1.0 was released in November 2008.

MIBBI - Minimum Information for Biological and Biomedical Investigations

A common portal to a group of nearly 40 checklists of Minimum Information for various biological disciplines. The MIBBI Foundry is developing a cross-analysis of these guidelines to create an intercompatible, extensible community of standards.

The concept was realized initially through the joint efforts of the Proteomics Standards Initiative, the Genomic Standards Consortium and the MGED RSBI Working Groups. The latest project to register with MIBBI is the MIABie guidelines for reporting biofilm research, as of January 2012.

Environmental & Physical Sciences

CF (Climate and Forecast) Metadata Conventions

The CF standard was originally framed as a standard for data written in netCDF format, with model-generated climate forecast data particularly in mind. However, it is equally applicable to observational datasets, and can be used to describe other formats. It is a standard for “use metadata” that aims both to distinguish quantities (such as physical description, units, and prior processing) and to locate the data in space–time.

Sponsored by the NetCDF Climate and Forecast Metadata Convention, the current version dates from December 2011.

CIM-Common Information Model

The Common Information Model (CIM) describes climate data, the models and software from which they derive, the geographic grids used to calculate and project them, and the experimental processes (typically simulations) that produced them.

The CIM was originally developed by the EU-funded Metafor Project. It is now maintained and developed by Earth Science Documentation (ES-DOC). The latest release dates from 2012.

CSMD-CCLRC Core Scientific Metadata Model

A study-data oriented model, primarily in support of the ICAT data managment infrastructure software. The CSMD is designed to support data collected within a large-scale facility’s scientific workflow; however the model is also designed to be generic across scientific disciplines.

Sponsored by the Science and Technologies Facilities Council, there is reference to CSMD 3.0 development in 2010; however, the latest full specification available is v 2.0, from 2004.

DIF - Directory Interchange Format

An early metadata initiative from the Earth sciences community, intended for the description of scientific data sets. It includes elements focusing on instruments that capture data, temporal and spatial characteristics of the data, and projects with which the dataset is associated. It is defined as a W3C XML Schema.

Sponsored by the Global Change Master Directory, the DIF Writer's Guide Version 6 is from November 2010.

FGDC/CSDGM - Federal Geographic Data Committee Content Standard for Digital Geospatial Metadata

A widely-used, but no longer current standard defining the information content for a set of digital geospatial data required by the US Federal Government.

CSDGM was sponsored by the US Federal Geographic Data Committee.  However, in September 2010 the FGDC endorsed ISO 19115 and began encouraging federal agencies to transition to ISO metadata.

ISO 19115

An internationally-adopted schema for describing geographic information and services. It provides information about the identification, the extent, the quality, the spatial and temporal schema, spatial reference, and distribution of digital geographic data.

Sponsored by the International Standards Organisation, ISO 19115:2003 was last reviewed in February 2009.

CIM - Common Information Model

The Common Information Model (CIM) describes climate data, the models and software from which they derive, the geographic grids used to calculate and project them, and the experimental processes (typically simulations) that produced them.

The CIM was originally developed by the EU-funded Metafor Project. It is now maintained and developed by Earth Science Documentation (ES-DOC). The latest release dates from 2012. - See more at: http://www.dcc.ac.uk/resources/metadata-standards/cim-common-information-model#sthash.xAgKwZFR.dpuf

Observations and Measurements

This encoding is an essential dependency for the OGC Sensor Observation Service (SOS) Interface Standard. More specifically, this standard defines XML schemas for observations, and for features involved in sampling when making observations. These provide document models for the exchange of information describing observation acts and their results, both within and between different scientific and technical communities.

PDBx/mmCIF - Protein Data Bank Exchange Dictionary and the Macromolecular Crystallographic Information Framework

Protein Data Bank archive (PDB) is the single worldwide archival repository of information about the 3D structures of proteins, nucleic acids, and complex assemblies, managed by the Worldwide PDB (wwPDB). The PDB Exchange Dictionary (PDBx) is used by the wwPDB to define data content for deposition, annotation and archiving of PDB entries. PDBx incorporates the community standard metadata representation, the Macromolecular Crystallographic Information Framework (mmCIF), orginally developed under the auspices of the International Union of Crystallography (IUCr). PDBx has been extended by the wwPDB to include descriptions of other experimental methods that produce 3D macromolecular structure models such as Nuclear Magnetic Resonance Spectroscopy, 3D Electron Microscopy and Tomography.

Repository-Developed Metadata Schemas

Some repositories have decided that current standards do not fit their metadata needs, and so have created their own requirements.

 

Life Sciences

ABCD - Access to Biological Collection Data

The Access to Biological Collections Data (ABCD) Schema is an evolving comprehensive standard for the access to and exchange of data about specimens and observations (a.k.a. primary biodiversity data). The ABCD Schema attempts to be comprehensive and highly structured, supporting data from a wide variety of databases. It is compatible with several existing data standards. Parallel structures exist so that either (or both) atomised data and free-text can be accommodated.

Sponsored by Biodiversity Information Standards TDWG - the Taxonomic Databases Working Group, the current specification was last modified in 2007.

Darwin Core

A body of standards, including a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing reference definitions, examples, and commentaries.

Sponsored by Biodiversity Information Standards (TWDG), the current standard was last modified in October 2009.

EML - Ecological Metadata Language

Ecological Metadata Language (EML) is a metadata specification particularly developed for the ecology discipline. It is based on prior work done by the Ecological Society of America and associated efforts (Michener et al., 1997, Ecological Applications).

Sponsored by ecoinformatics.org, EML Version 2.1.1 was released in 2011.

Genome Metadata

Genome metadata on PATRIC consists of 61 different metadata fields, called attributes, which are organized into the following seven broad categories: Organism Info, Isolate Info, Host Info, Sequence Info, Phenotype Info, Project Info, and Others.

ISA-Tab

The Investigation/Study/Assay (ISA) tab-delimited (TAB) format is a general purpose framework with which to collect and communicate complex metadata (i.e. sample characteristics, technologies used, type of measurements made) from 'omics-based' experiments employing a combination of technologies.

Created by core developers from the University of Oxford, ISA-TAB v1.0 was released in November 2008.

MIBBI - Minimum Information for Biological and Biomedical Investigations

A common portal to a group of nearly 40 checklists of Minimum Information for various biological disciplines. The MIBBI Foundry is developing a cross-analysis of these guidelines to create an intercompatible, extensible community of standards.

The concept was realized initially through the joint efforts of the Proteomics Standards Initiative, the Genomic Standards Consortium and the MGED RSBI Working Groups. The latest project to register with MIBBI is the MIABie guidelines for reporting biofilm research, as of January 2012.

Observ-OM

Observ-OM is founded on four basic concepts to represent any kind of observation: Targets, Features, Protocols (and their Applications), and Values. It is intended to lower the barrier for future data sharing and facilitate integrated search across panels and species. All models, formats, documentation, and software are available for free and open source (LGPLv3) at http://www.observ-om.org.

OME-XML - Open Microscopy Environment XML

OME-XML is a vendor-neutral file format for biological image data, with an emphasis on metadata supporting light microscopy. It can be used as a data file format in its own right, or as a way of encoding metadata within a TIFF or BigTIFF file (for which purpose there is the OME-TIFF specification).

The standard is maintained by the Open Microscopy Environment Consortium, and was last updated in June 2012.

PDBx/mmCIF - Protein Data Bank Exchange Dictionary and the Macromolecular Crystallographic Information Framework

Protein Data Bank archive (PDB) is the single worldwide archival repository of information about the 3D structures of proteins, nucleic acids, and complex assemblies, managed by the Worldwide PDB (wwPDB). The PDB Exchange Dictionary (PDBx) is used by the wwPDB to define data content for deposition, annotation and archiving of PDB entries. PDBx incorporates the community standard metadata representation, the Macromolecular Crystallographic Information Framework (mmCIF), orginally developed under the auspices of the International Union of Crystallography (IUCr). PDBx has been extended by the wwPDB to include descriptions of other experimental methods that produce 3D macromolecular structure models such as Nuclear Magnetic Resonance Spectroscopy, 3D Electron Microscopy and Tomography.

Protocol Data Element Definitions

A draft set of data elements required by the National Institues of Health (U.S.) for the submission of trial information to the CLincalTrials.gov registry and results database.

Repository-Developed Metadata Schemas

Some repositories have decided that current standards do not fit their metadata needs, and so have created their own requirements.

 

Here are some common metadata standards in social science disciplines.

Social and Behavioral Sciences

DDI - Data Documentation Initiative

A widely-used international standard for describing data from the social, behavioral, and economic sciences. Expressed in XML, the DDI metadata specification supports the entire research data life cycle.

Sponsored by the DDI Alliance, DDI version 3.2 was released in 2014.

MIDAS-Heritage

A British cultural heritage standard for recording information on buildings, archaeological sites, shipwrecks, parks and gardens, battlefields, areas of interest and artefacts.

Sponsored by the Forum on Information Standards in Heritage, MIDAS Version 1.1 was released in October 2012.

OAI-ORE - Open Archives Initiative Object Reuse and Exchange

The goal of these standards is to expose the rich content in aggregations of Web resources to applications that support authoring, deposit, exchange, visualization, reuse, and preservation. The standards support the changing nature of scholarship and scholarly communication, and the need for cyberinfrastructure to support that scholarship, with the intent to develop standards that generalize across all web-based information including the increasing popular social networks of “Web 2.0”.

QuDEx - Qualitative Data Exchange Format

The QuDEx standard/schema is a software-neutral format for qualitative data that preserves annotations of, and relationships between, data and other related objects. It can be viewed as the optimal baseline data exchange model for the archiving and interchange of data and metadata.

SDMX - Statistical Data and Metadata Exchange

A set of common technical and statistical standards and guidelines to be used for the efficient exchange and sharing of statistical data and metadata.

Sponsoring institutions include BIS, ECB, EUROSTAT, IMF, OECD, UN, and the World Bank. Technical Specification 2.1 was amended in May 2012.

Arts & Humanities

These disciplines often use the social and behavioral sciences standard known as DDI.

DDI - Data Documentation Initiative

A widely-used international standard for describing data from the social, behavioral, and economic sciences. Expressed in XML, the DDI metadata specification supports the entire research data life cycle.

Sponsored by the DDI Alliance, DDI version 3.2 was released in 2014.

MIDAS-Heritage

A British cultural heritage standard for recording information on buildings, archaeological sites, shipwrecks, parks and gardens, battlefields, areas of interest and artefacts.

Sponsored by the Forum on Information Standards in Heritage, MIDAS Version 1.1 was released in October 2012.

OAI-ORE - Open Archives Initiative Object Reuse and Exchange

The goal of these standards is to expose the rich content in aggregations of Web resources to applications that support authoring, deposit, exchange, visualization, reuse, and preservation. The standards support the changing nature of scholarship and scholarly communication, and the need for cyberinfrastructure to support that scholarship, with the intent to develop standards that generalize across all web-based information including the increasing popular social networks of “Web 2.0”.

 


RDA Metadata Directory is maintained by Sean Chen and Kate Anne Alderete.
The theme is maintained by Dustin Allen.
This page was generated by GitHub Pages.

 

Retrieved from http://rd-alliance.github.io/metadata-directory/standards/

Text Encoding Initiative (TEI)

The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form. Its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics. Since 1994, the TEI Guidelines have been widely used by libraries, museums, publishers, and individual scholars to present texts for online research, teaching, and preservation. In addition to the Guidelines themselves, the Consortium provides a variety of resources and training events for learning TEI, information on projects using the TEI, a bibliography of TEI-related publications and software developed for or adapted to the TEI.

General Research Data Dates Standard

Use ISO 8601 technical standard:

  • Year only: YYYY (e.g. 1997)
  • Year and month:  YYYY-MM (e.g. 1997-07)
  • Complete date: YYYY-MM-DD (e.g. 1997-07-16)
  • Date, time and zone: YYYY-MM-DDThh:mm±hh (e.g. 1997-07-06T15:45-08)

Media Types Standard

Use the MIME standard: application, audio, example, image, message, model, multipart, text, video

CERIF - Common Eurpean Research Information Format

The Common European Research Information Format is the standard that the EU recommends to its member states for recording information about research activity. Since version 1.6 it has included specific support for recording metadata for datasets.

DataCite Metadata Schema

A set of mandatory metadata that must be registered with the DataCite Metadata Store when minting a DOI persistent identifier for a dataset. The domain-agnostic properties were chosen for their ability to aid in accurate and consistent identification of data for citation and retrieval purposes. DataCite XML example file

Sponsored by the DataCite consortium, version 3.0 was recently released in 2013.

 

DCAT - Data Catalog Vocabulary

By using DCAT to describe datasets in data catalogs, publishers increase discoverability and enable applications easily to consume metadata from multiple catalogs. It further enables decentralized publishing of catalogs and facilitates federated dataset search across sites. Aggregated DCAT metadata can serve as a manifest file to facilitate digital preservation.

Dublin Core

A basic, domain-agnostic standard which can be easily understood and implemented, and as such is one of the best known and most widely used metadata standards.  Sponsored by the Dublin Core Metadata Initiative, Dublin Core was published as ISO Standard 15836 in February 2009.

OAI-ORE - Open Archives Initiative Object Reuse and Exchange

The goal of these standards is to expose the rich content in aggregations of Web resources to applications that support authoring, deposit, exchange, visualization, reuse, and preservation. The standards support the changing nature of scholarship and scholarly communication, and the need for cyberinfrastructure to support that scholarship, with the intent to develop standards that generalize across all web-based information including the increasing popular social networks of “Web 2.0”.

Observations and Measurements

This encoding is an essential dependency for the OGC Sensor Observation Service (SOS) Interface Standard. More specifically, this standard defines XML schemas for observations, and for features involved in sampling when making observations. These provide document models for the exchange of information describing observation acts and their results, both within and between different scientific and technical communities.

PROV

Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness. The PROV Family of Documents defines a model, corresponding serializations and other supporting definitions to enable the inter-operable interchange of provenance information in heterogeneous environments such as the Web.

RDF Data Cube Vocabulary

The standard provides a means to publish multi-dimensional data, such as statistics, on the web in such a way that it can be linked to related data sets and concepts using the W3C RDF (Resource Description Framework) standard. The model underpinning the Data Cube vocabulary is compatible with the cube model that underlies SDMX (Statistical Data and Metadata eXchange), an ISO standard for exchanging and sharing statistical data and metadata among organizations.

Repository-Developed Metadata Schemas

Some repositories have decided that current standards do not fit their metadata needs, and so have created their own requirements.

Retrieved from http://rd-alliance.github.io/metadata-directory/standards/

Loading

Data Documentation Initiative (DDI) Tools

DDI Controlled Vocabularies

Colectica for DDI (subscription) or Colectica for Excel (free)

Colectica software allows you to design, document, and publish your statistical data and survey research using open data standards (e.g., DDI). The Colectica platform consists of several software tools for viewing, creating, and managing your metadata.

Dublin Core Metadata Example

The Dublin Core Metadata Schema is composed of Elements (e.g. date) and Modifiers (e.g. accessioned). Some Elements are Required (e.g. title) and some are Repeatable (e.g. more than one author). Part of the schema is the controlled vocabulary it specifies. Two examples are the preferred subject terms used to characterize the data and the Scientific[species]Name of specimens both of which aid other researchers in discovery of the dataset.