Manage Your Research Data: Documentation, Organization, Metadata

This guide provides a primer on the fundamentals of data management.

On This Page

Documentation
Defines "documentation," provides examines, and gives best practices.

Organization
Provides an overview of best practices in organization, naming schemes, and file formats.

Metadata
Defines "metadata," discusses standards, provides recommendations, and discusses FAIR (findable, accessible, interoperable, and reusable) data.

Documentation

Documentation varies between disciplines (see DMPtool for more details about individual federal agency requirements).
In general terms, documentation is the supplemental material that provides the information needed to read, understand, identify, and reuse data.
These documents may include: Readme Files, Data dictionaries, Code books, Glossary, Definition files, Lab notebooks, or other supporting documents. 

Content in Documentation

Documentation should identify the context, scope, and format of the data collected in a project. Examples of information included with documentation:
Data collection methods
File types/formats
Variable names
Software used
Version control details
Data sources used
Access restrictions and confidentiality issues
Naming conventions

General Best Practice
Documentation should be in file types that are nonproprietary or open source (.txt, .csv, .ods).  
Documents should describe the content of data files, defines values, explain variables and parameters.
Ultimately, the goal of documentation is to allow future researchers to read, understand, and potentially reuse data.

Organization

Creating a clear and consistent organizational scheme at the beginning of a project is extremely important. Many aspects of organization, such as file type and storage solutions, will be addressed as part of your Data Management Plan, but some other considerations:
Choosing the file format(s)
Naming schemes
Version control
Long and short term storage

Naming Best Practice
*Use descriptive names that identify content and version without being too long (less than 25 characters).
*Name may also indicate researcher, equipment, lab, or date. This varies by the needs of the project.
*Avoid special characters like ! @ # $ % ^ & *.
*Add versions or dates into names. Examples:
DataFileName_1.0 = original document
DataFileName_1.1 = original document with minor revisions
DataFileName_2.0 = document with substantial revisions
image1_v1.jpg
image1_v2.jpg
image2_v1.jpg
image2_v2.jpg
dataset1_20210402_RMM
dataset1_20210301_RMM
dataset1_20200814_RMM
When using dates, use numerals and begin with the year and month. Example: 1/26/21 would be 20210126. 

File Formats 
Consider using file types that can be opened without subscription software. These options include:
Video images: MOV, MPEG, AVI, MXF
Text: XML, PDF/A, HTML, ASCII, UTF-8
Sounds: WAVE, AIFF, MP3, MXF
Containers: TAR, GZIP, ZIP
Statistics: ASCII, DTA, POR, SAS, SAV
Images: TIFF, JPEG 2000, PDF, PNG, GIF, BMP
Tables: CSV
Databases: XML, CSV
Geospatial: SHP, DBF, GeoTIFF, NetCDF      
Web archives: WARC

 

Metadata

Metadata standards vary between disciplines (see DMPtool for more details about individual federal agency requirements), but is broadly described as "data about data." Metadata provides contextual information surrounding the collected data, indicating the creator, creation date, format, subject, and other important details.

At a minimum, metadata should contain the 15 elements identified by Dublin Core standards (text below is from the Dublin Core guide):
Title: A name given to the resource. Typically a Title will be a name by which the resource is formally known.
Creator: An entity primarily responsible for making the resource. Examples of a Creator include a person, an organization, or a service. Typically, the name of a Creator should be used to indicate the entity.
Subject: The topic of the resource. Typically the subject will be represented using keywords, key phrases, or classification codes. Recommended best practice is to use a controlled vocabulary.
Description: An account of the resource. Description may include but is not limited to: an abstract, a table of contents, a graphical representation, or a free-text account of the resource.
Publisher: An entity responsible for making the resource available. Examples of a Publisher include a person, an organization, or a service. Typically, the name of a Publisher should be used to indicate the entity.
Contributor: An entity responsible for making contributions to the resource. Examples of a Contributor include a person, an organization, or a service. Typically, the name of a Contributor should be used to indicate the entity.
Coverage: The spatial or temporal topic of the resource, the spatial applicability of the resource, or the jurisdiction under which the resource is relevant. Spatial topic and spatial applicability may be a named place or a location specified by its geographic coordinates. Temporal topic may be a named period, date, or date range. A jurisdiction may be a named administrative entity or a geographic place to which the resource applies. Recommended best practice is to use a controlled vocabulary such as the Thesaurus of Geographic Names [TGN]. Where appropriate, named places or time periods can be used in preference to numeric identifiers such as sets of coordinates or date ranges.
Date: A point or period of time associated with an event in the lifecycle of the resource. Date may be used to express temporal information at any level of granularity. Recommended best practice is to use an encoding scheme, such as the W3CDTF profile of ISO 8601 [W3CDTF].
Type: The nature or genre of the resource. Recommended best practice is to use a controlled vocabulary such as the DCMI Type Vocabulary [DCMITYPE]. To describe the file format, physical medium, or dimensions of the resource, use the Format element.
Format: The file format, physical medium, or dimensions of the resource. Examples of dimensions include size and duration. Recommended best practice is to use a controlled vocabulary such as the list of Internet Media Types [MIME].
Identifier: An unambiguous reference to the resource within a given context. Recommended best practice is to identify the resource by means of a string conforming to a formal identification system.
Source: A related resource from which the described resource is derived. The described resource may be derived from the related resource in whole or in part. Recommended best practice is to identify the related resource by means of a string conforming to a formal identification system.
Rights: Information about rights held in and over the resource. Typically, rights information includes a statement about various property rights associated with the resource, including intellectual property rights.
Language:  A language of the resource. Recommended best practice is to use a controlled vocabulary such as RFC 4646 [RFC4646].
Publisher: An entity responsible for making the resource available. Examples of a Publisher include a person, an organization, or a service. Typically, the name of a Publisher should be used to indicate the entity.

Metadata Standards
Metadata requirements vary between disciplines and funding sources, however some of the standards are below:
General
Dublin Core (DC)
Metadata Object Description Schema (MODS)
Humanities
Text Encoding Initiative (TEI)
Visual Resources Association Core (VRA)
Social Sciences
Data Documentation Initiative (DDI)
Natural Sciences
Darwin Core
Integrated Taxonomic Information System (ITIS)
Earth Sciences
Directory Interchange Format (DIF)
Standard for the Exchange of Earthquake Data (SEED)
Ecology
Ecological Metadata Language (EML)
Geography & Geospatial
Federal Geographic Data Committee (FGDC)
ISO 19115

The Digital Curation Centre also provides recommended metadata standards in:
Social Science & Humanities
Physical Science
General Research Data
Earth Science
Biology 

FAIR Data

The ultimate goal of metadata is to make data findable, accessible, interoperable, and reusable ("FAIR Data"). The defining principles of FAIR data are:
Findable
F1. (Meta)data are assigned a globally unique and persistent identifier
F2. Data are described with rich metadata (defined by R1 below)
F3. Metadata clearly and explicitly include the identifier of the data they describe
F4. (Meta)data are registered or indexed in a searchable resource
Accessible
A1. (Meta)data are retrievable by their identifier using a standardised communications protocol
A1.1 The protocol is open, free, and universally implementable
A1.2 The protocol allows for an authentication and authorisation procedure, where necessary
A2. Metadata are accessible, even when the data are no longer available
Interoperable
I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
I2. (Meta)data use vocabularies that follow FAIR principles
I3. (Meta)data include qualified references to other (meta)data
Reusable
R1. (Meta)data are richly described with a plurality of accurate and relevant attributes
R1.1. (Meta)data are released with a clear and accessible data usage license
R1.2. (Meta)data are associated with detailed provenance
R1.3. (Meta)data meet domain-relevant community standards