During our training events, we often discuss with the audience, consisting of data stewards and other research supporters, what sort of information researchers need. Specifically, during a 2023 training course on FAIR organised by the DCC-PO, we gathered frequently asked questions about FAIR data. Below, you can find a selection of these questions, accompanied by answers and sources for further information. If you have a question regarding research data management, FAIR data, or open science, please contact us – perhaps your question will be the next to appear below!

If you have questions that are more specifically related to the DANS Data Stations, have a look here.

FAIR data

What is FAIR and why should I implement it?

The FAIR guiding principles for scientific data management and stewardship are a set of guidelines that you can follow to ultimately make your data better reusable by others.

  • F – Findable: ensuring both humans and machines are able to find your data
  • A – Accessible: clarifying how humans and machines can gain access to your data
  • I – Interoperable: aligning your data with other data available in the community by using the same language and establishing connections
  • R – Reusable: allowing others to reuse your data by describing and licensing your data clearly

Being able to find and reuse data on a large scale benefits science as a whole ecosystem, and can prevent costly duplicate efforts. Making your data FAIR can help it reach a broader audience and increase its impact.

Further information:

FAIR is not exclusive to data, there are also FAIR principles for research software: Barker, M., Chue Hong, N.P., Katz, D.S. et al. Introducing the FAIR Principles for research software. Sci Data 9, 622 (2022). https://doi.org/10.1038/s41597-022-01710-x

Can my data be FAIR?

FAIR is not an all-or-nothing principle, so your data can be incrementally more or less FAIR based on different practices that you implement. From the original FAIR principles, various different interpretations of metrics and practices have been defined that can help you figure out what steps you can take to increase the FAIRness of your data. Many elements of FAIR are discussed in the other FAQs. You can also use a tool to assess the FAIRness of your data, as a starting point to identifying ways to improve it. See https://fairassist.org/#!/ for an overview of tools you can use to assess the FAIRness of your data.



 

What is the difference between FAIR data and Open data?

Data being FAIR is not the same as data being open. Your data can be restricted or closed access, while still being considerably FAIR. The ‘A’ in FAIR stands for accessible, but the principle does not dictate that data should be openly accessible, only that it should be clearly communicated what the level of accessibility is. If your data is not openly accessible, it is important that you clearly state what the conditions of access and reuse are (for example, you may provide access upon request, after confirming the purpose of the reuse or the role of the requester). You can do this by providing a data availability statement or data access protocol, and specifying a licence for the data.



Data Management Plans (DMPs)

Why do I have to have a DMP?

Most funders and many institutions now require a DMP for each research project. The idea behind this is to make research transparent, replicable, and as open as possible. By writing a DMP before the start of a project, and by keeping this up-to-date throughout, you are thinking ahead about how to manage and share your research data. By putting effort into the DMP, less effort is subsequently required to manage the data effectively throughout the project. A good DMP can also reduce the risk of data loss or other threats that can negatively impact your data (e.g. obsolescence of software). 

Which DMP template should I use?

This depends on what it is for: funders and institutions often have their own templates or guidelines that they want you to adhere to, so the funder and university library website are good first points to check. In Europe, more and more research funders work along requirements and guidance developed by Science Europe. Their document is a good place to start, and while templates and guidelines vary, the main themes identified by Science Europe are part of every template. There are also various DMP tools that you can use and these often have inbuilt templates. Good and free-to-use examples are Argos and DMPonline. Added advantages are that they support joint writing and that the DMP produced can also be machine-readable.

Further information:

Horizon Europe DMP template: https://enspire.science/wp-content/uploads/2021/09/Horizon-Europe-Data-Management-Plan-Template.pdf

Are example DMPs available?

Because the information in a DMP differs per research project, institution, discipline, and so on, it is most useful to check out examples of similar projects. You can find public DMPs in Argos and DMPonline, on Zenodo, and, for example, on the DMP Use Case Project website. Your institution may have their own template in which certain aspects are already filled in for you, like how data is shared or backed-up in your institution.

 

Metadata

What is metadata?

Metadata is “data about data”, so data that gives information about other data. It is important in order to discover, find, and reuse data. You can think of project-level metadata, like what is the title of the project, what is it about (e.g., keywords), who is involved (e.g., who are the authors of a dataset), who is the funder, and so on. Other important metadata relate to what the dataset contains (e.g., what sort of files), how it was collected, how it was processed, and how it can be accessed and reused (e.g., licences, PIDs such as DOI). There is also metadata on the data-level, such as information about the file (data type, format, etc.) and about variables used in the file. Structured metadata can also be machine-readable, and therefore more easily found during catalogue or internet searches.

More information:

What is a metadata standard?

As mentioned briefly under ‘What is metadata?’, there are different types of metadata, and these contain different metadata elements (e.g., ‘title’, ‘author’, etc.). For different purposes or in different disciplines, different combinations of these elements are required. A metadata standard is a subject-specific agreed-on or recommended (‘standard’) group of metadata elements. The metadata standard includes which metadata elements should be used, and can also include rules on the syntax and which controlled vocabularies should be used (per metadata element). 

Internationally recognized, and widely used, metadata standards are:

  • Dublin Core terms, for bibliographic information and discovery of other ‘information objects’. The Dublin Core Metadata Element Set contains fifteen basic metadata elements, like ‘Title’, ‘Creator’, ‘Date’, ‘Description’, while the DCMI metadata terms include additional elements. 
  • DataCite Metadata Schema for citation and retrieval of datasets. It contains twenty main elements, including for example ‘ Title’, ‘Creator’, ‘Publisher’, ‘Subject’, ‘Language’, ‘Rights’. Many of its elements are similar to Dublin Core.
  • Data Documentation Initiative (DDI) for surveys and other observational methods in the social, behavioural, economic, and health sciences.

If you deposit your dataset in a trustworthy repository, the repository will already have metadata standards in place. DANS uses metadata from Dublin Core, DDI, and DataCite, following OpenAIRE guidelines and other community recommendations when deciding which metadata elements to include.

More information: 

How do I document metadata?

As a researcher, it is good to be aware of the importance and types of metadata, as it does not cost much time to document metadata if you start from the beginning of a project (see ‘What is metadata’ and ‘What is a metadata standard?’). There are metadata templates that help to check if you are collecting all relevant metadata (metadata fields), like dublincoregenerator. The most important thing to consider is whether you are adding all information that would be necessary to find, access, and reuse your data. You can collect your metadata simply in a text file and embed data-level information in the data files (think for example about EXIF information in an image file). To generate machine-readable metadata, you can submit your dataset to a trustworthy data repository (see also ‘Where should I store my data?’) – so there is no need to do this yourself. If you want to, however, it is possible to generate a machine-readable .xml file yourself, for example using the dublincoregenerator mentioned above.

More information:

Where and how do I register metadata?

If you deposit your dataset in a repository, like one of the DANS Data Stations, you will be asked to fill out metadata fields. In this way, the metadata elements are connected to your dataset. Always make sure to deposit your data in a repository where you have the option to add metadata, ideally as rich (extensive) as possible. This increases the chance that others will find and reuse the data.

Depending on the repository, the metadata is also indexed (i.e. findable through search engines) or harvested by metadata catalogues. Metadata in the DANS Data Station Archaeology, for example, are also visible in the ARIADNE Portal, while metadata of the DANS Data Station Social Sciences and Humanities can be seen in the ODISSEI Portal.

In some cases you may want to increase the visibility of your output through adding its metadata manually to a register or catalogue. For example, if you deposited training or teaching materials (e.g. slides or exercise templates) into Zenodo, you could then decide that to increase their visibility you want to add the metadata to the TeSS platform, a platform for finding training materials. (NB: while you can add metadata referring to your data in multiple places, you should never deposit your actual data or other materials twice.)

More information:

File formats

What is a file format?

Files are saved in different formats. File formats are standard ways to encode information for storage on a computer. You can see what the file format is based on the file extension (the suffix), like .jpg or .docx. 

There are normally multiple options of formats for the same type of file. For example, you can save an image as a JPG, TIFF, PNG, and more. There is a basic distinction between proprietary formats, which are owned by a company or organisation, and non-proprietary or open formats, which are publicly accessible.

Reference and more information:

What is meant with ‘preferred (file) formats’?

Preferred formats are file formats of which an organisation or repository, like DANS, based on international agreements, is confident that they offer the best long-term guarantee in terms of usability, accessibility and sustainability. As a general guideline, these include file formats which are:

  • Frequently used
  • Have open specifications (i.e. they are non-proprietary)
  • Are independent of specific software, developers, or vendors.

It is good to know that in practice it will not always be possible to use a format which adheres to all three of the criteria. For example, a proprietary format may be the one that is most used for a specific purpose in your field, like SPSS .sav files or ESRI shapefiles. In this case, it is best to store your data both in the proprietary, often used format and in a preferred, open format which has better chances of long-term preservation and accessibility.

More information:

Why should I use preferred formats?

We recommend that you use preferred formats (see question ‘What is meant with ‘preferred (file) formats’) because they offer the best long-term guarantee that your data will remain usable, accessible, and sustainable. Proprietary file formats, for example, may require specific software to be opened and this may not remain available. However, there are situations where a proprietary format may be the most common format used in your field, like SPSS .sav files in the social sciences or ESRI shapefiles in archaeology or geography. There are also cases where proprietary formats contain additional information that may be lost upon conversion. In such cases, we recommend that you archive your files in both the proprietary format as well as a preferred format, the latter perhaps not containing most information or in the most usable way at the moment, but having better perspectives in the long term. At least at DANS we will normally accept any file type if there is a good reason for using it – so you are never forced to use preferred formats. Just keep in mind that with non-preferred formats there is no guarantee that the data will remain readable in the long term.

More information:

Publishing and sharing datasets

My data cannot be understood by someone else, so what would be the point of publishing them?

Firstly, it is important to document your data well so that they can be understood by others. Even if your field is small and not many other people work with, for example, the same instrument in the lab or the same type of data, there will nonetheless be others out there who do and there may be other people in future who will conduct the same type of research or even want to build on yours. In the end, we can guess but cannot anticipate how science will develop and who will be interested in your data in the future. By documenting your data well (what did you do, how did you do it, what do the variables in your files mean?), for example in a readme file, you make the data understandable to others.

Secondly, if you publish your data by depositing them under an open licence in a repository, you do not only make them available to others, but also ensure their long-term preservation. Even if only few others will be able to understand them, at least they are preserved for those that do. 

Furthermore, your funder or institution may require that you make your data available in a research data repository. The fact that your research has been done is also useful to know for others, so even if the data are not reused as such, the metadata may give important information to others.

More information:

My data are not useful to others, so why would I make them available for reuse?

Are you very sure that your data are not useful to anyone else? When you started your project, you presumably had a research question you wanted to answer and a reason for wanting to answer it – others may also find the same thing interesting and worthwhile. Moreover, by making your data available, you also ensure that others can see what the conclusions in your journal paper or chapter are based on; in other words, it is important for research integrity. Furthermore, your funder or institution may require that you make your data available in a research data repository.

How do I make sure others can interpret my data?

You can do this by supplying sufficient information, in the form of metadata and data documentation. What does one need to know to understand the data? Think of project-level information, like what was the research about, why was it conducted, and by whom, and of data-level information like what methods were used to measure, process, and analyse the data, and what do the abbreviations and variable names in the data file mean? You can add this information for example as a readme file (see ‘What is a readme file?’), including a codebook. Also make sure to add all relevant files to your dataset, not just your data, but also, for example, experimental protocols, the questionnaire that was used to collect the data, or scripts used for analysis. 

More information:

What is a readme file?

A readme file is a text file in which you provide information about your dataset. It is good practice to add such a file to your dataset. It should contain all relevant information for your data to be understood and reused, such as information about the methods used, what the data mean (e.g. abbreviations, definitions, units of measurements), and access and reuse information. You can find more details and a good example in the 4TU.Research Data guidelines for creating a README file

More information:

How do I decide where to publish my data?

We recommend using a research data repository to publish your data. An alternative way is to add your data as supplementary material to a journal article, but in this case you may lose copyright of the data, the data may end up behind a paywall, and are probably not preserved in the long term.

But which research data repository should you choose? Sometimes your funder or institution will require a specific repository. If not, you can follow these recommendations by OpenAIRE:

  1. Preferably use a trustworthy (certified), domain repository. A domain-specific repository offers specialist expertise and metadata fields. Especially trustworthy repositories also offer long-term preservation services.
  2. Alternatively, use an institutional repository. An institutional repository probably accepts all types of data of any discipline, but they may not curate the data and therefore they may not offer long-term sustainable access to the data. OpenAIRE therefore recommends to only use such a repository if they do offer long-term access.
  3. If option 1 and 2 are not available or suitable, you can use a generic, or catch-all, repository, like Zenodo. While you may reach a wide audience, long-term preservation and access is not always guaranteed, and domain-specific metadata are not normally available. 
  4. Find a repository by searching in re3data.org. You can find trustworthy repositories by filtering on “Certificate”, or you can filter on other characteristics that you find especially important for your data, e.g. support for a specific metadata schema or licence.

To ensure you have found a trustworthy repository, look out for a CoreTrustSeal, Nestor, or ISO 16363 certification label on the repository’s website.

More information:

Sensitive data

How can I share a dataset containing personal data?

There are ethical considerations and legal regulations, like the General Data Protection Regulation (GDPR), to take into account here, and we recommend that you consult the privacy officer at your institution if at all possible. Briefly summarised, but not exhaustive: If your data processing is based on consent, you need to ensure that sharing data is covered in the informed consent form, and you could even ask for consent for sharing identifiable personal data. Secondly, (unless you have consent to share identifiable personal data and there are good grounds to do so), you should de-identify the data through anonymisation or pseudonymisation. If, however, this would result in too much information loss so that the data are no longer useful, you may still be able to share the data, for example with restricted access or through decentralised re-analysis

More information:

 

How can I anonymise or pseudonymise my data?

Data can be considered anonymised when the process of removing identifying characteristics is irreversible, while with pseudonymised data there is a key to de-identify the data back to the original data (see the Guidebook Making Qualitative Data Reusable or the OpenAIRE Sensitive Data Guide). 

The main aim is to remove any identifying characteristics. There are guides and tools available to help you do this:

More information:

My data cannot be shared - what do I do?

There are some data that cannot be shared, for example because the dataset contains sensitive data, like personal or confidential data. If it is not possible to take the sensitive data out of the dataset, or to de-identify them, without information loss, it is probably indeed not possible to share the data openly. However, you can still archive the data in a repository, with restricted access. In this way the data remains preserved in the long term and the metadata are available. If others are interested in accessing and/or reusing the data, they can put in an access request, which can be assessed objectively on a case-by-case basis. 

More information:

Rights and licences

What is a licence?

“A data licence is a legal arrangement between the creator of the data and the end user, or the place the data will be deposited, specifying what users can do with the data” (Deutz et al. 2020). If you assign a licence, you immediately make clear how people can or cannot use the data. Data without licence or rights waiver cannot be reused. 

For data as well as many other research outputs the Creative Commons licences are most used. You can find an overview of all of them at the Creative Commons webpage. For software and code there are other types of licences available, like MIT, GNU, and the Apache licence.

More information:

Which licence should I choose?

You should choose a licence which is as open as possible, which nowadays often also is a funder requirement.

For data or publications a Creative Commons licence is very suitable and widely known. There are six Creative Commons licences with combinations of four elements:

  • Attribution (BY)
  • Share Alike (SA) (you have to reshare under the same terms)
  • Non-Commercial (NC)
  • No Derivatives( ND) (you cannot change the original work)

There is also a Creative Commons Public Domain dedication, meaning that you give up all copyright; this is the CC0 (CC Zero).

At DANS we recommend to share your work as openly as possible, which would mean using CC0. Nonetheless, we understand that as a researcher it is important to be acknowledged for your work, and for this the CC BY licence is suitable (or a CC BY licence with additional restrictions). The CC BY licence is probably the most used licence for open research data and open access publications. More restrictive licences like the CC BY-SA (share-alike, i.e. the output coming out of the reuse has to be licenced under identical terms) or the CC BY-NC (non-commercial) are also frequently used, but we do not recommend it, as reuse can become more difficult. For example, a not-for-profit organisation providing training against cost price or even less may still be considered commercial. Or in another example, if you collect data together in a database and some of it is ‘share-alike’ this means that you have to share your database under the same terms, even if you want to make it more or less open. 

You may, however, not always have the choice, as publishers and repositories may only give you a limited number of choices. 

More information:

Who owns the data?

This is not straightforward to answer. Legally, at least under Dutch law, data cannot be owned. This makes some sense if you think for example about measuring the temperature with a thermometer – no one owns the fact that it is 15.7 degrees Celsius. However, while you may read this “no one owns data” fact, it is not very relevant for most research data. This is because if you organise or otherwise put your stamp on the data, they are no longer just bare facts – in other words, the work is ‘original’. Nonetheless, a better way of looking at it is perhaps who is responsible for (processed, organised, collected) data and who can decide what happens with it. It depends on where you are and which laws apply to you, but in the Netherlands as well as many other places it is the institution, i.e. the employer, who is responsible. In practice, the decision-making on, for example, where the data is stored in the long term is often put with the researcher, but this varies between institutions. If the policy in your institution is not clear, it is best to ask for this to be clarified. If you are not working for an employer, for example if you are an independent researcher, you are responsible for the data yourself. For students who produce data as part of their studies the situation is not always clear – if there are no clear guidelines and rules in your institution, we recommend that you do ask your supervisor for these.

More information:

  • LCRDM (in preparation). Data sovereignty, data governance and digital sovereignty. 

 

Vocabularies

What are vocabularies?

A vocabulary is essentially a list of terms. You can use these for the documentation of your data or metadata. A vocabulary can be a simple list or highly structured.

Vocabularies can be organised in different ways, for example as:

  • A list. This can be a simple, flat list (for example a list of the provinces of the Netherlands), but also a data dictionary (a list of terms with definition), or a controlled vocabulary (a list of terms with a process to manage it, such as policies on who maintains and curates it). You see these often as the values in dropdown menus.
  • A taxonomy: a hierarchical system, or classification scheme, with groups of classes and subclasses and relations between them. They can be represented as a tree structure. [example, professions]
  • A thesaurus: a controlled vocabulary where the terms (or concepts) are connected through relations. Examples are the European Language Social Science Thesaurus and the Art and Architecture Thesaurus.
  • An axiomatic or formal ontology: A shared system of classes, subclasses, and relations, using formal logic. This allows for the deduction of new information: if your data point is an instance of a leaf, and in the ontology we can see that a leaf is part of a plant and a plant is an organism and that organisms are of have been alive, we can say that the leaf was part of a (once) living thing. 

These concepts are also called ‘ontologies’ or ‘semantic artefacts’, although these terms do not have exactly the same meaning as vocabularies. To complicate things further, the terms can have different meanings to different communities – while confusing, it helps to be aware of this.

More information:

  • Maineri. Angelica Maria. (2022). Controlled vocabularies for the social sciences: what they are, and why we need them. Zenodo. https://doi.org/10.5281/zenodo.7157800
  • Pp. 14-19 in Yann Le Franc, Luiz Bonino, Hanna Koivula, Jessica Parland-von Essen, & Robert Pergl. (2022). D2.8 FAIR Semantics Recommendations Third Iteration (V1.0). Zenodo. https://doi.org/10.5281/zenodo.6675295
Why should I use a (controlled) vocabulary?

The main reason is that by having a set list you avoid inconsistencies through typos and the use of different words for the same (or a similar) concept. This, in turn, helps to make your data more interoperable with other datasets. It also facilitates searching and filtering of your dataset, especially if relations between terms are specified. For example, even if you have only specified sub-disciplines in your dataset, you could still easily find all entries related to the ‘humanities’ in general, if these parent-child relationships exist in the thesaurus you have used. Using a controlled vocabulary also helps with machine-readability. 

More information:

 

Where can I find suitable vocabulary terms for my data?

Which terms to use, and from which vocabulary, depends on the context – mainly on your discipline and topic of research. Because interoperability is the main aim here, it is worthwhile checking what others in the same field are doing: which terms are they using, and do these come from a certain controlled vocabulary or thesaurus? 

Other good starting points are registries and repositories for vocabularies / ontologies. Bartoc gives an overview of such registries, repositories and other vocabulary services, and it is also a registry in its own right, for any discipline. You can also search the FAIRsharing – Standards registry. Some disciplines, like the biosciences, have quite extensive ontology repositories. For some other disciplines such registries do not yet exist as such at the time of writing, but there are ‘awesome’ lists of recommendations for the humanities and the social sciences, for example.

Further information:

The term I need is in no vocabulary, what now?

Because vocabularies are by default simplifications of reality, not all words will be available as vocabulary terms. There is often a balance to be made between having terms that are applicable between different research projects, and therefore interoperable, and having more or more specific information contained in the term. If you do need a certain term for your data, but it is not part of the vocabulary you use, you can do two things:

  1. You can contact the people responsible for the vocabulary. This may seem daunting, but a good vocabulary is built on community consensus and has procedures in place exactly for this. Your term may, if agreed by ‘the community’, be the next new term in the vocabulary!
  2. You could simply use your own term in addition to the other terms from one or more vocabularies. To still make your dataset optimally interoperable, we then advice to firstly explain the term well (for example in your readme file) so that there can be no ambiguity to what you mean by it, and secondly to ‘map’ the term to a term in an existing vocabulary. With tabular data you can think of an extra column, or a ‘mapping’ table. For example, if you study ancient tombs and certain region-specific types are important to you but not available in an existing vocabulary, you could of course still use your typology, but also indicate that your term(s) are meant as ‘children’ of the parent term ‘tombs’ in the Getty Art and Architecture Thesaurus (http://vocab.getty.edu/page/aat/300005926). Depending on your knowledge on how to do this or the help available in your institution, it is preferable to do this using a machine-readable syntax and format.  

More information: