Standardized General Markup Language (SGML) and Extensible Markup Language (XML) are markup languages used for text documents and datasets, both to present them to people and to enable data exchange between computers.

XML is a variant of SGML: all XML files are SGML files. Since XML has a much stricter syntax, it is easier to validate. HTML (Hypertext Markup Language) is another variant of SGML; it is primarily intended for the presentation of rich text (and layout) and hyperlinks to other documents.

In addition to “regular” HTML there is also XHTML, which is HTML under the stricter rules of XML.

SGML and XML are hardly being further developed. HTML has recently seen its latest version 5 officially recognized as W3C standard. As Web technology continues to develop, it is expected that HTML will continue to be developed further.

XML, HTML and SGML are common and suitable markup language formats, provided the file formats are valid and complete (see paragraph below). Apart from these formats there are XML-based or SGML-based formats that can only be read by special software. Such files cannot be accepted without further verification; please check with DANS.

Validity

Valid markup language documents are both well-formed and comply with the rules that apply to the file formats.

Well-formed documents require that the content is defined in a particular manner. Well-formed XML complies with syntax rules that state, among other things, that the character set used is also the character set specified; that no prohibited characters are used in the file; that there is one root tag and that each <tag> is correctly terminated with a </tag>.

The rules governing the content of a markup document are described in a DTD (Document Type Definition) or (XML) schema file. At the top of XML and HTML documents there is a reference to the DTD or schema used. This reference should really lead to the schema file itself. Ideally, the schema should be attached, unless it is available at a reliable public service.
If a non-standard schema or DTD file is used, the data depositor should consult DANS beforehand.

Through schemas and DTDs, entirely new “file formats” can be defined, such as SVG (Scalable Vector Graphics, for vector images), TEI (Text Encoding Initiative, used to format and annotate text), and MathML (for mathematical formulas).
The World Wide Web Consortium (W3C) manages the specifications for HTML and XML, and provides a Markup Validator that can validate both XHTML and HTML. In addition, it can validate a number of other formats, such as MathML and SMIL.

Completeness

Markup language may be based on the use of other file formats, either in separate files or within one file. All files associated with an XML/HTML/SGML file must be included. Common markup language related files are XLST stylesheets, CSS definition files and JS/ES scripting languages, see related files below.

Preferred formats 

  • XML (.xml)
  • HTML (.html)
  • Related files: .css, .xslt, .js, .es

Non-preferred formats 

  • SGML (.sgml)
  • Markdown (.md)