Data publishers can improve the quality of their datasets by identifying and addressing potential issues prior to publication with the help of the new GBIF data validator.

The service runs the same checks as those carried out after datasets are published on, rather than simply spotting and flagging errors once they are public. It is also the first tool that interprets and validates a dataset’s content as well as its structure.
Users who upload or drag-and-drop a dataset using accepted formats to the validator quickly receive a report that interprets the data and highlights potential issues with its content, syntax and structure. Supported file types include Darwin Core Archives (DwC-A) and standard GBIF dataset templates as well as simple CSV files that contain Darwin Core terms in the first row. Those wishing to validate large datasets can also submit dataset URLs.
Processing time varies, depending on the size of any given dataset. However, since each new validation process generates a unique job ID, users with large datasets or strapped for time can bookmark their report URLs and return to them later.

 Each validation report contains:
  • A quick summary of the dataset that indicates whether can successfully index the file or not
  • An overview of any GBIF interpretation issues with the dataset
  • A detailed rundown of any issues with the metadata, dataset core and extensions
  • The number of records successfully interpreted
  • The frequency of terms used in dataset

Users of the data validator can also preview how their metadata will appear once it’s published on

Like all GBIF tools, the data validator is open-source software, with its source code and documentation available in the project’s GitHub repository.
User feedback will be both welcome and vital to refining this service and helping data publishers resolve potential issues with their datasets quickly and effectively.