Git annex tutorial

Git annex tutorial software license#

With the growing complexity of handling large scale datasets, the trustworthiness of derivative data can be at stake as large-scale computations are more difficult to reproduce, comprehend, and verify. multiple copies of the data, computationally inefficient processing). Storage and computational demands strain the capabilities of even well-endowed research institutions’ high-performance compute (HPC) infrastructure - rendering the analysis of these datasets unaffordable using methods common in fields accustomed to smaller datasets (e.g. Though large-scale datasets present unique research opportunities, they also constitute immense challenges. This development is accompanied by a growing awareness of the importance to make the data more findable, accessible, interoperable, and reusable (FAIR) 2, and increasing availability of research standards and tools that facilitate data sharing and management 3. The Wind Integration National Dataset (WIND) Toolkit 1, CERN data (), or NASA Earth data () are only some of the prominent examples of large, openly shared datasets across scientific disciplines. The amount of data available to researchers has steadily grown, but over the past decade, a focus on diverse, representative samples has resulted in datasets of unprecedented size. We demonstrate the framework’s performance using two showcases: one highlighting data sharing and transparency (using the dataset) and another highlighting scalability (using the largest public brain imaging dataset available: the UK Biobank dataset). It affords the capture of machine-actionable computational provenance records that can be used to retrace and verify the origins of research outcomes, as well as be re-executed independent of the original computing infrastructure. The framework attempts to minimize platform idiosyncrasies and performance-related complexities. Here we introduce a DataLad-based, domain-agnostic framework suitable for reproducible data processing in compliance with open science mandates.

Git annex tutorial software license#

However, they also pose considerable challenges for the findability, accessibility, interoperability, and reusability (FAIR) of research outcomes due to infrastructure limitations, data usage constraints, or software license restrictions.

Large-scale datasets present unique opportunities to perform scientific investigations with unprecedented breadth.