The web provides access to millions of datasets. This data can have additional impact when it is used beyond the context for which it was originally created. This is supported by several recent developments, from global initiatives to publish scientific or government data to policies that mandate data sharing in some domains, and an increasing number of data science communities. There is plenty of advice on how to make a dataset easier to reuse, including technical standards, legal frameworks, and guidelines. While some of these are more widely used than others, overall we have very little empirical insight into what makes a dataset more reusable than others, and which of the existing enabling tools and principles, if any, make a difference.
In this paper, we propose a way to close this gap. We explore potential reuse features through a literature review and present a case study on datasets on GitHub, a popular platform for sharing code and data that is openly accessible. We describe a corpus of more than 1.4 million data files, from over 65,000 GitHub repositories containing tabular data. Using GitHub's engagement metrics as proxies for dataset reuse, we relate them to reuse features from the literature and devise an initial model, using deep neural networks, to predict a dataset's reusability.
This work demonstrates the practical gap between principles and actionable insights that allow data publishers and tools designers to implement functionalities that provably facilitate reuse. We believe our findings could motivate the development of new, empirical approaches to study data sharing and reuse, which could help identify high-value data assets, inform planning and resourcing of data-publishing projects, and assess the impact of open-data policies.