Faculty Advisor

Naphtali Rishe

Faculty Advisor

Victor Potapenko

Faculty Advisor

Victor Potapenko

Location

East and Center Ballrooms

Start Date

29-3-2017 12:00 PM

End Date

29-3-2017 2:00 PM

Session

Session 2

Session Topic

Poster

Abstract

Computer scientists undertaking research often find themselves struggling to find or generate the right data to formulate or test hypotheses, validate models, or test algorithms – which in turn leads to greater time and effort allocated to searching for or producing data, rather than using it to perform scientific research. This data barrier is a significant impediment to scientific progress, because most of the progress occurs in the long tail of computer science: in many smaller, silo-like laboratories that rarely share data. This paper surveys nineteen existing data repositories based on their respective feature sets. Out of the nineteen reviewed repositories, only six were found to have feature sets differentiating from standard digital libraries. Most distinguished features were found to be in-browser data preview and manipulation, community-based data curation, data access via APIs, licensing of data, real-time data publishing, data versioning, and virtual workspaces, where users can create and manage research projects with fully traceable activity logs and interlinked artifacts such as data, papers, and code. These features were found to be scattered among various repositories, and not one single repository platform was found to provide them all as a complete package. None of the surveyed repositories offered features that would allow for purchase or sale of data. Studies based on a representative sample show that providers of data in the scientific domain offer it for free approximately 80% of the time. The survey showed that there is no single data repository that provides the right combination of features and tools geared towards the long tail of computer science. Moreover, none of the surveyed repositories provide the monetary motivational component for computer scientists to share data. We contend that a true data marketplace that implements a particular set of features would lower datarelated barriers on the path to scientific progress.

Comments

**Abstract Only**

File Type

Poster

Share

COinS
 
Mar 29th, 12:00 PM Mar 29th, 2:00 PM

A Review of Data Repositories for the Long Tail of Computer Science

East and Center Ballrooms

Computer scientists undertaking research often find themselves struggling to find or generate the right data to formulate or test hypotheses, validate models, or test algorithms – which in turn leads to greater time and effort allocated to searching for or producing data, rather than using it to perform scientific research. This data barrier is a significant impediment to scientific progress, because most of the progress occurs in the long tail of computer science: in many smaller, silo-like laboratories that rarely share data. This paper surveys nineteen existing data repositories based on their respective feature sets. Out of the nineteen reviewed repositories, only six were found to have feature sets differentiating from standard digital libraries. Most distinguished features were found to be in-browser data preview and manipulation, community-based data curation, data access via APIs, licensing of data, real-time data publishing, data versioning, and virtual workspaces, where users can create and manage research projects with fully traceable activity logs and interlinked artifacts such as data, papers, and code. These features were found to be scattered among various repositories, and not one single repository platform was found to provide them all as a complete package. None of the surveyed repositories offered features that would allow for purchase or sale of data. Studies based on a representative sample show that providers of data in the scientific domain offer it for free approximately 80% of the time. The survey showed that there is no single data repository that provides the right combination of features and tools geared towards the long tail of computer science. Moreover, none of the surveyed repositories provide the monetary motivational component for computer scientists to share data. We contend that a true data marketplace that implements a particular set of features would lower datarelated barriers on the path to scientific progress.

Rights Statement

Rights Statement

In Copyright. URI: http://rightsstatements.org/vocab/InC/1.0/
This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).