From the perspective of Julie M. Birkholz, Postdoctoral Researcher on the WeChangEd Project:
As the data scientist on the WeChangEd project I am responsible for two core tasks: 1) maintaining the digital relational database where we store, organize and query our data within the NodeGoat infrastructure, and 2) investigating through the use of a network approach the relational components of the women editors, periodicals and contributions. These two are fundamentally linked, as the project leader, Marianne, had the insight to consider the entire research pipeline when developing this project. In this short post I aim to highlight a few digital humanities aspects of this research project, specifically focusing on the data and the research pipeline. In an upcoming post I will discuss the second task of my work – network analysis; so stay tuned!
A data pipeline is a term I use to explain the process that the physical data undergoes- from identifying possible sources, the method of identifying the sources and extraction, the method of storage, identification, organization and analysis. Although often overlooked, the decisions we make already in collecting and storing the data influence the potential patterns we can identify to better understand a phenomena. For example, in this project (potential) relations are core to the research and thus as literary scholars we may go into the archive, knowing that we want to start at a certain source or looking for a specific item given our research question. In doing so we explicitly ignore other information that could potentially be seen as data. Taking notes, or (let’s hope) OCR readable pdfs, and metadata that the archive may have stored on the items of interest we seek to make sense of these data. We take notes, we may put this data into a spreadsheet, or a database which helps us to organize potential relations, and so forth.
In the WeChangEd project our data of interest is relational in nature and thus we have organized and stored these data in a relational database. Additionally, we know that libraries, and potentially other researchers also have (and likely different) information about the items in our database, whether it be the women editors, the periodicals or even the specific contributions. In order to facilitate a possible exchange and reuse of this data we implemented a RDF data model to organize these data (see Schelstraete, Jasper, and Marianne Van Remoortel. “Towards a sustainable and collaborative data model for periodical studies.” MEDIA HISTORY (2017). And this previous blog post documenting the process of organizing the dots). This permits an efficient union (technical term for linking data) of Linked Data – data stored as triples with specific labels/identifiers of variables/attributes that has the potential to be linked. This efficiently maximizes the use of these data, aids in our efforts in confirming the reliability of the data, and thus results in more complete information for understanding the phenomena in question. Storing the data in this way, also means we can easily project the data as a graph or network, where we can uncover the role of the relationships that we defined within our model.
So when one asks me about our database I have to correct them, indeed we have a physical space or infrastructure where we are storing these data and labeling it, but it is not a closed unit, with very specific boundaries or sub corpora. Instead this infrastructure serves as the storage and organizational unit of where these data are uniformly labeled so they can be efficiently linked with other data. This essentially means its use and reuse are limitless, which is fundamental to the project goals! Testing has already begun on an alpha version of the data store. This allows to explore how we can accurately share these data with the public and more importantly encourage others to use this best practice of data modeling to expand our corpora and thus knowledge on these women editors, periodicals and time period.
Thus in the WeChangEd Project the data pipeline we are implementing is not only an evolving and important resource for the study of women editors of 1720 – 1910 in Europe and Russia, but it is also has a number of digital humanities components that aids in addressing current gaps in our understanding of how to accurately consider, reuse, project, and analyze historical relational data; which I have the great job of being able to investigate with an innovative and multidisciplinary team.