main menu

By sharing our data, and doing this in an open, public, community fashion, we can determine the best practices for our field

“By sharing our data, and doing this in an open, public, community fashion, we can determine the best practices for our field”
Name: Prof Laura A. Janda
Position: Professor of Russian Linguistics
Institution: UiT The Arctic University of Norway
Country: Norway
More info: Home PageOther


An interview with Prof Laura A. Janda on 18 April 2017

What experiences made you realise the importance of sharing data?

Around 2007, I went to a conference and I realised that I needed to learn statistics. I went back to my university and took some courses, and since then I have written a textbook on statistics for linguistics, and I have developed a course here at UiT for linguists who want to use statistical methods in their research. I realised early on that one of the hardest things about learning to use statistics is figuring out what model fits your data, and it really helps to see examples of what other people have done. If I could see an example and see that it is similar to what I have done, then I could more easily relate to it. When I started, I didn’t take my courses in the Linguistics Department, but in the Psychology Department, with psychology professors as my teachers. Psychologists have been working with statistics much longer than we have, so they are much further along on this learning curve.

Another experience that has pushed me in the direction of Open Data is my work as associate editor of our journal, Cognitive Linguistics. While our journal has always been data-friendly, and there has never been an issue published that didn’t have a statistical analysis of data, around 2008 we crossed the 50% line for the first time. Over 50% of the articles published in our journal involve statistics, and we are probably never going back. While I don’t think we will ever make it to 100%, we are now very much dominated by statistical analyses of data.

Also, I found that it’s a problem as an editor and as a reviewer if you can’t see the data. It’s very important to provide access to the data so that others can see how it was done and learn from it, or even try to replicate it. In this way, we support the scientific method and the integrity of our field overall. It’s also important for transparency, to avoid fraud. We haven’t had any big scandals in linguistics the way that we have seen, for instance, in medicine, but it’s always possible for people to fudge their data a little bit. This is harder to do if the data is all open and public.

How are you involved with Open Data?

I felt it would really help if we had a single source for linguists to find data and code, and learn about it. We had the idea of launching a website that would house these kinds of open data resources, and we approached our library. To our great delight, they thought this was a wonderful project and were willing to spend months – and even years ­– on it, handling many of the professional and technical aspects, which would have been very difficult or impossible for me to tackle on my own, or even with help from my colleagues here in linguistics. This was very much a partnership, and we were lucky to be aided by our excellent library colleagues.

Working with TROLLing has even changed my own working habits. When you do a lot of theoretical and statistical studies, it can be hard to locate your own data or even understand how it was put together if it isn’t annotated well enough, especially once you’ve moved on to something new. Today, of course, I know exactly what all those fields mean, but will I know in a month, or in a year, or in 10 years? One nice thing about having a resource like TROLLing is that it really forces me to upload all my data in a place where I can find it again, and I can direct others to find it. Also, if I have gone through the exercise of annotating the data in a way that I hope makes it clear even to somebody who doesn’t know me and has no previous knowledge of my data, then, hopefully, it will be clear enough for me when I revisit the data later. Nowadays, it’s easier to go back to TROLLing to find my own data and code ­– and I know it’s always there, and it’s safe – rather than having to dig around in my own files.

I use my open data in teaching, too. There is a textbook that I use in my course, with some datasets and analyses for people to go through. But I have my own data, and there is something different about using your own data, because you know it so well. I give my students a dataset for each type of statistical analysis they are to learn. I give them my own dataset and my own code, and then we work through it. I can answer all their questions and really give them a full experience of what it’s like to work with data and code. It’s not like you can just collect data and shovel them over to some statistician; say the word “verb,” for example, and the shutters go down and he or she may not understand the linguistic terms. You have to analyse the data yourself, because the statistician will never understand it the way you do. Also, you have to have some idea of what the models are that you are going to use in the end, in order to collect the data that will be amenable to that kind of modelling in that kind of analysis.

One of my colleagues said, when we were making the instructional videos: “Laura, you have to make these instructional videos such that even your grandmother could upload data onto TROLLing.” I think we came pretty close to that. I think it’s fairly self-explanatory with the instructional videos, and I have always felt that research and teaching go hand in hand. I have never been involved in a research project that didn’t have some sort of teaching angle to it. Conversely, whenever I am teaching, I always try to think about what we still need to learn. That is one of the great things about teaching: you see the students, you can see the gears turning in their heads, and you can see that they see things from a different perspective. I learn from them constantly, and that again feeds back into the teaching and research. It’s a continuous cycle.

The students, therefore, are getting a simulated experience of hands-on working with the data. They get the data, they get the code, we go through it, we all sit there together, they all have their computers open; it’s a hands-on experience of working directly with the data.

What do you consider to be Open Data concerns?

One thing has concerned me quite a bit recently. We have a challenge sometimes finding academic research positions for many of our graduates in linguistics. However, there are some corporations that are very interested in hiring statistically-capable linguistics graduates – mostly big corporations like Google, Amazon, Apple, Facebook and such. And these are the public organisations, in the sense that everyone knows they exist. But they are doing a lot of clandestine research on you and me, using linguistics and big data, and everything that they do is kept undercover as company secrets. It’s spyware, let’s put it that way. They are spying on us; they are using linguistics and data techniques in order to spy on us. And they are not alone. There are also various governmental organisations doing similar things — spyware operations. This is something that is pretty much unstoppable. It’s going to happen and we can’t prevent it. But the more that we put things out there ourselves and make things as public as possible, I think that is our only defence – that we have all these things in plain sight, and not let it all be shut behind the doors of spying operations and major corporations.

What inspires you and makes you optimistic about the future of Open Science?

I think that statistical studies and data studies in linguistics are here to stay. That’s definitely part of our future. In the future, probably all linguistics programs will have courses in statistics for students, and statistical analysis will be expected when submitting articles to journals. My hope for the future, then, is that TROLLing will continue to be a clearinghouse for those materials, a place where people can upload their materials, share with each other and learn from each other. One never knows when collecting data what sort of structure in that data might have been overlooked, structure that somebody else could find. That is one of the really exciting things about this time that we are living in, that suddenly we have access to so much data and a way to look for the structure, thanks to the sophisticated statistical software. In that sense, I think we are living in very exciting times.

I want to mention the dissertation by Jaap Kamphuis that was defended in Leiden. I had met the author at a couple of conferences, and we were familiar with each other’s work. I was asked to be an examiner at his dissertation defence. I received a copy and was reading through it when I realised that he had taken the method that we had used, and he had gotten it from TROLLing, from our open data site. He had applied the method to different data, and used it in a different way; it was so exciting that I practically cried! This wouldn’t have happened if it weren’t for TROLLing. He might have read my article, but would have then had to call me to find out what methodology I’d used. Instead, he was able to go to TROLLing, download it, and see how it was done, and he said: “Yeah, I can do the same.”

What still needs to be done to get more people to share and open up their research data?

A big challenge is to educate people so that they understand that everybody gains, that nobody loses anything. That is also one of the things that we have safeguarded in TROLLing. We have instructions on how to cite the data, and once you place your data in TROLLing, everyone will recognise the data as yours, because your name was associated with it first. We record posting dates and related information. You can’t lose anything; you can only gain — more perspectives from more researchers and possibly greater interest in your research.

Finally, could you mention one important positive consequence of data sharing?

In psychology, they have been doing statistical analyses for a long time, and in linguistics we have come to this rather late. But that means that we are in a formative period where we are discovering which methods are going to work best for us. By sharing our data, and doing this in an open, public, community fashion, we can determine the best practices for our field, and help it advance by setting standards. I think this is really important.

Copyright: Creative Commons CC-BY Licence University Library, UiT The Arctic University of Norway

Tags: Open Data, Open Science, accessibility, career development, code, community, data citation, data curation, education, ethics, fraud, integrity, peer review, privacy, provenance, public, publishing, replication, repository, research, sharing, skills, software, standards, statistical analysis, teaching, tools, transparency, visibility

To more champions >
Print Friendly, PDF & Email
Comments are closed.