One of the inaugural Mozilla Science Fellows, Christie is an insect ecologist based in the Zipkin Lab at Michigan State University. She also works with the National Science Foundation-funded Long Term Ecological Research Network, and is interested in how we can use big(ish) ecological data and open science approaches to help build sustainable agricultural systems. The scale of Christie’s work requires collaboration, the integration of disparate data, and sound data management practices, which has led to passion for improving data management, policy, and teaching about best practices.
- Christie’s Blog: Practical Data Management for Bug Counters
- Research website
- @cbahlai on Twitter and GitHub
- YouTube Channel
- My digital toolbox: Ecologist Christie Bahlai talks data hygiene (Nature.com)
Can we start with an overview of your work and what you do?
My name is Christie Bahlai. I’m an ecologist at Michigan State University and a Mozilla Science Fellow alumni. I was part of the first group from the inaugural year of the program.
I’m interested in open science as it relates to data and making scientific work flow more efficiently and more equitably. Using open science brings people in, and frankly, it makes my life easier.
In my research, I work on broad-scale projects where I look at systems over long time periods. The project I’ve been working on most recently started when I was eight years old. Obviously, this is something that I couldn’t have done all myself: this research relies on me working with others, and it relies on other people having good data management skills for me to be able to work with them.
Can you tell me about a time when you’ve felt a sense of success?
There’s a few examples within the realm of open science, but the most recent big success was during my fellowship. I developed an open science and reproducible research course. My colleagues at the university were really skeptical about the concept at first — but I did it.
The whole idea behind this course was to give students real data that had been collected by someone else and get them to do an analysis, write up the data, write up the experiment, and publish it. We just got back our paper from Royal Society Open Science and it came back with only “minor revisions”. That’s quite unusual. Usually, papers come back with “reject and resubmit” or “major revisions.” [The paper, Thermally moderated firefly activity is delayed by precipitation extremes, has since been published.]
It was a great victory — a published paper generated entirely using open science principles. We published a preprint on bioRxiv, and two days later Science Magazine contacted me and wanted to highlight our work. It really gave legitimacy to what I was doing in open science and confirmed we didn’t have to go the way of conventional scientists.
What feedback did you get from people who were skeptical about this approach to open science and reproducible research?
Those who were skeptical were very excited that it had worked, but were a little disappointed the Science Magazine story didn’t mention their names or the name of the site where the data was collected. It was such a short story, the magazine said, “Scientists in Michigan did this,” instead of specifically naming the university or authors. But otherwise, they were really excited that it had worked, as a proof of concept — in fact, they funded me to give a second offering of the class in the spring.
Have you observed working open and following open science practices propel someone’s career or work?
I was working with a student from a developing country. She was in the graduate program at my institution, and I was mentoring her later on in her thesis. We got to talking about the constraints that she would be facing back in her home country, and what we could do to best serve her needs when she returned.
She had a U.S. visa that required her to return back to her home country for at least two years after her graduate program was done. That’s a very standard visa given to most people coming from developing nations into the United States. But it affects how their education needs to be framed: we can’t assume that a student will have the same access to Western resources after they finish their American education.
The U.S. institution she attended wasn’t taking into account that she would have to go back to her home country after obtaining her education. I said to her, “Where should we publish this paper? Will you be able to access a university library when you get back to your home country?” She just laughed and said, “Oh no. Well, I could if I paid $40 US dollars every time I wanted to access it. And it’s unlikely the people I’ll be working with will have access to the papers I’m producing for my thesis.”
Then, when we were looking at her analysis and her interpretation, I realized that she had been taught to do statistics using expensive proprietary software. So I said, “Are you going to have access to this when you get back to your home country?” Again, she just laughed and replied, “Oh, absolutely not. I’m never going to be using this program ever again.” I said, “Well, how would you feel if we re-did your analysis using software that you will have access to?” There are free statistical programs that she could use. She said, “Well, I’m not keen on learning a new statistical language this late in my program, but it makes a lot more sense, and will most likely make me more employable to use software that my peers can use.”
So we worked on translating all of her analysis into R, and I pushed her advisory committee in the direction of publishing in an open access journal. It’s not something we do that often in applied ecology, as article processing charges tend to be more expensive and they are less-recognized than the other ecology journals [Many well-recognized journals in ecology are subscription-based and free of article processing charges, or have charges subsidized by scientific societies, but their content is only available to subscribers.]
We ended up publishing with PLOS ONE because it was among the highest profile of the group. She got to hit submit on her paper just before she left to return to her home in Central Asia — she was thrilled. Not long after she returned to her home country she got a good job with an NGO, working in development in her own country. Her commitment to open science worked in her favor, and she will be able to share those practices with others.
This student’s work also helped us to develop a program that increased wheat yields in a food-insecure nation by about 50 percent. Wheat is a staple crop in this country, but the yields were very poor compared to most of the nations in the area. Through this program, we employed simple tactics to increase profitability and yield. Having her work published in an open access journal allowed the group to say, “Oh, we should publish this result in an open access journal too!” And we did. We wanted the people of that country to have access to our work.
How about an example of a challenge?
People in my field tend to hold their data pretty closely. I work with a fair bit of proprietary data — data that is produced by others and protected by copyright, patent, or trade secrets. That’s a challenge when I’m sharing code, because if someone else runs that code it’s not going to work.
Scientists have never been trained to share data, even if it’s through publicly funded research. There’s a perception that if data isn’t shared, it’s owned. The current funding system and the prevailing culture within science perpetuates that belief. There was a big editorial in the New England Journal of Medicine early last year that referred to people who use other people’s data as “research parasites.” It’s a tough issue. I have spent a lot of time as a data-collecting researcher in the field, tediously recording each number — it’s hard work. Yet, it is fundamentally the spirit of science to build on other people’s work. Fortunately, the attitude of owned data is shifting as more journals are requiring access to it.
I’ve found making code open is an easier sell than making data open because people don’t feel as much ownership on analysis code — at least in my field — even though that that is the lifeblood of what we do.
Especially with experimental scientists, data is still something that they’re very scared to share. People are terrified that when they show their work, people will find that they’re wrong. That’s frustrating to me because that’s part of the point of science. It’s supposed to be self-correcting.
Any other challenges you’ve faced?
People are very skeptical about technology and new ideas, especially when it comes from people who don’t look like them. It’s been a challenge being a woman pushing technology in a field traditionally dominated by men who don’t like technology.
There’s been a lot of, “Well, that’s not how we do things,” or “this isn’t right because this it’s not how we’ve always done it.” But then I explain, “What you’re thinking of as science is actually just the practices shaped by the technology you had available to you. Now we have new technology and we can use it to make things better. We can make all of our lives easier if we use technology to our advantage to make ourselves better communicators.”
Some people don’t agree with that sentiment. Some think that science is inherently competitive and that they need to keep their work secret or it will be stolen. Everyone has a story about their cousin’s friend’s uncle who had their data scooped and career ruined, but there are very few concrete examples of this happening.
There’s a self-reinforcing culture that starts with grad students early on. They take up the same practices, because they hear the same threatening stories over and over again- and then they tell these stories to the next generation of students. It’s a self-perpetuating system that keeps the people who follow these practices in power.
There’s also a concern in the field of science that group work will dilute authorship. The lone genius, or at least someone perceived as one, is the person that is rewarded the most in science. They’re the person who gets the big job, the funding, and the most likely to be published. They are seen as the person moving the field forward. I’m still a young enough, idealistic scientist to think, “You know, maybe we should try to do this better. Maybe we can advance our own work. Maybe we can advance everybody’s work by working together.” I think it’s the people working together in the trenches that are really the ones moving field forward
One of the advances I’ve heard about is adding citations to datasets, in addition to published research papers. Is that something you’ve seen?
That’s something that not yet looked at in my field. While it wouldn’t benefit me much personally because I don’t create a lot of data, I consider citations on datasets important and think they should be included. Unfortunately, the reality is researchers who evaluate my performance don’t know what to do with that. They don’t necessarily look at a dataset as a research product.
I’ve been working on a dataset with a previous supervisor for years and we plan on publishing it. The data has been publicly available for over 10 years, but it hasn’t been published — it doesn’t have a DOI [Digital Object Identifier] associated with it, making it easily citable.
I’ve published four papers using this dataset and every time I describe the methods like it’s a new standalone experiment. It kind of feels like reinventing the wheel each time — imagine you had to explain the complete history of evolutionary theory each time you wrote a paper on evolution, instead of just citing the key references you’re building on.
It doesn’t make sense, but that’s how it’s done with data reuse in my field. Attempts to simply cite the data have been met with strong resistance during peer review — they want the complete data collection methodology described.
A perverse consequence of this is that because of rules about plagiarism, this also means that the same data collection methodology has been described slightly differently four times in the literature. You can only push so many things at once with an established researcher — so you nudge them.
My next question is how you’ve approached solving this challenge. You’ve already started to touch on this — do you have anything to add?
As far as open data, where researchers and administrators are concerned, I’ve been trying to meet them where they are — to show them where open science techniques can benefit them.
I had an issue convincing university administrators of the value of my course. The initial course offering was called, “Open Science and Reproducible Research,” which students liked — but established researchers were very skeptical of the term “open science” and what that meant. So I changed the title of the course to “Reproducible Quantitative Methods.” Instead of framing it as an open science course, I framed it as a reproducible research course, focusing on technology and and how to use technology for more efficient communication.
Collaborators will shut you down pretty quickly if you come to them saying, “We’re using an open framework.” We’re often coming into very established places, and this is a relatively new idea to people. Most scientists know that you need to adopt new technology and that you need to be reproducible. Most scientists know that our field’s going to be increasingly quantitative, as we’re producing more data than we know what to do with. All it took to convince researchers and administrators of the value of my course offering was to rebrand it to fit their view.
As far as publishing research goes, I believe that publishing in an open access journal is a big step in the right direction. It’s not always easy to do, though, because it can be more expensive, or come across as less reputable because they’re less known. When I introduce the idea of publishing in an open journal, I try to find one they’ve heard of, like PLOS ONE.
The cost of publishing in an open access journal is a huge disincentive. [See @emckiernan13’s Twitter thread on this.] It’s one of the most common complaints I hear: People who want to go open access but can’t afford a PLOS One charge of $1450 USD. Consider that a typical research budget in a Canadian lab may be $30,000 to $50,000 CAD per year. U.S. labs tend to be much richer.
Now, turning to broadest issue in the Mozilla universe — keeping the internet open and free. What, for you, is the open internet?
An open internet is one without walled gardens. This is directly relevant to science — publishers have created walled gardens by controlling access to papers. Then they convince the producers of the product that they need to copyright and restrict access in order to be successful.
The restriction of papers to university libraries is something that frustrates me because most people don’t have access to that knowledge. And if we just took out the walled gardens, people would have more access to scientific knowledge.
To play devil’s advocate, what about the arguments around revenue models? That journals need to charge for access to remain sustainable, given the resources they put into operations and publishing.
Many open access journals, including PLOS ONE are struggling with revenue models. I know that PeerJ has gone through several different models, including pay to publish, charging by author, and charging per paper.
There are costs to producing scientific papers. Overhead costs for publishers include bandwidth and data storage, but most of the profit goes to the publishing enterprise itself. The current model doesn’t pay reviewers or editors of papers. It’s not the intellectual contributors being paid — and that’s a problem.
Stepping back for a moment — can you tell me about a time when the open aspects of the internet have been important for you?
Realizing the information that’s already available on the web was utterly revolutionary to me. I’m an associate scientist of one of the sites of the U.S. Long-Term Ecological Research Network . We have about 26 sites across the U.S., French Polynesia, and Antarctica. Most of research sites were set up in the mid to late ’80s, and have the policy to make all associated data available to the public. When I realized all these datasets were available and open to the public, I realized the powerful potential of the open web. Anyone, anywhere, can use these data. The sad thing is, it isn’t used by people outside the network as much as it could be. Now we’re all looking at how we can synthesize our data to make it even more accessible — to help people make better use of this resource.
You say this was a “revolution” to you? Can you expand? What possibilities did you see there?
I’m still growing as an ecologist — finding my place — but I’m learning that I’m a person who’s primarily interested in patterns. The patterns that I’m interested in most are how things and interactions evolve over long time periods. The data from this long-term research project is an important resource which allows us to see how things are changing over time.
One of the datasets I initially looked at was watching a community of lady beetles in South Western Michigan. There we have 10 different plant communities at this site, which are sampled every week during the summer season, and there are about 13 species that we capture on the regular. We happened to catch the invasion of three different species of lady beetle and see how the community shifted and changed over time in response to this major disturbance.
Long-term studies and data also enable us to ask climate change questions. You can’t really ask a climate change question in the short-term because you don’t really know if it’s climate change or if it’s just natural weather variability. Over the long-term, you can tell if there are correctional changes in a population, for example, or you’re seeing more extremes. Then you can document how this is changing the community of organisms over time.
How did you get involved with Mozilla, and what has that been like?
I did a physics degree for my undergrad, which seemed like a good idea at the time, but over the course of that degree I found a real passion for biology and ecology — so for my Master’s degree I focused on these. Then, during my Master’s and PhD, I had people coming to me and saying, “Oh, Christie, you did math and physics for your undergrad, can you help me analyze my data?” I said, “Yeah, sure.”
This has been a fruitful thing for me because it means I get involved in a lot of projects. I noticed there were some issues with people’s data, so I would have to spend a significant amount of time on data management. Essentially, they weren’t documenting their data and they were formatting it in ways that were non machine-readable. I’d spend 90 percent of my time cleaning up their data before running a fairly simple analysis.
I was annoyed because there was obviously something completely lacking in the undergraduate biology training area. That’s another thing I noticed in students — and the reason I was always asked to deal with stats. Students hit the data and statistics portion of their thesis and say, “I hate this,” and then have a meltdown moment. I thought we need to address this through training, because that’s the most fun part.
So I started writing a blog about data management. I didn’t even realize there was community around it at the time, but I got contacted by Greg Wilson of Software Carpentry, who said, “So, sounds like you and I have some ideas in common.” At the time, Software Carpentry was emerging from the Mozilla Science Lab and it has since become its own entity.
My introduction to the community came through these collaborations and through a sister group, Data Carpentry. They have spreadsheet lessons that I hear my own voice in because many of them are derived from my blog posts about data management. While active in this community, I heard about a call for Mozilla Science Fellows, so I applied.
My application focused on developing a completely new approach to training biology students in data, from statistical philosophy to data management and all the aspects of the open workflow — bringing it all together. Through this, I’ve found my people!
How has your involvement with Mozilla impacted you?
It’s been really awesome because I have resources to put towards developing my course. I have time to put towards it. I have the support of the network to put towards it. This is in addition to what I mentioned earlier about making myself creditable to university administrators.
It is very powerful to have the branding and influence of a well-known tech entity behind you. When you make an argument to administrators, you can say, “I am working with Mozilla. Yes, THAT Mozilla!”
Mozilla is an entity consisting of people who are very successful at what they do. They know technology, and that gives credibility to what I’m doing in the classroom. Administrators get excited because they see there’s a community of people who know what they’re doing behind us.
In terms of the support you get from the Mozilla community, what does that look like?
I worked with people in the Mozilla Science Lab on developing curriculum. I was able to incorporate all of MSL’s experiences in this area to develop a curriculum that works, and that speaks to students.
When we developed the curriculum, I made sure to include what I feel is the most important aspect of the course: Communication. I learned this from working with Mozilla people — technology is really just a means to be a better communicator.
In the course I argue that through learning the hard skills — tech skills, such as learning to code — they’re also learning soft skills, like better communication of their science. They learn to comment on their code — to annotate it — and preserve their code better so that they can communicate with collaborators and the public more efficiently.
When students learn to share data and document it well, they’re learning to be better communicators. When they understand how to navigate authorship and software, they’re learning incredibly important soft skills in science. There are interpersonal relationship issues associated with how to author different things — how do we get credit in science and in technology? Knowing how to navigate soft skills in science are just as important as the hard skills, which is why I include them in my course.
I also bring in speakers from the community who share specific examples of what they are working on and how they navigate interactions — and how these open technologies and practices has helped them succeed.
Do you have any feedback for Mozilla, or can you tell me about a time where things didn’t really quite meet your expectations? What could be better?
I really enjoyed my time with Mozilla, but it was a bit of culture shock when in the beginning. The way things operate at there is radically different from academia. In academia, the emphasis is slow — polish, polish, polish — until you go crazy and you can’t look at anything anymore. At Mozilla, everything is much faster — that was a big adjustment for me. I don’t know if I’m necessarily the academic who spends too much time on polishing, but the pace at Mozilla was a challenge at first.
The language used in Mozilla was a bit inaccessible at the beginning. I thought, “What have I got myself into? These people don’t speak English here.” Academia has a similar issue — there’s unique jargon for every field.
Finally, it was hard to get used to the idea that people cared about what I was doing — I definitely wasn’t expecting that. In science, we’re very siloed. It seems unreal to work with an organization that cares about what’s going on in other parts of the organization.
Early on, I was skeptical of their motives. I thought, “Are people trying to steal my work? Is this what people in academia warned me about? Is that what’s going on?” I was paranoid because I was conditioned to be suspicious and guarded of others.
What do you think is behind people’s caring? What might be the rationale or values?
I feel the people are genuinely invested in the community. They want to help and help you do well — and they want you to be able to help them do well. I’ve really adopted this new mentality in my academic work— that we move forward together. It sometimes makes academics suspicious of me and how I advertise my courses to students.
The skills I’ve learned from Mozilla are so relevant in the classroom. I just don’t think that we should expect everyone to get to the same place. I think that we should work together — each according to their needs and from each according to their skills. Everyone is different.
We should capitalize on that diversity rather than try to make everyone the same. I’ve had a number of students say, “OK, I’m starting at this point. Do you think I can still take your course?” I say, “Well, I don’t set a bar for students in this course. What I want you to do is improve your skills the most you can.
How might these stories that we’re collecting be useful to you, if at all?
They’re very useful to me because I am constantly in the business of marketing open science to people who are hesitant about it. The more data I can share with them, the more compelling it is. Any stories that can show them how open science is helping people achieve success, are useful to me.
An example of this is diversity initiatives — If I can show them proof of how open science helps retain women and minorities in science — they’ll say, “Oh, well, we should do this because we’ve been told by our bosses that we need to retain women and minorities in sciences.” And that’s a win.
One of the things I like to do when I’m talking to administrators specifically about Mozilla is to show a group picture from the ecological network that I work with and a group picture from Mozilla — then I ask them, “Which one of these looks more diverse?”
In much of ecology, it’s often about 25 to 30 percent women — where most of the upper echelons are white men. People of color are rarer still, and gender diversity — specifically non-binary gender diversity — is very poor.
In contrast, Mozilla has a wonderful cornucopia of people. You start to wonder, “Why doesn’t the ecology community look like that?” Clearly, these are people who are interested and engaged, so why does it look like a non-random sample? There are clearly biases in our sampling technique if the people in ecology look so dramatically different from the people at Mozilla.