Source: © Lucinda Rogers @ Heart Agency
By Rachel Brazil
2 September 2024
While more researchers are adopting open access, open data, open peer review and open projects, some significant barriers are hindering progress
Twenty years ago the debate surrounding open science focused on access to journals. By 2020 around 25% of all chemistry papers published were open access, and now most of the major publishers of chemistry journals offer some version of open access. But more researchers are starting to realise that other elements of open science are ripe for development. There is a tantalising future where chemists share their data in ways that allow easy reuse, awakening a new era of innovation.
One of the biggest culprits slowing this down is the humble pdf file, often the format for supplementary data submitted to journals. ‘Google and all those internet indexes have trouble reading pdfs and understanding what’s in them,’ says chemist Simon Coles from the University of Southampton. ‘Discoverability of data is really hampered by the fact that this is the way we operate.’ Coles is director of the UK Physical Sciences Data-science Service (PSDS), which is working to create an interconnected lake of data from UK physical science research.
Advertisement
The urgency to do this is in part linked to the recent explosion in machine learning methods. ‘There’s no hope of us even getting out of the starting blocks with all these fancy new technologies if we don’t have the right data to train the algorithms,’ says Coles. ‘We’re like a Porsche with shopping trolley wheels on – we can’t get anywhere because of the foundations on which we’re building.’
Crystallography is one part of chemistry ahead of the game. In the 1990s the community invested time and effort into developing methods to share data. What started with a crystal structure database developed into a common language to describe the data and then an information file format. ‘Towards the end of the 90s, we were actually publishing papers in this cif format … it was very easy to pick up an example from a crystallographic experiment and run simulations that you could then extrapolate into a whole different space,’ says Coles.
We’re like a Porsche with shopping trolley wheels on
The PSDS, a partnership between the University of Southampton and the UK’s Science and Technology Facilities Council, is now hoping to create the type of architecture that will allow open sharing in all areas of chemistry. For about four years they have been working on an infrastructure to collect, collate and curate – in other words, can the data be found, can it be re-used and can it be analysed? It’s an enormous task says Coles, because the data chemists use is so diverse: ‘It can be everything from biological systems to materials and all in between.’ There has been progress in the fundamentals of assigning chemical structure data, but it’s still early days: ‘The ability to unify this sort of disparate data is still a long way out.
Similar initiatives are going on in Germany’s chemistry community through NFDI4Chem, one of 30 government funded consortia to open up data. In 2020, chemist Johannes Liermann from the Johannes Gutenberg University Mainz joined a team that is setting up several data repositories and has been tasked to come up with standards for metadata as part of a five-year project, working with international organisations including the PSDS.
One of the distinctions that Liermann makes is the type of openness they are looking to promote, which they describe as FAIR – findable, accessible, interoperable and reusable. ‘It’s important, [as] this is often a concern in chemistry, to distinguish between open and FAIR data,’ says Liermann. FAIR data allows constraints on access; for example, due to a patent application. But the overarching principle of FAIR data is that there should be a record that it exists and it is machine readable.
NFDI4Chem has started with a focus on molecule related data. The harder task is encouraging cultural change through the adoption of electronic lab notebooks, which will ultimately provide a smooth journey from data collection and documentation in the lab to publication and open sharing.
New ways of working, new discoveries
Open science is more than just a way to share data – it’s also a fundamentally different and more collaborative way of doing science that could speed up progress. Medicinal chemist Matthew Todd from University College London School of Pharmacy in the UK has been experimenting with open drug discovery since the mid 2000s. He was looking for a cheap method to separate the enantiomers present in the drug praziquantel, which is commonly used in Africa to treat the parasitic infection bilharzia. The unwanted enantiomer in the current drug makes the tablet large and extremely bitter.
‘We set up a lab book online in collaboration with the University of Southampton, we got a bit of money for a research grant, and we started to post our experiments every day online,’ says Todd. They quickly developed a community of chemists willing to provide input, including many from industry, and found a cost-effective resolving agent. ‘A key thing about these open projects is that you can change what you’re doing as you’re doing it, with advice from the community, rather than waiting until the end and failing,’ says Todd, who expects distribution of a paediatric formulation of the single enantiomer drug to start soon.
Since then, Todd says, ‘everything we do is open,’ including the Open Source Malaria project that has produced publications featuring more than 50 authors. His collaborative open working approach now needs to be matched with open data. He points to the global Protein Data Bank that since 1971 has served as the single open repository of information about the 3D structures of proteins, nucleic acids and complex assemblies. This ultimately led to AlphaFold, the AI system developed by Google DeepMind that can now accurately predict 3D protein structure from its amino acid sequence. ‘We want to do the same thing for hit finding for new targets,’ says Todd. ‘That will only work if [the data is] open.’
This is in the pipeline with initiatives such as the Structural Genomics Consortium (SGC), a global public-private partnership that intends to openly publish small molecule screening data for thousands of human proteins. Drugs have already been inspired and developed by the private sector from such probes made available by the SGC. ‘That’s a great example of how people need to relax a little bit about sharing stuff, because it can still lead to great projects in the private sector that can help people, even if there’s no IP,’ says Todd.
Under review
One aspect of science that remains stubbornly closed is peer review, with most journals still opting for private and anonymous reviews. Ken Carslaw at the Institute for Climate and Atmospheric Science at the University of Leeds, UK, co-founded the journal Atmospheric Chemistry and Physics in 2001, which pioneered a model of transparent peer review where papers submitted are published as preprints alongside reviews and author responses and remain there even if not accepted for publication.
With an open approach, ‘people are more civil when they’re writing their reviews, they also tend to be more thorough, because they know it will be public and not just for the editor’s eyes, even if it’s not attributed,’ says Carslaw (most reviewers still prefer to stay anonymous). He also thinks it improves the quality of submissions because people know they will immediately be public.
Carslaw would like to see more journals being open: ‘It’s good for public confidence in science that we’re not hiding things, [it] prevents scientific fraud, [it] prevents all manner of things.’ But he doesn’t see much interest in the model in chemistry, which he suggests may be linked to the dominance of large publishers. But the discipline has also generally been slower than others in opening up, with the chemistry preprint server ChemRxiv launching 26 years after physics’ equivalent arXiv, for example.
The situation is similar with data. ‘In some disciplines, the access to underlying data has got a lot better, or the culture has changed, says Coles. ‘In chemistry that hasn’t changed massively.’ Todd also sees chemistry lagging behind disciplines such as genomics. He suggests this may be because chemists create novel molecules, which engenders a greater sense of ownership.
One big problem is the lack of incentives to work in an open way. ‘People are nervous they will get scooped if they share everything,’ says Todd. But open lab books overcome this concern ‘because everything is time stamped and it’s very clear who’s done what and when’. There is also little recognition or reward for the time and effort it takes to work in a more open way – it is unlikely to get you a promotion or grant funding. Moving forward, Cole thinks there needs to be a data citation system where the originator of any data re-used in a follow-on study gets some credit.
Can science really be considered open if it doesn’t address the needs of the whole world?
Intellectual property is another barrier that Sabina Leonelli from the University of Exeter, UK, says is a particular issue in chemistry. Leonelli is a philosopher of science who studies open science and transformations in research systems. She thinks industry involvement keeps a lot of data out of the public domain: ‘This research is not even findable according to the FAIR principles.’ Plus, Leonelli says, ‘what we’re seeing in a lot of private companies is actually tightening up of trademarking and intellectual property around data, because people are now so aware that it is a very valuable asset.’
In academia at least, there is a push for culture change but Todd says the process is still ‘difficult and slow’. Leirmann thinks the key is convincing colleagues that they are the ones who will benefit from a more open system – for example allowing them to keep better track of data created within their own research groups, rather than it being hidden in a shelf full of theses.
Leonelli is also interested in how science can be more open and inclusive globally. Current open data initiatives still don’t allow those without large resources to participate and she has seen the discrepancies first hand in her interactions with crop researchers in Ghana. ‘In many places in the world, there are structural issues with accessing broadband, with accessing the kind of computing facilities and expertise that allow you to take advantage of some of these tools.’ This often adds to the problem of research agendas skewed to the interests of richer countries, with globally important areas like agriculture receiving little attention. Can science really be considered open if it doesn’t address the needs of the whole world?
For Coles there is a future, not too far away, where open science has revolutionised the way chemists work. He envisions chemists using shared data and machine learning tools to test hypotheses and only then resort to the laboratory to confirm the results – a laboratory that may itself be fully robotic. ‘If we get it right with respect to pooling and managing and indexing our data, then we’re not that far away from that goal,’ he says.
While more researchers are adopting open access, open data, open peer review and open projects, some significant barriers are hindering progress
Twenty years ago the debate surrounding open science focused on access to journals. By 2020 around 25% of all chemistry papers published were open access, and now most of the major publishers of chemistry journals offer some version of open access. But more researchers are starting to realise that other elements of open science are ripe for development. There is a tantalising future where chemists share their data in ways that allow easy reuse, awakening a new era of innovation.
One of the biggest culprits slowing this down is the humble pdf file, often the format for supplementary data submitted to journals. ‘Google and all those internet indexes have trouble reading pdfs and understanding what’s in them,’ says chemist Simon Coles from the University of Southampton. ‘Discoverability of data is really hampered by the fact that this is the way we operate.’ Coles is director of the UK Physical Sciences Data-science Service (PSDS), which is working to create an interconnected lake of data from UK physical science research.
Advertisement
The urgency to do this is in part linked to the recent explosion in machine learning methods. ‘There’s no hope of us even getting out of the starting blocks with all these fancy new technologies if we don’t have the right data to train the algorithms,’ says Coles. ‘We’re like a Porsche with shopping trolley wheels on – we can’t get anywhere because of the foundations on which we’re building.’
Crystallography is one part of chemistry ahead of the game. In the 1990s the community invested time and effort into developing methods to share data. What started with a crystal structure database developed into a common language to describe the data and then an information file format. ‘Towards the end of the 90s, we were actually publishing papers in this cif format … it was very easy to pick up an example from a crystallographic experiment and run simulations that you could then extrapolate into a whole different space,’ says Coles.
We’re like a Porsche with shopping trolley wheels on
The PSDS, a partnership between the University of Southampton and the UK’s Science and Technology Facilities Council, is now hoping to create the type of architecture that will allow open sharing in all areas of chemistry. For about four years they have been working on an infrastructure to collect, collate and curate – in other words, can the data be found, can it be re-used and can it be analysed? It’s an enormous task says Coles, because the data chemists use is so diverse: ‘It can be everything from biological systems to materials and all in between.’ There has been progress in the fundamentals of assigning chemical structure data, but it’s still early days: ‘The ability to unify this sort of disparate data is still a long way out.
Similar initiatives are going on in Germany’s chemistry community through NFDI4Chem, one of 30 government funded consortia to open up data. In 2020, chemist Johannes Liermann from the Johannes Gutenberg University Mainz joined a team that is setting up several data repositories and has been tasked to come up with standards for metadata as part of a five-year project, working with international organisations including the PSDS.
One of the distinctions that Liermann makes is the type of openness they are looking to promote, which they describe as FAIR – findable, accessible, interoperable and reusable. ‘It’s important, [as] this is often a concern in chemistry, to distinguish between open and FAIR data,’ says Liermann. FAIR data allows constraints on access; for example, due to a patent application. But the overarching principle of FAIR data is that there should be a record that it exists and it is machine readable.
NFDI4Chem has started with a focus on molecule related data. The harder task is encouraging cultural change through the adoption of electronic lab notebooks, which will ultimately provide a smooth journey from data collection and documentation in the lab to publication and open sharing.
New ways of working, new discoveries
Open science is more than just a way to share data – it’s also a fundamentally different and more collaborative way of doing science that could speed up progress. Medicinal chemist Matthew Todd from University College London School of Pharmacy in the UK has been experimenting with open drug discovery since the mid 2000s. He was looking for a cheap method to separate the enantiomers present in the drug praziquantel, which is commonly used in Africa to treat the parasitic infection bilharzia. The unwanted enantiomer in the current drug makes the tablet large and extremely bitter.
‘We set up a lab book online in collaboration with the University of Southampton, we got a bit of money for a research grant, and we started to post our experiments every day online,’ says Todd. They quickly developed a community of chemists willing to provide input, including many from industry, and found a cost-effective resolving agent. ‘A key thing about these open projects is that you can change what you’re doing as you’re doing it, with advice from the community, rather than waiting until the end and failing,’ says Todd, who expects distribution of a paediatric formulation of the single enantiomer drug to start soon.
Since then, Todd says, ‘everything we do is open,’ including the Open Source Malaria project that has produced publications featuring more than 50 authors. His collaborative open working approach now needs to be matched with open data. He points to the global Protein Data Bank that since 1971 has served as the single open repository of information about the 3D structures of proteins, nucleic acids and complex assemblies. This ultimately led to AlphaFold, the AI system developed by Google DeepMind that can now accurately predict 3D protein structure from its amino acid sequence. ‘We want to do the same thing for hit finding for new targets,’ says Todd. ‘That will only work if [the data is] open.’
This is in the pipeline with initiatives such as the Structural Genomics Consortium (SGC), a global public-private partnership that intends to openly publish small molecule screening data for thousands of human proteins. Drugs have already been inspired and developed by the private sector from such probes made available by the SGC. ‘That’s a great example of how people need to relax a little bit about sharing stuff, because it can still lead to great projects in the private sector that can help people, even if there’s no IP,’ says Todd.
Under review
One aspect of science that remains stubbornly closed is peer review, with most journals still opting for private and anonymous reviews. Ken Carslaw at the Institute for Climate and Atmospheric Science at the University of Leeds, UK, co-founded the journal Atmospheric Chemistry and Physics in 2001, which pioneered a model of transparent peer review where papers submitted are published as preprints alongside reviews and author responses and remain there even if not accepted for publication.
With an open approach, ‘people are more civil when they’re writing their reviews, they also tend to be more thorough, because they know it will be public and not just for the editor’s eyes, even if it’s not attributed,’ says Carslaw (most reviewers still prefer to stay anonymous). He also thinks it improves the quality of submissions because people know they will immediately be public.
Carslaw would like to see more journals being open: ‘It’s good for public confidence in science that we’re not hiding things, [it] prevents scientific fraud, [it] prevents all manner of things.’ But he doesn’t see much interest in the model in chemistry, which he suggests may be linked to the dominance of large publishers. But the discipline has also generally been slower than others in opening up, with the chemistry preprint server ChemRxiv launching 26 years after physics’ equivalent arXiv, for example.
The situation is similar with data. ‘In some disciplines, the access to underlying data has got a lot better, or the culture has changed, says Coles. ‘In chemistry that hasn’t changed massively.’ Todd also sees chemistry lagging behind disciplines such as genomics. He suggests this may be because chemists create novel molecules, which engenders a greater sense of ownership.
One big problem is the lack of incentives to work in an open way. ‘People are nervous they will get scooped if they share everything,’ says Todd. But open lab books overcome this concern ‘because everything is time stamped and it’s very clear who’s done what and when’. There is also little recognition or reward for the time and effort it takes to work in a more open way – it is unlikely to get you a promotion or grant funding. Moving forward, Cole thinks there needs to be a data citation system where the originator of any data re-used in a follow-on study gets some credit.
Can science really be considered open if it doesn’t address the needs of the whole world?
Intellectual property is another barrier that Sabina Leonelli from the University of Exeter, UK, says is a particular issue in chemistry. Leonelli is a philosopher of science who studies open science and transformations in research systems. She thinks industry involvement keeps a lot of data out of the public domain: ‘This research is not even findable according to the FAIR principles.’ Plus, Leonelli says, ‘what we’re seeing in a lot of private companies is actually tightening up of trademarking and intellectual property around data, because people are now so aware that it is a very valuable asset.’
In academia at least, there is a push for culture change but Todd says the process is still ‘difficult and slow’. Leirmann thinks the key is convincing colleagues that they are the ones who will benefit from a more open system – for example allowing them to keep better track of data created within their own research groups, rather than it being hidden in a shelf full of theses.
Leonelli is also interested in how science can be more open and inclusive globally. Current open data initiatives still don’t allow those without large resources to participate and she has seen the discrepancies first hand in her interactions with crop researchers in Ghana. ‘In many places in the world, there are structural issues with accessing broadband, with accessing the kind of computing facilities and expertise that allow you to take advantage of some of these tools.’ This often adds to the problem of research agendas skewed to the interests of richer countries, with globally important areas like agriculture receiving little attention. Can science really be considered open if it doesn’t address the needs of the whole world?
For Coles there is a future, not too far away, where open science has revolutionised the way chemists work. He envisions chemists using shared data and machine learning tools to test hypotheses and only then resort to the laboratory to confirm the results – a laboratory that may itself be fully robotic. ‘If we get it right with respect to pooling and managing and indexing our data, then we’re not that far away from that goal,’ he says.
No comments:
Post a Comment