When there’s so much searchable literature readily available, do you ever wonder how much of it is truly original?
A paper in ‘Scientific Data’ by Lukas Gienapp, et al. describes the development, compilation and validation of the Webis-STEREO-21 dataset – a huge collection of 91 million cases of reused text passages found in 4.2 million unique open-access publications, designed to act as a foundation for the investigation of scientific text reuse within and across disciplines.
Assessing the scale of of ‘text recycling’ and plagiarism – whether intentional or unintentional – is technically challenging, so this initiative is an important step towards achieving this goal. You can read the full article at: https://www.nature.com/articles/s41597-022-01908-z