In a briefing on Monday, research leaders across tech, academia and the government joined the White House to announce an open data set full of scientific literature on the novel coronavirus. The COVID-19 Open Research Dataset, known as CORD-19, will also add relevant new research moving forward, compiling it into one centralized hub. The new data set is machine readable, making it easily parsed for machine learning purposes — a key advantage according to researchers involved in the ambitious project.
In a press conference, U.S. CTO Michael Kratsios called the new data set the “most extensive collection of machine readable coronavirus literature to date.” Kratsios characterized the project as a “call to action” for the AI community, which can employ machine learning techniques to surface unique insights in the body of data. To come up with guidance for researchers combing through the data, the National Academies of Sciences, Engineering, and Medicine collaborated with the World Health Organization to come up with “high priority” questions about the coronavirus related to genetics, incubation, treatment, symptoms and prevention.
The partnership, announced today by the White House Office of Science and Technology Policy, brings together the Chan Zuckerberg Initiative, Microsoft Research, the Allen Institute for Artificial Intelligence, the National Institutes of Health’s National Library of Medicine, Georgetown University’s Center for Security and Emerging Technology, Cold Spring Harbor Laboratory and the Kaggle AI platform, owned by Google.
The database brings together nearly 30,000 scientific articles about the virus known as SARS-CoV-2. as well as related viruses in the broader coronavirus group. Around half of those articles make the full text available. Critically, the database will include pre-publication research from resources like medRxiv and bioRxiv, open access archives for pre-print health sciences and biology research.
To continue reading this article, click here.