“Science is based on building on, reusing and openly criticizing the published body of scientific knowledge. For science to effectively function, and for society to reap the full benefits from scientific endeavors, it is crucial that science data be made open.” Panton Principles
“A piece of data or content is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike.” Open Definition
I could not agree more. However, what do open science and open data mean for mathematics?
As exciting as the open science and open data movements are, they appear at first glance to be largely unrelated to the world of pure mathematics, which revolves around theorems and proofs instead of experimental data. And theorems and proofs are “open” the moment they are published, right? Does this mean that mathematics is already “open”?
Of course, the word “published” is loaded in this context: The debate around open access publishing in academia is ongoing and far from settled. My personal view is that the key challenge economic: We need new funding models for open access publishing - a subject I have written a lot about recently. However, in this blog post I want to talk about something else:
What does open mathematics mean beyond math papers being freely available to anyone, under an open license?
The goal is to make mathematics more useful to everyone. This includes:
- Discovery. How can we make relevant mathematical research easier to find?
- Understanding. How can we make math papers easier to comprehend for a wider range of people?
- Exploration. How can we allow readers to interact with our work?
- Application. How can we make mathematical knowledge easier to use in many different contexts?
- Modification. How can we make it easier to build upon mathematical research?
We can open up new possibilities in each of these areas by reimagining what it means to publish mathematical research.
Mathematics as data
Examples, definitions, theorems, proofs, algorithms - these are the staples of mathematical research and constitute the main body of tangible mathematical knowledge. Traditionally we view these “items” of mathematical knowledge as prose. What if we start to view examples, definitions, theorems, proofs and algorithms as data?
Examples have always been the foundation of any mathematical theory and the discovery of new examples has been a key driver of research. As systematic search for examples (with computers and without) is becoming increasingly important in many fields, experimental mathematics have flourished in recent years. However, while many researchers publish the results of their experiments, and some great open databases exist, experimental results often remain stuck in a tarball on a personal website. Moreover, the highly structured nature of the mathematical objects encoded has led to a profusion of special purpose file formats, which makes data hard to reuse or even parse. Finally, there is a wealth of examples created with pen and paper that either are never published at all, or remain stuck in the prose of a math paper. To make example easier to discover, explore and reuse, we should:
- Create decentralized databases of examples. Think both OEIS and “github for examples”.
- Promote the use standard formats to represent structured data, such as YAML or JSON.
- Acquire the parsing skills to deal with special-purpose file formats where necessary.
- Complement LaTeX papers with data sets of examples in machine readable form.
- Make uploading data sets on the arXiv common practice.
- Publish examples even if they don’t make it in a paper.
The rise of experimental mathematics goes hand in hand with the rise of algorithms in pure mathematics. Even in areas that were solidly the domain of pen-and-paper mathematics, theoretical algorithms and their practical implementation play an increasingly important role. We are now in the great position where many papers could be accompanied by working code - where papers could be run instead of read. Unfortunately, few math papers actually come with working code; and even if they do the experiments presented therein are typically not reproducible (or modifiable) at the push of a button. Many important math software packages remain notoriously hard to compile and use. Moreover, a majority of mathematicians remains firmly attached to low-level languages, choosing small constant-factor improvements in speed over the usability, composability and readability afforded by higher-level languages. While Sage has done wonders to improve interoperability and usability of mathematical software, the mathematical community is still far away from having a vibrant and open ecosystem as available in statistics. (There is a reason why package managers or a corner stone of any programming language that successfully fosters a community.) In order to make papers about algorithms actually usable and to achieve the goal of reproducible research in experimental mathematics, we should:
- Publish the software we write. This includes publishing the scripts we use to run our experiments in order to make the easily reproducible.
- Write software to be used - even outside of our own office. Invest the time to polish and document code.
- Use common package repositories to publish software, not just the personal home page.
- Prefer high-level languages over low-level languages to make our libraries easier to reuse and our code easier to read and modify.
- Make software easy to install.
- Make coding part of the pure math curriculum, not just part of applied math.
Theorems and proofs are the main subject of the vast majority of pure math papers - and we do not consider them as data. However, opening up theorems and proofs to automatic processing by making their semantic content accessible to computers has vast potential. This is not just about using AI to discover new theorems a couple of decades in the future. More immediate applications (in teaching as well as research) include using computers to discover theorems in the existing literature that are relevant to the question at hand, to explore where a proof breaks when modifying assumptions, to get feedback while writing a proof about the soundness of our arguments or to verify correctness after a proof is done. The automatic and interactive theorem proving communities have made tremendous progress over the last decades, and their tools are commonly used in software verification. To be able to apply these methods in everyday mathematics, we should:
- Develop formal tools suitable for everyday use by working mathematicians (as opposed to experts in software verification or formal methods).
- Start formalizing parts of the mathematical articles we write.
- Create the infrastructure to publish and integrate formal content with prose articles.
- Explore the use of formal methods in teaching mathematics.
Mathematics for people
The points mentioned so far focus on making mathematical knowledge more accessible for computers. How can we make mathematical knowledge more usable for humans?
First of all, there is of course the issue of accessibility. From screen readers to Braille displays and beyond, there is a wealth of assistive technologies that can benefit from mathematics published in modern formats. For example, MathML provides richer information to assistive technologies than do PDF documents. Adopting modern formats and publishing technology can do a world of good here and have many positive side-effects, such as making math content more readable on mobile devices as well. But even assuming a readers are comfortably viewing math content on a desktop screen, there is a lot of room for improving the way mathematical articles are presented.
Communication depends on the audience. Math papers are generally written for other experts in the same field of mathematics, and as such, their style is usually terse and assumes familiarity with facts and conventions well-known to this core audience. However, a paper can also be useful to other readers who would prefer a different writing style: Researchers from other fields might prefer a summary that briefly lays out the main results and their context without assuming specific prior knowledge. Students would appreciate a wealth of detail in the proofs to learn the arguments a senior researcher takes for granted. Newcomers could benefit from links to relevant introductory material elsewhere. And everyone appreciates richly illustrated examples.
A single static PDF document is not the best tool for achieving all of the above objectives at the same time. By experimenting with dynamic, interactive documents, we can create articles that are more useful to a wider range of audiences. Documents could be “folded” by default, giving readers an overview first and allowing them to drill down for details where needed, possibly all the way to a formal proof. Examples could be presented side-by-side with the results they illustrate instead of the two being interleaved in a linear text. Software can enable readers to dynamically rearrange the text, for example by “pinning” definitions from the preliminaries to the screen while working through the proofs. Procedurally generated figures can be modified and explored interactively. Algorithms can be run and their execution observed - and articles could even be used as libraries from other software. Social annotation frameworks can allow readers everywhere to engage in a dialogue.
As soon as we leave the printed page behind us, the possibilities are endless. However, for these visions to fulfill their potential, openness is key. In particular:
- File formats have to be open and not proprietary. Everyone has to be able to create their own software for reading and writing these files.
- File formats have to be easily extensible, so that everyone can experiment with what a “document” can be.
- It should be possible to inspect a document to learn how it was written. (Think “show source” on a web page.) This way, authors can learn from each other by default.
- There is no reason why there should be separate software programs for “reading” and “writing” a document. The transition from “reading” to “working with” to “building on” can and should be seamless.
- Finally, licenses have to allow all of the above.
Conclusion
Open data matters for pure mathematics. Taking open principles seriously can transform mathematical research and make it more useful and relevant both within academia and in the real world.
To conclude, I want to add three more thoughts:
- Intuition is a form of mathematical knowledge that I have not mentioned so far. In my view, it is the most important one, but at the same time it is by far the most difficult one to convey in writing. The best way to communicate intuition is through dialogue. Intuitive explanations on a printed page can confuse more than they help, which is why they are often excluded from published papers. Dynamic documents can offer new room for experimenting with intuitive explanations - this cannot replace interaction in person, but which can be very valuable for anyone without access to an expert in the subject area.
- Open science and open mathematics have to be at least as much about education as they are about research. Open data that requires a team of data scientists and a compute cluster to make use of may create huge value for industrial applications, but excludes a large part of society. One of the key tenets of open knowledge is that it should be open to everyone. Being open to everyone is not just about licenses and price, however. It also means giving everyone the means to benefit from these resources. Open mathematics should empower not just experts, but learners at all levels.
- Making even a tiny fraction of these ideas happen will require a huge amount of work, and this work does not come for free. Making one math paper open means that a second paper is not going to get written. I think this is a worthy investment and creates far more value for society. However, as long as the academic job market values publication counts above everything else, this may not be a wise choice for many, career-wise. The transition to open mathematics will require both young researchers who are willing to take that risk and academic search committees who value innovation.