The data we already have
In early 2024, my labmate published a method to improve brain tumor diagnosis using standard-of-care MRI. There was a press release. A television crew came to the institute. For about a week, my boss’s phone didn’t stop ringing. Then we stopped working on the project. The results were positive and the method wasn’t wrong, but the work stopped because we didn’t have enough data to seriously consider taking the algorithm to clinical use.
This is common in medical imaging research. During my PhD I watched a pancancer project narrow itself, almost without anyone deciding, into a colorectal-only project, because that was the data we had: an approval already signed with the colorectal department in our hospital. I also watched a paper come back from review with the most significant criticism being the absence of an external test set, which we could not produce. None of this was dramatic. It was the everyday of the field. The worst consequence is not the abandoned projects but what it does to your imagination. Most of the work I have seen, and most of the work I have done, has been shaped more by the data we can access than by what we want to know. You learn, after a while, to adjust your curiosity. You think of a question, and then you think: is there data for this? And if there is not, you think of a different question.
So medical imaging research lacks data. That is not news. The bottleneck I want to describe here is more specific: the data exists. It just doesn’t move. Every CT and MRI scanner in every public hospital in Spain produces images, every hour of the day, every day of the year. These images serve their primary purpose (clinical care) and then they sit on hospital servers, accumulating. Having now experienced what they can do for research, it feels like a waste that so few of them are ever used a second time. They contain so much information, even when collected before any research project was defined. The brain tumor project I described was done entirely with retrospective data. This is the most ordinary thing anyone in the field will tell you, and it took me a long time to see it as a problem we could actually work to solve, rather than a fixed fact of the field.
Accessing medical image data, even as a researcher inside the affiliated institution, is extraordinarily slow and sometimes unsuccessful. You apply for approvals. You write protocols. You specify in advance what you will do with the data. You agree to conditions about storage and transfer. You wait. Sometimes the answer is yes, eventually. Sometimes it is no, or not now, or not in that form. The patient is not entirely absent from this process, but they are not active. They sign a consent form, usually agreeing to a specific study within a single institution. After that, the institution stores the data and holds the power to share it, or not, with other researchers. Most of us write the same polite fiction in our papers: data will be shared upon request. People are, understandably, protective of the data because data leads to papers, papers lead to funding, and funding enables the work. Hospital-affiliated groups have built careers on privileged access to their institutions’ patients.
But this power we, researchers, have over data is wrong. How is it that somewhere along the way we accepted the idea that a measurement taken inside a hospital belongs to the hospital? Not absolutely, I know, but in practice yes: the hospital decides where the data goes, if anywhere, because the hospital bears the legal responsibility. The patient, whose body produced the measurement, is not considered an active participant in the decision.
I think the reason this hasn’t changed is that nobody has actually asked the patients in any meaningful way. I have never met a patient who was offered a way to donate their anonymized imaging study to a public research archive. I have met many patients who would do it if asked. The option does not exist because the system has never been built to ask them.
A recent opinion article in PLOS Digital Health states that “framing data sharing solely within the concept of individual ownership is insufficient,” that “true individual control is impractical and risks placing an undue burden on individuals to protect their data from misuse,” and that the answer is collaborative governance: data access committees, ethics boards, third-party intermediaries who represent patient interests by proxy. They cite UK Biobank’s Ethics and Governance Council as the model, where participants consent broadly at enrollment and an independent body speaks for them thereafter. I disagree. UK Biobank is a genuinely successful model, but we do not yet have the evidence to conclude that a patient-initiated model would fail, because nobody has built one to test it. The argument against patient initiation is currently theoretical. Granular control over every secondary use of one’s data over decades is impractical, but a patient-initiated model does not have to look like that. It can be a single, one-time decision at the point of care, when an imaging study has already served its clinical purpose. A button with a simple message: would you like to add this study, anonymized, to a public research archive? That does not seem to me as a burdensome decision. It is structurally similar to the one Spanish families already make about organ donation at the moment of death, and they make it, eight times out of ten, because the system is built to make the conversation possible.
A patient-initiated model would dissolve several problems at once. If the patient is exercising a right rather than the hospital granting access, institutional liability disappears. The researcher’s competitive edge also disappears, because the dataset is no longer held inside one institution. And the patient, who actually generated the measurement, becomes an agent rather than a subject. My hypothesis is the following: when given a clear, low-friction option to donate an anonymized imaging study after it has served its clinical purpose, a meaningful proportion of patients will choose to do so. I genuinely don’t know if this is true. It might fail. Patients might not engage with the option. Or they might engage, but the resulting data quality might not be good enough to be useful. Both would be informative outcomes. The point is that the experiment has never been run.
In Catalonia, every patient in the public health system has access to La Meva Salut, a digital portal that already lets them view lab reports, book appointments, and access their imaging results. It was widely adopted during COVID when we needed it to verify our digital vaccination certificates, and it has remained in active use. So part of the technical infrastructure for patient-initiated donation already exists. After an imaging study has served its clinical purpose, a patient could see a clear option in La Meva Salut: would you like to donate an anonymized version of this study to a public research archive? If they say yes, the image, paired with anonymized minimal metadata (age, sex, anatomical region, diagnosis), is added to a Catalan open data bank as a new public dataset under a standard license.
The primary outcome of this pilot would not be a perfect dataset, or even a large one. It would be a measure of how many patients choose to donate when asked. A pilot at the scale of a single Catalan hospital would generate enough data to answer the donation-rate question within months. If the rate is meaningful, this could become the start of a new public archive of medical imaging data. If it is really successful, in the long-term it could go beyond medical images to other types of data. But medical images are a good place to start. They are easily anonymized, easy to explain to patients as a thing being shared, and the research community already uses standard formats. If the experiment is not successful, we will have learned something important about the limits of patient-driven models.
Two things make this experiment feasible today: COVID and AI. We understand, in a way we did not in 2019, why medical research matters and why data flow during a crisis is not optional. The cultural moment for asking patients to participate as agents, not subjects, is now. And the most useful side effect COVID has left us in Catalonia is La Meva Salut, a patient portal that exists and is trusted. As for AI, we now have a clearer language for talking about data systems and recognizing the role of data in AI’s recent successes. AlphaFold, which won the Nobel Prize, was built on decades of openly released protein structures. In medical imaging, one of the most cited papers, nnU-Net, was developed and validated on dozens of public datasets, showing what becomes possible when the data is actually available. It is more clear now than it has ever been that the field as a whole is bottlenecked on data access rather than on method. The architectures are smart enough. They need better and more data.
The brain tumor project did not resume. We moved on to other questions, other datasets. I do not think of it as a failure, exactly. It was a normal outcome, given the conditions. But it has stayed with me as a small example of a larger pattern. There is no shortage of data in medicine. There is just a shortage of ways to use it. This shortage is the result of choices, and choices can be changed. Until they are, many more projects will end the way that one did: not with an answer, but with an absence.