Responsible AI in Science Is Not About Smarter Models. It Is About Better Orchestration

The difference is not the model, but the method: without orchestration, AI generates noise; with the right boundaries, workflows and checks, it becomes a reliable tool for scientific understanding.

The responsible use of AI in science is no longer just a matter of good practice; it is increasingly part of the wider regulatory and governance landscape. In the European Union, for example, the AI Act includes transparency obligations (Article 50), reinforcing the idea that AI-generated outputs should not be treated as neutral, self-validating, or beyond scrutiny. That legal reference is important, but the deeper point is practical rather than formal. In scientific work, the real question is not simply whether AI has been used, but whether its use has been framed in a way that makes the output understandable, controllable, and fit for purpose. In science, responsible AI is not just about disclosure; it is about designing processes that make AI outputs usable, reviewable, and accountable.

Framing the problem: orchestration vs non-orchestration

A useful way to think about this is to move beyond the blurred label of “AI” itself. Today, the boundary is not always obvious. Would you call a simple script that produces a summary of a scientific paper “AI-based”? What about a workflow that classifies study types, extracts variables, flags uncertainty, and sends selected cases for expert review? These are very different systems, yet they are often discussed under the same umbrella. That is why, in practice, the more meaningful distinction is between non-orchestrated and orchestrated use. Non-orchestrated use is the familiar chat-based pattern: drag and drop a genotoxicity paper into ChatGPT, ask what the study shows, maybe ask whether it looks positive or negative, and move on. This can be helpful for quick exploration, but it remains fragile: context is managed manually, outputs do not naturally scale across many documents, and the model may drift as the context window fills with mixed instructions, partial evidence, or earlier assumptions. Thus, the real issue is not whether a tool looks intelligent, but whether the workflow around it is controlled enough to support scientific judgement.

Why orchestration matters

In our view, the value of AI in science emerges when language models are embedded within structured workflows rather than used as standalone answer engines. A general chat can help interpret a paper; an orchestrated system can help build an evidence base. Consider, for example, a scientific project currently being developed at Innovamol: the structured extraction of genotoxicity data from different sources: a non-orchestrated use might mean uploading one paper at a time and asking open questions about the result. On the other hand, an orchestrated pipeline, can process many documents in batch, apply a predefined schema, separate deterministic fields from interpretation-heavy fields, and flag uncertain cases for review. Deterministic elements might include study identifiers, OECD test guideline references, species, strain, route of exposure, dose groups, or bibliographic metadata. Less deterministic elements might include the rationale for the derived study outcome, the interpretation of narrative evidence, or the mapping of free text to endpoint categories. What matters here is not only technical design, but also scientific design. A strong orchestration layer requires knowing which variables can be standardised safely, which require interpretation, and where uncertainty should be surfaced rather than hidden. In other words, orchestration is not a cosmetic layer on top of a model; it is the discipline that determines whether the model will generate convenience or scientific value.

In other words: good orchestration does not ask the model to do everything; it uses the model where language understanding adds value, while anchoring the rest in rules, structure, validation, and fallback logic.

The distinction becomes clearer when the two approaches are viewed side by side. Using the example of genotoxicity data extraction, the table below contrasts a non-orchestrated use of a general chat interface with an orchestrated workflow designed to process evidence systematically, consistently, and at scale.

Aspect	Non-orchestrated use (e.g. ChatGPT)	Orchestrated use (e.g a structured toolkit)
Typical scenario	Drag and drop one genotoxicity paper into ChatGPT and ask: “Is this study positive or negative?”	Run a structured pipeline across many papers using a predefined extraction schema
Scale	One paper or a few papers at a time	Batch processing across tens or hundreds of papers
Context handling	Manual, conversation-based, easy to overload	Controlled, segmented, task-specific
Deterministic variables	Usually handled implicitly or inconsistently	Explicitly captured, e.g. study ID, OECD TG, species, strain, route, dose, publication year
Interpretation-heavy variables	Asked directly in free text, often with limited controls	Routed to LLM steps, e.g. study outcome rationale, endpoint interpretation, uncertainty notes
Error control	Mainly dependent on user vigilance	Schema checks, controlled vocabularies, fallback logic, expert review
Output	Plausible answer in prose	Structured, reviewable evidence record
Best use	Fast exploration or first reading	Scientific extraction, curation, and scalable evidence analysis

As the comparison shows, the difference is not simply one of convenience or speed. It is a difference in operating model: one approach supports ad-hoc interpretation, while the other is designed for structured, reviewable, and scientifically usable evidence generation.

In conclusion, we believe that the scientific community should avoid two opposite mistakes. The first is to assume that these tools should not be used in science at all, because that risks leaving organisations behind while others learn how to use them productively. The second is to adopt them too casually, as if fluent output were enough to justify trust. Scientific AI only becomes useful when its boundaries, limits, and failure modes are understood clearly and managed deliberately, and that in turn requires deep domain knowledge as much as technical knowledge. The challenge is not to reject these tools, nor to romanticise them, but to orchestrate them properly around real scientific problems. At Innovamol, this is exactly the direction we continue to pursue: staying active on these developments and, true to our mission, helping to organize scientific data.

“We shape our tools and thereafter our tools shape us” – Marshall McLuhan