Fibonacci Bio | The Clinical Refinery

The Question Everyone Asks

It's the first question we get from investors, partners, and fellow builders:

"Which models do you use?"
—everyone

The question assumes a singular answer—as if we picked a database vendor in 2015 and built our stack around it. But AI models aren't databases. They're more like employees: you hire different people for different roles, their capabilities change over time, and the job market shifts faster than your org chart.

The real answer is a portfolio strategy—one that balances capability against cost, speed against depth, and flexibility against lock-in. And that portfolio needs to be dynamic, because the model landscape is evolving faster than any technology stack in history.

This post documents how we think about model selection at Fibonacci. Not as a one-time vendor decision, but as an ongoing optimization problem across multiple axes.

1. The Evaluation Axes

Models don't compete on a single dimension. Anyone who tells you "Model X is the best" is oversimplifying. Best at what? For whom? At what cost? The evaluation space is multidimensional, and different applications weight these dimensions differently.

1.1 Intelligence

The axis everyone talks about. But "intelligence" itself fractures into subspecialties:

Analytical reasoning: Can the model work through complex, multi-step logical problems? Can it maintain coherence across a long chain of inference?
Domain knowledge: Does the model understand the nuances of drug discovery, molecular biology, clinical trial design? This isn't just about having read the papers—it's about understanding the implicit assumptions experts make.
Emotional intelligence: For patient-facing applications, can the model navigate sensitive conversations with appropriate tone? Can it recognize distress signals and respond appropriately?
Creative synthesis: Can it generate genuinely novel hypotheses by connecting disparate domains? Or does it just recombine patterns it's seen before?

The current frontier models—GPT-5.1, Gemini 3 Pro, Claude Opus 4.5—are remarkably close on standard benchmarks. The differences emerge in edge cases: complex multi-hop reasoning, domain-specific inference, tasks requiring sustained context over thousands of tokens. These edge cases are exactly where drug discovery lives.

1.2 Speed and Latency

Two distinct concerns that often get conflated:

Throughput: How many tokens per second can the model generate? This matters for batch processing—churning through thousands of papers, generating candidate hypotheses at scale.
Latency: How long until the first token arrives? This matters for interactive use—a scientist asking questions in real-time shouldn't wait 30 seconds for a response to begin.

For development iteration, speed is paramount. When you're debugging a prompt or testing a hypothesis, the difference between a 2-second response and a 20-second response isn't 18 seconds—it's the difference between flow state and context-switching. We've found that developer productivity roughly doubles when we drop response times below the threshold of conscious waiting.

1.3 Multimodal Capabilities

Drug discovery is inherently multimodal. We work with:

Chemical structures: 2D molecular diagrams, 3D conformations, crystal structures
Biological images: Microscopy, histopathology, imaging biomarkers
Charts and figures: The majority of information in scientific papers is locked in figures, not text
Audio: Dictated clinical notes, patient interviews, conference presentations

A model that can only process text is looking at drug discovery through a keyhole. The ability to reason directly over images—extracting data from a gel electrophoresis image, interpreting a dose-response curve, understanding a pathway diagram—is increasingly non-negotiable.

1.4 Tool-Calling Proficiency

Here's something that isn't widely appreciated outside of people actually building with these systems: tool-calling reliability might matter more than raw intelligence. A model that can reliably call external APIs, query databases, execute code, and coordinate multi-step workflows is exponentially more useful than a "smarter" model that hallucinates function parameters or forgets to chain calls correctly.

This capability varies wildly across models, and the differences aren't captured by the most hyped benchmarks. Some models struggle with complex JSON schemas. Others hallucinate function names. Others fail to chain multiple tool calls correctly. Where models break down on tool use frequently surprises us.

1.5 The Qualitative Dimension

There's a quality that defies benchmarks but announces itself immediately in practice. Some people call it "Big Model Smell"—the ineffable sense that you're talking to something that gets it versus something that's doing a very good impression of getting it.

You know the difference. One model gives you the Wikipedia answer, properly formatted, technically correct, utterly lifeless. Another gives you something that feels like it was thought—a response that anticipated your follow-up question, that noticed the unstated assumption in your prompt, that pushed back on the part of your question that didn't quite make sense.

This manifests as:

Writing quality: Does the prose feel alive, or does it have that telltale AI blandness—the aggressive hedging, the bullet-point-ification of nuance, the relentless positivity that reads like a corporate press release?
Judgment: Does the model know when to stop? When to ask for clarification instead of confabulating? When to say "I don't know" instead of generating plausible-sounding nonsense?
Surprise: Does it ever generate insights that make you think "I wouldn't have thought of that"? Or does it only ever give you back a more polished version of what you already knew?

These qualities matter enormously for scientist-facing applications. A tool that feels like a capable colleague gets used daily. One that feels like a fancy autocomplete gets used once, benchmarked, and forgotten. The adoption curve isn't about features—it's about whether people want to talk to it.

1.6 Openness and Customization

The open vs. closed weights debate is often framed ideologically. We frame it pragmatically: can you adapt the model to your specific distribution, and under what constraints?

Off-the-shelf models are trained on the internet. For specialized applications—clinical trial design, regulatory writing, molecular property prediction—you often need something more specific. The question is how you get there.

Open Weights

Fine-tune on proprietary data without sending it anywhere
Self-host for data sovereignty and compliance
No API rate limits or availability dependencies
Predictable costs at scale

Closed APIs

No infrastructure to manage
Automatic improvements as provider upgrades
Often better performance at frontier
Some offer fine-tuning, but your data touches their servers

For production systems handling sensitive patient data, open weights with self-hosting often wins. For rapid prototyping and non-sensitive workloads, closed APIs are faster to deploy. The right choice depends on the sensitivity of your data, the specificity of your task, and your willingness to maintain custom infrastructure.

1.7 Price

Token economics matter more than most teams admit. At scale, the difference between $15/million tokens and $1/million tokens is the difference between "run this analysis on everything" and "run this analysis on a sample."

But price optimization is nuanced. A cheaper model that requires three retries costs more than an expensive model that gets it right the first time. A model that needs verbose prompting consumes more tokens than one that understands terse instructions. Total cost of ownership includes error handling, prompt engineering overhead, and developer time.

Here's a counterintuitive example: Claude Opus 4.5 costs roughly 5x more per token than Sonnet 4.5. But on coding tasks, Opus uses up to 76% fewer tokens to reach the same outcome—less backtracking, less verbose reasoning, fewer failed attempts. The "expensive" model can actually be cheaper. Per-token pricing is a red herring; per-task pricing is what matters.

1.8 Transparency and Constraints

Models come with baggage. Some refuse to discuss certain topics for political reasons. Others have been trained with geographic biases or content restrictions that manifest in unexpected ways. Some will refuse to generate content that's entirely appropriate for medical contexts because it pattern-matches to restricted categories.

For drug discovery, we need models that can discuss:

Controlled substances and their mechanisms
Detailed toxicology and adverse events
Sensitive patient scenarios
Historical medical events and their context

A model that refuses to engage with these topics—or that injects inappropriate caveats into every response—is useless for our work, regardless of how well it benchmarks.

Evaluation Axes Comparison

Click models to compare • Higher values toward edge = better

Select Models

● Frontier models excel at intelligence

● Open models win on speed & price

2. Our Current Framework

Given these axes, how do we actually make decisions? Our framework has three tiers.

2.1 Intelligence Floor: Non-Negotiable

There's a minimum capability threshold below which a model simply can't do the job. Drug discovery reasoning is complex—multi-hop inference, integration of heterogeneous evidence, detection of subtle inconsistencies. Models below a certain capability level make errors that are worse than no answer at all.

For production systems, we default to frontier-class models. The cost difference between a frontier model and a mid-tier model is often 5-10x. The capability difference for complex tasks is often the difference between "works" and "doesn't work." Saving money on the model is false economy if it means the system can't do its job.

Current candidates that meet our intelligence floor:

Claude Opus 4.5
Gemini 3 Pro
GPT-5.1

Each of these has straightforward paths to HIPAA-compliant deployment via their enterprise offerings. We maintain integrations with all three. The frontier is contested, and leadership changes quarterly. Lock-in to a single provider is a strategic vulnerability.

Cost vs. Capability Tradeoff

Log scale cost • Pareto frontier highlighted • Hover for details

Model Tiers

Pareto Frontier

Models on the frontier offer the best capability for their price point. No model below the line is strictly better.

2.2 Development Velocity: Speed Over Capability

Here's a heresy: for most of your development cycle, the smartest model is the wrong model.

Production and development have fundamentally different failure modes. In production, a wrong answer is catastrophic. In development, a slow answer is catastrophic. The developer waiting 45 seconds for a response isn't just losing 45 seconds—they're losing their entire train of thought. They check Slack. They context-switch. They forget what they were testing. By the time the response arrives, they've mentally moved on.

We've measured this. A developer running experiments with 2-second latency will run 15x more experiments per hour than one waiting 30 seconds. Not 15% more. Fifteen times more. The compounding effects are brutal: faster iteration means faster debugging, faster debugging means faster feature development, faster feature development means faster learning about what actually works.

The teams treating frontier models as their default development environment are optimizing for the wrong thing. They're buying a Ferrari to learn parallel parking.

For development, we use fast, cheap models that are "good enough":

Open-weight models on specialized hardware: GPT-OSS 120B on Cerebras or Groq delivers near-frontier quality at dramatically lower latency. What was a 30-second wait becomes 2 seconds. The cognitive experience is completely different—it's the difference between a conversation and an email thread.
Efficient frontier models: Kimi-K2 offers better qualitative feel than most open models while maintaining speed. GLM 4.6 excels at tool-calling and code generation—often better than larger models for agentic workflows, because it was trained specifically for that use case.
Qwen series: Strong multilingual capabilities, good reasoning, competitive with closed models at a fraction of the cost. The quality gap that existed 18 months ago has largely closed.

The key insight: most development work isn't testing the limits of model capability. It's testing integration, debugging prompts, iterating on UX. You're not asking the model to solve P=NP; you're asking it to parse JSON without hallucinating extra fields. For these tasks, a 90%-as-good model at 10x the speed isn't a compromise—it's strictly dominant.

2.3 Specialized Fine-Tuning: When Generic Fails

General-purpose models are trained on the internet. The internet is not a drug discovery corpus. For specific tasks where our needs diverge significantly from typical usage, fine-tuned or RL-adapted models become necessary.

Good candidates for specialization:

Tasks with clear, measurable objectives: If you can define a reward function, you can do RL. Clinical trial protocol generation with specific formatting requirements. Regulatory document classification. Structured data extraction from specific document types.
Tasks with proprietary training data: If you have a corpus that doesn't exist on the public internet—internal assay results, proprietary clinical data, curated expert annotations—fine-tuning can unlock performance that generic models can't match.
Tasks where generic models have systematic biases: If a model consistently mishandles your domain's edge cases, targeted training may be more efficient than elaborate prompting.

We haven't yet deployed fine-tuned models in production, but we're building the infrastructure to do so. The trigger will be finding a task where prompt engineering hits diminishing returns and the performance gap justifies the investment.

3. The Portfolio View

Putting it together, our model strategy looks less like a vendor selection and more like an investment portfolio:

Production Tier (Frontier)

Claude Opus 4.5 on Bedrock—primary workhorse
Gemini 3 Pro on Vertex—multimodal tasks, long context
GPT-5.1 on Azure—pending compliance approval

Optimized for: accuracy, reliability, compliance

Development Tier (Fast)

GPT-OSS 120B on Cerebras—speed demon for iteration
Kimi-K2—when qualitative feel matters
GLM 4.6—agentic workflows, tool-calling
Qwen 3—cost-effective general purpose

Optimized for: latency, cost, developer experience

Specialized Tier (Custom)

Fine-tuned models for specific extraction tasks
RL-optimized models for constrained generation
Domain-adapted embeddings for retrieval

Optimized for: task-specific performance, proprietary advantage

Model Selection Framework

How we decide which tier to use for different tasks

Example Tasks

Press play to see example decisions

Model Tiers

Production

Frontier Models

Claude Opus 4.5Gemini 3 ProGPT-5.1

Optimized for: accuracy, reliability, compliance

Development

Fast Models

GPT-OSS 120BKimi-K2GLM 4.6

Optimized for: latency, cost, iteration speed

Specialized

Custom Models

Fine-tuned extractorsRL-optimized generators

Optimized for: task-specific performance

Rules of Thumb

Low complexity → Development tier

Patient-facing → Production tier

Custom formats → Specialized tier

3.1 The Routing Problem

A portfolio is only useful if you can allocate intelligently. Which queries go to which models? This is itself an optimization problem.

Simple heuristics work surprisingly well:

Complexity estimation: Short, simple queries go to fast models. Complex, multi-step queries go to frontier models.
Stakes assessment: Patient-facing outputs always use frontier. Internal development uses whatever is fastest.
Capability matching: Vision tasks route to vision models. Long documents route to long-context models.

Over time, we expect to build learned routers that dynamically select models based on query characteristics and historical performance. But explicit routing rules get you 80% of the benefit with 20% of the complexity.

4. The Temporal Dimension

Everything written above will be obsolete in six months. That's not pessimism—it's the defining fact of the field.

Consider: the model that was state-of-the-art when we started writing this post may not be state-of-the-art when you read it. The pricing we quoted has probably dropped. The "emerging" capability we mentioned is now table stakes. This is not an exaggeration—it's the lived experience of anyone building in this space.

The teams that treat model selection as a one-time decision are building on quicksand. They sign a three-year enterprise agreement with Provider X, build their entire stack around Provider X's APIs and quirks, and then watch in horror as Provider Y releases something 2x better at half the price. They're locked in. Their "strategic partnership" has become a strategic liability.

This has brutal implications for how you architect:

Don't over-invest in model-specific infrastructure. Every hour you spend optimizing for one model's idiosyncrasies is technical debt when you inevitably switch. Your abstractions should survive the model change—if they don't, you've built a house of cards.
Maintain multiple integrations. Provider lock-in is dangerous when the competitive landscape shifts quarterly. The switching cost should be measured in hours, not months. Services like OpenRouter make this easier—one API, dozens of models, instant switching.
Re-evaluate constantly. A model that was best-in-class in January may be third-best by June. Build evaluation into your process as a continuous loop, not a one-time vendor selection that gets revisited "next year."
Watch the open-source frontier. The gap between open and closed models is narrowing faster than anyone predicted. Today's enterprise moat—"only we have access to GPT-X"—may evaporate next quarter when an open model matches performance. Your competitive advantage needs to be something other than model access.

Our model strategy is designed to be disposable. Not our infrastructure, not our abstractions, not our data pipelines—but the specific models plugged into them. We're not betting the company on any single provider's roadmap. We're building a slot where any model can fit.

Conclusion: The Meta-Strategy

"Which model do you use?" is a question from a world that no longer exists—a world where you picked a vendor and lived with the choice for a decade.

The right question is: "How do you decide which model to use for each task, and how quickly can you change that decision when the landscape shifts?"

Our answer:

Evaluate across multiple axes—not just benchmarks, but speed, cost, qualitative feel, tool-calling reliability, and transparency. The "best" model depends entirely on what you're optimizing for.
Maintain a portfolio—frontier models for production stakes, fast models for development velocity, specialized models for narrow tasks. Different jobs, different tools.
Route intelligently—match the model to the task, not vice versa. The routing logic is where the leverage lives.
Stay liquid—the landscape changes faster than planning cycles. Build for adaptability, not for any specific future state.

The companies that win in AI-native drug discovery won't be those that picked the "right" model in 2025. They'll be those that built systems capable of absorbing whatever models exist in 2027, 2030, and beyond—systems where the model is a hot-swappable component, not a load-bearing wall.

The model is a component. The system is the product. The adaptability is the moat.

The Model Selection Question