The Ontological Advantage: Building Knowledge Graphs That Reason
Why structured data with strong schemas is the moat that separates insight from noise in drug discovery.
The Data Paradox
Pharma has a data problem, but not the one you think. The industry doesn't suffer from a shortage of data—it drowns in it. Every year, tens of thousands of papers are published, thousands of clinical trials report results, and petabytes of genomic, proteomic, and chemical data accumulate in repositories around the world.
Yet drug discovery remains stubbornly inefficient. The average cost to bring a drug to market now exceeds $2.6 billion. Attrition rates haven't meaningfully improved in decades. The promise of "Big Data" in pharma has, for the most part, failed to materialize.
The problem isn't data volume. It's data structure.
Most pharmaceutical companies have built what they call "data lakes." In practice, these are data swamps—vast repositories where information goes to die. There's an Amontillado quality to it: knowledge entombed by its own data, each new CSV and unstructured PDF adding another brick to the wall. Analysts spend 80% of their time excavating and reconciling, 20% actually analyzing. That ratio is inverted from what it should be.
At Fibonacci, we recognized early that the bottleneck in AI-driven drug discovery isn't algorithmic sophistication—it's the quality of the substrate those algorithms operate on. Our Medicine Engine is built on a foundation of highly structured knowledge graphs with rigorous ontological constraints. This isn't a technical curiosity. It's our core competitive advantage.
The Limits of Vector Search
When LLMs first arrived, the industry converged on a pattern called RAG—Retrieval-Augmented Generation. The idea was simple: embed your documents as vectors, and when a user asks a question, find the most semantically similar chunks and feed them to the model.
For simple fact retrieval, this works. "What is the molecular weight of gefitinib?" Vector search finds the relevant paragraph, and the LLM extracts the answer.
But vector RAG suffers from a fundamental limitation we call contextual fragmentation. When you embed text as vectors, you capture semantic similarity—but you lose structure. You lose the relationships between entities. You lose the causal chains that connect evidence to conclusions.
Ask a vector-based system: "What are the emerging resistance mechanisms for EGFR inhibitors, and how do they connect to alternative therapeutic strategies?" The system will return a dozen chunks that mention EGFR and resistance. But it cannot synthesize them. It cannot trace the path from mutation → mechanism → pathway → alternative target. The topology is invisible.
This is the core problem that knowledge graphs solve. A knowledge graph doesn't just store facts—it stores the structure of reasoning. When an LLM queries a knowledge graph, it doesn't just retrieve similar text. It traverses a pre-computed reasoning structure, following edges that encode biological causality. This is what enables multi-hop reasoning and global synthesis.
The industry calls this paradigm GraphRAG[1]—and it represents a fundamental shift from stateless information retrieval to stateful semantic memory.
1. Why Knowledge Graphs?
A knowledge graph is, at its core, a way of representing information as a network of entities and relationships. Nodes are things—drugs, proteins, diseases, genes, clinical trials. Edges are the relationships between them—"inhibits," "treats," "upregulates," "is_contraindicated_with."
This might sound like a relational database with extra steps. It's not. The difference is fundamental.
Relational databases are designed for transactions. They excel at answering questions you anticipated when you designed the schema: "How many units of Drug X did we sell in Q3?" But they struggle with exploratory queries that traverse relationships: "What pathways connect this kinase inhibitor to cardiovascular side effects?"
Drug discovery is inherently a graph problem. A drug binds to a target. That target participates in pathways. Those pathways are dysregulated in diseases. The disease has a patient population. The population has demographic characteristics that affect trial design. Every entity is connected to every other entity through a web of relationships.
The key insight isn't just that topology exists—it's that relevance is defined by adjacency. When you care about a target, you want to know what compounds hit it, what pathways it participates in, what diseases it's implicated in. You don't want all targets; you want this target's neighborhood. The graph structure itself tells you what context matters.
This is why graphs and LLMs are such a powerful combination. When an LLM needs to research a compound, it can do a depth-first traversal from that node—following edges to targets, then to pathways, then to diseases—and everything it encounters is logically relevant. The graph defines the search space. In a table, you'd need to know in advance which joins to make. In a vector store, you'd retrieve semantically similar chunks that might be about completely unrelated compounds. The graph gives you structured relevance.
1.1 Iterative Boundary Expansion
The advantage of graphs isn't cleaner query syntax—LLMs can write SQL just fine. The real power is that graphs enable iterative knowledge expansion at the boundary.
When an agent traverses a graph, it can reach the edge of what's known and make a decision: is this enough context, or do I need to go deeper? If a compound's target has sparse pathway annotations, the agent can autonomously trigger a literature search to fill in that gap. If a disease node lacks recent clinical trial connections, the agent can query ClinicalTrials.gov and extend the graph in real-time.
This is fundamentally different from static querying. The graph isn't just a database—it's a map of known and unknown territory. Agents explore the frontier, identify where knowledge is thin, and iteratively enrich the regions that matter for the current question. The structure of the graph tells them where to look next.
Consider a drug repurposing question: "Find approved kinase inhibitors that might work in pancreatic cancer." An agent doesn't just run a query. It traverses from pancreatic cancer to dysregulated pathways, from pathways to involved kinases, from kinases to known inhibitors. At each hop, it evaluates confidence: is this edge well-supported? If not, it pauses to gather more evidence before continuing. The graph provides the skeleton; the agent fills in the muscle.
An agent traversing entity relationships, evaluating confidence at each edge
2. The Ontology Is the Moat
A knowledge graph without an ontology is just a fancy way to store spaghetti. The ontology is what transforms a graph from a data structure into a reasoning engine.
An ontology defines three things:
- Schema: What types of entities exist, and what types of relationships can connect them.
- Semantics: What those entities and relationships mean—not just their labels, but their logical implications.
- Constraints: What configurations are valid, and what configurations are impossible.
The "NoSQL" movement of the 2010s popularized the idea of "schema-free" databases. This was always a misnomer. You don't eliminate schema by not defining it explicitly—you just make it implicit, inconsistent, and impossible to enforce.
Does this statement conform to our ontology's type constraints?
In drug discovery, implicit schemas are lethal. Consider a simple example: a data entry says "Drug X treats EGFR." Is that valid? It depends on what EGFR refers to. EGFR could be:
- The gene (a segment of DNA that encodes a protein)
- The protein (the receptor that sits on cell surfaces)
- A pathway (EGFR signaling cascade)
- A mutation (EGFR L858R, common in lung cancer)
Drugs don't "treat" genes or proteins—they bind to proteins, inhibit enzymes, or modulate pathways. Drugs treat diseases. Without an ontology that enforces these distinctions, your database will accumulate thousands of semantically invalid statements. Garbage in, garbage out—at scale.
2.1 The Cost of Late Binding
Some teams argue for a "late-binding" approach: ingest everything now, impose structure later. This is a trap.
Late binding works when data volumes are small and the domain is simple. In drug discovery, neither condition holds. By the time you realize your data model is broken, you have millions of malformed records. Fixing them isn't a schema migration—it's an archaeological excavation.
We enforce schema at ingestion. Every fact that enters our knowledge graph must conform to our ontology. If a data source provides information in a format that doesn't map cleanly to our types and relationships, we either transform it explicitly or reject it. There is no "we'll figure it out later."
2.2 Dynamic Schema Induction
Here's where LLMs change the equation. Traditional ontology engineering required human experts to predefine every entity type and relationship before data ingestion. If the schema didn't anticipate a relationship, that information was lost.
Modern systems can perform dynamic schema induction—the LLM itself discovers the ontological structure as it processes data. When it encounters "SpaceX launched Starship" and "NASA launched Artemis," it can induce the schema pattern: Organization performs Launch on Spacecraft—without being explicitly programmed with these categories.
This doesn't mean we abandon rigor. We use LLM-induced schemas as proposals that are validated against our core ontology. The LLM might discover a new relationship type we hadn't anticipated; that becomes a candidate for inclusion in our formal schema after human review. This hybrid approach—LLM flexibility with human-curated guardrails—gives us the best of both worlds: schema plasticity without sacrificing semantic precision.
2.3 Open-World vs. Closed-World
A critical ontological decision is whether to operate under open-world or closed-world semantics.
In a closed-world system, if something isn't in the database, it's assumed to be false. This is how SQL databases work: if there's no row for "Drug X treats Disease Y," then Drug X doesn't treat Disease Y.
In an open-world system, absence of information is not evidence of absence. If "Drug X treats Disease Y" isn't in the database, we simply don't know whether it's true or false.
Drug discovery demands open-world reasoning. The whole point is to discover relationships that aren't yet known. A system that treats unknown relationships as false will never propose novel hypotheses. Our ontology explicitly distinguishes between "known true," "known false," and "unknown"—and many of our most valuable queries specifically target the "unknown" space.
3. Anatomy of a Drug Discovery Ontology
Building an ontology for drug discovery is not a weekend project. It requires deep domain expertise, careful design decisions, and continuous refinement. Here's how we structured ours.
Compound
PROPERTIES
RELATIONSHIPS
3.1 Core Entity Types
Our ontology defines a hierarchy of entity types, each with specific properties and relationship constraints:
- Compound: Any chemical entity, from small molecules to biologics. Properties include molecular weight, structure (SMILES/InChI), lipophilicity, and synthetic tractability scores. Subtypes: SmallMolecule, Peptide, Antibody, Oligonucleotide.
- Target: A molecular entity that a compound interacts with. Almost always a protein, but can include RNA or DNA. Properties include sequence, structure (if known), druggability score, and expression patterns across tissues.
- Pathway: A biological process involving multiple molecular interactions. Properties include pathway type (signaling, metabolic, regulatory), member entities, and known disease associations.
- Disease: A pathological condition. Properties include ICD codes, affected tissues, prevalence, and known genetic associations. Subtypes follow the disease ontology hierarchy (DOID).
- ClinicalTrial: A registered study testing a compound in humans. Properties include phase, status, endpoints, patient population, and results (if available).
- Publication: A scientific paper or patent. Properties include authors, date, journal, and extracted claims (with confidence scores).
3.2 The Rigid Core, Flexible Shell
Here's a nuance that gets lost in the "graph vs. relational" debate: we don't view these as opposing paradigms. Our architecture is a superset—a rigidly structured relational core, extended by a flexible graph layer.
Some things must be tables. Diseases have ICD-10 codes. Asset classes have regulatory definitions. Treatment algorithms for standard of care are decision trees with specific branch points. These aren't fuzzy—they're normative. We maintain them in traditional relational structures with strict schemas, foreign key constraints, and audit trails.
But on top of that rigid core, we layer the graph. The graph captures what the tables can't: the emerging hypotheses, the contested relationships, the nuanced mechanistic connections that don't fit into predefined columns. "EGFR inhibition may sensitize tumors to checkpoint blockade in patients with high TMB"—that's not a row in a table. That's a path through the graph, with provenance and confidence scores attached.
The relational core provides the ground truth. The graph provides the hypothesis space. Queries can traverse both: "Given the standard-of-care treatment algorithm for NSCLC, what novel combinations have mechanistic support in our graph?" The answer requires joining structured treatment protocols with unstructured pathway analysis. That's the power of the hybrid.
3.3 Relationship Semantics
Relationships are not just labels—they carry semantic weight. Our ontology defines relationship types with precise meanings:
Compound → Target Relationships
- BINDS: Physical interaction (with affinity value)
- INHIBITS: Reduces target activity (with IC50)
- ACTIVATES: Increases target activity (with EC50)
- MODULATES: Alters activity without clear direction
Target → Disease Relationships
- IMPLICATED_IN: Genetic or functional evidence of involvement
- CAUSAL_FOR: Strong evidence of causality (rare, valuable)
- BIOMARKER_OF: Correlates with disease but may not be causal
Compound → Disease Relationships
- INDICATED_FOR: Approved for treatment
- INVESTIGATED_FOR: In clinical trials
- CONTRAINDICATED_FOR: Should not be used
- REPURPOSING_CANDIDATE: Computational prediction (with confidence)
Every relationship also carries metadata: the source of the assertion, the confidence level, the date it was added, and whether it was asserted by a human curator or inferred by our reasoning engine.
3.4 Standing on Shoulders: Existing Ontologies
We didn't build our ontology from scratch. The biomedical community has spent decades developing standardized ontologies, and ignoring them would be both arrogant and inefficient.
Our ontology incorporates and extends:
- ChEMBL: Chemical structures and bioactivity data
- UniProt: Protein sequences and annotations
- Gene Ontology (GO): Biological processes and molecular functions
- Disease Ontology (DOID): Disease classifications and relationships
- MeSH: Medical subject headings for literature indexing
- SNOMED-CT: Clinical terminology for healthcare data
The challenge is integration. These ontologies were developed independently, with different design philosophies and overlapping scope. We maintain explicit mappings between our internal types and these external standards, allowing us to ingest data from any source that uses them.
4. Integration Without Corruption
A knowledge graph is only as valuable as the data it contains. In drug discovery, relevant data is scattered across hundreds of sources: public databases (PubChem, ChEMBL, ClinicalTrials.gov), commercial datasets (Cortellis, Clarivate), proprietary experimental data, and the ever-growing corpus of scientific literature.
Integrating these sources is where most knowledge graph projects fail. The naive approach—write a custom ETL pipeline for each source—leads to what we call the N-squared problem.
4.1 The N-Squared Trap
If you have N data sources, and you want them to interoperate, the naive approach requires N² mappings. Source A needs to talk to Source B, Source C, and so on. When you add Source N+1, you need N new mappings.
This doesn't scale. At 10 sources, you have 90 potential mappings. At 100 sources, you have 9,900. Each mapping is a potential point of failure, and maintaining them becomes a full-time job.
Our solution is the ontology-as-hub pattern. Every source maps to our ontology exactly once. The ontology becomes the lingua franca—the common language that all sources speak. Adding a new source requires one mapping, not N mappings. Integration becomes O(N) instead of O(N²).
4.2 Entity Resolution: The Identity Crisis
The hardest problem in data integration isn't format conversion—it's entity resolution. When two sources mention "EGFR," are they talking about the same thing?
The answer is usually "it depends." EGFR in a genomics paper refers to the gene. EGFR in a structural biology paper refers to the protein. EGFR in an oncology paper might refer to the mutation (EGFR T790M) or the signaling pathway.
We maintain a canonical entity registry with explicit cross-references to external identifiers. When we ingest data, we don't just match on name—we resolve to our canonical entity using context clues (what type of entity does this source typically reference?), explicit identifiers (UniProt ID, ChEMBL ID), and machine learning models trained on disambiguation.
When resolution is ambiguous, we don't guess—and we don't hide the uncertainty either. Every resolved entity carries an explicit confidence score and a full evidence trail: which source made the claim, what context surrounded it, and what reasoning led to the resolution. This isn't just for auditing; it's for downstream decision-making. A drug-target link with 95% confidence from three independent sources is treated differently than one with 60% confidence from a single abstract. The graph doesn't just store facts—it stores how much we trust those facts.
4.3 Provenance: Trust, But Verify
Not all facts are created equal. A binding affinity measured in a peer-reviewed Nature paper is more trustworthy than a prediction from a computational model. A Phase 3 trial result is more definitive than a Phase 1 observation.
Every assertion in our knowledge graph carries provenance metadata:
- Source: Where did this fact come from?
- Evidence type: Experimental, computational, curated, inferred?
- Confidence: How reliable is this assertion?
- Timestamp: When was this added or updated?
- Curator: Who (or what system) added this?
This provenance metadata isn't just for auditing—it's used in query-time reasoning. When our models make predictions, they can weight evidence by reliability. When contradictions arise, we can trace them to their sources and resolve them.
5. Reasoning and Inference
A knowledge graph that only stores what you explicitly tell it is just a database with a trendy name. The real power of knowledge graphs emerges when they can derive new knowledge from existing facts.
If A inhibits B, and B phosphorylates C, then A indirectly affects C
If A affects C, and C activates D, then A modulates D
If A modulates D, and D is upregulated in F, then A may treat F
If D is upregulated in F, and E is dysregulated in G, find shared targets
5.1 Transitive Inference
The simplest form of reasoning is transitive closure. If we know:
- Drug A inhibits Kinase B
- Kinase B phosphorylates Protein C
- Protein C activates Pathway D
- Pathway D is upregulated in Disease E
Then we can infer: Drug A may reduce activity of Pathway D, and therefore may have therapeutic potential in Disease E.
This inference isn't certain—biology is messier than logic—but it generates hypotheses. Every inferred edge becomes a candidate for experimental validation.
5.2 Contradiction Detection
Scientific literature is full of contradictions. One paper says Drug X inhibits Target Y; another says it activates it. Both can't be true (usually).
Our reasoning engine actively searches for contradictions by checking logical consistency across the graph. When contradictions are detected, they're flagged for investigation. Often, the resolution reveals important nuance: the drug might inhibit the target at low concentrations but activate it at high concentrations. Or the papers were studying different isoforms.
Contradictions aren't bugs—they're features. They point to areas where our understanding is incomplete or where the literature needs reconciliation.
5.3 Hierarchical Abstraction: Community Detection
One of the most powerful techniques in modern knowledge graphs is community detection—algorithms that identify clusters of densely connected nodes. In drug discovery, these communities correspond to biological themes: a cluster of kinases in the MAPK pathway, a cluster of compounds targeting neuroinflammation, a cluster of trials in a specific therapeutic area.
We use algorithms like Leiden to hierarchically partition our graph:
- Level 0: The entire knowledge base
- Level 1: Major therapeutic areas (Oncology, Immunology, Neurology)
- Level 2: Specific disease mechanisms (EGFR signaling, checkpoint inhibition)
- Level 3: Atomic clusters of related entities
For each community at each level, we generate a textual summary—a "Community Report" that describes the key entities, relationships, and insights within that cluster. This creates a hierarchical index of our knowledge.
When an LLM needs to answer a broad question—"What are the emerging themes in kinase inhibitor development?"—it doesn't query millions of individual edges. It queries the community summaries at the appropriate level of abstraction. This is how we enable global sensemaking: synthesis across the entire corpus, not just retrieval of similar chunks.
5.4 The Unknown Unknown Finder
Perhaps the most valuable queries are those that identify gaps in knowledge. Our system can answer questions like:
- "Which kinases have no known inhibitors but are implicated in diseases with high unmet need?"
- "Which approved drugs have never been tested in indications where their targets are dysregulated?"
- "Which protein-protein interactions lack any published modulators?"
These "negative space" queries are impossible in traditional databases. They require reasoning about what doesn't exist, which in turn requires knowing what could exist. The ontology makes this possible.
6. The LLM Inflection Point
For decades, knowledge graph construction was a bottleneck. Extracting structured facts from unstructured text—papers, patents, clinical reports—required either expensive manual curation or brittle rule-based NLP systems. The economics didn't work: it cost more to populate the graph than the graph was worth.
Large Language Models changed this equation overnight.
LLMs are, at their core, machines for transforming unstructured text into structured representations. They excel at exactly the tasks that made knowledge graph construction expensive: entity recognition, relationship extraction, coreference resolution, and semantic normalization. What used to require a team of PhD curators now happens at API speed.
6.1 From Papers to Triples
Consider a sentence from a research paper: "Administration of gefitinib resulted in significant tumor regression in EGFR-mutant non-small cell lung cancer patients who had progressed on platinum-based chemotherapy."
A well-prompted LLM can extract multiple structured facts from this single sentence:
Critically, the LLM can map these extractions to our ontology. It doesn't just extract "gefitinib"—it resolves to our canonical Compound entity with ChEMBL ID. It doesn't just extract "EGFR mutation"—it recognizes this as a Biomarker entity that should be linked to the EGFR Target via a HAS_MUTATION relationship.
We process thousands of papers per day this way. Each extraction is assigned a confidence score based on the LLM's uncertainty and the complexity of the source sentence. High-confidence extractions are added directly to the graph. Lower-confidence extractions are queued for human review—but the human reviewer is now validating proposed facts, not reading raw papers. That's a 10x productivity gain.
6.2 Natural Language as Query Interface
Knowledge graphs have traditionally required specialized query languages—SPARQL, Cypher, Gremlin. These languages are powerful but inaccessible. A medicinal chemist shouldn't need to learn graph query syntax to ask questions about drug-target interactions.
LLMs bridge this gap. We expose our knowledge graph through a natural language interface where users can ask questions in plain English:
- "What approved drugs target the same pathway as imatinib?"
- "Show me all kinase inhibitors that have been tested in pancreatic cancer."
- "Which targets in the Wnt pathway have no known small-molecule modulators?"
The LLM translates these questions into formal graph queries, executes them, and returns results in natural language—with citations to the underlying data sources. The ontology is essential here: it provides the semantic scaffolding that allows the LLM to understand which entity types and relationships are valid.
6.3 Ontology-Grounded Generation
The synergy between LLMs and knowledge graphs runs both directions. The knowledge graph improves LLM outputs by providing grounding—a structured source of truth that constrains hallucination.
When our scientists ask the system to generate a hypothesis about a new target, the LLM doesn't freeform speculate. It retrieves relevant subgraphs from our knowledge base, reasons over the relationships, and generates hypotheses that are grounded in existing evidence. Every claim in the output can be traced back to specific nodes and edges in the graph.
This is the architecture pattern we call Retrieval-Augmented Reasoning (RAR). The knowledge graph provides the facts; the LLM provides the synthesis. Neither alone is sufficient. Together, they're transformative.
6.4 The Virtuous Cycle
The LLM populates the graph. The graph grounds the LLM. This creates a flywheel:
- LLM extracts facts from new literature → Graph grows
- Larger graph provides richer context → LLM extractions improve
- Better extractions mean higher precision → Less human review needed
- Freed-up humans focus on edge cases → Graph quality increases
- Higher quality graph → Better grounding for downstream tasks
Before LLMs, building a comprehensive drug discovery knowledge graph was a multi-year, multi-million-dollar endeavor that only the largest pharma companies could attempt. Today, it's table stakes for any serious AI-native biotech. The technology has democratized; the differentiator is now the quality of the ontology and the rigor of the integration pipeline.
That's exactly where we've invested.
7. Operationalizing the Graph
A knowledge graph sitting in a database is an asset. A knowledge graph integrated into operational workflows is a capability.
7.1 Query Patterns for Drug Discovery
We've developed a library of query patterns that encode common drug discovery reasoning:
- Shortest Path: What's the mechanistic connection between this drug and that side effect?
- Subgraph Matching: Find all drug-target-pathway combinations that match a successful precedent.
- Similarity Search: Given this drug, find others with similar target profiles.
- Counterfactual Analysis: If this target weren't inhibited, what pathways would be affected?
These patterns are exposed as APIs that other systems can call. Our clinical trial design system queries the graph to identify patient stratification biomarkers. Our safety prediction models traverse the graph to anticipate off-target effects.
7.2 Graph Neural Networks: Learning on Structure
Traditional machine learning treats data points as independent. Graph neural networks (GNNs) exploit the relational structure of knowledge graphs to make better predictions.
We use GNNs for:
- Link prediction: Given the existing graph, what edges are likely missing?
- Node classification: Based on a compound's neighborhood in the graph, predict its toxicity profile.
- Graph embeddings: Convert subgraphs into vectors for downstream ML models.
The quality of the ontology directly impacts GNN performance. Well-typed nodes and semantically meaningful edges give the network better signal than a flat, untyped graph.
7.3 The Feedback Loop
The knowledge graph isn't static—it evolves. When our ML models make predictions (e.g., "Drug X likely binds Target Y"), those predictions can be added to the graph as inferred assertions with associated confidence scores.
When predictions are validated experimentally, the confidence is updated, and the assertion is promoted from "inferred" to "experimental." When predictions are falsified, they're marked as such, and the model learns from the failure.
This creates a virtuous cycle: the graph improves the models, and the models improve the graph.
Conclusion: The Compounding Advantage
Knowledge graphs exhibit network effects. Every new entity added creates potential connections to every existing entity. Every new relationship type enables new queries. The value of the graph grows superlinearly with its size.
More importantly, the ontology is institutional memory. When a scientist leaves, their knowledge often leaves with them. When a company's understanding is encoded in a rigorous ontology, it survives team turnover. It can be queried, extended, and reasoned over by anyone—human or machine—who understands the schema.
This is why we built our Medicine Engine on this foundation. Not because knowledge graphs are fashionable, but because they're correct. Drug discovery is fundamentally about understanding complex relationships. Our tools should reflect that reality.
The companies that will win in AI-driven drug discovery aren't those with the most data. They're those with the most structured data—data that machines can reason over, query efficiently, and extend automatically. The ontology isn't overhead. It's the moat.
[1] The term "GraphRAG" was popularized by Microsoft Research's GraphRAG project, which demonstrated the power of graph-based retrieval for complex queries. Our implementation differs in specifics but shares the core insight: graphs enable reasoning that vector search cannot. ↩