Methodology
PepFold is a 5-stage computational pipeline that takes SNP variant identifiers (rsIDs) as input and produces annotated pharmacogenomic reports with ranked peptide candidates and Fmoc-SPPS synthesis protocols.
Pipeline Architecture
Stage 1: Variant Annotation
- Source: NCBI
ClinVardatabase - Input: List of rsIDs (e.g.,
rs429358) - Output: Clinical significance (pathogenic, benign, drug-response, etc.), associated gene symbol, review status
- Filtering: Variants classified as purely "benign" are excluded from downstream analysis
- Performance: Common pharmacogenomic variants are cached locally to reduce latency
Stage 2: Target Mapping
- Source:
UniProtprotein database - Input: Gene symbols from Stage 1
- Output: Protein sequence, functional domains, binding sites, protein name
- Region selection: Proprietary algorithm identifies optimal candidate interaction regions on each target protein
Stage 3: Peptide Generation
Primary: NVIDIA BioNeMo Evo 2
- 40B parameter genomic foundation model
- AI-driven candidate generation from target binding regions
- Multiple candidates generated per target for ranking
Fallback: Proprietary Rational Design Engine
When the Evo 2 API is unavailable, the pipeline activates a proprietary rational design engine that generates candidates using biochemical complementarity principles. This fallback is deterministic and reproducible.
Note: The generation method (evo2 or rational_design) is recorded per candidate in the report.
Stage 4: Structure Prediction
- Source:
ESMFold(Meta) - Input: Peptide amino acid sequence
- Output: PDB-format 3D structure with per-residue pLDDT confidence scores
- Rendering: Interactive 3D viewers in HTML reports, colored by confidence
- Failure mode: If ESMFold is unavailable, the candidate is flagged accordingly
Stage 5: Multi-Dimensional Scoring
Candidates are evaluated using a proprietary scoring system across four complementary dimensions. Scoring is deterministic and rule-based. It does not rely on machine learning.
| Dimension | What it measures |
|---|---|
| Binding affinity | Estimated interaction strength between the peptide and target protein binding region |
| Structural confidence | Quality and reliability of the predicted 3D fold, informed by ESMFold confidence metrics |
| Clinical relevance | Strength of the clinical evidence for the underlying genetic variant, derived from ClinVar annotations |
| Novelty | Sequence uniqueness relative to other candidates, rewarding diverse therapeutic approaches |
Each dimension is weighted according to a proprietary formula. Candidates are ranked by composite score, and the top N per target are selected for the report.
Synthesis Protocol Generation
Each top-ranked candidate receives a complete Fmoc-SPPS (solid-phase peptide synthesis) protocol adapted to its specific sequence properties.
- Resin selection: Automatically chosen based on the peptide's C-terminal residue
- Coupling sequence: Step-by-step amino acid additions with adapted reagents and timing
- Cleavage & deprotection: Cocktail formulation adapted to the sequence's amino acid composition
- Purification: RP-HPLC parameters tailored to the peptide's physicochemical properties
- Quality control: ESI-MS, analytical HPLC, amino acid analysis, endotoxin testing
- Estimates: Cost and time projections based on sequence length and complexity
IMPORTANT: These are computationally generated templates requiring laboratory optimization and validation by qualified chemists.
Report Format
- HTML: Interactive report with 3D molecular viewers and Plotly charts
- PDF: Print-ready export for distribution and archival
- Sections: Variant annotations, target mapping, ranked candidates with score breakdowns, 3D structure viewers, synthesis protocol per candidate
Data Sources
| Source | Purpose | Availability |
|---|---|---|
| NCBI ClinVar | Variant annotation and clinical significance | Public, free |
| UniProt | Protein target mapping and binding site data | Public, free |
| NVIDIA BioNeMo Evo 2 | AI-driven peptide candidate generation | Free tier available |
| ESMFold (Meta) | 3D structure prediction | Public, free |
In addition to these public data sources, PepFold applies proprietary algorithms for region selection, candidate design, and multi-dimensional scoring.
Limitations
- No molecular docking or free energy calculations
- Binding scores are heuristic, not physics-based
- ESMFold predictions are computational models, not experimental structures
- Evo 2 may fall back to rational design (flagged in each report)
- Synthesis protocols require wet-lab optimization
- Pipeline outputs have not been experimentally validated
Reproducibility
- Pipeline is deterministic given the same external API responses
- The rational design fallback is fully reproducible across runs
- All external data sources are timestamped in reports
Citation
If you use PepFold in your research, please cite:
PepFold: Pharmacogenomic Variant-to-Synthesis Pipeline. Olam Création, 2026. https://pepfold.com