PepFold

Methodology

PepFold is a 5-stage computational pipeline that takes SNP variant identifiers (rsIDs) as input and produces annotated pharmacogenomic reports with ranked peptide candidates and Fmoc-SPPS synthesis protocols.

Pipeline Architecture

rsIDs → [1. ClinVar] → [2. UniProt] → [3. Evo 2 / Rational Design] → [4. ESMFold] → [5. Scoring] → Report

Stage 1: Variant Annotation

  • Source: NCBI ClinVar database
  • Input: List of rsIDs (e.g., rs429358)
  • Output: Clinical significance (pathogenic, benign, drug-response, etc.), associated gene symbol, review status
  • Filtering: Variants classified as purely "benign" are excluded from downstream analysis
  • Performance: Common pharmacogenomic variants are cached locally to reduce latency

Stage 2: Target Mapping

  • Source: UniProt protein database
  • Input: Gene symbols from Stage 1
  • Output: Protein sequence, functional domains, binding sites, protein name
  • Region selection: Proprietary algorithm identifies optimal candidate interaction regions on each target protein

Stage 3: Peptide Generation

Primary: NVIDIA BioNeMo Evo 2

  • 40B parameter genomic foundation model
  • AI-driven candidate generation from target binding regions
  • Multiple candidates generated per target for ranking

Fallback: Proprietary Rational Design Engine

When the Evo 2 API is unavailable, the pipeline activates a proprietary rational design engine that generates candidates using biochemical complementarity principles. This fallback is deterministic and reproducible.

Note: The generation method (evo2 or rational_design) is recorded per candidate in the report.

Stage 4: Structure Prediction

  • Source: ESMFold (Meta)
  • Input: Peptide amino acid sequence
  • Output: PDB-format 3D structure with per-residue pLDDT confidence scores
  • Rendering: Interactive 3D viewers in HTML reports, colored by confidence
  • Failure mode: If ESMFold is unavailable, the candidate is flagged accordingly

Stage 5: Multi-Dimensional Scoring

Candidates are evaluated using a proprietary scoring system across four complementary dimensions. Scoring is deterministic and rule-based. It does not rely on machine learning.

DimensionWhat it measures
Binding affinityEstimated interaction strength between the peptide and target protein binding region
Structural confidenceQuality and reliability of the predicted 3D fold, informed by ESMFold confidence metrics
Clinical relevanceStrength of the clinical evidence for the underlying genetic variant, derived from ClinVar annotations
NoveltySequence uniqueness relative to other candidates, rewarding diverse therapeutic approaches

Each dimension is weighted according to a proprietary formula. Candidates are ranked by composite score, and the top N per target are selected for the report.

Synthesis Protocol Generation

Each top-ranked candidate receives a complete Fmoc-SPPS (solid-phase peptide synthesis) protocol adapted to its specific sequence properties.

  • Resin selection: Automatically chosen based on the peptide's C-terminal residue
  • Coupling sequence: Step-by-step amino acid additions with adapted reagents and timing
  • Cleavage & deprotection: Cocktail formulation adapted to the sequence's amino acid composition
  • Purification: RP-HPLC parameters tailored to the peptide's physicochemical properties
  • Quality control: ESI-MS, analytical HPLC, amino acid analysis, endotoxin testing
  • Estimates: Cost and time projections based on sequence length and complexity

IMPORTANT: These are computationally generated templates requiring laboratory optimization and validation by qualified chemists.

Report Format

  • HTML: Interactive report with 3D molecular viewers and Plotly charts
  • PDF: Print-ready export for distribution and archival
  • Sections: Variant annotations, target mapping, ranked candidates with score breakdowns, 3D structure viewers, synthesis protocol per candidate

Data Sources

SourcePurposeAvailability
NCBI ClinVarVariant annotation and clinical significancePublic, free
UniProtProtein target mapping and binding site dataPublic, free
NVIDIA BioNeMo Evo 2AI-driven peptide candidate generationFree tier available
ESMFold (Meta)3D structure predictionPublic, free

In addition to these public data sources, PepFold applies proprietary algorithms for region selection, candidate design, and multi-dimensional scoring.

Limitations

  • No molecular docking or free energy calculations
  • Binding scores are heuristic, not physics-based
  • ESMFold predictions are computational models, not experimental structures
  • Evo 2 may fall back to rational design (flagged in each report)
  • Synthesis protocols require wet-lab optimization
  • Pipeline outputs have not been experimentally validated

Reproducibility

  • Pipeline is deterministic given the same external API responses
  • The rational design fallback is fully reproducible across runs
  • All external data sources are timestamped in reports

Citation

If you use PepFold in your research, please cite:

PepFold: Pharmacogenomic Variant-to-Synthesis Pipeline. Olam Création, 2026. https://pepfold.com