Logo PDFBench

A Benchmark for De novo Protein Design from Function

1School of Computer Science and Technology, East China Normal University
2Institute of Artificial Intelligence (TeleAI), China Telecom
3School of Computer Science, Fudan University

Abstract

In recent years, while natural language processing and multimodal learning have seen rapid advancements, the field of de novo protein design has also experienced significant growth. However, most current methods rely on proprietary datasets and evaluation rubrics, making fair comparisons between different approaches challenging. Moreover, these methods often employ evaluation metrics that capture only a subset of the desired properties of designed proteins, lacking a comprehensive assessment framework. To address these, we introduce LogoPDFBench, the first comprehensive benchmark for evaluating de novo protein design from function. PDFBench supports two tasks: description-guided design and keyword-guided design. To ensure fair and multifaceted evaluation, we compile 22 metrics covering sequence plausibility, structural fidelity, and language-protein alignment, along with measures of novelty and diversity. We evaluate five state-of-the-art baselines, revealing their respective strengths and weaknesses across tasks. Finally, we analyze inter-metric correlations, exploring the relationships between four categories of metrics, and offering guidelines for metric selection. PDFBench establishes a unified framework to drive future advances in function-driven de novo protein design.
Abstract
Figure 1: summarization of existing protein design from function models.

Keyword-Guided

In Table 1, Table 2, Table 3 and Table 4, we report the benchmark results for keyword-guided task. Based on these results, our key findings are as follows:

  • The keyword-guided task is more challenging than the description-guided task. Compared to their performance on the description-guided task, baselines exhibit substantial degradation on the keyword-guided task, with the exception of the PPL-ProtGPT2, Novelty and Diversity. For structure-based metrics, the optimal pLDDT score drops from 75.25 to 66.20, while the optimal PAE increases from 0.96 to 14.53. In terms of language alignment, the ProTrek Score declines from 17.50 to 14.14, and the optimal Keyword Recovery score decreases from 42.65% to 28.74%. Most sequence-based metrics also show a similar downward trend.
  • Fine-tuning provides limited benefit. Unlike in the description-guided task, fine-tuning led to only modest gains in the keyword-guided task, likely due to its greater complexity. Specifically, both ProteinDTFT and PAAGFT showed only marginal improvements on most structure-based and language alignment metrics, with PAAGFT even experiencing a decline in ProTrek Score and PAE.
  • No model demonstrates satisfactory Foldability. For the four foldability metrics, a pLDDT score greater than 70 and a PAE less than 10 are generally required to indicate that the designed protein has good foldability. However, the highest pLDDT score, 66.20, is achieved by Pinal, with only 45.97% of sequences exceeding this threshold, while the lowest PAE, 12.55, is achieved by Chroma, with just 40% of sequences meeting the PAE < 10 criterion. These results indicate that none of the models exhibit satisfactory foldability in the keyword-guided task.
  • Protein designed by Pinal perform best in Language Alignment. Pinal significantly outperforms all other baselines across all Language Alignment metrics, with its advantage even more pronounced than on the description-guided task. This suggests that Pinal maintains relatively strong performance and robustness in the more challenging keyword-guided task.
  • ESM-designed protein perform best only in Plausibility. ESM3, optimized specifically for multiple-hot encoded keywords, uses InterProIDs as input rather than keyword name prompts, resulting in the best performance only on Plausibility. Moreover, it exhibits suboptimal or even inferior performance on more critical metrics such as Foldability and Language Alignment.
Table 1: Sequence-based results on the keyword-guided task
Models Plausibility Repetitiveness ESMscore GT-Identity ↑
PPL-ProtGPT2 ↓ PPL-ProGen ↓ Repeat ↓ Rep-2 ↓ Rep-5 ↓ F1 ↑ Precision ↑ Recall ↑
Natural 604.34 8.28 2.36 37.01 1.14 100.00 100.00 100.00 100.00
Random (U) 2436.77±21.26 21.87±0.03 0.01±0.00 0.28±0.00 0.00±0.00 0.72±0.00 0.82±0.00 0.64±0.00 0.32±0.13
Random (E) 3041.98±38.97 18.70±0.02 1.15±0.06 33.95±0.06 0.00±0.00 72.47±0.05 82.78±0.05 65.01±0.06 0.19±0.06
ProteinDT 1429.68±22.19 11.52±0.17 10.87±0.58 66.57±0.36 6.03±0.59 72.42±0.04 80.69±0.12 66.34±0.04 0.10±0.09
ProteinDTFT 1860.43±7.88 14.09±0.04 4.38±0.11 50.41±0.16 0.86±0.08 73.71±0.04 82.57±0.05 67.09±0.04 0.37±0.17
Pinal 584.25±12.33 8.48±0.07 14.69±0.21 61.50±0.18 4.99±0.07 72.43±0.08 75.67±0.06 69.86±0.12 5.87±0.44
PAAG 2545.89±9.82 18.26±0.01 2.36±0.09 33.98±0.29 0.04±0.01 73.58±0.01 83.40±0.03 66.38±0.03 0.10±0.04
PAAGFT 2044.39±31.75 14.74±0.11 4.10±0.33 47.92±0.36 0.90±0.21 73.59±0.06 83.19±0.11 66.53±0.03 0.05±0.05
Chroma 1322.71±18.13 12.21±0.02 2.70±0.12 49.07±0.05 0.29±0.05 72.76±0.05 80.06±0.06 67.13±0.04 0.11±0.09
ESM3 282.82±7.74 6.05±0.17 29.09±4.31 67.56±2.73 21.81±4.09 69.02±1.24 72.89±0.56 66.07±1.73 2.19±1.68
Table 2: Structure-based results on the keyword-guided task
Models pLDDT ↑ pLDDT > 70% ↑ PAE ↓ PAE < 10% ↑ TMscore % ↑
Natural 80.57 88.87 7.10 82.13 100.00
Random (U) 24.95±0.05 0.05±0.09 23.57±0.03 0.16±0.16 17.44±0.03
Random (E) 27.89±0.12 0.16±0.16 23.45±0.02 0.31±0.00 17.42±0.07
ProteinDT 38.72±0.20 0.10±0.18 26.26±0.06 0.00±0.00 11.58±0.01
ProteinDTFT 38.66±0.11 1.04±0.36 23.90±0.06 0.42±0.32 14.30±0.06
Pinal 66.20±0.61 45.97±1.59 14.53±0.29 32.94±1.61 15.76±0.22
PAAG 33.78±0.18 0.16±0.16 23.02±0.06 0.21±0.36 17.28±0.12
PAAGFT 41.06±0.24 1.14±0.09 24.10±0.01 0.62±0.63 0.05±0.05
Chroma 62.30±0.67 29.52±0.86 12.55±0.22 40.80±1.46 18.55±0.07
ESM3 62.89±3.44 37.36±9.16 15.65±1.64 29.15±9.78 19.92±2.35
Table 3: Language Alignment on the keyword-guided task
Models ProTrek Score % ↑ Retrieval Accuracy@T % ↑ Keyword Recovery % ↑
T=4 T=10 T=20
Natural 20.32 86.29 77.41 67.6 100.00
Random (U) 7.18±0.17 26.74±0.24 11.68±1.09 5.45±1.40 0.00±0.00
Random (E) 6.43±0.04 26.90±0.44 10.70±0.63 6.02±0.63 0.00±0.00
ProteinDT 3.89±0.17 31.57±0.77 13.97±0.63 7.43±0.63 0.03±0.05
ProteinDTFT 6.28±0.08 44.91±1.80 24.87±0.80 17.03±1.19 1.08±0.14
Pinal 14.14±0.29 69.83±2.01 59.08±2.95 49.95±1.97 28.74±0.24
PAAG 5.40±0.04 28.92±1.40 12.57±1.09 7.74±0.88 0.15±0.03
PAAGFT 3.86±0.07 32.03±0.65 15.47±1.56 9.14±0.39 0.02±0.03
Chroma 3.13±0.14 26.58±1.56 10.90±1.50 6.02±0.24 0.18±0.08
ESM3 4.87±1.51 37.85±9.71 23.31±7.22 17.13±6.48 11.68±7.36
Table 4: Novelty and Diversity on the keyword-guided task
Models Novelty % ↑ Diversity % ↑
Natural 39.16 -
Random (U) 98.70±0.01 97.98
Random (E) 98.23±0.04 99.75
ProteinDT 99.22±0.11 99.76
ProteinDTFT 99.02±0.02 99.54
Pinal 75.40±0.72 93.70
PAAG 98.60±0.08 99.17
PAAGFT 97.06±0.14 99.82
Chroma 98.00±0.17 98.18
ESM3 85.12±4.19 97.63

Description-Guided

In Table 5, Table 6, Table 7 and Table 8, we report the benchmark results for description-guided task on all 22 main metrics. Based on these results, we have findings as follows:

  • Chroma perform uneven on foldability and language alignment. Chroma performs relatively well in terms of foldability but shows markedly poor results across all six language alignment metrics. This may be attributed to Chroma's incorporation of protein structure from the PDB in its training, while the emphasis on "programmability" potentially reduces its focus on language alignment. In addition, Chroma-designed proteins perform well in terms of Seqeuncial Plausibilty, Repeat, Novelty and Diversity.
  • Finetuning improves performance tremendously. ProteinDTFT and PAAGFT demonstrate substantial performance gains over their original versions across PPL-ProtGPT2, pLDDT, PAE, and all language alignment metrics. Notably, the fine-tuned versions exhibit 1-2 orders of magnitude improvement in Retrieval Accuracy. However, this fine-tuning also results in more pronounced declines in GT-Identity, GT-TMscore, Novelty, and Diversity. In summary, we argue that fine-tuning on SwissMolInst extremely enhances the model's understanding of function descriptions at the cost of reduced sequncical and structural diversity.
  • Pinal can design plausible, foldable, and language-aligned proteins. Apart from Repeat, Novelty and Diversity, Pinal performs best on 11 out of 14 metrics. Such results suggest that the two-stage approach employed by Pinal for protein sequence design excels in sequencical plausibility, structural foldability, and language alignment.
  • Baselines receive a fairer evaluation. PAAG uses only the Success Rate metric (closely aligned with the definition of Keyword Recovery) for evaluating conditional generation and claims to outperform ProteinDT and Chroma. However, both PAAG and PAAGFT perform significantly worse on Keyword Recovery. ProteinDT introduces the Retrieval Accuracy metric to assess the language alignment of designed proteins, reporting accuracy of over 90. However, while ProteinDT ranks second only to Pinal in language alignment, no model has actually achieved such a high level of accuracy in our evaluations.
  • Baselines receive a more comprehensive evaluation. Proteins generated by all versions of PAAG and ProteinDT exhibit poor foldability and sequence plausibility, aspects not considered in their original evaluations. Although these models show a clear advantage in Novelty and Diversity, this alone is insufficient to establish strong design capabilities for description-guided tasks.
Table 5: Sequence-based results on the description-guided task
Models Plausibility Repetitiveness ESMscore GT-Identity ↑
PPL-ProtGPT2 ↓ PPL-ProGen ↓ Repeat ↓ Rep-2 ↓ Rep-5 ↓ F1 ↑ Precision ↑ Recall ↑
Natural 318.15 5.99 1.99 44.24 0.21 100 100 100 100.00
Random (U) 2484.03±4.53 21.71±0.00 0.01±0.00 0.34±0.00 0.00±0.00 0.71±0.00 0.82±0.01 0.63±0.00 0.37±0.03
Random (E) 3136.88±4.17 18.68±0.00 0.01±0.00 0.41±0.00 0.00±0.00 0.72±0.00 0.83±0.01 0.64±0.00 0.23±0.04
ProteinDT 1576.23±4.33 12.41±0.01 6.83±0.10 62.36±0.15 2.69±0.05 72.80±0.05 81.44±0.03 66.38±0.05 0.18±0.02
ProteinDTFT 1213.38±1.92 10.80±0.02 3.80±0.03 50.08±0.08 0.79±0.03 78.61±0.04 83.68±0.04 74.68±0.04 16.88±0.08
Pinal 308.97±0.67 5.81±0.02 12.83±0.13 58.13±0.16 58.13±0.16 76.63±0.06 77.74±0.08 75.99±0.06 18.65±0.15
PAAG 2782.70±9.64 17.84±0.02 2.34±0.02 45.66±0.03 0.05±0.01 73.26±0.03 83.10±0.02 66.04±0.03 0.17±0.02
PAAGFT 1332.35±3.82 11.09±0.01 3.53±0.04 52.25±0.05 0.83±0.05 77.67±0.00 83.80±0.02 73.04±0.02 13.99±0.12
Chroma 1322.38±13.61 12.22±0.02 0.03±0.00 0.55±0.00 0.00±0.00 0.73±0.00 0.80±0.00 0.67±0.00 0.22±0.04
Table 6: Structure-based results on the description-guided task
Models pLDDT ↑ pLDDT > 70% ↑ PAE ↓ PAE < 10% ↑ TMscore % ↑
Natural 80.64 81.27 9.20 65.73 100.00
Random (U) 22.96±0.04 0.16±0.04 24.85±0.01 0.56±0.03 16.95±0.03
Random (E) 25.77±0.03 0.20±0.06 24.71±0.01 0.60±0.04 17.10±0.00
ProteinDT 38.29±0.05 0.98±0.17 25.13±0.02 0.40±0.09 13.94±0.03
ProteinDTFT 51.42±0.04 25.61±0.30 18.57±0.02 23.92±0.17 22.79±0.10
Pinal 75.25±0.19 68.97±0.35 10.96±0.10 58.44±0.38 23.77±0.15
PAAG 28.39±0.07 0.07±0.03 25.38±0.01 0.10±0.03 14.63±0.03
PAAGFT 50.37±0.17 23.86±0.22 19.96±0.02 21.99±0.26 21.67±0.11
Chroma 59.18±0.09 20.17±0.23 15.03±0.04 28.62±0.62 17.93±0.02
Table 7: Language Alignment on the description-guided task
Models ProTrek Score % ↑ EvoLlama Score % ↑ Retrieval Accuracy@T % ↑ Keyword Recovery % ↑
T=4 T=10 T=20
Natural 27.00 60.33 68.57 45.23 28.06 100.00
Random (U) 1.03±0.04 36.23±0.07 29.33±0.40 13.06±0.39 6.89±0.07 1.86±0.00
Random (E) 1.04±0.06 34.11±0.11 29.39±0.46 12.91±0.24 6.78±0.36 1.86±0.00
ProteinDT 1.20±0.06 40.57±0.05 34.24±0.23 16.93±0.39 9.28±0.36 2.03±0.03
ProteinDTFT 13.89±0.04 52.84±0.20 68.88±0.61 55.56±0.37 47.29±0.53 37.00±0.08
Pinal 17.50±0.08 53.42±0.31 71.53±0.66 63.49±0.81 57.95±0.31 42.65±0.38
PAAG 1.29±0.04 34.39±0.18 29.54±0.23 13.63±0.23 7.06±0.27 1.86±0.00
PAAGFT 10.04±0.06 49.69±0.12 55.66±0.17 40.66±0.73 33.66±0.55 28.86±0.07
Chroma 2.10±0.02 40.10±0.24 29.63±0.27 13.58±0.62 7.33±0.33 1.94±0.03
Table 8: Novelty and Diversity on the description-guided task
Models Novelty % ↑ Diversity % ↑
Natural 36.12 -
Random (U) 98.77±0.03 97.01
Random (E) 98.46±0.01 99.56
ProteinDT 96.93±0.12 99.23
ProteinDTFT 75.09±0.04 79.87
Pinal 55.56±0.17 95.28
PAAG 98.90±0.02 99.15
PAAGFT 78.55±0.11 86.09
Chroma 97.27±0.02 97

Rethinking the Evaluation Metrics

In this section, we provide a comprehensive analysis regarding the correlations among different evaluation metrics on PDFBENCH.

  • Does PPL Accurately Reflect pLDDT and PAE? In the low PPL range, proteins are well-folded, exhibiting high pLDDT scoresand low PAE values. In contrast, proteins in the high PPL range struggle to fold into plausible structures. Within the medium PPL range, proteins with higher PPL values tend to display lower pLDDT scores and higher PAE values.
  • PPL_PLDDT
    Figure 2: (a) presents the distribution of PPL and pLDDT. (b) displays the distribution of PPL and PAE. (c) illustrates the Pearson correlation among these metrics. Note that PPL values are categorized into three ranges: values below 500 indicate a low PPL range, values between 500 and 2,000 represent a medium PPL range, and values above 2,000 correspond to a high PPL range.
  • Do Repetitive Patterns Lead to Lower Structural Plausibility? High Repeat scores (typically above 10) in protein sequences are associated with lower structural plausibility as indicated by lower pLDDT and higher PAE values.
  • Repeat
    Figure 3: (a) presents the distribution of Repeat and pLDDT. (b) presents the distribution of Repeat and PAE.
  • How Faithfully Do Designed Proteins Align with Functional Descriptions? The ProTrek Score and the EvoLlama Score are two oracle-based metrics that demonstrate high agreement in evaluating protein functions. Proteins with a ProTrek Score above 15 and an EvoLlama Score above 50 are considered well-aligned, indicating they faithfully match the input functional descriptions.
  • Language Alignment
    Figure 4: (a) presents the distribution of the ProTrek Score and EvoLlama Score for natural proteins. (b) and (c) present the distributions of the GT-TM Score, ProTrek Score, and EvoLlama Score for proteins designed by Pinal and the fine-tuned ProteinDT.
  • Does High Dtructural Dimilarity Indicate High Language Alignment? A high GT-TM Score generally indicates functional similarity among structurally similar proteins. However, high structural similarity is not a prerequisite for designing well-aligned proteins. Therefore, relying solely on this metric provides limited insight into whether the designed proteins align well with their functional descriptions.
  • Retrieval Accuracy Varies with Changes in Strategy. As Shown in Figure{}, the retrieval strategy employed in the Retrieval Accuracy metric has a significant impact on the evaluation results. Randomly sampling the negative pairs can serve as a workaround, but caution should be exercised when interpreting the absolute values of the metric.
  • Retrieval Accuracy
    Figure 5: Experimental results are reported for Soft, Random, and Hard Retrieval Accuracy. The random variant refers to the original implementation.

BibTeX

@misc{kuang2025pdfbenchbenchmarknovoprotein,
    title={PDFBench: A Benchmark for De novo Protein Design from Function}, 
    author={Jiahao Kuang and Nuowei Liu and Changzhi Sun and Tao Ji and Yuanbin Wu},
    year={2025},
    eprint={2505.20346},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    url={https://arxiv.org/abs/2505.20346}, 
}