In Table 1, Table 2, Table 3 and Table 4, we report the benchmark results for keyword-guided task. Based on these results, our key findings are as follows:
Models | Plausibility | Repetitiveness | ESMscore | GT-Identity ↑ | |||||
---|---|---|---|---|---|---|---|---|---|
PPL-ProtGPT2 ↓ | PPL-ProGen ↓ | Repeat ↓ | Rep-2 ↓ | Rep-5 ↓ | F1 ↑ | Precision ↑ | Recall ↑ | ||
Natural | 604.34 | 8.28 | 2.36 | 37.01 | 1.14 | 100.00 | 100.00 | 100.00 | 100.00 |
Random (U) | 2436.77±21.26 | 21.87±0.03 | 0.01±0.00 | 0.28±0.00 | 0.00±0.00 | 0.72±0.00 | 0.82±0.00 | 0.64±0.00 | 0.32±0.13 |
Random (E) | 3041.98±38.97 | 18.70±0.02 | 1.15±0.06 | 33.95±0.06 | 0.00±0.00 | 72.47±0.05 | 82.78±0.05 | 65.01±0.06 | 0.19±0.06 |
ProteinDT | 1429.68±22.19 | 11.52±0.17 | 10.87±0.58 | 66.57±0.36 | 6.03±0.59 | 72.42±0.04 | 80.69±0.12 | 66.34±0.04 | 0.10±0.09 |
ProteinDTFT | 1860.43±7.88 | 14.09±0.04 | 4.38±0.11 | 50.41±0.16 | 0.86±0.08 | 73.71±0.04 | 82.57±0.05 | 67.09±0.04 | 0.37±0.17 |
Pinal | 584.25±12.33 | 8.48±0.07 | 14.69±0.21 | 61.50±0.18 | 4.99±0.07 | 72.43±0.08 | 75.67±0.06 | 69.86±0.12 | 5.87±0.44 |
PAAG | 2545.89±9.82 | 18.26±0.01 | 2.36±0.09 | 33.98±0.29 | 0.04±0.01 | 73.58±0.01 | 83.40±0.03 | 66.38±0.03 | 0.10±0.04 |
PAAGFT | 2044.39±31.75 | 14.74±0.11 | 4.10±0.33 | 47.92±0.36 | 0.90±0.21 | 73.59±0.06 | 83.19±0.11 | 66.53±0.03 | 0.05±0.05 |
Chroma | 1322.71±18.13 | 12.21±0.02 | 2.70±0.12 | 49.07±0.05 | 0.29±0.05 | 72.76±0.05 | 80.06±0.06 | 67.13±0.04 | 0.11±0.09 |
ESM3 | 282.82±7.74 | 6.05±0.17 | 29.09±4.31 | 67.56±2.73 | 21.81±4.09 | 69.02±1.24 | 72.89±0.56 | 66.07±1.73 | 2.19±1.68 |
Models | pLDDT ↑ | pLDDT > 70% ↑ | PAE ↓ | PAE < 10% ↑ | TMscore % ↑ |
---|---|---|---|---|---|
Natural | 80.57 | 88.87 | 7.10 | 82.13 | 100.00 |
Random (U) | 24.95±0.05 | 0.05±0.09 | 23.57±0.03 | 0.16±0.16 | 17.44±0.03 |
Random (E) | 27.89±0.12 | 0.16±0.16 | 23.45±0.02 | 0.31±0.00 | 17.42±0.07 |
ProteinDT | 38.72±0.20 | 0.10±0.18 | 26.26±0.06 | 0.00±0.00 | 11.58±0.01 |
ProteinDTFT | 38.66±0.11 | 1.04±0.36 | 23.90±0.06 | 0.42±0.32 | 14.30±0.06 |
Pinal | 66.20±0.61 | 45.97±1.59 | 14.53±0.29 | 32.94±1.61 | 15.76±0.22 |
PAAG | 33.78±0.18 | 0.16±0.16 | 23.02±0.06 | 0.21±0.36 | 17.28±0.12 |
PAAGFT | 41.06±0.24 | 1.14±0.09 | 24.10±0.01 | 0.62±0.63 | 0.05±0.05 |
Chroma | 62.30±0.67 | 29.52±0.86 | 12.55±0.22 | 40.80±1.46 | 18.55±0.07 |
ESM3 | 62.89±3.44 | 37.36±9.16 | 15.65±1.64 | 29.15±9.78 | 19.92±2.35 |
Models | ProTrek Score % ↑ | Retrieval Accuracy@T % ↑ | Keyword Recovery % ↑ | ||
---|---|---|---|---|---|
T=4 | T=10 | T=20 | |||
Natural | 20.32 | 86.29 | 77.41 | 67.6 | 100.00 |
Random (U) | 7.18±0.17 | 26.74±0.24 | 11.68±1.09 | 5.45±1.40 | 0.00±0.00 |
Random (E) | 6.43±0.04 | 26.90±0.44 | 10.70±0.63 | 6.02±0.63 | 0.00±0.00 |
ProteinDT | 3.89±0.17 | 31.57±0.77 | 13.97±0.63 | 7.43±0.63 | 0.03±0.05 |
ProteinDTFT | 6.28±0.08 | 44.91±1.80 | 24.87±0.80 | 17.03±1.19 | 1.08±0.14 |
Pinal | 14.14±0.29 | 69.83±2.01 | 59.08±2.95 | 49.95±1.97 | 28.74±0.24 |
PAAG | 5.40±0.04 | 28.92±1.40 | 12.57±1.09 | 7.74±0.88 | 0.15±0.03 |
PAAGFT | 3.86±0.07 | 32.03±0.65 | 15.47±1.56 | 9.14±0.39 | 0.02±0.03 |
Chroma | 3.13±0.14 | 26.58±1.56 | 10.90±1.50 | 6.02±0.24 | 0.18±0.08 |
ESM3 | 4.87±1.51 | 37.85±9.71 | 23.31±7.22 | 17.13±6.48 | 11.68±7.36 |
Models | Novelty % ↑ | Diversity % ↑ |
---|---|---|
Natural | 39.16 | - |
Random (U) | 98.70±0.01 | 97.98 |
Random (E) | 98.23±0.04 | 99.75 |
ProteinDT | 99.22±0.11 | 99.76 |
ProteinDTFT | 99.02±0.02 | 99.54 |
Pinal | 75.40±0.72 | 93.70 |
PAAG | 98.60±0.08 | 99.17 |
PAAGFT | 97.06±0.14 | 99.82 |
Chroma | 98.00±0.17 | 98.18 |
ESM3 | 85.12±4.19 | 97.63 |
In Table 5, Table 6, Table 7 and Table 8, we report the benchmark results for description-guided task on all 22 main metrics. Based on these results, we have findings as follows:
Models | Plausibility | Repetitiveness | ESMscore | GT-Identity ↑ | |||||
---|---|---|---|---|---|---|---|---|---|
PPL-ProtGPT2 ↓ | PPL-ProGen ↓ | Repeat ↓ | Rep-2 ↓ | Rep-5 ↓ | F1 ↑ | Precision ↑ | Recall ↑ | ||
Natural | 318.15 | 5.99 | 1.99 | 44.24 | 0.21 | 100 | 100 | 100 | 100.00 |
Random (U) | 2484.03±4.53 | 21.71±0.00 | 0.01±0.00 | 0.34±0.00 | 0.00±0.00 | 0.71±0.00 | 0.82±0.01 | 0.63±0.00 | 0.37±0.03 |
Random (E) | 3136.88±4.17 | 18.68±0.00 | 0.01±0.00 | 0.41±0.00 | 0.00±0.00 | 0.72±0.00 | 0.83±0.01 | 0.64±0.00 | 0.23±0.04 |
ProteinDT | 1576.23±4.33 | 12.41±0.01 | 6.83±0.10 | 62.36±0.15 | 2.69±0.05 | 72.80±0.05 | 81.44±0.03 | 66.38±0.05 | 0.18±0.02 |
ProteinDTFT | 1213.38±1.92 | 10.80±0.02 | 3.80±0.03 | 50.08±0.08 | 0.79±0.03 | 78.61±0.04 | 83.68±0.04 | 74.68±0.04 | 16.88±0.08 |
Pinal | 308.97±0.67 | 5.81±0.02 | 12.83±0.13 | 58.13±0.16 | 58.13±0.16 | 76.63±0.06 | 77.74±0.08 | 75.99±0.06 | 18.65±0.15 |
PAAG | 2782.70±9.64 | 17.84±0.02 | 2.34±0.02 | 45.66±0.03 | 0.05±0.01 | 73.26±0.03 | 83.10±0.02 | 66.04±0.03 | 0.17±0.02 |
PAAGFT | 1332.35±3.82 | 11.09±0.01 | 3.53±0.04 | 52.25±0.05 | 0.83±0.05 | 77.67±0.00 | 83.80±0.02 | 73.04±0.02 | 13.99±0.12 |
Chroma | 1322.38±13.61 | 12.22±0.02 | 0.03±0.00 | 0.55±0.00 | 0.00±0.00 | 0.73±0.00 | 0.80±0.00 | 0.67±0.00 | 0.22±0.04 |
Models | pLDDT ↑ | pLDDT > 70% ↑ | PAE ↓ | PAE < 10% ↑ | TMscore % ↑ |
---|---|---|---|---|---|
Natural | 80.64 | 81.27 | 9.20 | 65.73 | 100.00 |
Random (U) | 22.96±0.04 | 0.16±0.04 | 24.85±0.01 | 0.56±0.03 | 16.95±0.03 |
Random (E) | 25.77±0.03 | 0.20±0.06 | 24.71±0.01 | 0.60±0.04 | 17.10±0.00 |
ProteinDT | 38.29±0.05 | 0.98±0.17 | 25.13±0.02 | 0.40±0.09 | 13.94±0.03 |
ProteinDTFT | 51.42±0.04 | 25.61±0.30 | 18.57±0.02 | 23.92±0.17 | 22.79±0.10 |
Pinal | 75.25±0.19 | 68.97±0.35 | 10.96±0.10 | 58.44±0.38 | 23.77±0.15 |
PAAG | 28.39±0.07 | 0.07±0.03 | 25.38±0.01 | 0.10±0.03 | 14.63±0.03 |
PAAGFT | 50.37±0.17 | 23.86±0.22 | 19.96±0.02 | 21.99±0.26 | 21.67±0.11 |
Chroma | 59.18±0.09 | 20.17±0.23 | 15.03±0.04 | 28.62±0.62 | 17.93±0.02 |
Models | ProTrek Score % ↑ | EvoLlama Score % ↑ | Retrieval Accuracy@T % ↑ | Keyword Recovery % ↑ | ||
---|---|---|---|---|---|---|
T=4 | T=10 | T=20 | ||||
Natural | 27.00 | 60.33 | 68.57 | 45.23 | 28.06 | 100.00 |
Random (U) | 1.03±0.04 | 36.23±0.07 | 29.33±0.40 | 13.06±0.39 | 6.89±0.07 | 1.86±0.00 |
Random (E) | 1.04±0.06 | 34.11±0.11 | 29.39±0.46 | 12.91±0.24 | 6.78±0.36 | 1.86±0.00 |
ProteinDT | 1.20±0.06 | 40.57±0.05 | 34.24±0.23 | 16.93±0.39 | 9.28±0.36 | 2.03±0.03 |
ProteinDTFT | 13.89±0.04 | 52.84±0.20 | 68.88±0.61 | 55.56±0.37 | 47.29±0.53 | 37.00±0.08 |
Pinal | 17.50±0.08 | 53.42±0.31 | 71.53±0.66 | 63.49±0.81 | 57.95±0.31 | 42.65±0.38 |
PAAG | 1.29±0.04 | 34.39±0.18 | 29.54±0.23 | 13.63±0.23 | 7.06±0.27 | 1.86±0.00 |
PAAGFT | 10.04±0.06 | 49.69±0.12 | 55.66±0.17 | 40.66±0.73 | 33.66±0.55 | 28.86±0.07 |
Chroma | 2.10±0.02 | 40.10±0.24 | 29.63±0.27 | 13.58±0.62 | 7.33±0.33 | 1.94±0.03 |
Models | Novelty % ↑ | Diversity % ↑ |
---|---|---|
Natural | 36.12 | - |
Random (U) | 98.77±0.03 | 97.01 |
Random (E) | 98.46±0.01 | 99.56 |
ProteinDT | 96.93±0.12 | 99.23 |
ProteinDTFT | 75.09±0.04 | 79.87 |
Pinal | 55.56±0.17 | 95.28 |
PAAG | 98.90±0.02 | 99.15 |
PAAGFT | 78.55±0.11 | 86.09 |
Chroma | 97.27±0.02 | 97 |
In this section, we provide a comprehensive analysis regarding the correlations among different evaluation metrics on PDFBENCH.
@misc{kuang2025pdfbenchbenchmarknovoprotein, title={PDFBench: A Benchmark for De novo Protein Design from Function}, author={Jiahao Kuang and Nuowei Liu and Changzhi Sun and Tao Ji and Yuanbin Wu}, year={2025}, eprint={2505.20346}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2505.20346}, }