Home

Introduction

FINITE Overview

Recovering functional neighborhoods to resolve the functional dark proteome in sparse interactomes

A computational framework that reformulates protein function prediction as functional neighborhood inference.

A substantial fraction of the proteome remains functionally unresolved, particularly in sparse interactomes where available molecular evidence is fragmented across sequence, structure, and observed associations. In such cases, the main obstacle to Gene Ontology (GO) prediction is often not the absence of functional signal but a neighborhood mismatch: observable molecular context does not align with the functional neighborhoods required to resolve function at appropriate levels of specificity. Here we propose FINITE, a framework that reformulates protein function prediction as functional neighborhood inference. FINITE first recovers latent neighborhood signals from intra-protein representations, then reorganizes them into inter-protein functional context, and finally distills the recovered organization into ontology-level inference over the GO hierarchy. In the learned representation space, the recovered neighborhoods exhibit an information-content (IC) organized multi-core architecture, in which proteins and GO terms co-localize according to functional specificity without explicit IC supervision. This organization is consistent with FINITE's strongest improvements, observed across three GO branches, for sparsely connected proteins and high-IC GO terms. These results support neighborhood recovery as a computational strategy for resolving function in sparsely characterized regions of the proteome.

Figure 1: Overview of protein function prediction under sparse interactomes — Figure 1. Overview of protein function prediction under sparse interactomes. (A) Schematic illustration of intra-protein and inter-protein network organization, showing low-degree peripheral regions and high-degree cores at both the residue-contact and protein-protein interaction scales. (B) Conceptual formulation of GO function prediction as a neighborhood-inference process, where neighborhood structures recovered from intra-protein or inter-protein representations are mapped to Gene Ontology terms. (C) Degree distribution of Swiss-Prot protein interaction networks, highlighting strong network imbalance with a small fraction of high-degree hub proteins and a majority of low-degree nodes. (D) One-hop neighborhood size distributions for GO-annotated Swiss-Prot proteins across protein–protein, protein–GO, and GO–GO relationships, illustrating heterogeneous and sparse neighborhood contexts.

Figure 2. Model architecture of FINITE. Overview of the FINITE framework for protein function inference across intra-protein, inter-protein, and gene ontology levels. (A) Latent neighborhood representation learning: proteins are embedded using alignment-derived and language-model features, enabling weakly similar proteins to form latent neighborhoods in representation space without direct GO supervision. (B) Functional neighborhood network learning: latent relationships are projected into an inductively constructed heterogeneous network integrating protein-protein, protein-GO, and GO-GO relations, where localized subgraphs are sampled and attended to infer functional organization. (C) Neighborhood distillation: functional neighborhood representations are distilled and aligned with GO embeddings to predict GO terms at different semantic scopes, yielding function predictions driven by network-level organization rather than direct feature-to-label mapping.

Upon the usage the users are requested to use the following citation:

INTRODUCTION

We propose FINITE, a functional neighborhood inference framework that reformulates protein function prediction under sparse interactomes as a multi-scale process of neighborhood recovery, reorganization, and distillation. FINITE first addresses the recovery of latent neighborhoods from weak intra-protein evidence by extracting functional regularities from sequence-based or alignment-derived representations, which may remain inaccessible when local structural information or annotation priors are insufficient. It then addresses the reorganization of recovered signals into inter-protein functional context by converting representation-level proximity into functional neighborhood networks, thereby constructing relational neighborhoods even when observed protein-protein or protein-GO associations are sparse.

NOTE

If you are interested in this research area or have any questions, please do not hesitate to contact us and we will do our best to answer them in order to facilitate mutual learning and progress. If you use our research results, please cite this article.

Exploration

From fragmented evidence to functional neighborhood recovery under sparse interactomes.

Frontier Perspective

Functional dark proteome under sparse interactomes

Across sequenced proteomes, a substantial fraction of proteins cannot be functionally resolved at any specific level of biological description. This regime reflects not the absence of biological function, but the fragmentation of available evidence across molecular, relational, and ontology-level contexts.

Core problem Fragmented evidence

Computational consequence Neighborhood mismatch

FINITE perspective Functional neighborhood recovery

Problem Fragmented evidence → Neighborhood mismatch → Functional neighborhood recovery

Functional dark proteome as a problem of recoverability

A substantial fraction of proteins cannot be resolved at any specific level of biological description, not because biological function is absent, but because available evidence remains fragmented across molecular and relational contexts.

We refer to this regime as the functional dark proteome, where uncertainty arises not only from whether function can be assigned, but from the range of specificity levels at which functional claims are supported. In this sense, darkness reflects a failure of recoverability rather than a lack of biological organization.

The Gene Ontology provides the hierarchical structure that makes this problem explicit: shallow terms capture broad categories, whereas deep, high-information-content terms specify precise molecular, process-level, or cellular roles. Resolving function therefore depends not only on assigning a term, but on determining the level of specificity that available evidence can support.

Problem

Why conventional GO prediction becomes unreliable

Standard approaches often assume that functional evidence can be recovered from strong priors such as sequence similarity, conserved domains, or observed molecular associations. Under sparse interactomes, however, these priors become incomplete or unevenly distributed, making it difficult to place proteins within the functional neighborhoods required for GO inference.

Mismatch

Observable neighborhoods do not match functional neighborhoods

FINITE is motivated by neighborhood mismatch: the observable neighborhoods derived from sequence, structure, or interaction data do not necessarily align with the functional neighborhoods needed to resolve function at an appropriate level of GO specificity. This mismatch becomes especially visible in under-characterized proteins and sparse relational settings.

Empirical structure of sparse interactomes

Network imbalance and multi-scale sparsity jointly constrain functional neighborhood inference.

Degree imbalance in protein interaction networks

Protein interaction networks exhibit strong structural imbalance, with a small fraction of hub proteins and a majority of low-degree nodes. As a result, most proteins reside in sparsely connected regions where observable neighborhoods are limited in size and informativeness.

This imbalance implies that conventional neighborhood-based inference is inherently uneven: proteins within well-connected cores benefit from dense relational context, whereas peripheral proteins lack sufficient local evidence for reliable functional assignment.

Multi-scale sparsity across molecular and ontology-linked neighborhoods

Sparsity extends beyond protein–protein interactions and propagates across the entire evidence chain. At the intra-protein level, functional signals may be diffuse or weakly localized, while at the inter-protein level, observed associations are incomplete and unevenly sampled.

Along the ontology-linked chain, neighborhood coverage remains heterogeneous: many protein–protein neighborhoods are absent, protein–GO and GO–GO associations are limited, and GO annotations follow a long-tail distribution. Critically, the most specific and biologically informative functions correspond to regions with the weakest neighborhood support.

Together, these properties establish a systematic relationship between sparsity and functional specificity. The regions of the ontology that are most informative are also those with the most fragmented supporting evidence. This structural mismatch explains why conventional propagation-based approaches tend to succeed for broadly annotated functions while failing to resolve high-specificity functional assignments.

How FINITE reframes the problem

A multi-scale process of neighborhood recovery, reorganization, and distillation.

Step 1

Recover latent neighborhoods

FINITE first recovers latent neighborhood signals from intra-protein representations, extracting functional regularities even when conventional sequence or structural priors are weak, diffuse, or insufficient.

Step 2

Reorganize into inter-protein context

These recovered signals are then reorganized into inter-protein functional context by constructing functional neighborhood networks from representation-level proximity, even when direct protein-protein or protein-GO associations are sparse.

Step 3

Distill into ontology-level inference

FINITE finally distills the recovered organization into inference over the GO hierarchy, enabling functional assignment at the level of specificity supported by the available evidence rather than defaulting to shallow annotations.

What emerges in the learned representation space

The recovered neighborhoods are not merely denser approximations of observed graphs. In the learned representation space, they exhibit an information-content-organized multi-core architecture.

Within this organization, proteins and GO terms of matching functional specificity co-localize without explicit IC supervision. This computational finding helps explain why FINITE shows its strongest gains for sparsely connected proteins and deep, high-information-content GO terms, where direct evidence is weakest but biologically informative annotations are most valuable.

Interpretation

From prediction to diagnosis of functional recoverability

Under this view, prediction failure is not only a matter of missing labels or missing edges. It reflects disrupted organization that prevents evidence from reaching the appropriate level of functional specificity.

Trajectory

Toward guided exploration of dark functional regions

The current release provides an initial exploration-oriented presentation of this framework. Subsequent updates will extend this page into more explicit exploration workflows, showing how FINITE can be used to investigate sparsely connected proteins, high-information-content GO regions, and broader functional dark proteome scenarios.

If you use this work, please cite the following article:

FRONTIER PERSPECTIVE

This page situates FINITE within the broader problem of the functional dark proteome under sparse interactomes, where biological organization is likely to persist but remains only partially recoverable from fragmented evidence.

EXPLORATION OUTLOOK

Future updates will expand this framework into more explicit exploration workflows, linking functional neighborhood recovery to practical analyses of sparsely characterized proteins and deep functional regions.

Documentation

Method overview, architectural logic, and framework documentation for FINITE.

Framework Documentation

From latent neighborhoods to ontology-level functional inference.

This page documents the core architecture of FINITE as a functional neighborhood inference framework. Rather than describing protein function as an isolated molecular attribute, FINITE models function as a property of recovered and reorganized neighborhoods across representation, network, and ontology levels.

Architectural overview

FINITE implements a three-stage inference pathway that connects intra-protein representation learning, inter-protein functional neighborhood organization, and ontology-level annotation.

This formulation is motivated by neighborhood mismatch under sparse interactomes. Instead of treating GO prediction as direct feature-to-label mapping or propagation over observed graphs alone, FINITE explicitly models how latent neighborhoods are recovered, reorganized into relational context, and distilled into functional assignments across the specificity spectrum.

Stage 1

Latent neighborhood representation learning

Each protein is embedded into a learned representation space using alignment-derived and language-model features. The objective at this stage is not to predict GO labels directly, but to recover latent neighborhood structure in which proteins sharing functional mechanisms are brought into proximity, even when pairwise sequence or structural similarity is weak.

This stage therefore serves as the entry point for neighborhood recovery: it extracts higher-order regularities that are not directly visible from raw sequence similarity, local structural cues, or sparse interaction evidence alone.

Stage 2

Functional neighborhood network learning

The latent neighborhood relations recovered in representation space are projected into an inductively constructed functional neighborhood network that integrates protein–protein, protein–GO, and GO–GO relations. Localized heterogeneous subgraphs are formed around each query protein, allowing the model to reorganize functional evidence in an explicit relational context.

This stage converts pairwise proximity into functional neighborhood structure. A protein is therefore interpreted not only through similar proteins, but through the joint organization of proteins, GO terms, and semantic relations within a local neighborhood network.

Stage 3

Neighborhood distillation for ontology-level inference

The neighborhood representations produced by the previous stage are distilled and aligned with GO embeddings to generate annotation probabilities. This step must preserve evidence across a wide range of functional specificity, from broadly annotated terms with extensive neighborhood support to deep, high-information-content terms supported by only a few proteins.

Rather than collapsing neighborhood evidence into a single classification decision, this stage translates recovered neighborhood organization into ontology-level assignments that remain sensitive to the information-content spectrum of GO.

How the three stages work together

The three stages form a single inference pathway from intra-protein signals to ontology-level annotation.

By design, each stage addresses a different aspect of neighborhood mismatch. The first stage recovers latent neighborhoods that are not visible in raw representations. The second stage reorganizes these neighborhoods into an explicit functional neighborhood network. The third stage distills this recovered organization into GO predictions across the full specificity spectrum.

Under this framework, the expected advantage of FINITE is not uniform across the ontology. If neighborhood recovery is effective, its strongest impact should appear in high-information-content regions where supporting evidence is sparse and mismatch is most severe.

Terminology

This site uses functional neighborhood to describe the local organization of proteins and GO terms that share biological relevance at a given level of specificity. Neighborhood mismatch refers to the discordance between observable neighborhoods derived from sequence, structure, or known interactions, and the functional neighborhoods required for reliable GO inference.

Current documentation scope

The present page focuses on architectural interpretation and framework logic. Documentation for source code usage, environment setup, and local execution will be added in subsequent updates after the code release structure is finalized.

If you use this work, please cite the following article:

ARCHITECTURAL LOGIC

FINITE models protein function as a property of recovered neighborhood organization rather than an isolated molecular label. The framework connects representation learning, functional neighborhood network construction, and ontology-level distillation within a unified inference pathway.

DOCUMENTATION OUTLOOK

Subsequent updates will extend this page with source code structure, configuration notes, local execution guidance, and additional usage documentation aligned with the released implementation.

Download

Data resources, model checkpoints, and implementation materials for FINITE.

Resource Hub

Core resources for functional neighborhood inference under sparse interactomes.

This page organizes the current FINITE resources into data, model, and implementation layers.
Public links will be added progressively as the release structure is finalized.

Release scope Data + Models + Code Archive

Resource overview

FINITE resources are organized around the three-stage inference pathway, from protein-level inputs and neighborhood construction to ontology-level inference.

The current release plan includes benchmark datasets, ontology files, multiple sequence alignment data, functional neighborhood network representations, graph-structured neighborhood resources, stage-specific model checkpoints, and code or archive entries for reproducible use.

Data Resources

Input data, ontology structure, neighborhood representations, and graph-level resources.

Core dataset TAR
Sparse interactomes dataset used for functional annotation benchmarks and neighborhood inference.
Multiple sequence alignment data TGZ
Alignment-derived evidence used for latent neighborhood representation learning.
Functional neighborhood node embeddings NPZ
Node-level representations combining ESM2 features and latent neighborhood representations.
Functional neighborhood network TAR
Graph-structured neighborhood data integrating protein–protein, protein–GO, and GO–GO relations.
Neighborhood score resources TAR
Recovered neighborhood scores used for downstream functional reorganization and distillation.

Model Checkpoints

Weights corresponding to the three stages of the FINITE inference pathway.

Stage 1 checkpoint TGZ
Weights for latent neighborhood representation learning from alignment-derived and language-model features.
Stage 2 checkpoint ZIP
Weights for functional neighborhood network learning over heterogeneous local subgraphs.
Stage 3 checkpoint ZIP
Weights for neighborhood distillation and ontology-level inference across the GO specificity spectrum.

Code Repository

Implementation files, training logic, and documentation for reproducible use.

GitHub Repository

The repository will be continuously updated, including the README, documentation, etc.

Zenodo Archive

Archived release entry for datasets, checkpoints, and supporting materials.

Zenodo Resources

The archived release entry will be added together with the first public resource package.

Release note

The current page defines the resource structure of the FINITE release rather than the final public endpoints.

Resource links will be completed in subsequent updates. The present layout is intended to clarify what kinds of data, model weights, and implementation materials will be made available for reproducing the three-stage neighborhood inference framework.

If you use this work, please cite the following article:

RESOURCE STRUCTURE

The FINITE release is organized around data resources, stage-specific model checkpoints, and reproducible implementation materials. This structure follows the logic of latent neighborhood recovery, functional neighborhood network construction, and ontology-level distillation.

RELEASE OUTLOOK

Public links for datasets, code, and archived materials will be added progressively as the release package is finalized and synchronized with the project documentation.

About

FINITE is a research project centered on functional neighborhood inference for protein function prediction under sparse interactomes.

The project addresses a central difficulty in functional annotation: in under-characterized settings, the primary obstacle is often not the absence of biological signal, but the fragmentation of available evidence across molecular, relational, and ontology-level contexts. FINITE was developed to study how latent functional neighborhoods can be recovered, reorganized, and distilled into Gene Ontology inference under these conditions.

Project perspective

FINITE frames protein function not as an isolated molecular label, but as a property of neighborhood organization across scales. This perspective links intra-protein representation learning, inter-protein functional neighborhood structure, and ontology-level annotation within a unified inference pathway.

Scientific context

The project is motivated by the functional dark proteome under sparse interactomes, where proteins and GO regions remain difficult to resolve because supporting evidence is weak, fragmented, or unevenly distributed. In this setting, FINITE provides a computational framework for studying functional recoverability rather than relying only on direct evidence matching.

Current site scope

This site is intended as a research-facing project page for FINITE. It currently emphasizes problem framing, methodological interpretation, architectural documentation, and resource organization. Subsequent updates will extend the site with more explicit exploration workflows and public release materials for code and data.

Research environment

FINITE is developed within a broader research program on computational biology and biological sequence intelligence, including protein language models, sequence–structure–function inference, ontology-aware annotation, and graph-based biological representation learning.

Protein function prediction and ontology-aware annotation
Protein language models and biological sequence representation learning
Graph-based modeling of biological networks and neighborhood structure
Sequence–structure–function inference in under-characterized systems
Computational analysis of sparse, heterogeneous, and incomplete biological evidence

For related research and laboratory information, please visit bliulab.net.

Acknowledgements

We thank laboratory members, collaborators, and reviewers for their time, feedback, and support during the development of this project. This work is supported by the National Natural Science Foundation of China (No. 62325202, 62473049) and the Zhongguancun Academy (Project No. 20240101).

The website interface also draws on open-source front-end components and standard web development utilities used for scientific presentation.

If you use this work, please cite the following article:

PROJECT POSITIONING

FINITE is presented here as a problem-driven research framework for studying functional neighborhood recovery under sparse interactomes, rather than as a standalone performance-oriented prediction tool.

SITE OUTLOOK

Future updates will continue to connect project background, methodological interpretation, downloadable resources, and guided exploration workflows within a unified site structure.

Contact

For scientific questions, framework details, or collaboration inquiries related to FINITE.

Project Contact

Contact points for the FINITE framework, manuscript, and resource development.

FINITE is a research-driven project on functional neighborhood inference under sparse interactomes. Please contact the appropriate person depending on the nature of your inquiry.

Project lead and first author

Jiangyi Shao

PhD, Computer Science, Beijing Institute of Technolog

E-mail: jyshao@bliulab.net

Responsible for the conceptual formulation of functional neighborhood inference, model design, and manuscript development. For questions regarding the framework, methodology, or technical aspects of FINITE, please contact here.

Scientific correspondence

Bin Liu

Professor, Beijing Institute of Technology

E-mail: bliu@bliulab.net

Corresponding author of the FINITE study. For formal scientific communication, manuscript-related discussion, and collaboration inquiries, please contact here.

Project contributors and platform support

Yunjie Wang (representative)

Zhongguancun Academy

E-mail: yjwang@bliulab.net

Contributors to the FINITE project include Jianxiang Zhao, Ziwen Wang, Mengjie Li, and Yunjie Wang, who participate in method implementation, experimental data preparation, and evaluation. The project is supported in part by the Zhongguancun Academy platform, including computational resources and student research participation.

Suggested inquiry types

Conceptual questions about functional neighborhood inference
Interpretation of FINITE results or representation-space organization
Requests regarding datasets, model checkpoints, or future releases
Discussion of applications under sparse interactomes or functional dark proteome settings

If you use this work, please cite the following article:

AUTHOR ROLES

The FINITE framework is primarily developed by the first author, with the corresponding author overseeing scientific direction. Contributors support implementation, data preparation, and evaluation across stages of the framework.

PLATFORM SUPPORT

The project is supported by institutional research platforms, including Zhongguancun Academy, which provides computational resources and enables student participation in model development and experimental analysis.

Home

Recovering functional neighborhoods to resolve the functional dark proteome in sparse interactomes

Cite

INTRODUCTION

NOTE

Exploration

Functional dark proteome under sparse interactomes

Functional dark proteome as a problem of recoverability

Why conventional GO prediction becomes unreliable

Observable neighborhoods do not match functional neighborhoods

Empirical structure of sparse interactomes

How FINITE reframes the problem

Recover latent neighborhoods

Reorganize into inter-protein context

Distill into ontology-level inference

What emerges in the learned representation space

From prediction to diagnosis of functional recoverability

Toward guided exploration of dark functional regions

Cite

FRONTIER PERSPECTIVE

EXPLORATION OUTLOOK

Documentation

From latent neighborhoods to ontology-level functional inference.

Architectural overview

Latent neighborhood representation learning

Functional neighborhood network learning

Neighborhood distillation for ontology-level inference

How the three stages work together

Terminology

Current documentation scope

Cite

ARCHITECTURAL LOGIC

DOCUMENTATION OUTLOOK

Download

Core resources for functional neighborhood inference under sparse interactomes.

Resource overview

Release note

Cite

RESOURCE STRUCTURE

RELEASE OUTLOOK

About

Project perspective

Scientific context

Current site scope

Research environment

Acknowledgements

Cite

PROJECT POSITIONING

SITE OUTLOOK

Contact

Contact points for the FINITE framework, manuscript, and resource development.

Project lead and first author

Scientific correspondence

Project contributors and platform support

Suggested inquiry types

Cite

AUTHOR ROLES

PLATFORM SUPPORT