| | --- |
| | tags: |
| | - protein language model |
| | pipeline_tag: text-classification |
| | --- |
| | |
| | # PDeepPP model |
| |
|
| | `PDeepPP` is a hybrid protein language model designed to predict post-translational modification (PTM) sites and extract biologically relevant features from protein sequences. By leveraging pretrained embeddings from `ESM` and incorporating both transformer and convolutional neural network (CNN) architectures, `PDeepPP` provides a robust framework for analyzing protein sequences in various contexts. |
| |
|
| | ## Model description |
| |
|
| | `PDeepPP` is a flexible model architecture that integrates the power of transformer-based self-attention mechanisms with convolutional operations for capturing local and global sequence features. The model consists of: |
| |
|
| | 1. A **Self-Attention Global Features module** for capturing long-range dependencies. |
| | 2. A **TransConv1d module**, combining transformers and convolutional layers. |
| | 3. A **PosCNN module**, which applies position-aware convolutional operations for feature extraction. |
| |
|
| | The model is trained with a loss function that combines classification loss and additional regularization terms to enhance generalization and interpretability. It is compatible with Hugging Face's `transformers` library, allowing seamless integration with other tools and workflows. |
| |
|
| | ## Intended uses |
| |
|
| | `PDeepPP` was developed and validated using PTM and BPS datasets, but its applications are not limited to these specific tasks. Leveraging its flexible architecture and robust feature extraction capabilities, `PDeepPP` can be applied to a wide range of protein sequence-related analysis tasks. Specifically, the model has been validated on the following datasets: |
| |
|
| | 1. **PTM datasets**: Used for predicting post-translational modification (PTM) sites (e.g., phosphorylation), focusing on serine (S), threonine (T), and tyrosine (Y) residues. |
| | 2. **BPS datasets**: Used for analyzing biologically active regions of protein sequences (Biologically Active Protein Sequences, BPS) to support downstream analyses. |
| |
|
| | Although the model was trained and validated on PTM and BPS datasets, `PDeepPP`’s architecture enables users to generalize and extend its capabilities to other protein sequence analysis tasks, such as embedding generation, sequence classification, or task-specific analyses. |
| |
|
| | --- |
| |
|
| | ### Key features |
| |
|
| | - **Dataset support**: `PDeepPP` is trained on PTM and BPS datasets, demonstrating its effectiveness in identifying specific sequence features (e.g., post-translational modification sites) and extracting biologically relevant regions. |
| | - **Task flexibility**: The model is not limited to PTM and BPS tasks. Users can adapt `PDeepPP` to other protein sequence-based tasks by customizing input data and task objectives. |
| | - **PTM mode**: Focuses on sequences centered around specific residues (S, T, Y) to analyze post-translational modification activity. |
| | - **BPS mode**: Analyzes overlapping or non-overlapping subsequences of a protein to extract biologically meaningful features. |
| |
|
| | ## How to use |
| |
|
| | To use `PDeepPP`, you need to install the required dependencies, including `torch` and `transformers`: |
| |
|
| | ```bash |
| | pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 |
| | pip install transformers |
| | ``` |
| | Before proceeding, you need to ensure that the `DataProcessor` and `Pretraining` files are in the same directory as the `example` file. |
| | Here is an example of how to use PDeepPP to process protein sequences and obtain predictions: |
| |
|
| | ```python |
| | import torch |
| | import esm |
| | from DataProcessor_pdeeppp import PDeepPPProcessor |
| | from Pretraining_pdeeppp import PretrainingPDeepPP |
| | from transformers import AutoModel |
| | |
| | # Global parameter settings |
| | device = torch.device("cpu") |
| | pad_char = "X" # Padding character |
| | target_length = 33 # Target length for sequence padding |
| | mode = "PTMS" # Mode setting (only configured in example.py) |
| | esm_ratio = 0.95 # Ratio for ESM embeddings |
| | |
| | # Load the PDeepPP model |
| | model_name = "fondress/PDeepPP_N-linked-glycosylation-N" |
| | model = AutoModel.from_pretrained(model_name, trust_remote_code=True) # Directly load the model |
| | |
| | # Initialize the PDeepPPProcessor |
| | processor = PDeepPPProcessor(pad_char=pad_char, target_length=target_length) |
| | |
| | # Example protein sequences (test sequences) |
| | protein_sequences = ["VELYP", "YPLDL", "ESHINQKWVCK"] |
| | |
| | # Preprocess the sequences |
| | inputs = processor(sequences=protein_sequences, mode=mode, return_tensors="pt") # Dynamic mode parameter |
| | processed_sequences = inputs["raw_sequences"] |
| | |
| | # Load the ESM model |
| | esm_model, esm_alphabet = esm.pretrained.esm2_t33_650M_UR50D() |
| | esm_model = esm_model.to(device) |
| | esm_model.eval() |
| | |
| | # Initialize the PretrainingPDeepPP module |
| | pretrainer = PretrainingPDeepPP( |
| | embedding_dim=1280, |
| | target_length=target_length, |
| | esm_ratio=esm_ratio, |
| | device=device |
| | ) |
| | |
| | # Extract the vocabulary and ensure the padding character 'X' is included |
| | vocab = set("".join(protein_sequences)) |
| | vocab.add(pad_char) # Add the padding character |
| | |
| | # Generate pretrained features using the PretrainingPDeepPP module |
| | pretrained_features = pretrainer.create_embeddings( |
| | processed_sequences, vocab, esm_model, esm_alphabet |
| | ) |
| | |
| | # Ensure pretrained features are on the same device |
| | inputs["input_embeds"] = pretrained_features.to(device) |
| | |
| | # Perform prediction |
| | model.eval() |
| | outputs = model(input_embeds=inputs["input_embeds"]) # Use pretrained features as model input |
| | logits = outputs["logits"] |
| | |
| | # Compute probability distributions and generate predictions |
| | softmax = torch.nn.Softmax(dim=-1) # Apply softmax on the last dimension |
| | probabilities = softmax(logits) |
| | predicted_labels = (probabilities >= 0.5).long() |
| | |
| | # Print the prediction results for each sequence |
| | print("\nPrediction Results:") |
| | for i, seq in enumerate(processed_sequences): |
| | print(f"Sequence: {seq}") |
| | print(f"Probability: {probabilities[i].item():.4f}") |
| | print(f"Predicted Label: {predicted_labels[i].item()}") |
| | print("-" * 50) |
| | ``` |
| |
|
| | ## Training and customization |
| |
|
| | `PDeepPP` supports fine-tuning on custom datasets. The model uses a configuration class (`PDeepPPConfig`) to specify hyperparameters such as: |
| |
|
| | - **Number of transformer layers** |
| | - **Hidden layer size** |
| | - **Dropout rate** |
| | - **PTM type** and other task-specific parameters |
| |
|
| | Refer to `PDeepPPConfig` for details. |
| |
|
| | ## Citation |
| | If you use `PDeepPP` in your research, please cite the associated paper or repository: |
| |
|
| | ``` |
| | @article{your_reference, |
| | title={`PDeepPP`: A Hybrid Model for Protein Sequence Analysis}, |
| | author={Author Name}, |
| | journal={Journal Name}, |
| | year={2025} |
| | } |
| | ``` |