Upload README.md with huggingface_hub

a3d239a verified 11 months ago

6.52 kB

	---
	tags:
	- protein language model
	pipeline_tag: text-classification
	---

	# PDeepPP model

	`PDeepPP` is a hybrid protein language model designed to predict post-translational modification (PTM) sites and extract biologically relevant features from protein sequences. By leveraging pretrained embeddings from `ESM` and incorporating both transformer and convolutional neural network (CNN) architectures, `PDeepPP` provides a robust framework for analyzing protein sequences in various contexts.

	## Model description

	`PDeepPP` is a flexible model architecture that integrates the power of transformer-based self-attention mechanisms with convolutional operations for capturing local and global sequence features. The model consists of:

	1. A Self-Attention Global Features module for capturing long-range dependencies.
	2. A TransConv1d module, combining transformers and convolutional layers.
	3. A PosCNN module, which applies position-aware convolutional operations for feature extraction.

	The model is trained with a loss function that combines classification loss and additional regularization terms to enhance generalization and interpretability. It is compatible with Hugging Face's `transformers` library, allowing seamless integration with other tools and workflows.

	## Intended uses

	`PDeepPP` was developed and validated using PTM and BPS datasets, but its applications are not limited to these specific tasks. Leveraging its flexible architecture and robust feature extraction capabilities, `PDeepPP` can be applied to a wide range of protein sequence-related analysis tasks. Specifically, the model has been validated on the following datasets:

	1. PTM datasets: Used for predicting post-translational modification (PTM) sites (e.g., phosphorylation), focusing on serine (S), threonine (T), and tyrosine (Y) residues.
	2. BPS datasets: Used for analyzing biologically active regions of protein sequences (Biologically Active Protein Sequences, BPS) to support downstream analyses.

	Although the model was trained and validated on PTM and BPS datasets, `PDeepPP`’s architecture enables users to generalize and extend its capabilities to other protein sequence analysis tasks, such as embedding generation, sequence classification, or task-specific analyses.

	---

	### Key features

	- Dataset support: `PDeepPP` is trained on PTM and BPS datasets, demonstrating its effectiveness in identifying specific sequence features (e.g., post-translational modification sites) and extracting biologically relevant regions.
	- Task flexibility: The model is not limited to PTM and BPS tasks. Users can adapt `PDeepPP` to other protein sequence-based tasks by customizing input data and task objectives.
	- PTM mode: Focuses on sequences centered around specific residues (S, T, Y) to analyze post-translational modification activity.
	- BPS mode: Analyzes overlapping or non-overlapping subsequences of a protein to extract biologically meaningful features.

	## How to use

	To use `PDeepPP`, you need to install the required dependencies, including `torch` and `transformers`:

	```bash
	pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
	pip install transformers
	```
	Before proceeding, you need to ensure that the `DataProcessor` and `Pretraining` files are in the same directory as the `example` file.
	Here is an example of how to use PDeepPP to process protein sequences and obtain predictions:

	```python
	import torch
	import esm
	from DataProcessor_pdeeppp import PDeepPPProcessor
	from Pretraining_pdeeppp import PretrainingPDeepPP
	from transformers import AutoModel

	# Global parameter settings
	device = torch.device("cpu")
	pad_char = "X" # Padding character
	target_length = 33 # Target length for sequence padding
	mode = "PTMS" # Mode setting (only configured in example.py)
	esm_ratio = 0.95 # Ratio for ESM embeddings

	# Load the PDeepPP model
	model_name = "fondress/PDeepPP_N-linked-glycosylation-N"
	model = AutoModel.from_pretrained(model_name, trust_remote_code=True) # Directly load the model

	# Initialize the PDeepPPProcessor
	processor = PDeepPPProcessor(pad_char=pad_char, target_length=target_length)

	# Example protein sequences (test sequences)
	protein_sequences = ["VELYP", "YPLDL", "ESHINQKWVCK"]

	# Preprocess the sequences
	inputs = processor(sequences=protein_sequences, mode=mode, return_tensors="pt") # Dynamic mode parameter
	processed_sequences = inputs["raw_sequences"]

	# Load the ESM model
	esm_model, esm_alphabet = esm.pretrained.esm2_t33_650M_UR50D()
	esm_model = esm_model.to(device)
	esm_model.eval()

	# Initialize the PretrainingPDeepPP module
	pretrainer = PretrainingPDeepPP(
	embedding_dim=1280,
	target_length=target_length,
	esm_ratio=esm_ratio,
	device=device
	)

	# Extract the vocabulary and ensure the padding character 'X' is included
	vocab = set("".join(protein_sequences))
	vocab.add(pad_char) # Add the padding character

	# Generate pretrained features using the PretrainingPDeepPP module
	pretrained_features = pretrainer.create_embeddings(
	processed_sequences, vocab, esm_model, esm_alphabet
	)

	# Ensure pretrained features are on the same device
	inputs["input_embeds"] = pretrained_features.to(device)

	# Perform prediction
	model.eval()
	outputs = model(input_embeds=inputs["input_embeds"]) # Use pretrained features as model input
	logits = outputs["logits"]

	# Compute probability distributions and generate predictions
	softmax = torch.nn.Softmax(dim=-1) # Apply softmax on the last dimension
	probabilities = softmax(logits)
	predicted_labels = (probabilities >= 0.5).long()

	# Print the prediction results for each sequence
	print("\nPrediction Results:")
	for i, seq in enumerate(processed_sequences):
	print(f"Sequence: {seq}")
	print(f"Probability: {probabilities[i].item():.4f}")
	print(f"Predicted Label: {predicted_labels[i].item()}")
	print("-" * 50)
	```

	## Training and customization

	`PDeepPP` supports fine-tuning on custom datasets. The model uses a configuration class (`PDeepPPConfig`) to specify hyperparameters such as:

	- Number of transformer layers
	- Hidden layer size
	- Dropout rate
	- PTM type and other task-specific parameters

	Refer to `PDeepPPConfig` for details.

	## Citation
	If you use `PDeepPP` in your research, please cite the associated paper or repository:

	```
	@article{your_reference,
	title={`PDeepPP`: A Hybrid Model for Protein Sequence Analysis},
	author={Author Name},
	journal={Journal Name},
	year={2025}
	}
	```