| | --- |
| | tags: |
| | - model_hub_mixin |
| | - pytorch_model_hub_mixin |
| | pipeline_tag: feature-extraction |
| | --- |
| | |
| | # SegmentEnformer |
| |
|
| | SegmentEnformer is a segmentation model leveraging [Enformer](https://www.nature.com/articles/s41592-021-01252-x) to predict the location of several types of genomics |
| | elements in a sequence at a single nucleotide resolution. It was trained on 14 different classes, including gene (protein-coding genes, lncRNAs, 5’UTR, 3’UTR, exon, intron, splice acceptor and donor sites) and regulatory (polyA signal, tissue-invariant and |
| | tissue-specific promoters and enhancers, and CTCF-bound sites) elements. |
| |
|
| |
|
| | **Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI) |
| |
|
| | ### Model Sources |
| |
|
| | <!-- Provide the basic links for the model. --> |
| |
|
| | - **Repository:** [Nucleotide Transformer](https://github.com/instadeepai/nucleotide-transformer) |
| | - **Paper:** [Segmenting the genome at single-nucleotide resolution with DNA foundation models](https://www.biorxiv.org/content/biorxiv/early/2024/03/15/2024.03.14.584712.full.pdf) |
| |
|
| | ### How to use |
| |
|
| | Until its next release, the transformers library needs to be installed from source with the following command in order to use the models. |
| | PyTorch, einops and enformer_pytorch should also be installed. |
| | |
| | ``` |
| | pip install --upgrade git+https://github.com/huggingface/transformers.git |
| | !pip install torch einops enformer_pytorch==0.7.6 |
| | ``` |
| | |
| | A small snippet of code is given here in order to retrieve both logits from dummy DNA sequences. |
| | |
| | ``` |
| | import torch |
| | from transformers import AutoModel |
| |
|
| | model = AutoModel.from_pretrained("InstaDeepAI/segment_enformer", trust_remote_code=True) |
| |
|
| | def encode_sequences(sequences): |
| | one_hot_map = { |
| | 'a': torch.tensor([1., 0., 0., 0.]), |
| | 'c': torch.tensor([0., 1., 0., 0.]), |
| | 'g': torch.tensor([0., 0., 1., 0.]), |
| | 't': torch.tensor([0., 0., 0., 1.]), |
| | 'n': torch.tensor([0., 0., 0., 0.]), |
| | 'A': torch.tensor([1., 0., 0., 0.]), |
| | 'C': torch.tensor([0., 1., 0., 0.]), |
| | 'G': torch.tensor([0., 0., 1., 0.]), |
| | 'T': torch.tensor([0., 0., 0., 1.]), |
| | 'N': torch.tensor([0., 0., 0., 0.]) |
| | } |
| | |
| | def encode_sequence(seq_str): |
| | one_hot_list = [] |
| | for char in seq_str: |
| | one_hot_vector = one_hot_map.get(char, torch.tensor([0.25, 0.25, 0.25, 0.25])) |
| | one_hot_list.append(one_hot_vector) |
| | return torch.stack(one_hot_list) |
| | |
| | if isinstance(sequences, list): |
| | return torch.stack([encode_sequence(seq) for seq in sequences]) |
| | else: |
| | return encode_sequence(sequences) |
| | |
| | sequences = ["A"*196608, "G"*196608] |
| | one_hot_encoding = encode_sequences(sequences) |
| | preds = model(one_hot_encoding) |
| | print(preds['logits']) |
| | ``` |
| | |
| | ## Training data |
| | |
| | The **SegmentEnformer** model was trained on all human chromosomes except for chromosomes 20 and 21, kept as test set, and chromosome 22, used as a validation set. |
| | During training, sequences are randomly sampled in the genome with associated annotations. However, we keep the sequences in the validation and test set fixed by |
| | using a sliding window of length 196kb (original enformer input length) over the chromosomes 20 and 21. The validation set was used to monitor training and for early stopping. |
| | |
| | ## Training procedure |
| | |
| | ### Preprocessing |
| | |
| | The DNA sequences are tokenized using one-hot encoding similar to the Enformer model |
| | |
| | ### Architecture |
| | |
| | The model is composed of the Enformer backbone, from which we remove the heads and replaced it by a 1-dimensional U-Net segmentation head made of 2 downsampling convolutional blocks and 2 upsampling convolutional blocks. Each of these |
| | blocks is made of 2 convolutional layers with 1, 024 and 2, 048 kernels respectively. |
| | |
| | ### BibTeX entry and citation info |
| | |
| | ```bibtex |
| | @article{de2024segmentnt, |
| | title={SegmentNT: annotating the genome at single-nucleotide resolution with DNA foundation models}, |
| | author={de Almeida, Bernardo P and Dalla-Torre, Hugo and Richard, Guillaume and Blum, Christopher and Hexemer, Lorenz and Gelard, Maxence and Pandey, Priyanka and Laurent, Stefan and Laterre, Alexandre and Lang, Maren and others}, |
| | journal={bioRxiv}, |
| | pages={2024--03}, |
| | year={2024}, |
| | publisher={Cold Spring Harbor Laboratory} |
| | } |
| | |
| | ``` |