| | --- |
| | license: other |
| | license_name: raml-v1.0 |
| | pipeline_tag: text-generation |
| | tags: |
| | - model_hub_mixin |
| | - pytorch_model_hub_mixin |
| | - RxNN |
| | - RxLM |
| | - ReactiveTransformer |
| | - Event-Driven |
| | - MemorySystem |
| | - ShortTermMemory |
| | - Real-Time |
| | - RxLM |
| | - ReactiveLanguageModel |
| | - RealTimeLanguageModel |
| | language: |
| | - en |
| | datasets: |
| | - roneneldan/TinyStories |
| | - ReactiveAI/TinyStories-Plus-Interaction-SFT |
| | - ReactiveAI/TinyStories-MRL |
| | library_name: RxLM |
| | gated: true |
| | extra_gated_prompt: "Accept [Reactive AI Model & Architecture License (RAML) v1.0](https://github.com/RxAI-dev/rxlm/blob/main/MODELS_LICENSE.md) terms to access the repository and use model. Reactive Transformer (pending patent #P.453260) is available for free for non-commercial usage. For commercial usage please contact Reactive AI at licensing@rxai.dev" |
| | extra_gated_fields: |
| | Company: text |
| | Country: country |
| | I want to use this model for: |
| | type: select |
| | options: |
| | - Research |
| | - Education |
| | - label: Other |
| | value: other |
| | I agree to use this model for non-commercial use ONLY: checkbox |
| | extra_gated_heading: "You need to agree to use this model only for research or education purposes under Reactive AI Model & Architecture License (RAML) v1.0" |
| | extra_gated_description: "The repository will be available instantly after accepting license terms" |
| | extra_gated_button_content: "Accept license terms" |
| | --- |
| | |
| | # RxT-Alpha (Supervised) |
| | World's first experimental (PoC) real-time **Reactive Language Model (RxLM)** based on revolutionary **Reactive Transformer** |
| | architecture - processing only single interactions/messages, with all the context moved to **Short-Term Memory**, |
| | managed by **Attention-Based Memory System**. |
| |
|
| | **RxLMs** have linear computational/inference cost scaling (`O(NT)`) compared to **LLMs** quadratic growth (`O(N²T)`), |
| | where `N` is the number of messages in conversation and `T` is the number of tokens in single interaction. Thanks to that |
| | scaling, they are just `N` times faster and cheaper than **LLMs**. |
| |
|
| | That's not all from the advantages - event-driven real-time processing with memory is a lot more natural and human-like, |
| | than LLMs data-driven approach (processing full conversation history everytime). It's a crucial milestone in development |
| | of AGI and awareness models. |
| |
|
| | > This is _Supervised_ version of the model with "weak" memory system - result of Supervised Memory System Training (SMST). It's |
| | > able to remember information between interactions (without passing it explicitly in prompt/chat template), but it |
| | > has to be refined in next Memory Reinforcement Learning (MRL) stage for full functionality. |
| |
|
| | After supervised stages model reached `~3.7/10.0` initial mean reward in MRL (min. `~1.8/10.0`, max. `~6.2/10.0`) |
| |
|
| |
|
| | ## Reactive Transformer Architecture |
| | Experimental research model made to test our Reactive Transformer architecture and Attention-based Memory System. |
| |
|
| | Reactive Transformer has additional Short-Term Memory layers, connected to model with Memory Cross-Attention, and updated by Memory Encoder and Memory Attention. |
| | Short-Term Memory state is kept between interactions/event (single message), not between tokens in sequence - that's key difference between RxNNs and RNNs. |
| |
|
| | The goal of the architecture is to process only single messages and keep conversation history in Short-Term Memory - we believe, that this is the key requirement |
| | for awareness and AGI. Processing all the chat history on every interaction is not natural and that's not how human awareness is working. Then, Reactive Transformer |
| | architecture is a first step in transition from language models to awareness models. |
| |
|
| | To balance number of the parameters, decoder is based on Mixture-of-Experts architecture, while the encoder is using regular |
| | dense feed forward layers. This model is using gated self/interlayer version of memory attention network with sigmoid residual gates. |
| |
|
| | ### Architecture details: |
| | - dim: 256 |
| | - layers: 12 |
| | - heads (for split): 16 |
| | - **Decoder:** |
| | - self-attention: symmetric Sparse Query Attention |
| | - query/key/value heads: 8/16 |
| | - memory cross-attention: symmetric Sparse Query Attention |
| | - query/key/value heads: 8/16 |
| | - Mixture-of-Experts Feed Forward |
| | - experts: 36 |
| | - active experts: 6 |
| | - SwiGLU feed forward with 384 dim |
| | - size: \~141M (~34.8M Activated) |
| | - **Encoder:** |
| | - self-attention: symmetric Sparse Query Attention |
| | - query/key/value heads: 8/16 |
| | - SwiGLU feed forward with 784 dim |
| | - size: ~13.8M |
| | - **Memory Attention:** |
| | - variant: **Gated Self/Interlayer Memory Attention** |
| | - attention layers: symmetric Sparse Query Attention |
| | - query/key/value heads: 8/16 |
| | - residual gate: linear with sigmoid activation (per STM slot) |
| | - size: ~5.53M |
| | - RoPE for self-attention, memory cross-attention (query only) and memory attention (key only) |
| | - RMS Norm for all normalization layers |
| | - vocab: 20k (english only) |
| | - interaction (query + answer) length: 256 tokens |
| | - STM size: 12 layers * 256 slots (* 256 dim) |
| | - size: ~160M |
| | - Library: RxLM |
| |
|
| | <!-- <img src="https://raw.githubusercontent.com/RxAI-dev/RxNN/refs/heads/main/assets/research/reactive-transformer-self-interlayer.png" width="800" /> --> |
| |
|
| | <!-- <img src="https://raw.githubusercontent.com/RxAI-dev/RxNN/refs/heads/main/assets/research/residual-gate.png" width="800" /> --> |
| |
|
| | > RxT-Alpha models intentionally use very short sequence length and STM size (256 tokens), but that isn't their "full" context size - it's only for single |
| | > message. "Full" context is theoretically infinite, restricted by STM size and memory abilities. For PoC models we want to |
| | > reach 16 steps in Memory Reinforcement Learning curriculum, which should enable fluent conversations for 4k tokens context |
| | > for this model. That sizes are good for research, final models will handle SOTA contexts. |
| |
|
| | <!-- <img src="https://raw.githubusercontent.com/RxAI-dev/RxNN/refs/heads/main/assets/research/stm-abms.png" width="800"> --> |
| |
|
| |
|
| | ## RxT-Alpha Training |
| | Models from RxT-Alpha series are a part of first PoCs for Reactive Transformer, Attention-Based Memory System and Memory Reinforcement Learning, |
| | used mainly to test library and architecture basics, before training bigger models (that are still relatively small, as it's PoC). |
| |
|
| | They are trained to generate simple stories based on [**roneneldan/TinyStories**](https://huggingface.co/datasets/roneneldan/TinyStories), |
| | and follow-up answers to question about those stories. |
| |
|
| | Supervised Memory System Training includes 4 steps, before proceeding to Reinforcement Learning stages. |
| |
|
| | ### Base Models Pre-Training |
| | Decoder was trained with Encoder and additional MLM head model, using Joint LM Training (with MLM and Autoregressive loss) |
| | and [**roneneldan/TinyStories**](https://huggingface.co/datasets/roneneldan/TinyStories) dataset. |
| | Both encoder and decoder are using shared embedding layer |
| |
|
| | ### Supervised Fine-Tuning |
| | **RxT-Alpha-Nanoe** models were fine-tuned to generate real-time interactions (sequences) on our synthetic dataset (improved in v3), |
| | inspired by TinyStories - [**ReactiveAI/TinyStories-Plus-Interaction-SFT**](https://huggingface.co/datasets/ReactiveAI/TinyStories-Plus-Interaction-SFT). |
| |
|
| | Models were fine-tuned using Joint LM Training mode (for memory cross-attention pre-training): |
| | - encode data with encoder and calculate MLM loss for it |
| | - save encoder layer's results as Short-Term Memory (available for decoder by memory cross-attention) |
| | - process data with decoder and calculate autoregressive loss |
| |
|
| | That training results in decoder with ~95% accuracy, because it has access to all next tokens information with memory cross-attention. In next training stages it |
| | will access previous interactions data with those layers. |
| |
|
| | ### Self-Supervised Memory Attention Pre-Training |
| | Memory Attention was pre-trained to combine accumulated Short-Term Memory states with next interaction data processed by the |
| | encoder, using weighted mean (with randomized arbitrary weights) as labels and negative cosine similarity as loss. Label weights |
| | depending on inner step: |
| | - first step, when STM is in initial random normal state, using 90% of new encoded data |
| | - follow-up steps are using `50% - step * 5%` of new encoded data |
| | - each step could have 0-15% random differences in weights |
| |
|
| | Additionally, random noise is added to both inputs and labels. |
| |
|
| | This model was trained on six arbitrary selected steps using [**ReactiveAI/TinyStories-MRL**](https://huggingface.co/datasets/ReactiveAI/TinyStories-MRL) |
| | dataset - `steps-6` subset and `supervised` split. |
| |
|
| | > This stage is fast and could reach convergence after even single epoch |
| |
|
| | ### Supervised Memory-Aware Training |
| | Finally, with pre-trained/fine-tuned components, in last supervised stage, model is trained to use previous/accumulated STM |
| | states as memory cross-attention input, instead of the same sequences as decoder's input: |
| | - previous (or first) interaction is processed by encoder and used to update memory |
| | - next interaction is processed by decoder, using related information from STM |
| | - loss is calculated from decoder's logits and gradients propagate through memory attention to encoder |
| |
|
| | In this stage we are using gradual unfreeze strategy: |
| | - start from training only decoder |
| | - after N epochs unfreeze memory attention |
| | - after another K epochs unfreeze encoder |
| |
|
| | ## Next Stage: Memory Reinforcement Learning |
| | The model is able to generate grammatically correct answers with basic retention between interaction, and is ready for the |
| | **Memory Reinforcement Learning** in the next stage. More info soon. |
| |
|
| | ### Research paper: https://arxiv.org/abs/2510.03561 |
| |
|
| | ## Usage |
| | Model could be loaded and used for interactions (could run on CPU) with our RxLM framework (https://github.com/RxAI-dev/RxLM): |
| | ```python |
| | from rxlm.rxt.models import RxTAlpha |
| | from rxlm.training.tokenizer import load_tokenizer_from_hf_hub |
| | |
| | tokenizer = load_tokenizer_from_hf_hub('ReactiveAI/RxT-Alpha-Supervised') |
| | |
| | model = RxTAlpha.from_pretrained('ReactiveAI/RxT-Alpha-Supervised', tokenizer=tokenizer) |
| | model.share_components() # currently required to connect embeddings/STM |
| | |
| | seq_len = 256 |
| | |
| | # Memory init - could be used as "system prompt" in LLMs |
| | stm_init_state = model.tokenize_full_interaction('System prompt like', 'Initial memory for the model', max_seq_len=seq_len) |
| | model.init_stm_state(**stm_init_state) |
| | |
| | # Helper function |
| | def interaction(query: str): |
| | tokenized_query = model.tokenize_query(query, max_seq_len=seq_len) |
| | for token_id in model.interact(**tokenized_query, max_seq_len=seq_len, temperature=1.0): |
| | if token_id == -1: print('\n', '[Start memory update...]') |
| | elif token_id == -2: print('[Memory updated]') |
| | else: |
| | txt_token = model.stringify_token(token_id) |
| | print(txt_token, end='') |
| | |
| | # Process first interaction |
| | interaction('Tell me a story about...') |
| | # Process follow-up interaction |
| | interaction('What was that story about?') |
| | |
| | ``` |
| |
|