| | --- |
| | language: |
| | - zh |
| | - en |
| | - de |
| | - fr |
| | license: mit |
| | pipeline_tag: feature-extraction |
| | library_name: transformers |
| | tags: |
| | - embeddings |
| | - lora |
| | - sociology |
| | - retrieval |
| | - feature-extraction |
| | - sentence-transformers |
| | --- |
| | |
| | # THETA: Textual Hybrid Embeddingβbased Topic Analysis |
| |
|
| | ## Model Description |
| |
|
| | THETA is a domain-specific embedding model fine-tuned using LoRA on top of Qwen3-Embedding models (0.6B and 4B). It is designed to generate dense vector representations for texts in the sociology and social science domain. |
| |
|
| | The model is suitable for tasks such as semantic search, similarity computation, clustering, and retrieval-augmented generation (RAG). |
| |
|
| | **Base Models:** |
| | - [Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) |
| | - [Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B) |
| |
|
| | **Fine-tuning Methods:** |
| | - **Unsupervised:** SimCSE (contrastive learning) |
| | - **Supervised:** Label-guided contrastive learning with LoRA |
| |
|
| | ## Intended Use |
| |
|
| | This model is intended for text embedding generation, semantic similarity computation, document retrieval, and downstream NLP tasks requiring dense representations. |
| |
|
| | It is **not** designed for text generation or decision-making in high-risk scenarios. |
| |
|
| | ## Model Architecture |
| |
|
| | | Component | Detail | |
| | |---|---| |
| | | Base model | Qwen3-Embedding (0.6B / 4B) | |
| | | Fine-tuning | LoRA (Low-Rank Adaptation) | |
| | | Output dimension | 896 (0.6B) / 2560 (4B) | |
| | | Framework | Transformers (PyTorch) | |
| |
|
| | ## Repository Structure |
| |
|
| | ``` |
| | CodeSoulco/THETA/ |
| | βββ 0.6B/ |
| | β βββ supervised/ |
| | β βββ unsupervised/ |
| | βββ 4B/ |
| | β βββ supervised/ |
| | β βββ unsupervised/ |
| | βββ logs/ |
| | ``` |
| |
|
| | Pre-computed embeddings are available in a separate dataset repo: [CodeSoulco/THETA-embeddings](https://huggingface.co/datasets/CodeSoulco/THETA-embeddings) |
| |
|
| | ## Training Details |
| |
|
| | - **Fine-tuning method:** LoRA |
| | - **Training domain:** Sociology and social science texts |
| | - **Datasets:** germanCoal, FCPB, socialTwitter, hatespeech, mental_health |
| | - **Objective:** Improve domain-specific semantic representation |
| | - **Hardware:** Dual NVIDIA GPU |
| | |
| | ## How to Use |
| | |
| | ```python |
| | from transformers import AutoTokenizer, AutoModel |
| | from peft import PeftModel |
| | import torch |
| | |
| | # Load base model |
| | base_model = AutoModel.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True) |
| | tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-0.6B", trust_remote_code=True) |
| |
|
| | # Load LoRA adapter |
| | model = PeftModel.from_pretrained( |
| | base_model, |
| | "CodeSoulco/THETA", |
| | subfolder="0.6B/unsupervised/germanCoal" |
| | ) |
| | |
| | # Generate embeddings |
| | text = "Social structure and individual behavior" |
| | inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) |
| |
|
| | with torch.no_grad(): |
| | outputs = model(**inputs) |
| | |
| | embeddings = outputs.last_hidden_state[:, 0, :] # CLS token |
| | ``` |
| | |
| | ## Limitations |
| | |
| | - Fine-tuned for sociology/social science domain; may not generalize well to unrelated topics. |
| | - Performance depends on input text length and quality. |
| | - Does not generate text and should not be used for generative tasks. |
| | |
| | ## License |
| | |
| | This model is released under the **MIT License**. |
| | |
| | ## Citation |
| | |
| | ```bibtex |
| | @misc{theta2026, |
| | title={THETA: Textual Hybrid Embedding--based Topic Analysis}, |
| | author={CodeSoul}, |
| | year={2026}, |
| | publisher={Hugging Face}, |
| | url={https://huggingface.co/CodeSoulco/THETA} |
| | } |
| | ``` |
| | |