Efficient LLM
updated
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper
• 2311.01282
• Published
• 37
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Paper
• 2311.03285
• Published
• 31
Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization
Paper
• 2311.06243
• Published
• 21
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor
Cores
Paper
• 2311.05908
• Published
• 14
Tied-Lora: Enhacing parameter efficiency of LoRA with weight tying
Paper
• 2311.09578
• Published
• 16
I&S-ViT: An Inclusive & Stable Method for Pushing the Limit of
Post-Training ViTs Quantization
Paper
• 2311.10126
• Published
• 9
SparQ Attention: Bandwidth-Efficient LLM Inference
Paper
• 2312.04985
• Published
• 40
A Survey of Resource-efficient LLM and Multimodal Foundation Models
Paper
• 2401.08092
• Published
• 3
SliceGPT: Compress Large Language Models by Deleting Rows and Columns
Paper
• 2401.15024
• Published
• 73
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Paper
• 2401.15077
• Published
• 20
Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient
LLMs Under Compression
Paper
• 2403.15447
• Published
• 16
A Controlled Study on Long Context Extension and Generalization in LLMs
Paper
• 2409.12181
• Published
• 45