LLM Architecture
[ModernBERT] Modern Bidirectional Encoder🔗
Arxiv: https://arxiv.org/abs/2412.13663 18 Dec 2024
The paper introduces ModernBERT, a new family of encoder-only transformer models that brings modern optimizations to BERT-style architectures.
Key Features:
-
Architectural Improvements:
- Uses GeGLU activation
- RoPE positional embeddings
- Alternating local-global attention
- Native 8192 sequence length
- Optimized for efficient inference on common GPUs
- Full model unpadding for better efficiency
-
Training:
- Trained on 2 trillion tokens
- Includes code data in training mixture
- Uses modern BPE tokenizer with 50,368 vocabulary size
-
Unique Advantages:
- Successfully combines modern LLM architecture improvements with encoder-only models
- Achieves better performance while maintaining high efficiency
- Represents first major Pareto improvement over older encoders like BERT
- Code-Aware Design: Uses a code-aware tokenizer that can properly handle programming syntax
- The code training makes ModernBERT uniquely suited for code-related tasks while maintaining strong performance on traditional NLP tasks
Limitations:
- MLM-only objective (Masked Language Modeling)
- Not trained with RTD (Replaced Token Detection) which might hurt classification results
[GLiNER] Generalist Model for NER using Bidirectional Transformer🔗
Arxiv: https://arxiv.org/abs/2311.08526 14 Nov 2023
Key Points:
Problem & Solution:
- Traditional NER models are limited to predefined entity types
- GLiNER introduces a compact model that can identify any type of entity
- Uses bidirectional transformer encoder for parallel entity extraction
Architecture:
- Uses bidirectional transformer (like BERT/DeBERTa) as backbone
- Components:
- Pre-trained textual encoder
- Span representation module
- Entity representation module
- Treats NER as matching entity types with text spans in latent space
Performance:
- Parallel entity extraction vs sequential generation in LLMs
- Compact design (50M-300M parameters) vs billions in LLMs
- Effective negative entity sampling during training
- Entity type dropping as regularization technique
Limitations:
- Lower performance on informal text (e.g., tweets)
- Reduced effectiveness on non-Latin scripts
- Room for improvement in low-resource languages
[Set-Fit] Sentence Transformer Fine-tuning🔗
Arxiv: https://arxiv.org/abs/2209.11055 22 Sep 2022
SetFit is a two-stage framework for few-shot learning:
-
Siamese Fine-tuning Stage:
- Takes pairs of text (positive/negative)
- Fine-tunes Sentence Transformer in contrastive manner
- Creates better text embeddings for target task
-
Classification Stage:
- Uses fine-tuned embeddings from Stage 1
- Trains simple classifier (logistic regression) on these embeddings
- Produces final classification output
Advantages:
- No prompts or verbalizers needed
- Much smaller parameter count than PEFT/PET
- Faster training time
- Works well in multilingual settings
- Comparable accuracy to larger models