LLM Architecture

The paper introduces ModernBERT, a new family of encoder-only transformer models that brings modern optimizations to BERT-style architectures.

Key Features:

Architectural Improvements:
- Uses GeGLU activation
- RoPE positional embeddings
- Alternating local-global attention
- Native 8192 sequence length
- Optimized for efficient inference on common GPUs
- Full model unpadding for better efficiency
Training:
- Trained on 2 trillion tokens
- Includes code data in training mixture
- Uses modern BPE tokenizer with 50,368 vocabulary size
Unique Advantages:
- Successfully combines modern LLM architecture improvements with encoder-only models
- Achieves better performance while maintaining high efficiency
- Represents first major Pareto improvement over older encoders like BERT
- Code-Aware Design: Uses a code-aware tokenizer that can properly handle programming syntax
- The code training makes ModernBERT uniquely suited for code-related tasks while maintaining strong performance on traditional NLP tasks

Limitations:

MLM-only objective (Masked Language Modeling)
Not trained with RTD (Replaced Token Detection) which might hurt classification results

Key Points:

Problem & Solution:

Architecture:

Uses bidirectional transformer (like BERT/DeBERTa) as backbone
Components:
1. Pre-trained textual encoder
2. Span representation module
3. Entity representation module
Treats NER as matching entity types with text spans in latent space

Performance:

Limitations:

SetFit is a two-stage framework for few-shot learning:

Siamese Fine-tuning Stage:
- Takes pairs of text (positive/negative)
- Fine-tunes Sentence Transformer in contrastive manner
- Creates better text embeddings for target task
Classification Stage:
- Uses fine-tuned embeddings from Stage 1
- Trains simple classifier (logistic regression) on these embeddings
- Produces final classification output

Advantages: