Semantic Deduplication
TOON Converter includes a powerful Semantic Deduplication engine. Unlike traditional deduplication which looks for exact matches, this feature uses embedding models (via sentence-transformers) to identify items that are semantically identical or highly similar.
This is particularly useful for: * Cleaning datasets before RAG ingestion * Reducing context window usage by removing redundant information * Normalizing user inputs
Installation
To use semantic features, you need to install the optional dependencies:
pip install "toonverter[semantic]"
# Or directly:
pip install sentence-transformers scikit-learn
CLI Usage
The deduplicate command processes a file and removes semantic duplicates from lists found within the data structure.
# Basic usage
toon deduplicate input.json -o cleaned.json
# Customize model and threshold
toon deduplicate input.json \
--model all-MiniLM-L6-v2 \
--threshold 0.85 \
-o cleaned.json
Arguments:
* input_file: Path to source data file (JSON, TOON, YAML, etc.)
* --output, -o: Path to save result. If omitted, prints to stdout.
* --model: SentenceTransformer model name (default: all-MiniLM-L6-v2)
* --threshold: Cosine similarity threshold (0.0 - 1.0). Higher means stricter matching. Default: 0.9.
* --language-key: If objects have a specific language field, you can specify it.
Python API
You can use the deduplicate function directly in your Python code:
import toonverter as toon
data = {
"items": [
"Apple",
"Banana",
"Fuji Apple", # Might be deduplicated against "Apple" depending on threshold
"Orange"
]
}
# Deduplicate
cleaned = toon.deduplicate(data, threshold=0.8)
Advanced Usage: Custom Text Extraction
For complex objects, you might want to control exactly what text is used for embedding comparison. You can use the SemanticDeduplicator class directly:
from toonverter.analysis.deduplication import SemanticDeduplicator
def my_text_extractor(item):
# Only compare based on title and description
return f"{item.get('title', '')} {item.get('description', '')}"
deduper = SemanticDeduplicator(text_extraction_func=my_text_extractor)
cleaned_data = deduper.optimize(data)
Performance
Exact Match: The system always performs an O(N) hash-based exact deduplication first.
Semantic: This is O(N^2) within each list. For very large lists (>10k items), this can be slow. It is recommended for document chunks, tag lists, or moderate-sized datasets.