Semantic Deduplication
======================

TOON Converter includes a powerful **Semantic Deduplication** engine. Unlike traditional deduplication which looks for exact matches, this feature uses embedding models (via ``sentence-transformers``) to identify items that are *semantically* identical or highly similar.

This is particularly useful for:
* Cleaning datasets before RAG ingestion
* Reducing context window usage by removing redundant information
* Normalizing user inputs

Installation
------------

To use semantic features, you need to install the optional dependencies:

.. code-block:: bash

   pip install "toonverter[semantic]"
   # Or directly:
   pip install sentence-transformers scikit-learn

CLI Usage
---------

The ``deduplicate`` command processes a file and removes semantic duplicates from lists found within the data structure.

.. code-block:: bash

   # Basic usage
   toon deduplicate input.json -o cleaned.json

   # Customize model and threshold
   toon deduplicate input.json \
       --model all-MiniLM-L6-v2 \
       --threshold 0.85 \
       -o cleaned.json

Arguments:
* ``input_file``: Path to source data file (JSON, TOON, YAML, etc.)
* ``--output, -o``: Path to save result. If omitted, prints to stdout.
* ``--model``: SentenceTransformer model name (default: ``all-MiniLM-L6-v2``)
* ``--threshold``: Cosine similarity threshold (0.0 - 1.0). Higher means stricter matching. Default: ``0.9``.
* ``--language-key``: If objects have a specific language field, you can specify it.

Python API
----------

You can use the ``deduplicate`` function directly in your Python code:

.. code-block:: python

   import toonverter as toon

   data = {
       "items": [
           "Apple",
           "Banana", 
           "Fuji Apple",  # Might be deduplicated against "Apple" depending on threshold
           "Orange"
       ]
   }

   # Deduplicate
   cleaned = toon.deduplicate(data, threshold=0.8)

Advanced Usage: Custom Text Extraction
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For complex objects, you might want to control exactly what text is used for embedding comparison. You can use the ``SemanticDeduplicator`` class directly:

.. code-block:: python

   from toonverter.analysis.deduplication import SemanticDeduplicator

   def my_text_extractor(item):
       # Only compare based on title and description
       return f"{item.get('title', '')} {item.get('description', '')}"

   deduper = SemanticDeduplicator(text_extraction_func=my_text_extractor)
   cleaned_data = deduper.optimize(data)

Performance
-----------

* **Exact Match**: The system always performs an O(N) hash-based exact deduplication first.
* **Semantic**: This is O(N^2) within each list. For very large lists (>10k items), this can be slow. It is recommended for document chunks, tag lists, or moderate-sized datasets.