Skip to main content

PD.Cipher Tokenizer SDK

Welcome to the PD.Cipher Tokenizer SDK documentation. This SDK provides a high-performance Rust tokenization library with Python bindings designed for secure, encrypted machine learning.

Overview

The PD.Cipher Tokenizer SDK is built on a composable pipeline architecture that allows you to create custom tokenization pipelines for various data types including:

  • Text data with multiple languages and scripts
  • Binary data and byte sequences
  • Structured data (JSON, XML)
  • Multimodal inputs (vision, audio)

Key Features

  • High Performance: Built in Rust with parallel processing capabilities
  • Composable Architecture: Mix and match components to create custom pipelines
  • Security First: Designed for encrypted ML workflows with deterministic output
  • Python Bindings: Easy integration with Python ML frameworks
  • Hugging Face Compatible: Works seamlessly with the Hugging Face ecosystem

Architecture

The SDK follows a pipeline architecture inspired by Hugging Face tokenizers:

Input Data → [DataSource] → [Normalizer] → [PreTokenizer] → [Model] → [PostProcessor] → Encoding

Components

  1. DataSource: Handles input ingestion from various sources
  2. Normalizer: Performs text normalization (Unicode, lowercasing, accent removal)
  3. PreTokenizer: Splits input into initial chunks (whitespace, punctuation)
  4. Model: Core tokenization algorithms (BPE, WordPiece, Unigram)
  5. PostProcessor: Adds special tokens, handles padding and truncation

Getting Started

To get started with the PD.Cipher Tokenizer SDK:

  1. Install the SDK using cargo add cipher-tokenizer
  2. Check out the examples in the examples/ directory
  3. Read the API documentation below

Use Cases

The PD.Cipher Tokenizer SDK is ideal for:

  • Secure ML Pipelines: Tokenize data before encryption for privacy-preserving ML
  • High-Performance NLP: Process large text corpora efficiently
  • Multimodal AI: Handle text, vision, and audio data in unified pipelines
  • Custom Tokenization: Build domain-specific tokenizers for specialized applications

Support

For questions and support: