PD.Cipher Tokenizer SDK
Welcome to the PD.Cipher Tokenizer SDK documentation. This SDK provides a high-performance Rust tokenization library with Python bindings designed for secure, encrypted machine learning.
Overview
The PD.Cipher Tokenizer SDK is built on a composable pipeline architecture that allows you to create custom tokenization pipelines for various data types including:
- Text data with multiple languages and scripts
- Binary data and byte sequences
- Structured data (JSON, XML)
- Multimodal inputs (vision, audio)
Key Features
- High Performance: Built in Rust with parallel processing capabilities
- Composable Architecture: Mix and match components to create custom pipelines
- Security First: Designed for encrypted ML workflows with deterministic output
- Python Bindings: Easy integration with Python ML frameworks
- Hugging Face Compatible: Works seamlessly with the Hugging Face ecosystem
Architecture
The SDK follows a pipeline architecture inspired by Hugging Face tokenizers:
Input Data → [DataSource] → [Normalizer] → [PreTokenizer] → [Model] → [PostProcessor] → Encoding
Components
- DataSource: Handles input ingestion from various sources
- Normalizer: Performs text normalization (Unicode, lowercasing, accent removal)
- PreTokenizer: Splits input into initial chunks (whitespace, punctuation)
- Model: Core tokenization algorithms (BPE, WordPiece, Unigram)
- PostProcessor: Adds special tokens, handles padding and truncation
Getting Started
To get started with the PD.Cipher Tokenizer SDK:
- Install the SDK using
cargo add cipher-tokenizer - Check out the examples in the
examples/directory - Read the API documentation below
Use Cases
The PD.Cipher Tokenizer SDK is ideal for:
- Secure ML Pipelines: Tokenize data before encryption for privacy-preserving ML
- High-Performance NLP: Process large text corpora efficiently
- Multimodal AI: Handle text, vision, and audio data in unified pipelines
- Custom Tokenization: Build domain-specific tokenizers for specialized applications
Support
For questions and support:
- GitHub Issues: cipher-tokenizer-sdk/issues
- Documentation: cipher-tokenizer-sdk.probabilitydrive.com
- Email: support@probabilitydrive.com