The discussion revolves around the introduction and significance of a Python tokenizer written in C++, aiming to be a drop-in replacement for existing solutions like tiktoken
. Key themes emerge regarding performance optimization, the software development lifecycle, the role of Python in AI/ML, and the specifics of tokenization algorithms.
Performance Optimization and Drop-in Replacements
A central theme is the value and beauty of creating a drop-in replacement that offers significant performance improvements. This is highlighted as a major selling point for adoption.
- One user states, "There’s something beautiful about creating a drop in replacement for something that improves performance substantially."
- The motivation for such work is evident, as one comment suggests, "Agreed. I figured nobody would use it otherwise."
- The importance of clearly communicating these improvements is also emphasized: "Put it in there readme & description. It's a big selling point."
- There's a general sentiment that substantial progress in areas like Transformers & LLMs is currently shifting towards performance optimization: "I feel as though we're at a stage where most substantial progress is being made on the performance side."
- One user expresses admiration for engineers who produce performant C++ code in AI/ML, indicating a belief that "good chance some solid engineering tradeoffs have been made and dramatic improvement will be seen."
- The potential for performance gains by writing parts of AI/ML infrastructure in C++ is recognized, particularly for "drop in and fix key bottlenecks."
- A comparison to another project with a similar goal is made: "Nice work! I tried something similar a while back ago: https://github.com/kevmo314/tokie"
- The desire for compatibility is expressed: "Would it be possible to eliminate that little vocab format conversion requirement for the vocab I see in the test against tiktoken? It would be nice to have a fully compatible drop in replacement without having to think about details." The author of the new tokenizer confirms a successful update to achieve this: "Alright, 0.1.1 should now be a true drop-in replacement. I'll write up some examples soon."
- There's a question about whether optimizations could be upstreamed to existing libraries: "I've reached out to the guy who maintains Tiktoken to talk about this."
Software Development Lifecycle and "Make It Right"
The discussion touches on different philosophies for software development, often framed as stages of "making it work," "making it right," and "making it fast." This prompts a debate on the meaning and relationship between these stages, particularly in the context of ML where initial functionality might be imperfect.
- A common adage is presented: "Make it work. Make it fast. Make it pretty."
- An alternative phrasing is offered: "Make It Work, Make It Right, Make It Fast."
- The distinction between "make it work" and "make it right" is explored, with some arguing they are synonymous ("if it's not right, it doesn't work").
- However, another user clarifies that in ML, "it does work to a degree even if it's not 100% correct." This leads to the idea that "make it work" often involves "hacking," and "make it right" addresses incremental corrections.
- The concept of "right" is further elaborated: "make it work can be a hacky, tech debt laden implementation. Making it right involves refactoring/rewriting with an eye towards maintainability, testability, etc etc."
- An analogy is used to differentiate: "My mentor used say it is the difference between a screw and glue... You can glue some things together and prove that it works, but eventually you learn that anytime you had to break something to fix it, you should've used a screw."
- Another interpretation of "Make it Right" is "Make it Beautiful," meaning "non-hacky, easily maintainable, easily extendable, well tested, and well documented."
- The phrase "Make it work" can also be seen as encompassing initial functionality, even if rough: "Make it work."
- A user points to ancient wisdom: "A similar concept dates back to 30BC: https://en.wikipedia.org/wiki/De_architectura Firmitas, utilitas, venustas - Strong, useful, and beautiful."
- The Huggingface transformers library is mentioned as undergoing a refactor for extensibility and performance: "The Huggingface transformers lib is currently undergoing a refactor to get rid of cruft and make it more extensible, hopefully with some perf gains."
The Role of Python in AI/ML
A significant portion of the discussion centers on the suitability of Python for AI/ML development, particularly in contrast to lower-level languages like C++ or Rust. The trade-offs between iteration speed, ecosystem, and raw performance are debated.
- One user suggests moving away from Python for ML, arguing, "In the long run it doesn’t make sense just because it’s the language ML engineers are familiar with."
- This provocative statement is met with strong disagreement: "No! This is not good. Iteration speed trumps all in research, most of what Python does is launch GPU operations, if you're having slowdowns from Pythonland then you're doing something terribly wrong. Python is an excellent (and yes, fast!) language for orchestrating and calling ML stuff. If C++ code is needed, call it as a module."
- Another defends Python's utility: "It makes plenty of sense. Python handles strings well, has a great package ecosystem, and is easy to write/learn for non-programmers. It can be easily embedded into a notebook (which is huge for academics) and is technically a "write once run anywhere" platform in theory. It's great."
- The prevalence of performance-critical code being written in C/Cython within Python libraries is noted: "Most of that is already happening under the hood. A lot of performance-sensitive code is already written in C or cython. For example numpy, scikit learn, pandas. Lots of torch code is either C or CUDA."
- The reason for Python's adoption by researchers is explained by its conciseness: "ML researchers aren’t using python because they are dumb. They use it because what takes 8 lines in Java can be done with 2 or 3 (including import json) in python for example."
- One user argues that the actual bottlenecks are in CUDA kernels, not Python overhead: "The key bottlenecks are not in tokenization, but running the actual CUDA kernels. Python actually has very little overhead. (See VLLM, which is primarily in Python)." This suggests C++ focus is on kernel optimization primarily.
- A question is raised about Rust's role: "It looks like TikToken is written in Rust (...) are the gains here actually from porting to C++?"
- The sentiment "I'm relieved to see that its not written in rust" is expressed, suggesting a potential preference against Rust for this project.
Tokenization Specifics and Algorithms
The performance improvements of the new tokenizer are attributed to specific algorithmic changes, particularly in how special tokens are handled, leading to a discussion about tokenization quality and underlying algorithms.
- The core of the performance gain is explained: "simplifying the algorithm to forego regex matching special tokens at all."
- A user queries potential quality impact: "Does that mean there could be cases with less quality in terms of tokenization?" The original author confirms expected identical output: "The output should be identical, assuming no bugs."
- The method used by
tiktoken
(regex on special tokens) is contrasted with the new approach (simple string matching with caller-defined special tokens): "The Tiktoken implementation takes a collection of all special tokens upon initialization and compiles them into a regex by joining them with|
... So this yields a huge regexp of special tokens that shouldn't be considered at all. TokenDagger does not do this. Instead, simple string matching is used." - A question is posed about tokenizers for code: "Is there a tokenizer someone can recommend for code?"
- Interest in WebAssembly (WASM) bindings is expressed.
- A comparison to another tokenizer crate is requested: "How does this compare to the BPE crate [1]? Its main selling point is support for incrementally re-tokenising text, but it's also faster than tiktoken." The author plans to benchmark against it.
- The availability of local tokenizers for different LLMs, like Gemini, is discussed, with the observation that Gemini uses SentencePiece and shares vocabulary with Gemma.
- It's noted that "Underlying them is a core algorithm like SentencePiece or Byte-pair encoding (BPE). Tiktoken and TokenDagger are BPE implementations."
- The value of building model-specific quirks into the library for easier integration and minor performance gains is considered.
- A user asks about potential differences in tokenization compared to OpenAI's tokenizer, to which the author clarifies that both use BPE and should produce identical results.
- A comparison with Huggingface's tokenizers is suggested, noting that the
tiktoken
benchmark might be outdated. An anecdotal observation is that Huggingface's tokenizers are often faster. - The overall importance of tokenization performance is questioned, with some believing it's a "ridiculously small part of the overall computation" dominated by matrix multiplications, while others see value in optimizing it, especially for large-scale data processing.
- The dominance of GPU kernels in overall computation speed is re-iterated: "Tokenization is typically done on CPU and is rarely (if ever) a bottleneck for training or inference. GPU kernels typically dominate in terms of wall clock time..."
- The complexity and deployment challenges of highly optimized tokenizers like SentencePiece and Tiktoken are mentioned as a significant undertaking.
- The idea that faster tokenizers "obliterates the SOTA from the money is no object status quo" is a strong statement about the potential impact.
Learning and Resources
The discussion lightly touches on learning resources for understanding LLM internals and performance optimization.
- A user asks about resources for self-teaching LLM internals.
- The author shares a list of resources including Modal's GPU glossary, Karpathy's LLM overview, 3b1b's videos on transformers, and a CUDA optimization log, also mentioning the use of ChatGPT for learning.