This morning Facebook’s AI Research (FAIR) lab released an update to fastText, its super-speedy open-source text classification library. When it was initially released, fastText shipped with pre-trained word vectors for 90 languages, but today it’s getting a boost to 294 languages. The release also brings enhancements to reduce model size and ultimately memory demand.
Text classifiers like fastText make it easy for developers to ship tools that depend on underlying language analysis. Flagging clickbait headlines or filtering spam both require an underlying model that can interpret and categorize language.
From the start, fastText was designed to be implemented on a wide variety of hardware. Unfortunately, in its original state, it still required a few gigabytes of memory to run. This isn’t a problem if you’re working in a state of the art lab, but it’s a deal killer if you’re trying to make things work on mobile.
By collaborating with the team that produced another Facebook open-source project, similarity search (FAISS), the company was able to reduce the memory requirement to just a few hundred kilobytes. FAISS addresses some of the inherent bottlenecks that developers face when dealing with huge amounts of data.
A massive corpus of information is often best represented in a multi-dimensional vector space. For Facebook and many other companies, optimizing the comparison of these vectors for comparing content with user preferences and comparing content with other content is critical. The approach of the FAISS team ended up playing a big role in reducing the memory demands of fastText.
“A few key ingredients, namely feature pruning, quantization, hashing, and re-training, allow us to produce text classification models with tiny size, often less than 100kB when trained on several popular datasets, without noticeably sacrificing accuracy or speed,” said the Facebook authors of a December 2016 paper entitled “fastText.zip: Compressing Text Classification Models.”
The authors went on to hypothesize that additional model size reduction might be possible in the future. The challenge isn’t so much shrinking the models as it is maintaining accuracy. But until then, engineers can access the updated library on GitHub and begin tinkering today.