Popular datasets, model architectures, and common image transformations for natural language processing

artificial intelligence, machine learning, deep learning, natural language processing, text, classification


TorchLanguage is the equivalent of TorchVision for Natural Language Processing. It gives you access to text transformers (tokens, index, n-grams, etc) and data sets.

Join our community to create datasets and deep-learning models! Chat with us on Gitter and join the Google Group to collaborate with us.

This repository consists of:

  • torchlanguage.datasets : Pre-built datasets for common NLP tasks
  • torchlanguage.models : Generic pretrained models for common NLP tasks
  • torchlanguage.transforms : Common transformation for text
  • torchlanguage.utils : Tools, functions and measures for NLP


Make sure you have Python 2.7 or 3.5+ and PyTorch 0.2.0 or newer. You can then install torchlanguage using pip :

pip install TorchLanguage

Optional requirements

If you want to use English tokenizer from SpaCy <>_, you need to install SpaCy and download its English model:

pip install spacy
python -m spacy download en

Text transformation pipeline

The following transformation are available :

  • Character
  • Character2Gram
  • Character3Gram
  • Compose
  • DropOut
  • Embedding
  • FunctionWord
  • GensimModel
  • GloveVector
  • HorizontalStack
  • MaxIndex
  • PartOfSpeech
  • RandomSamples
  • RemoveCharacter
  • RemoveLines
  • RemoveRegex
  • Tag
  • ToFrequencyVector
  • ToIndex
  • Token
  • ToLength
  • ToLower
  • ToNGram
  • ToOneHot
  • ToUpper
  • Transformer
  • VerticalStack


The data module provides the following:

  • Ability to download and load a corpus from a directory. The file must be name Class_Title.txt:
dataset = torchlanguage.datasets.FileDirectory(
  • Wrapper for dataset splits (train, validation) and cross-validation:
cross_val_dataset = {'train': torchlanguage.utils.CrossValidation(dataset, k=k),
   'test': torchlanguage.utils.CrossValidation(dataset, k=k, train=False)}
for k in range(k):
   for data in cross_val_dataset['train']:
      inputs, label = data
   # end for
   for data in cross_val_dataset['test']:
      inputs, label = data
   # end for
# end for


The datasets module currently contains:

  • FileDirectory: Load a corpus from a directory
  • ReutersC50Dataset: The Reuters C50 dataset for authorship attribution
  • SFGram: A set of science-fiction magazine with five authors.

Others are planned or a work in progress:

  • Traduction
  • Question answering

See the examples directory for examples of dataset usage.

Related Work


EchoTorch is a Python framework to easily implement Reservoir Computing models with pyTorch.



If you find TorchLanguage useful for an academic publication, then please use the following BibTeX to cite it:

   author = {Schaetti, Nils},
   title = {TorchLanguage: Natural Language Processing with pyTorch},
   year = {2018},
   publisher = {GitHub},
   journal = {GitHub repository},
   howpublished = {\url{}},

Related posts


Leave a reply

Your email address will not be published. Required fields are marked *