Python: How to determine the language?

Go To StackoverFlow.com

29

I want to get this:

Input text: "ру́сский язы́к"
Output text: "Russian" 

Input text: "中文"
Output text: "Chinese" 

Input text: "にほんご"
Output text: "Japanese" 

Input text: "العَرَبِيَّة"
Output text: "Arabic" 

How can I do it in python? Thanks.

2016-08-25 10:26
by Rita
What did you try - Raskayu 2016-08-25 10:27
this may help http://stackoverflow.com/questions/4545977/python-can-i-detect-unicode-string-language-cod - Sardorbek Imomaliev 2016-08-25 10:34


27

Have you had a look at langdetect?

from langdetect import detect

lang = detect("Ein, zwei, drei, vier")

print lang
#output: de
2016-08-25 10:38
by dheiberg
Not very accurate - detects language of text 'anatomical structure' as ro(Romanian). Multiple language output required for such cases. polyglot performs much better - Yuriy Petrovskiy 2018-06-20 10:41
Interesting, for the same example langdetect can determine different languages :- - Denis Kuzin 2018-06-27 10:12


45

  1. TextBlob. Requires NLTK package, uses Google.

    from textblob import TextBlob
    b = TextBlob("bonjour")
    b.detect_language()
    

pip install textblob

  1. Polyglot. Requires numpy and some arcane libraries, unlikely to get it work for Windows. (For Windows, get an appropriate versions of PyICU, Morfessor and PyCLD2 from here, then just pip install downloaded_wheel.whl.) Able to detect texts with mixed languages.

    from polyglot.detect import Detector
    
    mixed_text = u"""
    China (simplified Chinese: 中国; traditional Chinese: 中國),
    officially the People's Republic of China (PRC), is a sovereign state
    located in East Asia.
    """
    for language in Detector(mixed_text).languages:
            print(language)
    
    # name: English     code: en       confidence:  87.0 read bytes:  1154
    # name: Chinese     code: zh_Hant  confidence:   5.0 read bytes:  1755
    # name: un          code: un       confidence:   0.0 read bytes:     0
    

pip install polyglot

To install the dependencies, run: sudo apt-get install python-numpy libicu-dev

  1. chardet has also a feature of detecting languages if there are character bytes in range (127-255]:

    >>> chardet.detect("Я люблю вкусные пампушки".encode('cp1251'))
    {'encoding': 'windows-1251', 'confidence': 0.9637267119204621, 'language': 'Russian'}
    

pip install chardet

  1. langdetect Requires large portions of text. It uses non-deterministic approach under the hood. That means you get different results for the same text sample. Docs say you have to use following code to make it determined:

    from langdetect import detect, DetectorFactory
    DetectorFactory.seed = 0
    detect('今一はお前さん')
    

pip install langdetect

  1. guess_language Can detect very short samples by using this spell checker with dictionaries.

pip install guess_language-spirit

  1. langid provides both module

    import langid
    langid.classify("This is a test")
    # ('en', -54.41310358047485)
    

and a command-line tool:

    $ langid < README.md

pip install langid

2017-11-04 02:32
by Rabash
detectlang is way faster than TextblobAnwarvic 2018-04-24 14:18
@Anwarvic TextBlob uses Google API (https://github.com/sloria/TextBlob/blob/dev/textblob/translate.py#L33)! that why it's slow - Thomas Decaux 2019-01-14 17:59
polyglot ended up being the most performant for my use case. langid came in secon - jamescampbell 2019-02-23 13:19


0

You can try determining the Unicode group of chars in input string to point out type of language, (Cyrillic for Russian, for example), and then search for language-specific symbols in text.

2016-08-25 11:10
by Kerbiter