Language Detection does not working as expected

Go To StackoverFlow.com

1

I am using https://code.google.com/p/language-detection java library to detect language of given text. Profiles used are as it came with the library. However the result sometimes is surprisingly different from expected. What could be wrong in the code or should I be regenerating profiles?

I have tried with "ld.detect("en");" commented and uncommented. Does white space affect language detection?

    LanguageDetect ld = new LanguageDetect();
    ld.init("C:\\James\\languageTest\\profiles");
    //ld.detect("en");

    String textCurrentLine;
    BufferedReader br = null;
    try {
        br = new BufferedReader(new FileReader("C:\\James\\failcases.txt"));

        while ((textCurrentLine = br.readLine()) != null) {
           System.out.println(ld.detect(textCurrentLine));

        }
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        try {
            if (br != null) {
                br.close();
            }
        } catch (IOException ex) {
            ex.printStackTrace();
        }
    }
}

Below is what I get for few words

Communication - en
Timing - tl
none - it
user - it
No - pt
Yes - fr
user - no
generated - da
Diagnostic - it
not supported - en
supported - en
Bus Speed - en
Protocol - it
2013-10-24 12:58
by James Shaji
I would not expect language detection heuristics to be particularly good if the text sample is as small as one or two words - Stephen C 2013-10-24 13:17
That's what I was assuming it to be... - James Shaji 2013-10-24 13:21


1

As the FAQ of the library is stating:

Can langdetect handle short texts?

This library requires that a detection text has some length, almost 10-20 words over.

It may return a wrong language for very short text with 1-10 words.

You are trying it on one-word or two-word texts, this is not the use case this library is build for, so you're gonna have wrong results.

For single words without context, you can try to match them with dictionaries of the languages you are targetting.

2013-10-24 13:16
by Cyrille Ka
I was planning of maintaining a dictionary to detect language, however wanted to check whether I was doing it right with this library. Creating a dictionary would be a huge task. Any ideas to build library or any prebuilt libraries available online - James Shaji 2013-10-24 13:27
I don't know of such a library, sorry - Cyrille Ka 2013-10-24 13:34