How to auto-detect a file's encoding : Charset « I18N « Java

Home
Java
1.2D Graphics GUI
2.3D
3.Advanced Graphics
4.Ant
5.Apache Common
6.Chart
7.Class
8.Collections Data Structure
9.Data Type
10.Database SQL JDBC
11.Design Pattern
12.Development Class
13.EJB3
14.Email
15.Event
16.File Input Output
17.Game
18.Generics
19.GWT
20.Hibernate
21.I18N
22.J2EE
23.J2ME
24.JavaFX
25.JDK 6
26.JDK 7
27.JNDI LDAP
28.JPA
29.JSP
30.JSTL
31.Language Basics
32.Network Protocol
33.PDF RTF
34.Reflection
35.Regular Expressions
36.Scripting
37.Security
38.Servlets
39.Spring
40.Swing Components
41.Swing JFC
42.SWT JFace Eclipse
43.Threads
44.Tiny Application
45.Velocity
46.Web Services SOA
47.XML
Java » I18N » Charset 




How to auto-detect a file's encoding
   

/*
 *  Copyright 2010 Georgios Migdos <[email protected]>.
 
 *  Licensed under the Apache License, Version 2.0 (the "License");
 *  you may not use this file except in compliance with the License.
 *  You may obtain a copy of the License at
 
 *       http://www.apache.org/licenses/LICENSE-2.0
 
 *  Unless required by applicable law or agreed to in writing, software
 *  distributed under the License is distributed on an "AS IS" BASIS,
 *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 *  See the License for the specific language governing permissions and
 *  limitations under the License.
 *  under the License.
 */

import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.nio.ByteBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;

/**
 *
 @author Georgios Migdos <[email protected]>
 */
public class CharsetDetector {

    public Charset detectCharset(File f, String[] charsets) {

        Charset charset = null;

        for (String charsetName : charsets) {
            charset = detectCharset(f, Charset.forName(charsetName));
            if (charset != null) {
                break;
            }
        }

        return charset;
    }

    private Charset detectCharset(File f, Charset charset) {
        try {
            BufferedInputStream input = new BufferedInputStream(new FileInputStream(f));

            CharsetDecoder decoder = charset.newDecoder();
            decoder.reset();

            byte[] buffer = new byte[512];
            boolean identified = false;
            while ((input.read(buffer!= -1&& (!identified)) {
                identified = identify(buffer, decoder);
            }

            input.close();

            if (identified) {
                return charset;
            else {
                return null;
            }

        catch (Exception e) {
            return null;
        }
    }

    private boolean identify(byte[] bytes, CharsetDecoder decoder) {
        try {
            decoder.decode(ByteBuffer.wrap(bytes));
        catch (CharacterCodingException e) {
            return false;
        }
        return true;
    }

    public static void main(String[] args) {
        File f = new File("example.txt");

        String[] charsetsToBeTested = {"UTF-8""windows-1253""ISO-8859-7"};

        CharsetDetector cd = new CharsetDetector();
        Charset charset = cd.detectCharset(f, charsetsToBeTested);

        if (charset != null) {
            try {
                InputStreamReader reader = new InputStreamReader(new FileInputStream(f), charset);
                int c = 0;
                while ((c = reader.read()) != -1) {
                    System.out.print((char)c);
                }
                reader.close();
            catch (FileNotFoundException fnfe) {
                fnfe.printStackTrace();
            }catch(IOException ioe){
                ioe.printStackTrace();
            }

        }else{
            System.out.println("Unrecognized charset.");
        }
    }
}
    
  














Related examples in the same category
1.List the Charset in your system
2.Converting Between Strings (Unicode) and Other Character Set Encodings
3.Charset encoding test
4.encoder and decoder use a supplied ByteBuffer
5.Translate Charset
6.Detect non-ASCII characters in string
7.Displays Charsets and aliasesDisplays Charsets and aliases
8.Encode and DecodeEncode and Decode
9.extends CharsetDecoder to create Base64 Decoder
10.extends Charset to create Hex Charset
11.Get the default charset
12.Charset Toolkit
13.Defines common charsets supported in all Java platforms.
java2s.com  | Contact Us | Privacy Policy
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.