문자셋과 프로그래밍

보통 프로그래밍 공부를 C언어로 시작한다. 물론 나도 그랬다

시스템프로그래밍이나 MFC에 들어서면 조금 낯설게 느껴지는 C언어를 발견 할 수 있다

예를 들어 wprintf이나 wcslen 같은 함수들이다

결론부터 말하면 이것들은 모두 같은 것이나 사용에 따라 달라진다는 것이다

#문자 코드

ASCII(American Standard Code for Information Interchange)

ANSI에서 정의한 표준 코드

8bit(=1byte)를 사용하여 문자 표현

UNICODE

영어권을 제외한 나라에서 ASCII로만 문자 표현이 어려워지자 각 나라별 언어를 모두 표현하기 위해 나온 코드 체계

16bit(=2byte)를 사용하여 문자 표현

#문자셋의 종류

SBCS (Single Byte Character Set)

문자표현 방식: 1바이트

아스키 코드

MBCS (Multi Byte Character Set)

한글: 2바이트 영문 : 1바이트

WBCS (Wide Byte Character Set)

문자 표현 방식: 2바이트

유니코드

#MBCS 문제점

한글이 들어갈 때 프로그래머의 실수가 생길 수 있다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#include <stdio.h>
#include <string.h>
int main()
{
    char str[] = "ABC한국";
    int size = sizeof(str);
    int len = strlen(str);
    int i;
 
    printf("배열의 크기: %d \n", size); // 배열의 크기: 8
    printf("문자열의 크기: %d \n", len); // 문자열의 길이: 7
    
    for (i = 0; i < 5; i++)
    {
        fputc(str[i], stdout); // ABC한
    }
 
    fputs("\n",stdout);
 
    for (i = 0; i < 7; i++)
    {
        fputc(str[i], stdout); //ABC한글
    }
    return 0;
}
Colored by Color Scripter
cs

배열의 크기는 NULL 문자를 포함해서 크기가 8로 잘 나오고

문자열의 길이는 일반적으로 생각했을 때 5가 나올 거라고 예상하지만 7로 나온다

그 이유는 한글은 2바이트로 처리 되어서 ABC=3byte 한국=4byte 합 7 바이트가 된 것이다

# 유니코드 기반=WBCS 기반 프로그래밍

1
2
3
4
5
6
7
8
9
10
  #define      CONST             const
 
  typedef      char              CHAR;
  typedef      CHAR *            LPSTR;
  typedef      CONST CHAR *      LPCSTR;
 
  typedef      wchar_t           WCHAR;
  typedef      WCHAR *           LPWSTR;
  typedef      CONST WCHAR *     LPCWSTR;
 
Colored by Color Scripter
cs

WBCS 기반 프로그래밍도 문제가 있다

아직까지 모든 프로그램이 유니코드 기반으로 동작하는 것이 아니라 아스키코드 기반으로 동작하는 것도 있다

이로 인해 유니코드와 아스키코드 동시에 지원하게 코드를 작성하면 문제가 해결된다.

#MBCS와 WBCS 동시 지원 매크로

1
2
3
4
5
6
7
8
9
10
  #ifdef   UNICODE
     typedef   WCHAR       TCHAR;
     typedef   LPWSTR      LPTSTR;
     typedef   LPCWSTR     LPCSTR;
  #else
     typedef   CHAR        TCHAR;
     typedef   LPSTR       LPTSTR;
     typedef   LPCSTR      LPCSTR;
  #endif
 
Colored by Color Scripter
cs

1
2
3
4
5
6
7
8
9
  #ifdef   _UNICODE
     #define  __T(x)   L  ##  x  // L과 x를 결합
  #else
     #define  __T(x)   x
  #endif
 
  #define  _T(x)       __T(x)
  #define  _TEXT(x)    __T(x) 
 
Colored by Color Scripter
cs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26 
  #ifdef   _UNICODE
      #define    _tmain          wmain
      #define    _tcslen         wcslen
      #define    _tcscat         wcscat
      #define    _tcscpy         wcscpy
      #define    _tcsncpy        wcsncpy
      #define    _tcscmp         wcscmp
      #define    _tcsncmp        wcsncmp
      #define    _tprintf        wprintf
      #define    _tscanf         wscanf
      #define    _fgetts         fgetws
      #define    _fputts         fputws
  #else
      #define    _tmain          main
      #define    _tcslen         strlen
      #define    _tcscat         strcat
      #define    _tcscpy         strcpy
      #define    _tcsncpy        strncpy
      #define    _tcscmp         strcmp
      #define    _tcsncmp        strncmp
      #define    _tprintf        printf
      #define    _tscanf         scanf
      #define    _fgetts         fgets
      #define    _fputts         fputs
  #endif
 
Colored by Color Scripter
cs

결론: 변환과정

매크로 UNICODE가 정의 되어 있으면

TCHAR str; => WCHAR str; => wchar_t str;

_T("hello"); => __T("hello); => L"hello"

_tmain => wmain

매크로 UNICODE가 정의 되어 있지 않으면

TCHAR str; => CHAR str; => char str;

_T("hello"); => __T("hello"); => "hello"

_tmain => main

예제

LPCWSTR 라는 자료형을 사용할 경우

LP라는 것은 포인터(*)로 생각

C는 const

W는 WBCS 즉 유니코드

STR은 char

로 생각

변환 과정

LPCSTR

매크로 UNICODE 정의 시

LPCSTR => LPCWSTR => CONST WCHAR * => const wchar_t *

매크로 UNICODE가 정의 안되어 있을 경우

LPCSTR => LPCSTR => CONST CHAR * => const char*

저작자표시 (새창열림)

Prabbit's archive

문자셋과 프로그래밍

티스토리툴바