Support NATA 2014
kiwi-kiwi
kiwi-kiwi
  • Member since: Mar. 6, 2009
  • Offline.
Forum Stats
Member
Level 09
Programmer
Newgrounds charset Jan. 23rd, 2013 @ 02:20 PM Reply

I was wondering why newgrounds uses the windows-1252 charset instead of UTF-8, is there any advantage in preferring a codepage over unicode ?

Diki
Diki
  • Member since: Jan. 31, 2004
  • Offline.
Forum Stats
Moderator
Level 13
Programmer
Response to Newgrounds charset Jan. 23rd, 2013 @ 04:36 PM Reply

At 1/23/13 02:20 PM, kiwi-kiwi wrote: is there any advantage in preferring a codepage over unicode ?

No.

Rawnern
Rawnern
  • Member since: Jan. 24, 2009
  • Offline.
Forum Stats
Member
Level 28
Movie Buff
Response to Newgrounds charset Jan. 24th, 2013 @ 07:15 AM Reply

At 1/23/13 02:20 PM, kiwi-kiwi wrote: I was wondering why newgrounds uses the windows-1252 charset instead of UTF-8, is there any advantage in preferring a codepage over unicode ?

I have no idea, but sometimes not all the characters aren't supported in some charset. I had a big problem with this sometime ago.

VBAssassin
VBAssassin
  • Member since: Jul. 22, 2008
  • Offline.
Forum Stats
Member
Level 01
Blank Slate
Response to Newgrounds charset Jan. 24th, 2013 @ 01:56 PM Reply

Like most things. There are advantages and disadvatages to both. windows-1252 is a single byte encoding meaning each character at most can use a single byte of memory. UTF-8 is multibyte so when you go out of the normal 255 byte range, a single character will start using more space. This also makes it harder to process since you must use unicode compatible functions (the defaults in PHP for example are not for use with unicode - unless you override them with the mb functions in the php.ini but that's for another time).

The main reason to use UTF-8 is to support different languages (like french, german, chinese, etc).

UTF-8 is the recommended standard because it allows regular single byte encodings for the regular ascii english characters giving it good backward compatibility while extending it with multibyte groups of characters for different languages.

Depending on how the sites are coded on the backend often determines how easy is it so upgrade from a single byte to multibyte encodings like UTF-8. Maybe that's why this site hasn't upgraded?

Personally, on my site Coder Profile i allow UTF-8 as the encoding since that's what text editors like MSWord tend to paste into the pages regardless of the defined charset. Using UTF-8 allows the characters to display as expected. HOWEVER, since i allow source codes to be pasted i do not want the UTF-8 quotes and few others which tends to be the 66 and 99 style quotes since the syntax highlighter won't support them. So i tend to map a few characters to lower equivalents. The whole string is then converted from UTF-8 back down to latin-1 to prevent having to retest and change code in the PHP environment for UTF-8 support and since i first started coding the site years ago before utf-8 was popular this was must better to do in this case. However, i do encode to html entities and in doing so, the upper range of characters in latin-1 are never used. This means i can switch to UTF-8 quite easily without messing up any of the existing data should i decide to do change to UTF-8 in the future ;)

kiwi-kiwi
kiwi-kiwi
  • Member since: Mar. 6, 2009
  • Offline.
Forum Stats
Member
Level 09
Programmer
Response to Newgrounds charset Jan. 24th, 2013 @ 03:21 PM Reply

At 1/24/13 01:56 PM, VBAssassin wrote: Like most things. There are advantages and disadvatages to both. windows-1252 is a single byte encoding meaning each character at most can use a single byte of memory. UTF-8 is multibyte so when you go out of the normal 255 byte range, a single character will start using more space.

Yup, the first 128 characters are indeed single byte to ensure backwards compatibility with ASCII, but afterwards the sign bit is used to signal that the character is made out of more bytes and the rest of them are in the following bytes. a 110 in any byte after the first again signals that the character is continued on the next byte, and as if that wasn't enough, you have 4 forms of normalization for unicode strings because some unicode characters can be written either as a single unicode char or as a sequence of unicode chars, for instance cedilla can be written as a cedilla or as a simple e with the accent over it.

Depending on how the sites are coded on the backend often determines how easy is it so upgrade from a single byte to multibyte encodings like UTF-8. Maybe that's why this site hasn't upgraded?

This is most likely the answer to my question.

VBAssassin
VBAssassin
  • Member since: Jul. 22, 2008
  • Offline.
Forum Stats
Member
Level 01
Blank Slate
Response to Newgrounds charset Jan. 24th, 2013 @ 04:22 PM Reply

Think that proves my point about why they are likely still using a single byte encoding ;)