bug-sed
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#66236: Specific Korean characters break Unicode parsing


From: Kristian Järventaus
Subject: bug#66236: Specific Korean characters break Unicode parsing
Date: Wed, 27 Sep 2023 14:38:15 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.15.1

sed (GNU sed) 4.8
Packaged by Debian


Issue: I have a bunch of data that I want to clean up in the form

====
GET_PAGE: ForkPoolWorker-19, title='외출하다', hash: 3ff685bdbf1db10566e9c3bbbc680a0ec6656e3b3d9752b5d10c9e6dcc08d648 GET_PAGE: ForkPoolWorker-19, title='Module:munge text', hash: 86aa20ba5f2a310911fc93b32b7ef14de944b233f2894236ed236350cf467a4d GET_PAGE: ForkPoolWorker-19, title='Module:ko-translit', hash: 3f795c903dc252d3dedad1f7100c22de324986980a475396aabcdd554b886897 GET_PAGE: ForkPoolWorker-19, title='Module:ko-pron', hash: f4dde115a55246e97c0a14ea30f6896d9759e748040b8d45ac9c60ebb073cdcb GET_PAGE: ForkPoolWorker-19, title='Module:ko', hash: 8ebb346f32119102d15f4b464dcf178912f5ca4889ece0cbeed97ae198a6e743 GET_PAGE: ForkPoolWorker-19, title='Module:ko-pron/data', hash: bd4e173ed2d8f9140b524ba76d7c9862494d8fb798d8e756ea5229a830e815d9 GET_PAGE: ForkPoolWorker-19, title='Template:it-pr', hash: ecdb98dc9ac1387ad4f847c7bc2113fcafd016b2e7b44dc8ae806fcb83c95d62 GET_PAGE: ForkPoolWorker-19, title='traffica', hash: 40728b79d679469e655593a096dbf2780a92b584d1a79d296d3b24a1543832b5
=====

(title contains basically all article titles from en.wiktionary.org, so tons and tons of Unicode, from everywhere in the Unicode set)

However, certain Hangeul (Korean) characters break *something*. After doing some replacements on data that looks like the above, I am always left with a bunch of lines with Korean titles.

> sed 's/GET_PAGE: \(ForkPoolWorker-[0-9]\+\), title=["\x27]\(.\+\)["\x27], hash.*/\1, \2/'


Output:
======
ForkPoolWorker-19, Template:ko-conj/verbForkPoolWorker-19, Template:affix
GET_PAGE: ForkPoolWorker-19, title='외출하다', hash: 3ff685bdbf1db10566e9c3bbbc680a0ec6656e3b3d9752b5d10c9e6dcc08d648
ForkPoolWorker-19, Module:munge text
ForkPoolWorker-19, Module:ko-translit
======

I tried to figure out if there was some kind of weird end-of-line character or something that would stop the regex from processing, and in all the faulty examples (all with Korean titles) I could find one shared byte: what is M-m in `cat -v` output, 237 decimal ('m' + 128).

=====
'허공''M-mM-^WM-^HM-jM-3M-5'
title='평의회'title='M-mM-^OM-^IM-lM-^]M-^XM-mM-^ZM-^L'
title='풍년화'title='M-mM-^RM-^MM-kM-^EM-^DM-mM-^YM-^T'
'프로''M-mM-^TM-^DM-kM-!M-^\'
기계화M-jM-8M-0M-jM-3M-^DM-mM-^YM-^T
맹세하다M-kM-'M-9M-lM-^DM-8M-mM-^UM-^XM-kM-^KM-$
애프터M-lM-^UM- M-mM-^TM-^DM-mM-^DM-0
고해M-jM-3M- M-mM-^UM-4
얼큰하다M-lM-^VM-<M-mM-^AM-0M-mM-^UM-^XM-kM-^KM-$
추가하다M-lM-6M-^TM-jM-0M-^@M-mM-^UM-^XM-kM-^KM-$
푼체M-mM-^QM-<M-lM-2M-4
목표어M-kM-*M-)M-mM-^QM-^\M-lM-^VM-4
=====

The version of the above command without anything after the capture block

>  sed 's/GET_PAGE: \(ForkPoolWorker-[0-9]\+\), title=\(.\+\)/\1, \2/'

parses correctly, because the .\+ captures to the end of the line (so my initial suspect was wrong). Afaict, if my Unicode is correct (and I don't have much reason to believe it is mangled, the file contains basically the titles of every en.wiktionary.org article, so not just Korean and ascii), it seems that the presence of a character with the M-m byte causes the rest of the line to be broken unicode-parsing-wise, which causes any specific regexes (like the second ["\x27]) to fail parsing because the unicode 'cursor' is out of synch or something similar.

I can confirm that the presence of specific characters is the cause by eliminating individual characters:

====
GET_PAGE: ForkPoolWorker-19, title='외출하다',
GET_PAGE: ForkPoolWorker-19, title='출하다',
GET_PAGE: ForkPoolWorker-19, title='외하다',
GET_PAGE: ForkPoolWorker-19, title='외출다',
GET_PAGE: ForkPoolWorker-19, title='외출하',
GET_PAGE: ForkPoolWorker-20, title='부도덕하다',
GET_PAGE: ForkPoolWorker-20, title='도덕하다',
GET_PAGE: ForkPoolWorker-20, title='부덕하다',
GET_PAGE: ForkPoolWorker-20, title='부도하다',
GET_PAGE: ForkPoolWorker-20, title='부도덕다',
GET_PAGE: ForkPoolWorker-20, title='부도덕하',
====
> sed 's/GET_PAGE: \(ForkPoolWorker-[0-9]\+\), title=["\x27]\(.\+\)["\x27],.*/\1, \2/' kor.txt > kor.test
=====
GET_PAGE: ForkPoolWorker-19, title='외출하다',
GET_PAGE: ForkPoolWorker-19, title='출하다',
GET_PAGE: ForkPoolWorker-19, title='외하다',
ForkPoolWorker-19, 외출다
GET_PAGE: ForkPoolWorker-19, title='외출하',
GET_PAGE: ForkPoolWorker-20, title='부도덕하다',
GET_PAGE: ForkPoolWorker-20, title='도덕하다',
GET_PAGE: ForkPoolWorker-20, title='부덕하다',
GET_PAGE: ForkPoolWorker-20, title='부도하다',
ForkPoolWorker-20, 부도덕다
=====

Every single occurrence of this issue that I found (and there were many of them, because the data is very big) had a M-m byte somewhere in the hangeul.

I can't reproduce this on https://sed.js.org/, there the output is as expected.


--
Kristian Järventaus
Research Assistant / Tutkimusavustaja
Clausal Computing Oy
kristian.jarventaus@clausal.com





reply via email to

[Prev in Thread] Current Thread [Next in Thread]