bug#66236: Specific Korean characters break Unicode parsing

bug-sed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#66236: Specific Korean characters break Unicode parsing

From:	Kristian Järventaus
Subject:	bug#66236: Specific Korean characters break Unicode parsing
Date:	Wed, 27 Sep 2023 14:38:15 +0300
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.15.1

sed (GNU sed) 4.8
Packaged by Debian


Issue: I have a bunch of data that I want to clean up in the form

====

GET_PAGE: ForkPoolWorker-19, title='외출하다', hash:3ff685bdbf1db10566e9c3bbbc680a0ec6656e3b3d9752b5d10c9e6dcc08d648GET_PAGE: ForkPoolWorker-19, title='Module:munge text', hash:86aa20ba5f2a310911fc93b32b7ef14de944b233f2894236ed236350cf467a4dGET_PAGE: ForkPoolWorker-19, title='Module:ko-translit', hash:3f795c903dc252d3dedad1f7100c22de324986980a475396aabcdd554b886897GET_PAGE: ForkPoolWorker-19, title='Module:ko-pron', hash:f4dde115a55246e97c0a14ea30f6896d9759e748040b8d45ac9c60ebb073cdcbGET_PAGE: ForkPoolWorker-19, title='Module:ko', hash:8ebb346f32119102d15f4b464dcf178912f5ca4889ece0cbeed97ae198a6e743GET_PAGE: ForkPoolWorker-19, title='Module:ko-pron/data', hash:bd4e173ed2d8f9140b524ba76d7c9862494d8fb798d8e756ea5229a830e815d9GET_PAGE: ForkPoolWorker-19, title='Template:it-pr', hash:ecdb98dc9ac1387ad4f847c7bc2113fcafd016b2e7b44dc8ae806fcb83c95d62GET_PAGE: ForkPoolWorker-19, title='traffica', hash:40728b79d679469e655593a096dbf2780a92b584d1a79d296d3b24a1543832b5

=====

(title contains basically all article titles from en.wiktionary.org, sotons and tons of Unicode, from everywhere in the Unicode set)

However, certain Hangeul (Korean) characters break *something*. Afterdoing some replacements on data that looks like the above, I am alwaysleft with a bunch of lines with Korean titles.

> sed 's/GET_PAGE: \(ForkPoolWorker-[0-9]\+\),title=["\x27]\(.\+\)["\x27], hash.*/\1, \2/'



Output:
======
ForkPoolWorker-19, Template:ko-conj/verbForkPoolWorker-19, Template:affix

GET_PAGE: ForkPoolWorker-19, title='외출하다', hash:3ff685bdbf1db10566e9c3bbbc680a0ec6656e3b3d9752b5d10c9e6dcc08d648

ForkPoolWorker-19, Module:munge text
ForkPoolWorker-19, Module:ko-translit
======

I tried to figure out if there was some kind of weird end-of-linecharacter or something that would stop the regex from processing, and inall the faulty examples (all with Korean titles) I could find one sharedbyte: what is M-m in `cat -v` output, 237 decimal ('m' + 128).


=====
'허공''M-mM-^WM-^HM-jM-3M-5'
title='평의회'title='M-mM-^OM-^IM-lM-^]M-^XM-mM-^ZM-^L'
title='풍년화'title='M-mM-^RM-^MM-kM-^EM-^DM-mM-^YM-^T'
'프로''M-mM-^TM-^DM-kM-!M-^\'
기계화M-jM-8M-0M-jM-3M-^DM-mM-^YM-^T
맹세하다M-kM-'M-9M-lM-^DM-8M-mM-^UM-^XM-kM-^KM-$
애프터M-lM-^UM- M-mM-^TM-^DM-mM-^DM-0
고해M-jM-3M- M-mM-^UM-4
얼큰하다M-lM-^VM-<M-mM-^AM-0M-mM-^UM-^XM-kM-^KM-$
추가하다M-lM-6M-^TM-jM-0M-^@M-mM-^UM-^XM-kM-^KM-$
푼체M-mM-^QM-<M-lM-2M-4
목표어M-kM-*M-)M-mM-^QM-^\M-lM-^VM-4
=====

The version of the above command without anything after the capture block

>  sed 's/GET_PAGE: \(ForkPoolWorker-[0-9]\+\), title=\(.\+\)/\1, \2/'

parses correctly, because the .\+ captures to the end of the line (so myinitial suspect was wrong). Afaict, if my Unicode is correct (and Idon't have much reason to believe it is mangled, the file containsbasically the titles of every en.wiktionary.org article, so not justKorean and ascii), it seems that the presence of a character with theM-m byte causes the rest of the line to be broken unicode-parsing-wise,which causes any specific regexes (like the second ["\x27]) to failparsing because the unicode 'cursor' is out of synch or something similar.

I can confirm that the presence of specific characters is the cause byeliminating individual characters:


====
GET_PAGE: ForkPoolWorker-19, title='외출하다',
GET_PAGE: ForkPoolWorker-19, title='출하다',
GET_PAGE: ForkPoolWorker-19, title='외하다',
GET_PAGE: ForkPoolWorker-19, title='외출다',
GET_PAGE: ForkPoolWorker-19, title='외출하',
GET_PAGE: ForkPoolWorker-20, title='부도덕하다',
GET_PAGE: ForkPoolWorker-20, title='도덕하다',
GET_PAGE: ForkPoolWorker-20, title='부덕하다',
GET_PAGE: ForkPoolWorker-20, title='부도하다',
GET_PAGE: ForkPoolWorker-20, title='부도덕다',
GET_PAGE: ForkPoolWorker-20, title='부도덕하',
====

> sed 's/GET_PAGE: \(ForkPoolWorker-[0-9]\+\),title=["\x27]\(.\+\)["\x27],.*/\1, \2/' kor.txt > kor.test

=====
GET_PAGE: ForkPoolWorker-19, title='외출하다',
GET_PAGE: ForkPoolWorker-19, title='출하다',
GET_PAGE: ForkPoolWorker-19, title='외하다',
ForkPoolWorker-19, 외출다
GET_PAGE: ForkPoolWorker-19, title='외출하',
GET_PAGE: ForkPoolWorker-20, title='부도덕하다',
GET_PAGE: ForkPoolWorker-20, title='도덕하다',
GET_PAGE: ForkPoolWorker-20, title='부덕하다',
GET_PAGE: ForkPoolWorker-20, title='부도하다',
ForkPoolWorker-20, 부도덕다
=====

Every single occurrence of this issue that I found (and there were manyof them, because the data is very big) had a M-m byte somewhere in thehangeul.

I can't reproduce this on https://sed.js.org/, there the output is asexpected.



--
Kristian Järventaus
Research Assistant / Tutkimusavustaja
Clausal Computing Oy
kristian.jarventaus@clausal.com

[Prev in Thread]

Current Thread

[Next in Thread]

bug#66236: Specific Korean characters break Unicode parsing, Kristian Järventaus <=

Index(es):
- Date
- Thread