emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Can watermarking Unicode text using invisible differences sneak thro


From: Tim Cross
Subject: Re: Can watermarking Unicode text using invisible differences sneak through Emacs, or can Emacs detect it?
Date: Thu, 20 Jan 2022 17:35:23 +1100
User-agent: mu4e 1.7.6; emacs 28.0.91

Richard Stallman <rms@gnu.org> writes:

> [[[ To any NSA and FBI agents reading my email: please consider    ]]]
> [[[ whether defending the US Constitution against all enemies,     ]]]
> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
>
> Explanation to Eli: I understand that these 0-width characters have
> legitimate, useful purposes.  It is good that we support them.
>
> The issue I've raised, which was explained in the text I cited, is
> that _allegedly_  it is possible to use them maliciously, by inserting
> a sequence of them to function as a sort of watermark that users
> normally won't even see.
>
>   > You can highlight them like so:
>
>   > (set-face-background 'glyphless-char "red")
>
>   > I've had that configured ever since
>   > https://debbugs.gnu.org/cgi/bugreport.cgi?bug=31194#40
>
>   > If you're not expecting zero-width characters in text in general,
>   > I think it's a good setting.
>
> I think I will try that, just in case someone sends me some of those.
> Thanks.
>
> Should we make this the default?  I think it is likely that most Emacs users
> will see only malicious zero-width characters, and not useful ones.
>
> Is there a way we could detect automatically when these zero-width
> characters are being used in a legit way for their intended purpose,
> and in that case, display them as zero-width for real?
>
> That way, they would work right when used properly, and ring an alarm
> (metaphorically) when used in a fishy way.
>
>   > Emacs by default displays ZWJ and ZWNJ characters (and any other
>   > zero-width characters) as thin 1-pixel spaces on GUI frames, and as
>   > simple spaces on TTY frames.  So Emacs users are likely to see these
>   > "hidden" sequences of characters on display.
>
> I wonder if we could do something clever to show when there is a
> sequence of multiple different 1-pixel characters?  For instance,
> maybe give different colors to different characters, so that a
> sequence of several shows as a funny spectrum?
>
> This could alert the user that "someone's messing with you here".
>
> There are many possible variants of the details -- I don't know what
> would be best, or what would be easy, but people could try various
> methods.

Just to add some context here which some might find useful.

At one point, I worked for an organisation which had real concerns about
sensitive information being released (mainly to the press) and wanted to
be able to track down the source when it occurred. Essentially, this
technique was used. All electronic documents, when distributed to teh
approved list of recipients, had a unique id stamp using zero-width
characters. When I left, the organisation was also experimenting with
adding similar 'marks' to emails sent via the orgnaisation's email
server. So this practice is definitely occurring. It is probably more
prevalent in PDF and word documents, but I guess could be in plain text
email messages as well.  

This technique (and related ones) don't need high technical expertise
either. We had a similar problem at a University I wored at where
students used this technique to defeat the anti-plagiarism software the
uni used. The software used basic text matching and students started to
defeat it by using both zero width characters to break patterns and by
using utf characters with glyphs that looked like standard characters,
allowing the document to print an look correct, but also breaking
pattern matching. Of course, once you are aware this is going on, you
can improve the pattern matching and add checks to detect this type of
activity. Personally, I was always amazed at the length people went to
defeat the anti-plagiarism software. Always seem it would be easier not
to plagiarise and cite when appropriate.    

It is a big challenge to find out a way to alert users to this possible
unwanted 'tagging', but at the same time, allow legitimate use. For
exmaple, in org-mode, it can sometimes be difficult to combine different
markup and other syntax - often it is because of a corner case which is
difficult to address with font-locking regexp. Adding a zero-width space
is sometimes sufficient to work around the ambiguity in tghe regexp.
Point is, anything which makes such use visual noticeable will also make
the technique less useful for addressing this issue.   



reply via email to

[Prev in Thread] Current Thread [Next in Thread]