Re: Verifying Toolchain Semantics

On 6 October 2014 11:30, Ian Grant <address@hidden> wrote:

http://lists.gnu.org/archive/html/guile-devel/2014-10/msg00016.html

From: William ML Leslie
Date: Mon, 6 Oct 2014 00:57:49 +1100
On 3 October 2014 22:56, Taylan Ulrich Bayirli/Kammer <address@hidden> wrote:

> Say, for example, that I can guess you will end up viewing the
> document with one of five output programs - maybe, evince, chrome,
> tk, gecko translating xsl-fo to xul or firefox viewing html 4
> directly, and say that I only have exploits for the first three.
> Now I have a bit of a problem: I need to find a way to get your tool
> to emit the specific structure that will invoke the exploit. Since
> it will show a document that looks correct, although underneath may
> have a different structure, I will have to account for variation in
> what it emits but there will be a limited class of algebraic
> manipulations that it can make without changing the semantics of the
> document.

It's not about this. These are all the programs that have problems.
We're not going to use these sorts of programs anymore. We're going to
use auto-generated, nearly perfect code. And after a few years it
won't be just near perfect, it'll be perfect.

​Verifiable, safe renderers for these kind of formats is something I really want to see, but if you have that, why do you need the code to have this amorphous `semantic-fixpoint` quality?​  What does it give you?

​​

> Maybe I would not find a way to /reliably/ produce a document that
> will infect your machine. Or maybe I would. It's hypothetical.
> But you know as well as I do that fuzzing a document generator to
> see what instructions produce exploitable output is easier than
> either writing a verifiable PDF viewer or writing a tool that
> bisimulates a PDF document with a class of layout languages. The
> cherry on top is that all that extra code you introduce is surely
> ripe with additional vulnerabilities - so maybe I don't need to
> bother with producing those back-end exploits at all.

This is wrong. If you look at the automatically-generated assembler
code from a formal spec, in the form of a functional program
generating that code:

https://github.com/IanANGrant/red-october/blob/master/src/dynlibs/ffi/AbstractMachineWord.sml

​The paragraph I wrote above talks about three things, in increasing order of labour:

0. Fuzzing #2

1. A verifiable PDF viewer (a formal model, together with some implementation of that model, generated or not)

2. ​A tool that bisimulates a veriety of layout languages

​​

​I was trying to suggest #1 as the ideal solution, as you seem to be here.  The original straw-man was #2, intended to be dual to the semantic fixpoint you described.

you will see that the extra code has no extra attack vectors. It's all
machine-generated. If this were a piece of code for reading a binary
file, then it would have been generated from a grammar describing
exactly what are the valid words in that language. (Using words and
language in the technical sense as "valid strings" and "set of all
valid strings".) So buffer overruns won't happen. And if all the
semantics are formalised, then off-by-one errors won't happen
either. If they do, they'll happen all over the place and you will
find out very quickly.

> Though I think your analogy is wrong. PDF is already an abstract
> semantics, like any formal language spec. The question is whether
> an implementation realizes precisely those semantics, or adds "oh
> and insert a backdoor" to some pieces of code in that language, "as
> an addition to the language semantics." This augmentation to the
> semantics needn't be apparent in the source code of the
> implementation, since the language it's written in might again be
> compiled by a compiler that "augments" the semantics of the language
> it compiles, and that in turn needn't be apparent in the compiler's
> source ... until you arrive at the since long vanished initial
> source code that planted the virus.

> I hope by now you understand why I think this to be nonsense. It's
> still a C compiler. You know that, I know that. Why wouldn't the
> machine be able to tell?

Because it's formally undecidable. That's Rice's Theorem. And it's not
"in practice" probably detectable by semantantic methods, because we
are going to extra lengths to obscure the effective operational
semantics by multiple re-encodings. How are you going to recognise a
compiled Fortran90 program translating a C program into an abstract
assembler like lightning, as a C compiler? You don't even know what
the opcode numbers are: we can permute them.

Worse comes to worse, you feed it a small C program as input inside an abstract interpreter.  It's the victim's CPU that is being burned, so nontermination of the algorithm is a minor problem for the attacker, who needs to ensure utilisation is low enough to avoid questions.

Rice's theorem applies to the entire class of programs that people can write in a turing-complete language, but in practice, people write programs that can be checked for denotational equivalence.  It is costly (computationally) to do this with a general purpose programming language,​ but I think you'd be surprised at the wide range of existing programs that we can analyse that way. *

I don't think that a semantic fixed point would be problematic for detection of C compilers.  While the elaboration you write to go from formal semantics to concrete implementation will obviously be non-deterministic, it will still be algorithmic, in that someone ultimately needs to list ways that definitions can be turned into concrete implementations.  These make up an initial algebra, the basic terms of which can be treated individually, and in particular, the 'lossy' aspects of the translation that should make it difficult to reverse the process (such as introduction of spurious effects) can be categorised.  The output program can be partially normalised, and the resulting effect graph analysed with the effect types of the implementation terms.

If nothing else, reducing to the first few elements of the IO colist and normalising the values involved is unlikely to be prohibitive (otherwise the program wouldn't run at an acceptible speed anyway).

​Hah - in a real bind, you don't need to detect the C compiler at all - always add code that wraps IO calls / external effects, and augments the output if need be.​

​* Can't seem to find a good paper on this at the moment, but I did enjoy this post that has just been doing the rounds:

  http://www.haskellforall.com/2014/09/morte-intermediate-language-for-super.html​

  Papers saying "I tried to do it and failed" are more common, but I think it's just a common task for the directionless PhD student.

> The problem that I have with this discussion is that rather than
> address many* of the points in my email, he instead resorted to
> insisting I was a shill. It's very difficult to have a conversation
> with someone who uses silencing tactics. Or better - it's
> pointless.

What are you talking about? Here is what you wrote to me:

http://lists.gnu.org/archive/html/guile-devel/2014-09/msg00036.html

> Please understand that our lack of enthusiasm for it is not without
> our own individual commitments to the security and freedoms of
> users; rather while the extra assurances your project may provide
> sound nice, they are frankly rather minimal and don't have much
> additional utility, and so are not priority one.

Now if you wanted me to continue answering your questions, then well,
I'm sorry, I just got the wrong end of stick completely. I thought you
wanted me to fuck off!

​I see.  I can be a bit literal sometimes.  I was being honest about my priorities in response to a request to down-tools and work on this.

So I replied as below. To you, privately. And that was it. Silencing
tactics? You then copied it to the world, and a guy who I _told_ you,
was deputy director of the NSA's National Computer Security Centre!

​I was talking about this:

​http://lists.gnu.org/archive/html/guile-devel/2014-09/msg00038.html

Maybe that's not what you were trying to do.

--
William Leslie

Notice:
Likely much of this email is, by the nature of copyright, covered under copyright law. You absolutely MAY reproduce any part of it in accordance with the copyright law of the nation you are reading this in. Any attempt to DENY YOU THOSE RIGHTS would be illegal without prior contractual agreement.

From:	William ML Leslie
Subject:	Re: Verifying Toolchain Semantics
Date:	Mon, 6 Oct 2014 19:51:36 +1100