chicken-hackers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH][simple-md5] Enable hashing of big files


From: Peter Bex
Subject: Re: [PATCH][simple-md5] Enable hashing of big files
Date: Mon, 16 Dec 2024 10:28:12 +0100

On Mon, Dec 16, 2024 at 01:19:06AM +0100, lou wrote:
> Hi,
> 
> Here's the promised follow-up to my earlier memory-mapped-files
> patch.

Hi Lou,

Thanks for the patch!

> Enabling simple-md5 to digest big files by splitting files that
> are bigger than what the C implementation can digest into chunks
> and recuring over them.

Yeah that's definitely a missing bit from the original egg.

> Two notes:
> 
>  1. This implementation is really only useful for posix
>     environments. The alternative `mapped-pointer` implementation
>     for Windows means, as far as I understand, that the Windows
>     version is limited by physical memory.

I *really* don't remember where that code came from.  Possibly from
Felix's simple-sha1 egg.  This egg is basically a copy of that but with
sha1 replaced with md5.

On that note...

>  2. I hope I got the types right. I got a bit confused because
>     `unsigned-integer` seems to only accept values up to INT_MAX
>     not up to UINT_MAX, which wasn't clear to me from the
>     documentation here:
>     http://wiki.call-cc.org/man/5/Foreign%20type%20specifiers#integers

Yeah, I had a look at that.  The unsigned-integer type can definitely
hold UINT_MAX, but the problem is that you're passing it to pointer+
which is defined to take "integer", which maps to the "int" type in C,
resulting in a loss of precision.

Perhaps this should be changed to offset_t, but we don't have a foreign
type for that (yet).  We could of course abuse ssize_t...

Anyway, more to the point: if you look at these changes:
https://bugs.call-cc.org/changeset/43992/project

you'll see that I reverted part of your changes.  The reason is that
IMO it's better to keep using "unsigned-int", as that will always use
fixnums, which are faster to decode than arbitrary integers, (as INT_MAX
will typically be converted to a bignum).  I've used
most-positive-fixnum as the largest representable fixnum instead of
pulling in limits.h.

Not that it matters a great deal, because the loop should typically not
be executed that often, and the main procedure should not be called that
often (and it might actually be faster to use INT_MAX for large files),
but I think simpler and less fussy code is always good.

I got some compilation warnings on simple-md5 for CHICKEN 6 due to using
the u8vector / bytevector types for the "context" structure.  Instead,
I reverted that to the original (nonnull-)scheme-pointer so that the
warnings go away.  This type works for bytevectors.

@Felix: I think the manual can improve the wording here a bit.
It states:

"scheme-pointer is typically used to get a pointer to the raw byte
 content of bytevectors. But if you pass in a SRFI-4 vector, you will
 get a pointer to a bytevector object header (not the bytevectors's
 contents), which is almost certainly wrong. Instead, convert to a
 bytevector beforehand, or use a SRFI-4 specific type."

This is a bit misleading/confusing - you can pass u8vector because
that's equivalent to bytevector and directly encoded.  Only the other
srfi-4 types have wrapper objects.

In the case of this egg, using scheme-pointer is better because it is
equivalent to "void *" on the C level, which gets rid of the type
warnings you'd get if using "char *" for the "struct MD5Context *"
arguments to the MD5Update function.

> I'll take a look at simple-sha1 in a couple of days, I suspect it
> would benefit from a similar patch but I haven't checked that out
> yet. Hope this is useful to some people.

Yeah, it should be the same, as that's the "parent" egg.

Cheers,
Peter



reply via email to

[Prev in Thread] Current Thread [Next in Thread]