I am very new to nim and pulling my hear out trying to read a large file into a buffer in chunks and hash the buffer's content like so:

import streams, murmur, strutils

const size = 1_048_576

var
  i = open("input")
  buf: array[size, char]
  fhash: BiggestInt = 0

while i.readBuffer(buf.addr, size) > 0:
  fhash += hash(buf)

echo fhash
i.close()

This obviously doesn't work, as the hash expects a string, not an array of char. Casting the buffer to string doesn't work either since string has different internal representation (terminating null character + length field). Reading the whole file into a string is not an option for my use-case as some of the files are hundreds of MBs to GBs in size and I am trying to write a memory-efficient algorithm. There is a closed thread that talks about a similar challenge, but to be honest, the tensor stuff there reads like Chinese to me. Would appreciate an easy to understand/implement solution. If there is an existing library that does these sort of conversions, that's cool too.

2018-01-07 00:25:14

Here is how I would do it:

import streams, murmur, strutils

const size = 1_048_576

var
  i = open("input")
  buf = newString(size)
  fhash: BiggestInt = 0

while i.readChars(buf, 0, size) > 0:
  fhash += hash(buf)

echo fhash
i.close()

Hope this helps

2018-01-07 11:42:20
Works like a charm. Exactly what I was looking for. Thanks, man! 2018-01-07 14:07:53
I believe that the fact that because strings in Nim are byte buffers with a length field are used in a lot of functions to handle binary data where in languages like C an unsigned char[] would be used, is quite confusing. At least for me. If it is indeed the case that strings are to be used as binary buffers indistinctly, could this be mentioned in the manual or somewhere to avoid confusion and/or ugly code with needless casts? (For example, my first instinct when dealing with the base64 module was to move the string to seq[u8])
2018-01-09 19:11:14

+= is usually a bad way of mixing hash values, MurmurHash seems to use

hash  hash XOR k
hash  (hash ROL r2)
hash  hash × m + n

as the mixing step. https://en.wikipedia.org/wiki/MurmurHash

2018-01-10 07:57:59
@Araq. Great point. I completely agree and I don't have += in the actual program. I used it in the forum purely for brevity. To be honest, I am not happy about how I am doing it the published version either (too many string to int conversions and vice versa). I will definitely check out the XOR approach. Looks very promising. Thanks!
2018-01-10 17:37:46

Changed to the following. I know it's not exactly how murmur does it, but this seems to work reliably as in producing identical duplicate lists comparing to the previous version without the extra math.

hash = hash xor (BiggestInt)hash(buf)

Thanks again @Araq for the suggestion. If anything, it made the code a lot cleaner and easier to understand.

2018-01-10 20:29:39