I am very new to nim and pulling my hear out trying to read a large file into a buffer in chunks and hash the buffer's content like so:
import streams, murmur, strutils const size = 1_048_576 var i = open("input") buf: array[size, char] fhash: BiggestInt = 0 while i.readBuffer(buf.addr, size) > 0: fhash += hash(buf) echo fhash i.close()
This obviously doesn't work, as the hash expects a string, not an array of char. Casting the buffer to string doesn't work either since string has different internal representation (terminating null character + length field). Reading the whole file into a string is not an option for my use-case as some of the files are hundreds of MBs to GBs in size and I am trying to write a memory-efficient algorithm. There is a closed thread that talks about a similar challenge, but to be honest, the tensor stuff there reads like Chinese to me. Would appreciate an easy to understand/implement solution. If there is an existing library that does these sort of conversions, that's cool too.
Here is how I would do it:
import streams, murmur, strutils const size = 1_048_576 var i = open("input") buf = newString(size) fhash: BiggestInt = 0 while i.readChars(buf, 0, size) > 0: fhash += hash(buf) echo fhash i.close()
Hope this helps
+= is usually a bad way of mixing hash values, MurmurHash seems to use
hash ← hash XOR k hash ← (hash ROL r2) hash ← hash × m + n
as the mixing step. https://en.wikipedia.org/wiki/MurmurHash
Changed to the following. I know it's not exactly how murmur does it, but this seems to work reliably as in producing identical duplicate lists comparing to the previous version without the extra math.
hash = hash xor (BiggestInt)hash(buf)
Thanks again @Araq for the suggestion. If anything, it made the code a lot cleaner and easier to understand.