Nim Compiler Version 0.17.3 (2017-10-25) [Linux: amd64]

Copyright (c) 2006-2017 by Andreas Rumpf

git hash: fa02ffaeba219ca3f259667d5161d30e47bb13e0

active boot switches: -d:release


Code:

import os except sleep
import posix, strutils
import threadpool

when isMainModule:
  proc main() =
    var i = 0
    while true:
      i.inc()
      echo i
      discard sleep(1)
  if fork() == 0:
    echo("Hello from the child!")
    spawn main()
    threadpool.sync()
  else:
    quit(QUITSUCCESS)


I was using https://github.com/OpenSystemsLab/daemonize.nim to try and write a simple daemon.

I ran into a problem where invoking threadpool's spawn from a forked process resulted in a deadlock.

The code above is a simplified example from how I am using daemonize.nim.


strace: Process 146711 attached

futex(0x65dd24, FUTEX_WAIT_PRIVATE, 1, NULL)


Is there any way around this? I couldn't find anything explicitly prohibiting threadpool from being used like this.

2018-01-02 15:06:59

I tried running your code under helgrind.

valgrind --tool=helgrind test

It reported a data race:

==24599== ----------------------------------------------------------------
==24599==
==24599== Possible data race during read of size 8 at 0x33B168 by thread #1
==24599== Locks held: none
==24599==    at 0x118490: nimSpawn3 (stdlib_threadpool.c:394)
==24599==    by 0x10948A: NimMainModule (test.c:186)
==24599==    by 0x10943E: NimMain (test.c:165)
==24599==    by 0x10918C: main (test.c:172)
==24599==
==24599== This conflicts with a previous write of size 8 by thread #5
==24599== Locks held: none
==24599==    at 0x11813A: slave_ZLEiMrITYl7xqyEhA5iC1g (stdlib_threadpool.c:281)
==24599==    by 0x1156FB: threadProcWrapDispatch_t0lnloO9aBTBU0elWgvMSlw_2 (stdlib_system.c:4330)
==24599==    by 0x115811: threadProcWrapStackFrame_t0lnloO9aBTBU0elWgvMSlw (stdlib_system.c:4454)
==24599==    by 0x115811: threadProcWrapper_2AvjU29bJvs3FXJIcnmn4Kg (stdlib_system.c:4472)
==24599==    by 0x4C32D06: mythread_wrapper (hg_intercepts.c:389)
==24599==    by 0x5555493: start_thread (pthread_create.c:333)
==24599==    by 0x5853AFE: clone (clone.S:97)
==24599==  Address 0x33b168 is 0 bytes inside data symbol "readyWorker_BT69aUhVoxO4qq9b8EinSVNA"
==24599==
==24599== ----------------------------------------------------------------

Followed by this:

==24703==
==24703== Possible data race during read of size 1 at 0x33C848 by thread #1
==24703== Locks held: none
==24703==    at 0x11849E: selectWorker_BW1ODxSJ4rN4KmfdYS8AFw (stdlib_threadpool.c:369)
==24703==    by 0x11849E: nimSpawn3 (stdlib_threadpool.c:394)
==24703==    by 0x10948A: NimMainModule (test.c:186)
==24703==    by 0x10943E: NimMain (test.c:165)
==24703==    by 0x10918C: main (test.c:172)
==24703==
==24703== This conflicts with a previous write of size 1 by thread #5
==24703== Locks held: none
==24703==    at 0x118130: slave_ZLEiMrITYl7xqyEhA5iC1g (stdlib_threadpool.c:280)
==24703==    by 0x1156FB: threadProcWrapDispatch_t0lnloO9aBTBU0elWgvMSlw_2 (stdlib_system.c:4330)
==24703==    by 0x115811: threadProcWrapStackFrame_t0lnloO9aBTBU0elWgvMSlw (stdlib_system.c:4454)
==24703==    by 0x115811: threadProcWrapper_2AvjU29bJvs3FXJIcnmn4Kg (stdlib_system.c:4472)
==24703==    by 0x4C32D06: mythread_wrapper (hg_intercepts.c:389)
==24703==    by 0x5555493: start_thread (pthread_create.c:333)
==24703==    by 0x5853AFE: clone (clone.S:97)
==24703==  Address 0x33c848 is 4648 bytes inside data symbol "workersData_R5YxoJYCt3PvKeJGhluUDQ"

2018-01-02 20:58:18

There's no guarantee that the problem that valgrind is reporting is the cause of your bug of course.

Anyhow, the code in question looks to be this, in threadpool.nim:

proc nimSpawn3(fn: WorkerProc; data: pointer) {.compilerProc.} =
  # implementation of 'spawn' that is used by the code generator.
  while true:
    if selectWorker(readyWorker, fn, data): return

vs this:

proc slave(w: ptr Worker) {.thread.} =
  isSlave = true
  while true:
    when declared(atomicStoreN):
      atomicStoreN(addr(w.ready), true, ATOMIC_SEQ_CST)
    else:
      w.ready = true
    readyWorker = w

2018-01-02 21:06:16

Thanks for the helgrind example. I'll have to add that to my toolkit.

https://linux.die.net/man/3/fork

A process shall be created with a single thread. If a multi-threaded process calls fork(), the new process shall contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and other resources...

import threadpool

when isMainModule:
  quit(QUITSUCCESS)

Running strace, I noticed that there were 12 clone syscalls made.

In nim/lib/pure/concurrency/threadpool.nim, setup is called in the main body and creates a worker for each processor.

I think this explains why the mutex is not properly cleared as we are forking from a multi-threaded process that is setup by threadpool.nim.

2018-01-03 02:46:49
Well? Anything we can do about that? 2018-01-10 07:43:25

Idea #1 (Probably not)

Place some sort of guard around fork in the posix module that checks for any active threads before forking.

There are async-signal-safe routines that can are valid to call post-fork.

Maybe not a good idea to change the expected behavior of the OS-specific fork behavior.


Idea #2 (Easiest)

Install a pthread_atfork handler.

The parent handler could either clear the appropriate locks and terminate the running threads or quit and warn against the dangerous behavior.

Further investigation could be made to see if similar measures are required for other operating systems.


Idea #3 (Most flexible?)

Add a check in threadpool.nim for a flag like --define:threadpoolExplicitSetup.

I like the default behavior of threadpool as it is currently. There is code in the main body of threadpool.nim that initializes locks and spawns threads.

When the above flag is defined, explicit procs could be exposed that allows the user to explicitly control when the thread spawning behavior occurs.

2018-01-10 23:31:34