I am debugging code that has assumed cudaMemcpy synchronises with non blocking streams – which I believe is an error. In the same host thread, if a stream is created with CU_STREAM_NON_BLOCKING and then a cudaMemcpy (Not cudaMemcpyAsync) is used to access data written by a kernel in the non blocking stream, then that would […]