I am debugging code that has assumed cudaMemcpy synchronises with non blocking streams – which I believe is an error.
In the same host thread, if a stream is created with CU_STREAM_NON_BLOCKING and then a cudaMemcpy (Not cudaMemcpyAsync) is used to access data written by a kernel in the non blocking stream, then that would be incorrect?
As the cudaMemcpy goes into the default stream for that host thread only, which the nonblocking stream by definition does not synchronise with. So is it correct to say that using cudaMemcpy with a nonblocking stream is generally bad? And that only cudaMemcpyAsync should be used with nonblocking streams?
So it would be generally good advice to avoid creating nonblocking streams, create only blocking streams and keep most work off the default stream to allow blocking streams to act asynchronously, and then if someone accidentally uses cudaMemcpy everything still works.