You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Jingwang and Xuezhao while working on the fast forward project just found some nasty race conditions in write file! Jingwang, please correct me if I get anything wrong here or omit anything. Here’s what Jingwang just wrote me about this:
"I think this is a very critical path in PLFS, because all of the write operations will go through the code of WriteFile. After reviewing the source code of WriteFile, I think that it has more problems when multiple threads are considered, such as unprotected read on a member which could be shared between threads. It needs more detailed examination to ensure thread-safety. Xuezhao's case brings up a whole class of serious problems under multiple threads.”
We had decided that a class-level pthread_rwlock_t would be the safest approach since it is only modified when dropping files are added so, most of the time, getting the read locks will not limit any parallelism.
[We also had a more general discussion about how much thread parallelism is important to PLFS. For large HPC and for MPI jobs, typically parallelism is achieved through multi-process so if we had a single-threaded PLFS, that wouldn’t hurt performance. But systems are going to more cores and we have small performance critical jobs like archive transfer jobs, so I think we should try to care about thread-parallelism to some degree as well. Also, in a lot of the burst buffer future planning, PLFS is being moved off the compute nodes onto the burst buffer nodes where they will aggregate requests from many ranks on many compute nodes. They will likely do this as threads within a single PLFS process so that also means that we need to continue caring about thread-parallelism within PLFS.]
Thanks,
John
On Apr 2, 2014, at 9:20 AM, Bonnie, Dave dbonnie@lanl.gov wrote:
I would lean towards locking it on any access - STL maps are not thread-safe. If you search it while it is being updated it could lead to a situation where you return an invalid OpenFh.
In particular, iterators from STL maps are what were causing many of the thread-safety errors I fixed in the FUSE daemon last year.
Looks like writefile needs a class-level pthread_rwlock_t to protect against race conditions:
From a plfs-devel mailing list exchange (http://sourceforge.net/p/plfs/mailman/message/32177946/):
Jingwang and Xuezhao while working on the fast forward project just found some nasty race conditions in write file! Jingwang, please correct me if I get anything wrong here or omit anything. Here’s what Jingwang just wrote me about this:
"I think this is a very critical path in PLFS, because all of the write operations will go through the code of WriteFile. After reviewing the source code of WriteFile, I think that it has more problems when multiple threads are considered, such as unprotected read on a member which could be shared between threads. It needs more detailed examination to ensure thread-safety. Xuezhao's case brings up a whole class of serious problems under multiple threads.”
We had decided that a class-level pthread_rwlock_t would be the safest approach since it is only modified when dropping files are added so, most of the time, getting the read locks will not limit any parallelism.
[We also had a more general discussion about how much thread parallelism is important to PLFS. For large HPC and for MPI jobs, typically parallelism is achieved through multi-process so if we had a single-threaded PLFS, that wouldn’t hurt performance. But systems are going to more cores and we have small performance critical jobs like archive transfer jobs, so I think we should try to care about thread-parallelism to some degree as well. Also, in a lot of the burst buffer future planning, PLFS is being moved off the compute nodes onto the burst buffer nodes where they will aggregate requests from many ranks on many compute nodes. They will likely do this as threads within a single PLFS process so that also means that we need to continue caring about thread-parallelism within PLFS.]
Thanks,
John
On Apr 2, 2014, at 9:20 AM, Bonnie, Dave dbonnie@lanl.gov wrote:
I would lean towards locking it on any access - STL maps are not thread-safe. If you search it while it is being updated it could lead to a situation where you return an invalid OpenFh.
In particular, iterators from STL maps are what were causing many of the thread-safety errors I fixed in the FUSE daemon last year.
On Apr 2, 2014, at 9:10 AM, Chuck Cranor chuck@ece.cmu.edu
wrote:
hi-
looking at the current code: in WriteFile class we have:
in the WriteFile code when we modify fhs, we lock data_mux.
Q: is it save to traverse fhs without locking it? consider:
Container_fd::write(const char *buf, size_t size, off_t offset, pid_t pid,
ssize_t *bytes_written) {
// ...
plfs_error_t ret = PLFS_SUCCESS;
ssize_t written;
WriteFile *wf = pfd->getWritefile();
ret = wf->write(buf, size, offset, pid, &written);
// ...
}
ok, now within WriteFile::write we do:
plfs_error_t
WriteFile::write(const char *buf, size_t size, off_t offset,
pid_t pid, ssize_t *bytes_written)
{
plfs_error_t ret = PLFS_SUCCESS;
ssize_t written = 0;
that "getFh(pid)" searches fhs without locking data_mux:
struct OpenFh *WriteFile::getFh( pid_t pid ) {
map<pid_t,OpenFh>::iterator itr;
struct OpenFh *ofh = NULL;
if ( (itr = fhs.find( pid )) != fhs.end() ) {
is it safe to call fhs.find(pid) without holding data_mux?
I'm trying to map this stuff into the new Index framework and
wasn't sure about this one.
chuck
Plfs-devel mailing list
Plfs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/plfs-devel
The text was updated successfully, but these errors were encountered: