I've had a long-running, hard to reproduce problem where an elasticsearch cluster would sometimes get into a state where new nodes would no longer see existing nodes. The only solution would be to first shut down all nodes, then bring them back up.
It now looks like this happens when I run both a data node and a client-only node on each machine, and the (master) data node ends up bound to port 9301 instead of 9300.
This is not a firewall issue as I have ports 9200-9400 open, and can telnet to port 9301 on the machine running the master node from any other machine in the cluster.