Deconstructing nuggets: the stability and reliability of complex question answering evaluation