The test consists of over 100,000 questions and is considered one of the world's most authoritative machine-reading gauges. To score well, machine learning agents need to process large amounts of data and provide precise answers to queries. Humans score a typical 82.304 in this test, which is slightly lower than the 82.44 achieved by the model created by Alibaba’s Institute of Data Science of Technologies.
“That means objective questions such as ‘what causes rain’ can now be answered with high accuracy by machines,” Luo Si, chief scientist for natural language processing at the Alibaba institute, said in a statement. “The technology underneath can be gradually applied to numerous applications such as customer service, museum tutorials and online responses to medical inquiries from patients, decreasing the need for human input in an unprecedented way.”A day later, an AI model from Microsoft did even better and scored 82.650. However, Stanford ranks the Alibaba higher than the Microsoft one because it scored a higher F1 score, which isn't disclosed. The scores listed above are the exact match (EM) values.