Are there testing strategies or methodologies to evaluate deep learning algorithms?

So for those of you who are working in software development, will know that testing is a very big thing. All of you mean, there are lots of people working. And so the question is, what about DL algorithms? How do we evaluate them and so on? the testing of DL algorithms is not as a software, because software gets written by humans, there can be all kinds of bugs here and there, that is the challenge, you need to go figure out what those bugs or by both by looking at the code and giving inputs and checking that things are actually matching, what we expect. In deep learning, first thing is, of course, to benchmark the model sites. If you have built a model, let us say you build a model to detect sentiment. Okay, so you are given a review. And you can detect whether it's a positive review or a negative, this is what we have built. How do you test this? Well, to test it, you need some data set, right, which gives you different reviews and also gives you labels of the correct answer. Yes, or is it positive, is it negative and so on, and then on this benchmark data set, you can evaluate your DL algorithms. So this is more like what we typically have a test set to evaluate. But this is also not enough. Because what if your test set is not detailed enough or what if you have missed a completely different time? Have reviews in your test set, right? So these kinds of questions may come. So people are developing very recently in research more careful methods, for example, in the domain of natural language processing NLP, and there is a recent paper from Microsoft about something called checklist where they say that a particular algorithm must go through a particular checklist of things, right. So each checklist item is one experiment to be done. Well, I'll give you an example. For example, you might have something like, I really enjoyed eating a mango. And you know that the positive sentence, if I convert the mango to guava, or to something like orange, and I give the same sentence, I expect the same output. But one of the algorithms may not give the same output because it's trained on some complex data set. So you might have such rules where you substitute Parts of the input test with different examples and check if the model is still doing well. And they have come up with a list of different such rules, and you can apply them. I think this is a direction where there is not enough clarity, still it's happening in research. But it's an important direction, you might be seeing on the debate that is happening online, about bias in DL algorithms. These algorithms might have a bias towards one community or one type of data and so on. Now to be sure that there is no bias, there are a bunch of tests that you can do. And these tests must be carefully done. So this is an emerging area. And if you're interested in definitely following the research that is happening. But it is one where there are still not many job roles in the industry, but definitely a research interest.