Talking the How-To's of Data Testing

This month on PLATO Panel Talks get ready to dive deep into the world of data testing! Our host, Mike Hrycyk, chats with data testing experts Chris Wagner (Director of Data Solutions, Baker Tilly) and Deepak Ankarath (QA Manager, PLATO), who share their insights on crucial skills for data testers, specific data testing challenges like big data and ETL transformations, and strategies for ensuring data quality. Together, the panel provides insights for best practices in evermore complex data environments.

Episode Transcript:

Mike Hrycyk (00:00):

Hello everyone. Welcome to another episode of PLATO Panel Talks. I’m your host, Mike Hrycyk, and today we’re going to talk about database testing and testing with databases with our panel of testing experts. Database testing at PLATO has been there since the start. Database testing is part of testing that all of our testers do on some level, but we wanted to dig in a little bit and find out about specialists in the area and help everyone understand it better. Now I’m going to turn over to our panel of experts and let them introduce themselves. Deepak, tell us about yourself.

Deepak Ankarath (00:31):

Hey Mike, so, I have around 18 years of experience in testing with close to 10 years of experience in the data side, specifically. Here at PLATO, I am heading the practice for Data Testing, CoE. That’s pretty much from my side for now.

Mike Hrycyk (00:47):

Excellent. Thank you, Deepak. Alright, Chris, tell us about yourself.

Chris Wagner (00:51):

Hey Mike and Deepak, thank you so much for having me on the show. My name’s Chris Wagner, Director of Data Solutions at Baker Tilly. I also host the YouTube channel KratosBI, home of the data gods, so make sure you like, subscribe, and leave a comment below—all that good stuff!

I’ve been working in data for over two decades and I think one of the biggest sources for success or reasons for being successful is that I’ve paid a lot of attention to the data quality of the solutions that I build on, and I start with data quality as a foundation of any build that I do and I talk about it a lot. In fact, that’s how me and Deepak met. Deepak was my data testing guy! And so, like, oh, I’m so happy to be able to talk with Deepak, the guy who made sure my data was high enough quality to roll out to our organization.

Mike Hrycyk (01:45):

Welcome to both of you. I’m excited about this topic. Let’s dig right in. The first question, as many of our loyal listeners might be comfortable and familiar with, is: We’re going to set the parameters. Database testing is a pretty wide term. Let’s just define it a little bit. So, what are the different categories of database testing that there are and that we might talk about today? Let’s start with you, Chris.

Chris Wagner (02:08):

So, I categorize database testing into kind of two different categories. I look at it as system-level testing, and then I look at it as business functional testing. And you could break ’em out further into each of those cases, and I know some people will just break them out on a spectrum, but I think there are tests that we run that we care about on a technical level as part of engineering and building a solution. But then there’s sometimes even more important tests, which is what is the quality of what’s being consumed by our users when they’re looking at information. The classic example I give is how many salespeople do you have, right? That number is a real number that matters to people. That’s a legitimate number that you have to make sure that you have a hundred salespeople, you better have a hundred distinct salespeople, and they better be the right ones. It better not be 100 Deepaks, right? That would be – you may have the right number, but you don’t have the right individual one. That’s how I separate it.

Mike Hrycyk (03:10):

Can you dig just a little bit deeper into your definition of the system testing? Because I’m not entirely certain what you mean by that.

Chris Wagner (03:17):

Sure. So I’ll test things where I think about system testing. I’ll be testing like throughput rates. How fast am I able to load data? What’s the quality of the data when it comes through when we’re passing data back and forth in Delta files or CSVs? Or, if we’re writing data back and forth, are we able to maintain the appropriate timings on our data and our overall loads of our data? Are they inconsistent? Are we talking about any significant disruptions in the size of the data as it moves along and through, right? Do we see huge spikes on Wednesdays? If that’s true, why do we see spikes on Wednesdays, right? So I think there’s a metadata in this that comes along with data loading and data quality that I think a lot of people may not look at.

Mike Hrycyk (04:05):

Interesting. That makes total sense. I have not really thought about that as part of a data testing expert, but it totally makes sense. A lot of what you’ve described would fall to our performance testing group who also have to understand data, but that’s great. Deepak, same question for you. What are the different categories? Is your perception the same or is it different?

Deepak Ankarath (04:27):

Let me take a step back, and I’d say – I would say this data based testing is a subcategory of data testing. Databases, we are testing whether the table is constructed properly. We have containers, we have the data type and everything that is the database testing, but testing data is something that Chris has mentioned. Make sure that the data is proper. For that, we will do many kinds of testing, whether the transformation is correct, and the counts are correct. The security – that’s one of the parameters. Performance is another parameter. That, from my perspective, is how the data and database testing comes into the picture.

Chris Wagner (05:03):

And to build on what you’re saying, Deepak, that’s a great call out. We used to exclusively work in a database world where, as a database, you get certain constraints that were hardcore enforced. Well, now that we’re moving into a more lakehouse environment, now all of a sudden, we get to a place where those constraints aren’t as hard enforced as they have been in the past, right? Making sure that you have uniform cardinality, right? Like, hey, the 100 salespeople I mentioned before, well, they better each have their own unique identifying key. In a database world, that was hard, and – that was there. If you created a field that was an identity field, it maintained it all the time. In a lake, that’s not the case necessarily. You could overlay that key over and over again. You could see 100 Deepaks, and you actually had to test for the quality of your data and build that into your processes, something that you didn’t necessarily have to think about before. Now it’s becoming increasingly critical that we engineer that as part of our solution.

And now we’re going to get into something that I like to talk about here, and that’s inline and offline data quality tests. So, an inline test is something that is done as part of the engineering process that will actually fail something from moving forward. To say, hey, it’s lightweight, it’s short, it doesn’t drag down our loading processes or our data engineering processes so that you can be assured that you’re maintaining a level of quality as your data is loaded in and you maintain that and you’ve got that SLA in place. But if you need to do a deeper dive or something, that’s – let’s say you want to do a deep dive petabyte compare between two files. You want to compare customer name and address – these big field descriptions between two files that could take 15 or 16 hours to run, but you don’t do that in line and slow down your load processes. You have to move that offline. Well, if in your offline process you detect that there’s an issue, then you have to be able to – your engineering team has to have created a process that allows you to roll back to that old place or fix and go forward. Inline and offline. Data quality testing I think is critical and a lot of organizations aren’t addressing that.

Mike Hrycyk (07:39):

That’s great. You guys have done something a little unusual for me in these podcasts, which is you’ve taken the answer to this question a completely different way than I was expecting, and we’re still going to talk about the other part, but it’s really good and I love it. When I thought of the categories I reached back to, so we’ve got, I don’t know, six or seven data specialists for testers at PLATO, and I’ve certainly worked with others. And when we talk about it, they’re more along the lines of specialization. So, we have some people who specialize in testing big data and the ramifications that go around that. We have some people who specialize in ETLs and data transformation and testing that when you talk to those people and you say, can you go and do this other one? They’re like, oh, I don’t do that, but I guess I can learn. And that’s interesting to me that they fall into those specializations, and there’s almost maybe a hierarchy there and I’ll let maybe you guys address that. And then, the other category that I came up with is data verification testing, which, for the most part, all testers do to some level. So, any thoughts about those as categories? Are there other categories that I’m not thinking about? So it’s sort of in parallel with the other answer we’ve already gotten. So let’s start with you this time. Deepak.

Deepak Ankarath (08:46):

No, those are the primary things which we test as part of a data tester, like the data verification and ETL transformation counts, and on top of that, currently, we are moving everything to the cloud. Now everything is cloud, so security is a very crucial thing that needs to be addressed when migrating from on-prem to the cloud and the data on the cloud. So we need to consider that as well. And performance may be, I don’t know, Chris can answer this. As everything is in the cloud, so we are not handling anything, any server or anything. So the performance will be, it’ll be good.

Mike Hrycyk (09:21):

Would you sort of agree that if you think about a specialist and their transition through their career, was my assumption sort of correct? It starts with data verification, which a lot of testers do. Then it’s maybe transformation, and then it gets more complex when you get to big data. Would you sort of see that?

Deepak Ankarath (09:37):

Yeah. The very basic thing is to learn to run query – write a query, run a query, and, as you said, do data verification. See if what you’re saying in the UI is the same as in the database; that will be the very stepping stone. And from there, you have to learn ETLs. But everywhere it is data writing query as Chris used to tell you, but learn to write SQL; that’s enough.

Chris Wagner (09:57):

And Deepak is a hundred percent right. The basics are starting with query. So, if you don’t know how to write a query, you’ve got to start with writing a query. I think the maturity line that I like to see with people when it comes to bringing them on my testing and having them mature is going from understanding how to technically test something for technical alignment to testing for business alignment. And that’s a degree of maturity that I think comes with time because it’s the business alignment that adds the most value that’s built upon the technical testing. So when you come out of IT school, or you just start as a tester, you’re going to be focusing on technical capabilities, but throughout your career, you’re going to need to evolve into something that understands the business that you’re working with. And in fact, the only people I’ve ever seen progress in their careers are the people who did take the time to learn the business side and not just the technical side. They learn the business side to be built upon their technical solutions, they engineer then meet the businesses

Mike Hrycyk (11:01):

Well, and that answer is good for just about all forms of testing and aligns with what PLATO and we’ve talked about lots of times: which is the reason the value add that a good tester brings is the ability to take the user perspective and the capabilities and the business needs and align them with what the developers are producing because they’re reading those requirements with a little bit of a different lens. You guys have both segued very nicely into my next question, which is, so I’ve interviewed tons of people over the course of my career and with lots of different skill levels, and I do ask every single interview about their level of SQL skills. So, stepping back from the experts in data testing, what do you consider an appropriate skill query level? And maybe it’s the ability to write queries, but the appropriate level for your average tester. So someone who’s an intermediate moving into senior, where would you say that the appropriate level of database capability is there?

Chris Wagner (11:59):

So that’s a great question. I think when I’m looking at someone at the adequate level, they’ve mastered the technical aspects of writing queries, sorting queries, and then starting to look at how they engineer their solutions to maintain a high level of data quality on an ongoing basis that starts to eliminate manual testing. They should hopefully be doing that from the start. Their lead should be like, okay, here’s how we’re going to automate this stuff. But if that’s not the case, I’d be looking for them to be starting to take the lead on finding new mechanisms or new ways to automate and engineer a solution so that they maintain their data quality

Deepak Ankarath (12:45):

For anybody to get into the data side, as I said previously, queries and SQL are the primary things, and to get into the next level, they should have the capability to understand the process or take a look into the code written by somebody like an architect or designed by somebody like Chris. Take a look into that and see if they can improve something there. So, that kind of expertise needs to be achieved if you need to go to an expert level, but to start with, you can do the data verification stuff and learn some kind of transformation. That should be good enough.

Chris Wagner (13:16):

Mike, do you mind if I build upon what Deepak said there?

Mike Hrycyk (13:19):

Yeah, go for it.

Chris Wagner (13:20):

So love it when testers build upon technical designs. I love it even better when they build upon the business requirements, and start to create parallel testing capabilities from whatever it is our engineers or even our architects have designed. The idea is that our testers need to test the architecture as well as whatever code has been written underneath it. So again, we need to get to that right value, that high data quality, not just that the code ran the way we’d expect. That doesn’t help us. Not just that, yeah, you follow the architectural design. That doesn’t necessarily help us. What helps us is making sure that we’re getting to the right answer regardless of the design, regardless of the code that’s in place.

Mike Hrycyk (14:13):

You’ve both managed to slightly dodge the true intent of my question but with very useful information. So in my interpretation of what you’ve just said, you’ve given me a really good path for your average tester to move up and out and grow in terms of data testing. I’m going to state what I think the answer to this might be and see if you guys have any comments. Just based on, I don’t know, the over a thousand interviews I’ve had, it seems like your average tester who hasn’t specialized in database testing at all will have some level of query capability and if they don’t, well I’m probably not going to consider them, but the average, where people seem to be, is confidence in writing a query, confidence in retrieving data right up until familiarity with joins, but not being experts inner and outer joins. And that’s what I see a lot. Would you guys consider that a reasonable average starting place for your average tester or do you have a different perspective? Chris?

Chris Wagner (15:09):

Are we talking about a starting tester? I would agree with where you’re at.

Mike Hrycyk (15:13):

Um, the senior end of intermediate.

Chris Wagner (15:15):

Yeah, if I’m going into the intermediate, I’d expect them to know how to do joins, inner joins, outer joins, and even subqueries as part of their testing to do their validation. So, I guess, I’d press a little bit harder on them, but that’s okay. I guess, the other thing I’d really be wanting my intermediate testers to do is have a good means of pushing back on engineering and business and the other people who say that they’re right and the tests show that they’re wrong. We need that type of backbone and that type of confidence in order to make that progression from a beginner tester to a more intermediate tester.

Mike Hrycyk (15:59):

Well, that’s good. I like that perspective. Deepak?

Deepak Ankarath (16:02):

I second Chris. So if you have an intermediate or less senior-level tester here, she should be proficient in the joins. If you’re a starter, you can go with the normal queries and regular queries, but if you’re into the next level, you need to have knowledge of joins, subqueries, temporary reviews and those kinds of things.

Mike Hrycyk (16:20):

Awesome. I’ve got a little anecdote to bring up for that. So back in the day, I don’t know, 2005-ish – so quite a while ago – I was at this little startup, and I think there were seven or eight of us and some of us were remote, which was brand new for us. Then there wasn’t a lot of remote technology at the time. We were in Vancouver, but we had a database expert out of Seattle, and we were writing applications to retrieve data from large data stores to print music for people who were willing to buy it at a kiosk. We had a lot of store procedures. We were heavily dependent upon that within our database – or sorry, in our architecture. And the guy, his name was Bruce, I like Bruce, he had this little foible that when he encountered issues, he had to talk them through with someone.

When I started, I was the test lead, I guess you would say. When I started there, he would often take one of the developers and he’d spend a half hour, an hour and he’d walk through these giant stored procedures looking for the issues, but of course, they always had a lot of time constraints and if a sprok wasn’t working, that probably stopped me from being able to do my testing so I would have a little bit more time. What I ended up doing was spending two or three-hour sessions with Bruce, walking through historic procedures, trying to find the issues. And I’ll tell you on day one when I started that everything was Greek, but it didn’t matter. Him [Bruce] talking to me and me responding sometimes was good and helped and by the time I was done there, we would still be having these discussions, but I was able to point out, is this really what you wanted to do?

I didn’t always understand everything he was doing, but I could ask questions, which is what a tester does best, and in asking those questions he says no – and then he would spot something else. And that formed my capability and knowledge around data testing more than anything else I did in those formative years. And it was really interesting and exciting and it’s good that we were a startup, so our managers didn’t know we were spending that kind of time on stuff like this. Maybe neither of us would’ve done better because of it, but I’ve never really encountered that since then. It was a unique experience for me.

Deepak Ankarath (18:19):

No, that’s something which I have done with Chris. I sat next to him and then developed a framework to test his own code.

Chris Wagner (18:27):

And you’re right, it’s one of those things you don’t often have the ability to do, but honestly, I’ve never even thought about it until you mentioned how educational that is to have someone basically riding shotgun as you go through a development process. That makes a ton of sense. I will say that I use the rubber duck methodology for troubleshooting that, and maybe that’s been to the detriment of the people I work with because I just talked to my little Batman guy here and I explain what I’m doing when I encounter that issue. I can’t sort it out, I tell it to Batman, and I explain it all to them and then I invariably find my issue. But yeah, that’s a good call out. That’s a great educational opportunity for people.

Mike Hrycyk (19:13):

I promise you that when I started it, I was every bit as helpful as your Batman, but within a month or two, I was providing real value, so it was faster. I mean, it’s also my first foray into even understanding paired testing or paired development. So, those are very valuable and useful methods that can help produce much higher-quality code.

Okay, next question. And so this one sort of harkens back to our first question. I’d just like you guys to help our listeners understand a bit better what the categories are; as I outlined them – we already dug into how you guys outlined them. What is data transformation testing? So pretend we’re talking to that intermediate who doesn’t have a lot of experience. What is it? What can they expect to look at when they’re doing it? Let’s start with you, Deepak.

Deepak Ankarath (19:56):

Data transformation. We can say it’s a part of EDL Testing. Suppose in a source table, we have two columns, first name and last name and in the target table, we have only one column full name. So, this person has to write a query which will concatenate in the source like first name and last name and should be the same as the full name in the target. That is something we can call a transformation. This first name and last name in the source is combined to make a full name in the target. That’s a kind of transformation.

Mike Hrycyk (20:26):

That’s a really good illustrative example, but I mean it gets a lot more complicated than that. You can have seven or eight different data sources, and you can actually get it to manage and calculate on the data. And so, the testing is making sure that you’re getting the right results. And I’m not supposed to be answering the question – let’s go to you, Chris!

Chris Wagner (20:45):

Yes. And to build upon what Deepak is saying it’s testing and validating that any logic that’s applied to your data is generating the right results as you would expect them for your business case. And first and last name becoming full name is probably the simplest answer, but you go to the infinitely complex or potentially infinitely complex, which is okay, we’ve come up with a regressional model or data science has come in on a scoring mechanism for how to score clients on transactional history to categorize them as A, B or C, and then they get benefits accordingly. How do you validate that those statistical models are appropriately assigning those characteristics to your clients? That could become a much more difficult transformation process to validate. So you start with the basics of last name, first name, but you can go incredibly detailed.

Mike Hrycyk (21:48):

Yeah, I mean, I see it as just data validation testing on crack. You just have to understand what’s happened to the data in between. Great. Same question, but big data, and maybe we need to define even that term a little bit as we tell what the testing around it is. So, let’s start with you, Chris.

Chris Wagner (22:06):

So I think the current definition of how I think of big data, and it’s starting to evolve because I’m starting to look at – we went from just having data to big data. I’m starting to just go back to data again, but I’m going to describe the way I think of big data as any data system that’s comprised of massive numbers of files that we have to manage in order to view and look at the data. So, for example, if you’re looking at a whole bunch of Parquet files with PolyBase and you’ve got those all out into a lake, and you’re trying to understand what that data is, that’s what I think of as big data.

Mike Hrycyk (22:50):

Can you maybe give five words of what a data lake is? I don’t think everyone’s going to know.

Chris Wagner (22:55):

Sure, yeah. A data lake is just, okay, we’ve just started dumping these files into this big directory, and from this chaos, we’re going to make order from that, or not. And so that’s where you come up with the idea of a data swamp which is like, well, they’ve just been dumping data in and now it’s all these JSON files, XML files and Delta files and Parquet files, and nobody can make hide nor hair of it. But a lake, in general, is just all sorts of different files put in a file management system that you can then bring forward into your data estate.

Mike Hrycyk (23:34):

But it’s not simply unplanned chaos. I mean, I think it is chaotic, but it’s storage for a reason. It’s because your ways of pulling the data out aren’t as defined as you would have for, say, an MS SQL database. And so, there’s reasons for doing it. So, it’s not just chaotic or not just chaotic without purpose.

Chris Wagner (23:54):

– Sometimes! Sometimes I think it’s just people get lazy and they decide not to put any effort into it, and so, oh, we’ll just put it in the lake. If you’re saying, hey, I want to store all of my customer service phone calls, I’m going to put that into a lake – Hey, great, that makes perfect sense to me because you’re going to have a mixture of client feedback forms. You’re going to have audio recordings. Heck, you might even have videos or you might have pictures that go along with it. So, to me, that makes a ton of sense, but if you’re going to just start to dump all your ERP stuff into a lake because you don’t want to bother with coming up with the structure, then I think that’s lazy and that’s a problem.

Mike Hrycyk (24:34):

Now I’m just curious, I’m going to keep asking questions. Do developers who now want to use that data, do they always pull it out into a rational database or a data cube or some other data format before they start using it or are they using it directly?

Chris Wagner (24:48):

So, there’s lots of patterns. Some will attempt to use it directly. It’s almost always a problem because if you have chaos going in, you’re going to have chaos coming out. And so, I highly recommend building out a pattern. I think that the classical pattern that we sort of adopt across the industry is your medallion architecture, your bronze, silver, gold, and I would say there’s an API process that runs along with that, but that is the structure that I really, really recommend and pattern I highly recommend people get used to.

Mike Hrycyk (25:21):

Alright. Deepak, anything to add or argue about in Chris’ statements?

Deepak Ankarath (25:27):

No. For me, big data is as Chris shared is a very large amount of data. And from a tester’s perspective, we cannot sample a few data and test it like other areas of automation or functional testing. If you’re testing data, we need to test everything. So, we need to be more efficient. That’s what I would say when we are testing a large amount of data. So, our queries, which we’re using to test the developer’s query should be efficient to test our travels through the entire data. So we cannot say that I will take the first hundred and test it or a thousand rows of data. There is nothing like that in the data testing area. Either test it completely or don’t test it.

Chris Wagner (26:02):

Well, and to build upon that. When we’re talking about a lake and big data, that’s where data quality testing is much, much more important than it was when we were talking about databases. Because in a database, I would have defined database types that I would set out, right, okay, the ID column is going to be an integer, then we’re going to have an ID name and that’s going to be a VARCHAR, and then I’m going to have an amount field and that’s going to be an integer or whatever. I’m going to go in, I’m going to have all these defined characteristics. But in a lake, we’re talking about all sorts of different file structures that could come in where the schema could change radically. Like, okay, hey, this time this file has five columns, what do you do when all of a sudden there’s seven columns? Or you go from 500 columns to 450 columns, how do you handle that schema drift? What are the downstream impacts on it? And are you maintaining the proper level of data quality that your system and your platform are counting on? And that’s where testing and data quality engineering really becomes critical to your organization.

Mike Hrycyk (27:13):

So, it kind of feels like the data structures, the structurelessness of a data lake, might quickly become overwhelming to a tester and they will step back and attempt to focus more on making sure the outputs of the application are right and that the data verifies that way. Is that a mistake? Is that a good tactic if it becomes swampy – I like that term you said. Or should the focus really be on the lake itself from a tester’s perspective? Let’s start with you, Chris.

Chris Wagner (27:44):

So, I definitely think that that needs to be at the lake layer. Quality should start there. And if it’s the foundation for everything that comes downstream of it. If you just look at the, okay, what’s my consumption layer like? You can easily miss all sorts of different issues in your foundation that could be structurally devastating to you six, or twelve months down the road or three years from now. And, it’s only by understanding and maintaining that base level of quality that your platform will be reliable in the ongoing future. So I would highly, highly encourage people to start their data quality at that raw layer and then move that up.

And that’s where if I circle back to inline and offline tests, we have inline tests to make sure that the right files are being loaded, that we’ve got the right record count, we’ve got a good hash key that goes along with the data that’s being written. We do counts and amounts to make sure that we have the right numbers of items that are coming across and our values are matching out. But then, so okay, the whole process runs and is good, but it’s those offline tests that become increasingly valuable for the long-term viability of your platform. They do those broader evaluations of the quality of the data coming in, right? Have you gone from camel casing names where it’s Chris Capital C-h-r-i-s to everything being uppercase all of a sudden? People don’t like that! And data quality is what’s going to help resolve that at that foundational layer. You’re going to pick that up at the lower layers faster than you do at the upper layers. It also means we’re going to have to engineer data quality into our system and not just have these manual tests that we run after the fact because that’s a lot of work if we’re not finding ways to engineer that quality into place.

Mike Hrycyk (29:44):

And it ties all the way back to shift left if you think about it. If you’re only testing it at the outplay layer, you’re just compounding the amount of work it’s going to take to fix later. I see you nodding, Deepak, did you have anything to add?

Deepak Ankarath (29:56):

No, I just want to add one quote, which Chris used to say to me frequently: “No data is better than bad data.”

Mike Hrycyk (30:03):

I like that quote.

Deepak Ankarath (30:05):

Chris told me, like the business rules. As a tester, what we need to do is we need to learn the business rules and add them to our testing. In automation, we used to say unit testing, we need to add that rule to the ETL or the transformation that the developer is developing so that the rule is automatically executed along with the data load itself. There is no need for me to go there and test it again, it is automatically tested each time the data is loaded in production, or Dev, or QA, whichever. So that way, we can make sure the foundation is correct.

Mike Hrycyk (30:38):

Alright, so moving right along because we are running out of time, I’m going to jump to sort of a new topic. What tools do data testers need to have experience and familiarity with to be a data tester?

Chris Wagner (30:52):

Well, number one, I think the foundation is SQL, right? You have to know how to write SQL that’s – out of the gates, you have to know that. Number two, I think it’s Excel, right? You need to be able to put these test results into a format that can be shared, and people can understand and know how to work. And you can automate a ton of testing inside of Excel across a billion different systems. So get in and learn SQL and learn Excel and if you really want to go pro, that’s when we start talking about things like learning how to work with tools like dbt. dbt has got an AMAZING testing layer inside of that offering that we’re starting to leverage quite a bit.

Mike Hrycyk (31:36):

Deepak?

Deepak Ankarath (31:39):

To start with queries, Excel and then would say maybe a little bit of visualization part like Power BI so that you can depict your test results to the management, higher management or business people in a very easy way that they can understand. I know, Mike, you’ll be familiar with a test report – test status report, and you’ll be sending hundreds of test cases where like 70 out of them are a high criticality. So, a person needs to sit and read through Excel, but if you can put it in a Power BI so they can read it on their phone. Okay, yeah. From there it is easy. It is like five passed, three fail, and out of which one is major, and everything is there on my phone, I can just swipe it. So a little bit of knowledge on visualization will be helpful. Any of them like a Power BI or QUILL or anything like that. And then some knowledge of ETL. That is good. If you want to build any kind of automation, you need to know a little bit of ETL, but a general knowledge of the ETL will be very helpful.

Mike Hrycyk (32:31):

We are very lucky because I’m going to go into a second anecdote. Even earlier in my career from the last one, myself and my director of engineering who just really wanted to code had done very rudimentary automation and at the endpoint, the testing part of the automation was he pulled a database, wrote it to the flat file, and then the tool I used was WinDiff. So, I could compare that to a baseline and I don’t even know if WinDiff still exists. There are lots of comparison tools out there these days, but I used WindDiff every day to do that. And if there was something different in the database, I had to go and dig in why? What changed that there’s this difference? End of my short anecdote.

Okay. Moving along again, I think we’ve answered this already. Do the specializations of automation, performance testing, security testing – do all of those fall under the umbrella of data tester? Does a data tester have to have some familiarity and how deep does it have to go?

Chris Wagner (33:20):

Yeah, I think they all fall into at least a very shallow layer. You need to be able to do those tests or the beginnings of those tests to start in that area. Now, you may not know how to do it to begin with, but I would ask and work with testers to begin that development process so that they can learn how to begin in that area.

Mike Hrycyk (33:42):

Deepak?

Deepak Ankarath (33:43):

I would say it does not come under the data testing perspective, but those are some parts of data testing like maybe at a different level. So, we will do automation of data testing, like building ETLs to test other ETLs. We do performance testing to see how much time it takes to load a hundred thousand rows of data from one system to another. And we do security testing while transferring data from on-prem to a cloud or those kinds of things. So, I would say those are part of data testing, but not under data testing. Those are something required as part of data testing.

Mike Hrycyk (34:16):

Great. I think you make a really good point, Deepak. When you think about the core goals of automation, it’s really to validate a build to show that it has an acceptable functionality and quality so that a functional tester, when they go in to test the features, isn’t wasting their time finding other things. And when we look at data, it’s the same kind of thing. It’s the underpinnings of the structure of an application. And so if you can validate that your data system is put together properly and it’s working in a way that you expect, then the functional feature testing that you can do in the application is more meaningful. Because if you’re getting hung up on data issues, then you’re not really testing the functions themselves. So being able to get those results asynchronously before you start your functional testing is really powerful and a good thing.

So big data is just that big, lots of it. How do you deal with the traditional QA conundrum of coverage and wanting to test everything? Does it keep you up at night?

Deepak Ankarath (35:14):

That’s when we need to work efficiently, start automating, and understand the business rule. There is no pointing, running the whole query, and running it throughout the night; rather, we need to understand the business rule and where it is used and implement that business rule when the data is loaded or at the very foundation level so that your big data is tested. But we are not spending the whole night sitting and writing queries or checking whether it is correct or not.

Chris Wagner (35:38):

That circles back to my inline and offline tests. Those offline tests are what you’ll have engineered and built out to do that “Test Everything” mantra. The question really comes into how much time, energy and resources you put into testing everything, right? You have to scale your compute; you have to understand what your budgets are for managing what you’re going to be testing and when you’re going to be doing your testing. But that does mean not only are you going to have to engineer that solution, but you’re going to also have to be able to engineer at scale any fixes to that solution. So, it’s going to have to be something that’s done hand in hand with a data engineering team.

Deepak Ankarath (36:17):

Let me add a small example of what Chris has told you. When we were working together, like me and Chris, we built some kind of automation framework, and as I mentioned earlier, the test results were sent to your mobile phone. Even Chris had access to the mobile phone, and he had a habit of waking up early to work out. So he saw that the data which got loaded last night had failed, and he reached out to our offshore team and told them that, and by the time we were in the office or by the time the client started looking into it, everything was fixed, and things were good. Even though the data was raw, everything was fixed – to add to the offline thing, those things were handled offline.

Mike Hrycyk (36:55):

Great. Okay, last question. For your average tester who wants to go down the database or data testing path, how do you suggest they get started?

Deepak Ankarath (37:04):

Oh, if they are an average SQL tester, then they need to get to the next level. Learn and understand ETLs. You don’t need to learn ETLs, but you do need to understand ET TL. If you click on and open an ETL and understand what this transformation is, read an ETL and understand what this transformation does, and do a little bit of visualization, those kinds of things will be helpful.

Mike Hrycyk (37:25):

Alright, Chris?

Chris Wagner (37:26):

I’ll piggyback on what Deepak just ended with there, which is to learn how to visualize data quality so that over a very large amount of data, numbers of files, and numbers of tables, you can see where you have data quality issues and start to move to and address hotspots within your data quality, right? Knowing that data is always going to be on a spectrum of how good it is, is a maturity that a lot of organizations are starting to get to, but to be able to visualize and understand the data quality in the table, in a landscape, in a schema and an architecture, that can really help with that maturity and that process. So start to learn how you can visualize that data quality.

Mike Hrycyk (38:12):

That’s great. And I think maybe the next podcast you should collaborate with Deepak on is about how to do that, how to learn, how to visualize that big data and your overall data architecture and find hotspots. We’re out of time here so we can’t do it today, but I think that would be really interesting.

Chris Wagner (38:28):

That’s a great idea.

Mike Hrycyk (38:29):

I will listen to it. Make sure you tag us on it!

Alright. Thank you to our panel for joining us for a really great discussion about data testing. I learned things, I came up with anecdotes. I hope you all enjoyed it as well. You can find more great content and discussions on database testing and more on @PLATOTesting, on LinkedIn, Instagram, and on our website. You can find links to all of our social media and websites in the episode description. Speaking of, we’ll see you next month for more discussion on all things software testing. Until then, if there’s a specific software testing challenge you’d like us to tackle, reach out on social media or leave us a comment on our website. Links are in the descriptions. Once more. Chris, can you plug your own podcast?

Chris Wagner (39:10):

Well, thank you, Mike, for having me. You can find me on KratosBI over on YouTube!

Mike Hrycyk (39:15):

If you’re enjoying PLATO Panel Talks we’d be incredibly grateful if you could rate and review us on your favourite podcast platform. Your support helps us to reach more testers. We’ll see you next month for more discussions on all things software testing.

Talking the How-To’s of Data Testing

Build better software & make a social impact.