Jul 18

Practical Experience: Turning knowledge into understanding and complex reasoning

Building on my previous post about knowledge, understanding and complex reasoning I’d like to present a practical example of how taking basic knowledge and piecing information together to form understanding helps to resolve problems. And how building understanding of multiple connected areas allows for complex reasoning and complex problem solving.

This example recently came up where our development environment ran out of disk space. This environment actually contained three distinct builds to allow parallel testing and remove developer bottlenecks. This means that it contains three copies of our production system as a base line from which the developers code is deployed.

First step is to work out what caused the problem. Turns out one of our tables had mysteriously grown to 21 million rows and with relevant indexes was now consuming nearly 15GB of disk space alone. Our development environment had 32GB allocated to the database for three environments. Historically the database was less than a gigabyte so this sudden growth was definitely not expected. Having three environments at 15GB quickly consumed the development environment’s allocated for database disk which killed MySQL as it tried to operate with no space left. At this point I have no knowledge of why this table has 21 million rows or if this is normal. Looking at the production and QA environments I can see the table is the same size there so it isn’t something peculiar in that sense. I quickly review the table and notice that there appears to be a few duplicate entries in the table. As I don’t understand the system (or that this is irregular) I assume that this is a part of the proper functioning of the table and is normal. There is some other discernible difference in the rows that I’m missing at a glance. Unfortunately the engineer who built the system responsible for the table is on holidays so can’t triage the problem and resolve it (not to mention on the other side of the world). Guess we’re left trying to apply basic knowledge.

The first step is to build a workaround for the problem with the knowledge we’ve got. We know that the table is large however we don’t understand why yet. So the first step is to reduce the dev environment down to a single build so that it can handle only one system. Since MySQL is behaving badly we treat it equally badly. This is going to cost us some time however we need at least one of the environments up to verify some code that is there. Here is a trade off in getting the system back to stable until such time that we can understand the problem. It takes a number of hours though the system is stable. The two other environments have been manually deleted since MySQL wasn’t happy. This has meant we have our environment working for verification but we’re stuck unable to deploy. Our database team is going to need to restart the server so that MySQL will let go of some of the files it had open but beyond this we’re fine. So we wait a day for that to happen.

The next step is resolving the problem. Our engineer is pinged and chimes in to ask one of our BI architects to reduce the data feed from six months to six weeks which should resolve the problem. This makes sense as it should reduce the data being imported and thus reduce the amount of rows in the table. However something is missing here: why did it only just cause a failure? We still don’t understand the cause of the failure and we’re just stabbing in the dark trying to solve the problem.

The following day I sat down and visited the BI architect to gain some knowledge of what was happening. With sit down and look through the ETL script that is used to build the file. At the moment the script is outputting all of the available data with no restrictions at all – roughly 1.6 million rows. However that isn’t the 21 million rows that was being imported. Probing a little further the BI architect realises something. He looks through the settings and finds that instead of creating a new file that it was in fact appending to the file each time. This ties back to why it has only just happened: it’s taken a while for the file to grow roughly 1.6 million rows each run. The size of the file with all rows appended was over 8GB and after the modifications were made to reduce it to 6 months it was under 700k rows and 300MB. This explained an observation I had earlier: why there were duplicate rows. Now we understand why it happened and also can identify the symptom of the problem accurately.

Once the file was properly generated most of the systems gradually repaired themselves appropriately. MySQL needed some attention to get it resolved again however was reasonably resilient to the abuse that it had been given. At this point I now have a better understanding of how the rebuild of this particular table works and knowledge around what it should look like. This example shows the development of knowledge, to understanding to complex reasoning to understand all of the moving pieces to work out the root cause of a problem: a single ticked checkbox named “Append”.

This also demonstrates how complex reasoning can be developed through gaining understanding of various systems. The hardest part of the process is taking the time to develop the knowledge and understanding to be able to complete the complex reasoning to understand why the systems failed. It also provides a practical insight into the continuous learning that we do every day in our workplaces.

No comments

No Comments

Leave a comment

%d bloggers like this: