# Processing big data from small experiments

I thought I’d share a post about a problem we faced in the most recent project I worked on. The aim of the project was to calculate certain properties of fluids which were being forced through a jet. This involved processing terrabytes of data, most of which was redundant as we only cared about the development of the edges.

From a very high level we can describe the breakup process by saying fluid leaves a jet with a cylindrical shape and ends up becoming deformed due to disturbances in the air. These deformations eventually become large enough for the jet to break up into droplets. In order to calculate the properties we required it was important to measure the rate at which the minimum diameter of the fluid reduced, right down to the point of break-up.

The Problem

The break-up of a jet happens at micron length scales and microsecond time scales. This meant employing the use of a (very expensive) high-speed camera, capable of capturing TIFF images at a rate of more than 24,000 fps. Each video contained around 12,000 frames and so captured many different break-ups. Unfortunately, using a high speed camera soon meant we had terabytes of TIFF files from which we needed to extract the minimum diameter data from. The project timescale was limiting and so to get as many results as possible, we needed a way to automatically process this data. This had to be done quickly whilst still maintaining a high level of accuracy.

Our Solution

The videos were often too large to load entirely into memory and so this meant they had to be handled frame by frame, scrapping those that didn’t actually show break-up occurring. When break-up did occur, we began by stripping the image of data we didn’t need to ease on the amount of work. The image was represented as a W x H x (r, g, b, a) matrix, which we were able to reduce to a single int array of length H. We did this by cycling over the pixels, using a threshold value to determine whether the current pixel was part of the background or the jet. This allowed us to accumulate diameter data for the whole image in just one pass. Each value in the int array represented the diameter of the jet (in pixels) on the row with the same index in the image.

The next job was to work out a way of tracking portions of a fluid so that we could calculate how the diameter evolved over time.

This two-part problem required us to locate a region of fluid to track in the current frame and then find the same region of fluid in the next frame. Fortunately, we are able to find “Lagrangian elements” which appear as hourglass-shaped structures (as seen in the image above). After applying a (modified) peak-detection algorithm, we could identify where these elements started and ended. This gave us a reference point at which to track the fluid. Once an element had been found, we could simply take the minimum value between the two endpoints to determine the fluids minimum diameter at that particular moment in time. As the flow rate of the jet was set constant in the experiments, we knew that the distance the fluid moved between each frame would also be (more or less) constant. So, after finding the Lagrangian elements in the next frame, we simply had to find the one which had start and end points further than the last frame, but not so far as to enter into the next.

From the point of detection, an element could be tracked across approximately 50 frames before it broke up – giving 50 minimum diameter values which were stored in a column of a ‘result’ matrix. A statistical analysis was performed on the columns of the result matrix to produce average values for the minimum diameter from breakup backwards in terms of pixels and the frame rate of the camera. Finally, as the frame rate was known and as we were able to determine the distance in pixels that the fluid moved between frames, we could rescale the results into SI units and output a graph like the one below! The solution was RAM-light as we were only ever loading one frame into memory at a time and calculations were only performed on a single array. The result matrix also didn’t eat up memory as it only reached widths on the order of hundreds. This result meant we could run several analyses at the same time, which was not possible when loading an entire video.

In terms of processing time, a single video ended up taking around 6 minutes. This allowed us to analyse videos on the same day they were captured and complete our experimental analysis within the allotted time. Happy Days!

Lessons learnt:

• Taking time to pre-process raw data can save much more time during the processing stage.
• The benefits of ensuring an application runs with low memory requirements should not be overlooked. It could allow much shorter turnaround times if a task is repetitive and can be run on multiple inputs at the same time.
• A thorough understanding of the problem space helps dramatically when problem solving. By identifying that Lagrangian elements behaved in the manner that they did, we were able to develop a tracking process relatively easily. Without this prior knowledge, this may have taken much longer.