We present a novel approach for predicting debug information in stripped binaries. Using machine learning, we first train probabilistic models on thousands of non-stripped binaries and then use these models to predict properties of meaningful elements in unseen stripped binaries. Our focus is on recovering symbol names, types and locations, which are critical source-level information wiped off during compilation and stripping.

Our learning approach is able to distinguish and extract key elements such as register-allocated and memory-allocated variables usually not evident in the stripped binary. To predict names and types of extracted elements, we use scalable structured prediction algorithms in probabilistic graphical models with an extensive set of features which capture key characteristics of binary code.

Based on this approach, we implemented an automated tool, called Debin, which handles ELF binaries on three of the most popular architectures: x86, x64 and ARM. Given a stripped binary, Debin outputs a binary augmented with the predicted debug information. Our experimental results indicate that Debin is practically useful: for x64, it predicts symbol names and types with 68.8% precision and 68.3% recall. We also show that Debin is helpful for the task of inspecting real-world malware – it revealed suspicious library usage and behaviors such as DNS resolver reader.


Machine Learning for Code