Depix是一款快速移除文字截图上的马赛克,让原始文本得以重新呈现的开源工具

 

它是基于Python编写。

Depix使用的算法利用了线性盒式滤波器分别处理每个马赛克块。它对搜索图像中的所有块进行像素化以检查直接匹配。将周围的多匹配块的匹配进行比较,重复此过程多次,设法找到匹配结果并输出。

源码:https://github.com/beurtschipper/Depix

---------------------------------------------------------

Depix

Depix is a tool for recovering passwords from pixelized screenshots.

This implementation works on pixelized images that were created with a linear box filter. In this article I cover background information on pixelization and similar research.
Example

image
Installation

    Install the dependencies:

pip install git+https://github.com/beurtschipper/Depix

    Run Depix:

depix \
    -p /path/to/your/input/image.png \
    -s images/searchimages/debruinseq_notepad_Windows10_closeAndSpaced.png \
    -o /path/to/your/output.png

Example usage

    Depixelize example image created with Notepad and pixelized with Greenshot. Greenshot averages by averaging the gamma-encoded 0-255 values, which is Depix's default mode.

depix \
    -p images/testimages/testimage3_pixels.png \
    -s images/searchimages/debruinseq_notepad_Windows10_closeAndSpaced.png

Result: image

    Depixelize example image created with Sublime and pixelized with Gimp, where averaging is done in linear sRGB. The backgroundcolor option filters out the background color of the editor.

depix \
    -p images/testimages/sublime_screenshot_pixels_gimp.png \
    -s images/searchimages/debruin_sublime_Linux_small.png \
    --backgroundcolor 40,41,35 \
    --averagetype linear

Result: image

    (Optional) You can create pixelized image by using genpixed.

genpixed -i /path/to/image.png -o pixed_output.png

    For a detailed explanation, please try to run $ depix -h and genpixed.

About
Making a Search Image

    Cut out the pixelated blocks from the screenshot as a single rectangle.
    Paste a De Bruijn sequence with expected characters in an editor with the same font settings as your input image (Same text size, similar font, same colors).
    Make a screenshot of the sequence.
    Move that screenshot into a folder like images/searchimages/.
    Run Depix with the -s flag set to the location of this screenshot.

Algorithm

The algorithm uses the fact that the linear box filter processes every block separately. For every block it pixelizes all blocks in the search image to check for direct matches.

For most pixelized images Depix manages to find single-match results. It assumes these are correct. The matches of surrounding multi-match blocks are then compared to be geometrically at the same distance as in the pixelized image. Matches are also treated as correct. This process is repeated a couple of times.

After correct blocks have no more geometrical matches, it will output all correct blocks directly. For multi-match blocks, it outputs the average of all matches. The algorithm uses the fact that the linear box filter processes every block separately. For every block it pixelizes all blocks in the search image to check for direct matches.
Known limitations

    The algorithm matches by integer block-boundaries. As a result, it has the underlying assumption that for all characters rendered (both in the de Brujin sequence and the pixelated image), the text positioning is done at pixel level. However, some modern text rasterizers position text at sub-pixel accuracies.
    The algorithm currently performs pixel averaging in the image's gamma-corrected RGB space. As a result, it cannot reconstruct images pixelated using linear RGB.

Future development

    Implement more filter functions

Create more averaging filters that work like some popular editors do.

    Create a new tool that utilizes HMMs

After creating this program, someone pointed me to a research document from 2016 where a group of researchers managed to create a similar tool. Their tool has better precision and works across many different fonts. While their original source code is not public, an open-source implementation exists at DepixHMM.

Edit 16 Feb '22: Dan Petro created the tool UnRedacter (write-up, source) to crack a challenge that was created as a response to Depix!

Still, anyone who is passionate about this type of depixelization is encouraged to implement their own HMM-based version and share it.

from  

https://github.com/beurtschipper/Depix

-------------------------------------------------

恢复图片里的被马赛克的部分

Introduction

Pixelization is used in many areas to obfuscate information in images. I've seen companies pixelize passwords in internal documents. No tools were available for recovering a password from such an image, so I created one. This article covers the algorithm and similar research on depixelization.

The tool is available on Github, and the image below shows one of the test results.
No alt text provided for this image
What is pixelization?

Pixelization describes the process of partially lowering the resolution of an image to censor information. The implementation of my algorithm attacks the common linear box filter. A linear box filter takes a box of pixels, and overwrites the pixels with the average value of all pixels in the box. Its implementation is simple and its workings fast, for it can process multiple blocks in parallel.

The image below shows an example of the linear box filter. An image of an emoticon is divided into four blocks. The average color of a block overwrites the block’s pixels, resulting in a final, pixelized emoticon. It is impossible to directly reverse the filter, since the original information is lost.
No alt text provided for this image
Deblurring tools, history and research

Images can be obfuscated in many ways, which is generally referred to as blurring. Pixelization with box filters can be seen as a subset of blurring techniques. Most blurring algorithms tend to spread out pixels as they try to mimic natural blurs caused by shaky cameras or focusing issues.

There exist many deblurring tools for common tasks, such as sharpening blurry photographs. Unfortunately, the pixelated passwords I'm working with are only a couple of blocks in height, so there is nothing to sharpen.

Recent developments in AI have raised fancy headlines such as "Researchers Have Created a Tool That Can Perfectly Depixelate Faces". However, the AI does no such thing. This recent PULSE algorithm is similar to Google's RAISR algorithm from 2016. The AI generates faces that result in the same image when pixelized, but the face it recovers is not the original.

Algorithms such as PULSE seem new, but they stem from a long lineage of deblurring tools. M. W. Buie wrote a tool in 1994 (!) to generate 'Plutos', blur them, and match them with observed images.

In a widely known article from 2006, D. Venkatraman explains an algorithm for recovering a pixelized credit card number. The idea is simple: generate all credit card numbers, pixelize them, and compare the result to the pixelized number.

An amazing paper (Hill, Steven & Zhou, Zhimin & Saul, Lawrence & Shacham, Hovav) from 2016 uses Hidden Markov Models (HMMs) to obtain insane accuracy in recovering pixelized text. The source code was not public, but JonasSchatz implemented it in DepixHMM in 2021. The technique is an advanced version of the technique described by Venkatraman.

In 2019, S. Sangwan explained how Photoshop can be utilized to recover faces for OSINT by sharpening an image and looking it up via Google Images. It's similar to other techniques, in that it uses Google to 'brute-force' the face in the image.

Note the similarities between the mentioned solutions. If not enough information is available to properly smooth the image back together, the technique-of-choice is to pixelize similar data and check if it matches. This is also the basis for my algorithm for recovering passwords from screenshots.
Algorithm description

Since the linear box filter is a deterministic algorithm, pixelizing the same values will always result in the same pixelated block. Pixelizing the same text - using the same locations of blocks - will result in the same block values. We can try to pixelate text to find matching patterns. And luckily, this would even work for a part of the secret value. Every block, or combination of blocks, can be considered a sub-problem.

I didn't choose to create a lookup table of potential fonts. The algorithm requires the same text size and color on the same background. Modern text editors also add hue, saturation and lightness, allowing for a huge amount of potential fonts settings with which the screenshot was taken.

This solution is quite simple: take a De Bruijn sequence of expected characters, paste it in the same editor, and make a screenshot of that. That screenshot is used as a lookup image for similar blocks. For example:
No alt text provided for this image

This sequence includes all 2-character combinations of expected characters. It's important that 2-character combinations are used, because some blocks can overlap two characters.

Finding proper matches requires the exact block of the same configuration of pixels to exist in the search image. In a test image, my algorithm couldn't find a part of the 'o'. I noticed it was because in the search image, the search block also includes a part of the next letter (the 'd'), but in the original image there was a space.
No alt text provided for this image

Creating a De Bruijn sequence of letters with spaces around them obviously introduces the same problem: the algorithm wouldn't be able to find the proper blocks for consecutive letters. An image with both spaced and close letters takes longer to search but yields better results.

For most pixelized images, the tool seems to find single-match results for blocks. It assumes these are correct. The matches of surrounding multi-match blocks are then compared to be at the same geometrical distance as in the pixelized image. These matches are also treated as correct.

After correct blocks have no more geometrical matches, it will output all correct blocks directly. For multi-match blocks, it outputs the average of all matches. Its output is nowhere near perfect, but it performs quite well. The image below shows a test image with random characters. Most characters can be properly read.
No alt text provided for this image
Final notes

Always completely remove sensitive information from images, because obfuscation techniques can disclose recoverable parts of the original value.

If there are other tools that can recover passwords from pixelized images, I'd like to know about them. Please test them on the test images in the Github repository first before claiming that they work. I'm also interested in other techniques, such as pattern recognition of pixelized blocks.

Edit 3 Oct '21: JonasSchatz actually implemented DepixHMM based on the mentioned 2016 research paper involving Hidden Markov Models (HMMs) to recover text!

Edit 16 Feb '22: Dan Petro created the tool UnRedacter (write-up, source) to crack a challenge that was created as a response to Depix!

The mentioned technique beautifully links to vulnerable patterns in cryptography. It's similar to hash cracking, exploiting the use of ECB, and the utilization of known-plaintext attacks. Use best-practices for securing data. The assumption that a schema can't be broken, just because the implementer doesn't know how, is a common pitfall in information security.

from 

https://www.linkedin.com/pulse/recovering-passwords-from-pixelized-screenshots-sipke-mellema