Pattern Detection
Pattern Discovery and Detection
Recently I have been trying to solve the problem of finding common word patterns in strings of text.
Most pattern tools assume you already know the pattern that you wish to locate, and then allow you to search for matches based upon that pattern (such as regular expressions). The problem I faced was that I needed to find patterns as opposed to match them.
I scoured the web for tools to perform this job to no avail. I decided to look for solutions to similar problems. A friend suggested the field of genetic research, where finding commonly occuring patterns is a relatively common task.
I found several cases where pattern discovery used a matrix to solve the problem.
First, one pattern is placed across the x axis of a matrix, and one is placed across the y axis.
| G |
|
|
|
|
|
|
| A |
|
|
|
|
|
|
| T |
|
|
|
|
|
|
| T |
|
|
|
|
|
|
| A |
|
|
|
|
|
|
| C |
|
|
|
|
|
|
| A |
|
|
|
|
|
|
| T |
|
|
|
|
|
|
| A |
|
|
|
|
|
|
|
| G | T | A | A | C | A |
Then a mark is placed on each location in the matrix where a match is found on its corresponding x and y axis.
| G | X |
|
|
|
|
|
| A |
|
| X | X |
| X |
| T |
| X |
|
|
|
|
| T |
| X |
|
|
|
|
| A |
|
| X | X |
| X |
| C |
|
|
|
| X |
|
| A |
|
| X | X |
| X |
| T |
| X |
|
|
|
|
| A |
|
| X | X |
| X |
|
| G | T | A | A | C | A |
Finally, the matrix is examined to find diagonal lines of markers running from bottom-left to top right
| G | X |
|
|
|
|
|
| A |
|
| X | X |
| X |
| T |
| X |
|
|
|
|
| T |
| X |
|
|
|
|
| A |
|
| X | X |
| X |
| C |
|
|
|
| X |
|
| A |
|
| X | X |
| X |
| T |
| X |
|
|
|
|
| A |
|
| X | X |
| X |
|
| G | T | A | A | C | A |
In this example we can see that the longest pattern common to both is ACA. We have discovered this pattern without any prior knowledge of what we were looking for.
I decided to try this match versus my original problem, finding phrases within text.
| mat |
|
|
| X |
|
|
|
| the |
|
| X |
|
| X |
|
| on |
| X |
|
|
|
|
|
| sat |
|
|
|
|
|
|
|
| cat |
|
|
|
|
|
| X |
| The |
|
| X |
|
| X |
|
|
| Sit | on | the | mat | said | the | cat |
The common patterns “the cat”, and “on the mat” were successfully detected. This technique works better for detecting text as there are more of a variety of possibilities thereby reducing 'noise'.
As a side-note, when performing the matching process using a computer, storing the matches in a matrix is quite wasteful. The matrix is populated sparsely, and in diagonal lines. If the matrix is transformed by rotating it 45 degrees clockwise, the lines become horizontal, more obvious for detection, and easier to store as coordinates/vectors rather than a bulkier 2D array.
This can be also used to find commonly occuring patterns in a single document by placing the same document on the x and y axis, and ignoring the centre coordinates (1,1) , (2,2) , (3,3) etc.



