Genealogy for Recursive Exhaustion


For Use by the Recursive Exhaustion Algorithm

Very Large Scale Data Collection

Why would one want to use genealogy for such a potentially dangerous purpose?

My own answer is that the “bad guys”, people of evil intent will use it themselves regardless.  They do not need our knowledge or permission.  They will be able to collect information about children as well as adults.   The only hope of keeping society together is to do the same thing but do it better.  Genealogists can make an enormous potential contribution to this.

To explain this, I need  to show you how genealogical information can be turned into numerical data suitable for use with mathematical methods.

A key principle is that expressing data as human-readable words is misleading.  To say that  my grandfather was born on Impney Farm is very correct, but obscure.   To say Dodderhill is less accurate, but more convenient.  To say that he was born in Droitwich is just barely correct today, but they were distinct at one time.  To just say Worcesteshire is a terrible approximation, but better than nothing.

The solution is to translate everything into numbers and indicate  precision by the number of decimal points, as people working in the physical sciences do.

My grandfather was born in a farmhouse whose location can be expressed as a latitude and longitude to eight decimal points of acccuracy.  To say that he was born in Dodderhill near Droitwich is to make an error in the seventh decimal p0int, and to round it up by saying he was born in Droitwich itself introduces an error in about the sixth decimal point.

Just expressing locations and other data as numbers does not make any real use of math.  It is more like bookkeeping.  Very useful in itself, but not suitable for the mathematical methods of data science.


The first step of a truly mathematical approach is to encode various things we know about person into a set of coordinates in a multidimensional vector space.  That’s easier than it sounds.

It  is obvious that names can be expressed by numbers.  For example, my first name, Douglas, can be assigned the number 45 because it is the 45th most common male first name on some old US census.  That is a terrible way to do it, as I shall explain later, but it does express one property of my first name — frequency.  You could call it the frequency or popularity dimension.

Similarly, my middle name, Pardoe, could be encoded as the number 80085, since it is the 80085th most common surname on the same old US census.  It is not found on any other name list from that census  My last name, Wilson, could be encoded by the number 8, since it was the 8th most common surname on that old census.

As far as I know, the combination of these three names are unique.  I am the one person in the world whose name could be represented as 45, 80085, 8.  It is a unique identifier.  On the other hand, my brother’s names do not describe a unique individual.  There is another person with exactly his three names in the same city in which we were raised.

To make a unique identifier for my brother it is necessary to add three numbers representing his birthdate: year, month and day.  To represent us both in the same vector space, my own birthdate would have to be added.

As I said, encoding names using lists from some old census is a terrible way to do it, but it illustrates a basic method.  A few basic facts about a person such as their name and birthdate can easily be translated into a sequence  of numbers, which we call a vector.  Of vital importance is the ability to invert this.  Given a sequence of numbers, it must be easy to decode them, reproducing the original information.

So the first step in mathematical genealogy is encoding a few basic facts which a person can easily read into a sequence of numbers.  The last step is the inverse, recovering that description from its numerical representation.  That could be just table-lookup, but there are better ways, discussed elsewhere.

What happens between these steps is the key to mathematical genealogy.   Encoding and decoding are inverses, more easily done with linear algebra, but it may be necessary to use category theory and think of them as adjoint functorsIn general, sandwiching important transformations between an operation and its inverse are the most powerful mathematical methods I’ve ever encountered.

The meat in the sandwich discussed here is recursive exhaustion, the most powerful data collection and correction method I know.

In my website on recursive exhaustion I use the term exhaustion as it is used in the context of computer science.   But another meaning of the term is that of exhausting a space of possibilities.  That basically means doing it for everybody.  For mathematical genealogy using recursive exhaustion the important thing is to create a mathematical model like a sequence of numbers for every single person we can identify, past, present and even future — though extrapolating beyond even the best date is always risky:  she may have a miscarriage.

That meaning of exhaustion is entirely consistent with a major goal of genealogy in general.  We do not want to simply create a mathematical model for existing people, we want to create one for people long dead.  For example, the sequence of numbers 131, 80085, 8, 1880, 10, 16 could represent my grandfather, Frederick Pardoe Wilson, born on October 16, 1880.  That is a unique descriptor for the man.  A goal of genealogy includes producing mathematical descriptions for not only my grandfather but all of my other relatives.  I know some identifying and other information about a very few going back to before the Norman Conquest.  Obviously I’ve had too much time on my hands.

One reason for collecting this information in mathematical form is that it will be easier to merge with that of other people.  The fact that I share ancestors six generations back with a fourth cousin made it much easier to establish the basic facts.   Collaboration will be much much easier for everyone when genealogical data is in mathematical form.

Though rewarding for many reasons,  encouraging collaboration is not the main reason for this approach.  Methods such as recursive exhaustion can be used to extract an enormous amount of additional information about each other person who has ever existed.  Much of this information would be of great value to those interested in their family history.  I will explain this in detail on various posts be added to this site.   Meanwhile look at the website actually called Very Large Scale Social Data Collection, with the domain name.

This entry was posted in Uncategorized. Bookmark the permalink.