## Example of Recursive Exhaustion

Though encoding names by their frequency in name lists from an old US census is a terrible way too do it, I shall use it here.  Since I do not know the middle names of many of my ancestors, I will ignore them.  Thus instead of using the unique descriptor for my full name, Douglas Pardoe Wilson, as 45, 80085, 8, I will only use Douglas Wilson, which is could be 45, 8.  Obviously there will be many Douglas Wilsons.

Instead of a unique descriptor using the three numbers, I will use two of them plus my birthdate.  Thus I am 45, 8, 1949, 8, 5.  I think that is probably a unique descriptor, involving just five numbers.  Now instead of using five numbers to describe me, I could add the sequences for my parents.  Using the same encoding, my father is 19, 8, 1920, 6, 6.  My mother was 14, 999999, 1920, 1, 1, where I have used 999999 to indicate that her maiden surname was not found in that census at all.

The result of doing this is to have a sequence of fifteen numbers to describe myself:  45, 8, 1949, 1920, 6, 6, 14, 999999, 1920, 1, 1.  This is entirely plausible.  I am in many ways like my father, in many other ways like my mother.  Though using something based on name frequency is just plain wrong, the set of fifteen numbers is a better summary of me than the original five.

This can be done for my father and mother also.  My mother would be not only the five given numbers, but the fifteen, 14, 999999, 1920, 1, 1, 85, 999999, 1885, 8, 29, 997, 1891, 4, 2.  My father would be 19, 80085, 8, 1920, 6, 6, 131, 8, 1880, 10, 16, 256, 7238, 1886, 11, 2.

Having fifteen numbers for each of my parents, I would then have a 45 number descriptor, my own plus each of theirs.  If we go back one more generation, I’d get a 135 number descriptor.  Using a valid multimensional name space instead of just name frequency, one person might have ten numbers in their own description, thirty in the one including their parents, nighty in the next and 270 in the next.  Having a legitimate 270 number vector description of a person would be very useful.  In fact a single person may have hundreds of numbers in their description and the multiplication factor may be much higher.  Instead of using just parents in multiplying the basic data for a person, birthplace could be used.  Eventually megabytes of information could be found for each person existing today and nearly as much for all ancestors.

This is what Recursive Exhaustion is all about.  It is one of the most powerful tools I have.

## Name Spaces

As I mentioned on the front page of this site, one way of encoding a name is by its frequency in some canonical lists, such as a former US census.  By itself that is a terrible way of doing it.  It reflects popularity only.  My middle name Pardoe could be encoded as 80085 because it is the 80085 most common surname in some old US census.  Geographically, the name Pardoe is most associated with the village of Ombersley in Worcestershire, England.  The very next name in the frequency list at 80086 is Paraluelos, which has some entirely different origin, unknown to me.

A much better way of translating names into numbers would be to record the latitude and longitude of Ombersley, the place where it was most commonly found, plus perhaps the geographical coordinates of the place in Shropshire where it was first recorded.

Unlike the example of recording names by their frequency in a census, this produces a useful entry in a vector space of names.  It could be used to find names common to nearby places.  However, the word ‘common’ suggests that this is again ultimately based on name frequency.  Those four coordinates would nevertheless be much more useful than the previous single coordinate.

I believe it possible to a set of coordinates which would make an effect way of placing a name in a name space.  It will not be easy to do.  For example, my last name, ‘Wilson’, was common in Worcestershire, but also in Scotland.  How would one deal with this matter?  Possibly one could have two sets of four coordinates, one for the Worcestershire Wilsons and one for the Scottish ones.  It is not clear how these would be combined, but I believe there is some practical way of doing so.

Regardless of the technical details, I think that some way of placing names in name spaces would make it much easier to define an individual.  For example my last two names and those of a great many of my ancestors and their descendants, Pardoe Wilson, together make it clear that I come from the Worcestershire Wilsons not the Scottish  ones.  On the other hand there are some Douglas Robert Wilsons.  Robert was a relatively rare name in Worcestershire in earlier centuries, but common in Scotland.  The chances are that a Douglas Robert Wilson from the 18th or 19th century was one of the Scottish Wilsons.

Expressing this in mathematical form is difficult, but I believe the powerful method of Recursive Exhaustion would make it quite possible.  I’ll explain this in a later post.

## For Use by the Recursive Exhaustion Algorithm

Why would one want to use genealogy for such a potentially dangerous purpose?

My own answer is that the “bad guys”, people of evil intent will use it themselves regardless.  They do not need our knowledge or permission.  They will be able to collect information about children as well as adults.   The only hope of keeping society together is to do the same thing but do it better.  Genealogists can make an enormous potential contribution to this.

To explain this, I need  to show you how genealogical information can be turned into numerical data suitable for use with mathematical methods.

A key principle is that expressing data as human-readable words is misleading.  To say that  my grandfather was born on Impney Farm is very correct, but obscure.   To say Dodderhill is less accurate, but more convenient.  To say that he was born in Droitwich is just barely correct today, but they were distinct at one time.  To just say Worcesteshire is a terrible approximation, but better than nothing.

The solution is to translate everything into numbers and indicate  precision by the number of decimal points, as people working in the physical sciences do.

My grandfather was born in a farmhouse whose location can be expressed as a latitude and longitude to eight decimal points of acccuracy.  To say that he was born in Dodderhill near Droitwich is to make an error in the seventh decimal p0int, and to round it up by saying he was born in Droitwich itself introduces an error in about the sixth decimal point.

Just expressing locations and other data as numbers does not make any real use of math.  It is more like bookkeeping.  Very useful in itself, but not suitable for the mathematical methods of data science.

The first step of a truly mathematical approach is to encode various things we know about person into a set of coordinates in a multidimensional vector space.  That’s easier than it sounds.

It  is obvious that names can be expressed by numbers.  For example, my first name, Douglas, can be assigned the number 45 because it is the 45th most common male first name on some old US census.  That is a terrible way to do it, as I shall explain later, but it does express one property of my first name — frequency.  You could call it the frequency or popularity dimension.

Similarly, my middle name, Pardoe, could be encoded as the number 80085, since it is the 80085th most common surname on the same old US census.  It is not found on any other name list from that census  My last name, Wilson, could be encoded by the number 8, since it was the 8th most common surname on that old census.

As far as I know, the combination of these three names are unique.  I am the one person in the world whose name could be represented as 45, 80085, 8.  It is a unique identifier.  On the other hand, my brother’s names do not describe a unique individual.  There is another person with exactly his three names in the same city in which we were raised.

To make a unique identifier for my brother it is necessary to add three numbers representing his birthdate: year, month and day.  To represent us both in the same vector space, my own birthdate would have to be added.

As I said, encoding names using lists from some old census is a terrible way to do it, but it illustrates a basic method.  A few basic facts about a person such as their name and birthdate can easily be translated into a sequence  of numbers, which we call a vector.  Of vital importance is the ability to invert this.  Given a sequence of numbers, it must be easy to decode them, reproducing the original information.

So the first step in mathematical genealogy is encoding a few basic facts which a person can easily read into a sequence of numbers.  The last step is the inverse, recovering that description from its numerical representation.  That could be just table-lookup, but there are better ways, discussed elsewhere.

What happens between these steps is the key to mathematical genealogy.   Encoding and decoding are inverses, more easily done with linear algebra, but it may be necessary to use category theory and think of them as adjoint functorsIn general, sandwiching important transformations between an operation and its inverse are the most powerful mathematical methods I’ve ever encountered.

The meat in the sandwich discussed here is recursive exhaustion, the most powerful data collection and correction method I know.

In my website on recursive exhaustion I use the term exhaustion as it is used in the context of computer science.   But another meaning of the term is that of exhausting a space of possibilities.  That basically means doing it for everybody.  For mathematical genealogy using recursive exhaustion the important thing is to create a mathematical model like a sequence of numbers for every single person we can identify, past, present and even future — though extrapolating beyond even the best date is always risky:  she may have a miscarriage.

That meaning of exhaustion is entirely consistent with a major goal of genealogy in general.  We do not want to simply create a mathematical model for existing people, we want to create one for people long dead.  For example, the sequence of numbers 131, 80085, 8, 1880, 10, 16 could represent my grandfather, Frederick Pardoe Wilson, born on October 16, 1880.  That is a unique descriptor for the man.  A goal of genealogy includes producing mathematical descriptions for not only my grandfather but all of my other relatives.  I know some identifying and other information about a very few going back to before the Norman Conquest.  Obviously I’ve had too much time on my hands.

One reason for collecting this information in mathematical form is that it will be easier to merge with that of other people.  The fact that I share ancestors six generations back with a fourth cousin made it much easier to establish the basic facts.   Collaboration will be much much easier for everyone when genealogical data is in mathematical form.

Though rewarding for many reasons,  encouraging collaboration is not the main reason for this approach.  Methods such as recursive exhaustion can be used to extract an enormous amount of additional information about each other person who has ever existed.  Much of this information would be of great value to those interested in their family history.  I will explain this in detail on various posts be added to this site.   Meanwhile look at the website actually called Very Large Scale Social Data Collection, with the RecursiveExhaustion.com domain name.

Posted in Uncategorized | Comments Off on Genealogy for Recursive Exhaustion

## Free but Compulsory Genealogy

The idea of anything being compulsory offends me, but some things like public education seem to depend on it.  The only thing that can ameliorate my distaste for something being compulsory is if it is free.  That is indeed the principle behind public education in the advanced countries, that it should be both free and compulsory, so that everyone gets a basic education, whether they want to or not, but it is provided at no cost by the state.

I believe that some things related to genealogy are so important that they should also be made free and compulsory.

Two things stand out in my mind.

First, each individual should provide enough information to locate him or her in what I think of as genealogical space, but you might prefer to think of as a global genealogical tree.  People who cannot provide this information should be helped to obtain it as part of what is really their education.  Knowledge of self is an essential part of the basic knowledge an educated person needs.

Secondly, individuals should be required to obtain  DNA samples.  This is not the same as a requirement to provide DNA samples to governments.  Access to the information obtained from DNA should be entirely under the control of the individual, except in certain specific situations.  That is a separate issue, discussed elsewhere.  The key thing is that everyone have a DNA profile, which can easily be compared with that of others for genealogical purposes.  Doing such comparisons should be voluntary, but making them easy is important.  Other reasons for having universal DNA analysis are discussed elsewhere.

Two things ameliorate the distasteful aspects of requiring DNA information to be collected for individual use.  First, this service will be provided free of charge, and secondly a number of powerful software tools for making use of it will be provided, also free of charge.

These proposals raise a great many issues, and will be controversial.  I am myself greatly disturbed at the idea of making any such thing compulsory.  But I believe the benefits to society greatly outweigh the risks.

Comments Off on Free but Compulsory Genealogy

## Where was I Born?

Previously I showed you how to derive a name for yourself which would be a unique descriptor.  I also showed you how to derive a unique descriptor from several birthdates, and times, including your own and at least your parents’.  Now what about birthplaces.  I don’t know if my brother and I were born in exactly the same geographic location, but probably close to one another.  Adding the birthplaces of ancestors would help distinguish people from different families, but not siblings with identical birthplaces.  My father was born in the family home at 2716 Clarke Drive, Vancouver BC, but may indeed have been born in the same bed as one of his siblings.  I don’t know.

The case of twins is especially hard, since their birthplaces were likely the same down to a matter of an inch or so.  Twins may be distinguished by birth times, though, and certainly by given names, so providing a total unique descriptor for any person is not a problem.  But if names would suffice, and birth times good, then why look at birthplaces at all?

The answer is in the mathematical advantages of having a lot of data, more than just enough for unique descriptors.  How to collect, massage and use that data is the subject of a later page.  The relevance for genealogy is simple.  We need to construct such descriptors for our ancestors, to the best of our ability, which we can then use in matching algorithms to provide reliable exhaustive proof of family trees.

## When was I born?

I have written about my name as a unique descriptor, and how to make one of yours.

But names are symbolic.  What about using numbers to identify people?  They would be easier to work with when using mathematics.  An obvious candidate is birthdate.

I’m an old guy, from the middle of the last century, August 5th, 1949.   But just to maintain my self-respect, let me just point out that it was in 1975, when I was just 26, that I first started to write about Social Technology, at 40 years before people started talking about “Social Apps”.

Anyway, that bit of egoism aside, that date is not a good enough descriptor.  Lots of people were born on that day.  Specifying birthplace in lat/long coords would help, but let’s just stick with times for now, as I used names on the previous page.  I think all I need to do is add the birthdates of my parents: my mother, January 1st, 1920, and my father, June 6th, 192o.  We have this little mnemonic for dates in our family:  my mother was born on the first day of the first month, my father the 6th day of the 6th month, myself, the fifth day of the eighth month, my brother, the eighth day of the fifth month.

There will be people who are not sufficiently distinguished by just three birthdates.  We have only one coincidence that I know of in my family, and only a partial one: both of my grandmothers were born on April 2nd — but on different years.  Still, in this huge world of ours, there may be a case where three birthdates are not enough.  If so, add the dates for grandparents.  That is sure to be enough.

Using birthdates is perhaps the easiest unique descriptor to provide, but birthplaces are of some use too.  More about that later.

## What is My Name?

As I wrote elsewhere, I am known here as Doug Wilson, though there is another person in town with that name.  To distinguish us, I could use my full name, Douglas Pardoe Wilson, which has the advantages of being a unique descriptor.  I don’t there is anyone else in the world with that name.  Adding my middle name was all I needed.  But the same would not be true for my brother, Alan.  His middle name is the common one of Edward, so there are many Alan Edward Wilsons in the world, including one who went to my old high school in North Vancouver.

So let me try to produce a unique descriptor for my brother.  As is common in some countries, let’s add our mother’s maiden name.  He would then be Alan Edward Cottet Wilson.  Unique.  But let’s be consistent here.  If he is going to have four names, I should as well.  So let me be Douglas Pardoe Cottet Wilson.

This may not be good enough for some people, though.  Consider a John William Smith, whose mother’s maiden name was Jones.  Those are all common names, so the combined name of John William Jones Smith may be not be unique.  So what now?  We could add a the maiden names of one grandmother.  That’s probably enough to get a unique descriptor for the man, but to be fair, let’s add both of them.

Instead of continuing with an artificial example, I’ll be egocentric again and look at what my own would be.  While it only takes three names to make mine unique, it did no harm to add a fourth, with the end goal of giving everyone the same number of names.  Putting the maiden name of my maternal grandmother before that of my paternal one, I would then be Douglas Pardoe Walker Tighe Cottet Wilson.  If I continued adding maiden names past absurdity, then my real middle name of Pardoe would reoccur, since it was the maiden of my third great-great-great-grandmother Cicely, of Ombersley, Worcestershire, England.

That’s carrying things a bit far.  How many names do we need to provide every human being with a unique descriptor?  Four, Five, Six?  And what would your name be?