Topic modelling in the archives

There seems to be a lot of topic modelling going on at the moment. Any why not? Projects like Mining the Dispatch are demonstrating the possibilities. Tools like Mallet are making it easy. And generous DHers like Ted Underwood and Scott Weingart are doing a great job explaining what it is and how it works.

I’ve talked briefly about using topic modelling to explore digitised newspapers, something that the Mapping Texts project has also been investigating. But I’ve also been following with interest Chad Black’s use of algorithmic techniques, including topic modelling, to look for local variations amidst the legal system of the early modern Spanish empire.

As part of the Invisible Australians project, Kate and I are exploring the bureaucracy of the White Australia Policy. In particular, we’re interested in the interaction between policy and practice, between the highly-centralised bureaucracy and the activities of individual port officials. Like Chad, we’re interested in mapping local variations — to try and understand the bureaucracy from the point of view of an individual forced to live within its restrictions.

I recently gave a presentation about the project at Digital Humanities Australasia (post coming soon!), and in preparation I decided to try a few topic modelling experiments. They were very simple, but I was impressed by the possibilities for exploring archival systems.

The problem I started with was this. The workings of the White Australia Policy are well documented by records held by the National Archives of Australia. Some series within the archives are specifically related to the operations of the policy — such as those containing many thousands of CEDTs. But there are also general correspondence series created by the customs offices in each state, as well as the Commonwealth Department of External Affairs which administered the Immigration Restriction Act (responsibility was later taken by the Department of Home and Territories and it’s successors). These general correspondence series are important, because they often include details of difficult or controversial cases — those that required a policy judgment, or prompted a change in existing practices. But how do you find relevant files within series that can contain large numbers of items?

Series A1, for example, is a correspondence series created by the Department of External Affairs. It contains more than 60,000 items. Past research tells us that amongst these 60,000 files are records of important policy discussions relating to White Australia. But these files tend to be labelled with the names of the people involved, so unless you know the names in advance they can be difficult to find.

Mitchell Whitelaw’s A1 Explorer, part of the Visible Archive project, lets you to explore the contents of Series A1 in a easy and engaging way. But while the A1 Explorer provides new opportunities for discovery, it doesn’t offer the fine-grained analysis we need to sift out the files we’re after. And so… topic modelling.

The process was pretty simple. While I can dip into my bag of screen-scrapers to harvest series directly from the NAA’s RecordSearch database, there was already an XML dump of A1 available from data.gov.au. So I extracted the basic file metadata from the XML and wrote the identifiers and titles out to a text file, one item per line. Following the instructions on the website I then loaded this file into Mallet:

/Applications/Mallet/bin/mallet import-file --input ./A1.txt --output A1.mallet --keep-sequence --remove-stopwords

Then it was just a matter of firing up the topic modeller:

/Applications/Mallet/bin/mallet train-topics --input ./A1.mallet --output-state ./A1.gz --output-doc-topics ./A1-topics.txt --output-topic-keys ./A1-keys.txt --num-topics 40

Again, I just followed the examples on the Mallet site.

Once it was finished I opened up A1-keys.txt to browse the ‘topics’ Mallet had found. The results were intriguing. There are a large number of applications for naturalisation in A1, so it’s no surprise that ‘naturalisation’ figures prominently in a number of the topics. What was more interesting was the way Mallet had grouped the naturalisation files. For example:

naturalization christian hans hansen jensen petersen andersen nielsen larsen christensen johannes jens niels pedersen andreas johansen martin jorgensen

and

naturalisation certificate giuseppe salvatore frank la leo samios spina sorbello leonardo fisher natale patane torrisi barbagallo luka rossi ross

Based on the co-occurrence of names within the file titles, Mallet had created groupings that roughly reflected the ethnic origins of applicants. It makes sense when you think about what Mallet is doing, but I still found it pretty amazing.

Mallet also found clusters around the major activities of the department, such as the administration of the territories. But of most interest to us was:

1 0.55539 passport ah student exemption students lee wong chinese young deserter education sing wing chong readmission son hing chin wife

The Chinese names alongside words such as ‘readmission’ and ‘wife’ suggested that this topic revolved around the administration of the White Australia Policy. This was easy to test. In A1-topics.txt was a list of every file in the series and their weightings in relation to each of the topics. I wasn’t sure what was a reasonable cut-off value to use in assessing the weightings, but after a bit of trial and error I fixed on a value of 0.7. I then just extracted the identifiers of every file that had a weighting greater than 0.7 for this topic. I used the identifiers to build a simple web page that Kate and I could browse. I also included links back to RecordSearch so we could explore further.

Browse the full list

It’s a pretty impressive result. Instead of fumbling with the uncertainties of keyword searches, we now have a list of more than 1,300 files that are clearly of relevance to Invisible Australians. There’s a few false positives and there are likely to be other files that we’ll have missed altogether, but now we have a much clearer picture of the types of files that are included and how they are described.

And that was at my first attempt, simply using the default settings. I’m now starting to play around with some of Mallet’s configuration options to see what sort of difference they make. I’m also keen to try out GenSim, a topic modelling package for Python.

I’m really excited about the possibilities of these sort of tools for analysing the contents of archival descriptive systems, something I mentioned in my Digital Humanities Australasia paper. Much more to come on this I suspect…

It’s all about the stuff: collections, interfaces, power and people

This is the full version of a paper I presented at the National Digital Forum, 30 November 2011.

In 1901, one of the first acts of the Commonwealth of Australia was to create a system of exclusion and control designed to keep the newly-formed nation ‘white’. But White Australia was always a myth. As well as the Indigenous population, there were already many thousands of people classified as ‘non-white‘ living in Australia — most were Chinese, but there were also Japanese, Indians, Syrians and Indonesians.

Here are some of them…

The real face of White Australia

The administration of what became known as the White Australia Policy created a huge volume of records, much of which is still preserved within the National Archives of Australia. These photographs are attached to certificates that non-white residents needed to get back into the country if they decided to travel overseas. There are thousands upon thousands of these certificates in the Archives. Thousands of certificates representing thousands of lives — all monitored and controlled.

But is is too easy to see these people as the powerless victims of a repressive system. There were many acts of resistance. Some argued against the need to be identified ‘just like a criminal’. Others exercised control over their representation, submitting formal studio portraits instead of mug shots.

Most commonly and most powerfully, people resisted the policy simply by going ahead and living rich and productive lives.

My partner, Kate Bagnall, is helping to rewrite Australian-Chinese history by overthrowing the stereotype of the culturally isolated Chinese man living a lonely, meagre existence surrounded by gambling and opium dens. By mining the available records, by reading against the grain of contemporary reports and by working with family historians, Kate is documenting their intimate lives — their wives, their lovers, their families and descendants — the sorts of relationships that sent a shudder through the edifice of White Australia. Power can be reclaimed in many subtle and subversive ways.

‘The real face of White Australia’ is an experiment. It uses facial detection to technology to find and extract the photographs from digital copies of the original certificates made available through the National Archives of Australia’s collection database. The photographs you see here come from just one series, ST84/1. There’s no API to the collection so I reverse-engineered the web interface to create a script that would harvest the item metadata and download copies of all the digitised images. There are 2,756 files in this series. On the day I harvested the metadata, 347 of those files had been digitised, comprising 12,502 images. It took a few hours, but I just ran my script and soon I had a copy of all of this in my local database.

Then came the exciting part. Using a facial detection script I found through Google and an open source computer vision library, I started experimenting with ways of extracting the photos. After a few tweaks I had something that worked pretty well, so I pointed my aging laptop at the 12,502 images and watched anxiously as the CPU temperature rose and rose. It took a few emergency cooling measures, but the laptop survived and I had a folder containing 11,170 cropped images. About a third of these weren’t actually faces, but it was easy to manually remove the false positives, leaving 7,247 photos.

These photos. These people.

With my database fully primed and loaded it was just a matter of creating a simple web interface using Django for the backend and Isotope (a jQuery plugin) at the front. Both are open source projects. All together, from idea to interface, it took a bit more than a weekend to create, and most of that was waiting for the harvesting and facial detection scripts to complete. It would be silly to say it was easy, but I would say that it wasn’t hard.

What we ended up with was a new way of seeing and understanding the records — not as the remnants of bureaucratic processes, but as windows onto the lives of people. All the faces are linked to copies of the original certificates and back to the collection database of the National Archives. So this is also a finding aid. A finding aid that brings the people to the front.

According to Margaret Hedstrom the archival interface ‘is a site where power is negotiated and exercised’. Whether in a reading room or online, finding aids or collection databases are ‘neither neutral nor transparent’, but the product of ‘conscious design decisions’. We would like to think that this interface gives some power back to the people within the records. Their photographs challenge us to do something, to think something, to feel something. We cannot escape their discomfiting gaze.

But this interface represents another subtle shift in power. We could create it without any explicit assistance or involvement by the National Archives itself. Simply by putting part of the collection online, they provided us with the opportunity to develop a resource that both extends and critiques the existing collection database. Interfaces to cultural heritage collections are no longer controlled solely by cultural heritage institutions.

It’s these two aspects of the power of interfaces that I want to focus on today.

There are a growing number of examples where the records created by repressive or discriminatory regimes have, in Eric Ketelaar’s words, ‘become instruments of empowerment and liberation, salvation and freedom’. Nazi records of assets confiscated during the Holocaust have been used to inform processes of restitution and reparation. Government records have helped members of Australia’s Stolen Generations trace family members. Descendants of inmates incarcerated by American colonial authorities in what was the world’s largest leprosy colony in the Philippines, have embraced the administrative record as an affirmation of their own heritage and survival. Records can find new meanings. Power can be reclaimed.

Technology can help. Tim Hitchcock has described how something as simple as keyword searching can turn archives on their heads. Recordkeeping systems tend to reflect the structures and power relations of the organisations that create them. The ‘hierarchical and institutional nature of most archives’, Hitchcock argues, ‘contains an ideological component which is sucked in with every dust-filled breath’. But digitisation and keyword searching free us from having to follow the well-worn paths of institutional power. We can find people and follow their lives against the flow of bureaucratic convenience. We can gain a wholly new perspective on the workings of society. ‘What changes’, Hitchcock asks, ‘when we examine the world through the collected fragments of knowledge that we can recover about a single person, reorganised as a biographical narrative, rather than as part of an archival system?’

Projects such as Unknown no longer may help us answer that question.

Unknown no longer

It’s aiming to extract the names and biographical details of slaves from the 8 million manuscript documents held by the Virginia Historical Society. The documents include court records, receipts, wills and inventories. Here is a page from the ‘Inventory of Negroes at Berry Plain Plantation, King George County, Virginia’ for 1855, listing names, occupations and valuations.

Tim Hitchcock is one of the directors of London Lives a project that similarly seeks to find the people in 240,000 manuscript pages documenting the lives of plebeian Londoners in the 17th century.

London Lives

More than three million names have already been extracted from the records of courts, workhouses, hospitals and other institutions. Work is continuing to link these names together, to merge these various shards of identity and piece together the experiences of London’s poorest inhabitants.

Remember me from the US Holocaust Memorial Museum is working with photographs taken by relief agencies in the aftermath of World War Two. The photographs are of displaced children who survived the Holocaust but were separated from families. What happened to them? The project is seeking public help to identify and trace the children.

Remember me

These are all projects about finding people. Finding the oppressed, the vulnerable, the displaced, the marginalized and the poor and giving them their place in history. This is what Kate and I hope to do with Invisible Australians, the broader project of which our faces experiment is part.

Invisible Australians

‘Invisible Australians’ aims to extract more than just photographs. We want to record and aggregate the biographical data contained within the records of the White Australia Policy — to extract the data and rebuild identities.

But we want to do more, we want to link these identities up with with other records, with the research of family and local historians, with cemetery registers and family trees, with newspaper articles and databases we don’t even know about yet. We want to find people, families and communities.

It’s ridiculously ambitious and totally unfunded. But it is possible.

The most exciting part of online technology is the power it gives to people to pursue their passions. As with the faces, we don’t need the help of the National Archives. We need the records to be digitized, but that’s happening anyway and we can afford to be patient. Most of the tools we need already exist, and are free. In the past 12 months, for example, there have been a number of open source tools released for crowd-sourced transcription of manuscript records.

People with passions, people with dreams, people who are just annoyed and impatient, don’t have to wait for cultural institutions to create exactly what they need. They can take what’s on offer and change it.

Interfaces can be modified. It is amazingly easy to write a script that will change the way a web page looks and behaves in your browser. I was frustrated by the standard interface to digitized files in the National Archives of Australia’s Recordsearch database — so I changed it.

Before and after

Not only did make it look a bit nicer, I added new functions. My script lets you print a whole file or a range of pages and display the entire contents of the file on a pretty cool 3d wall.

I’ve shared this script, and a few other Recordsearch enhancements. Anyone can install them with a click and use them.

Wragge Labs Emporium

Interfaces are sites of power and we can claim some of that power for ourselves. Online technologies not only free us from the having to brave the physical intimidation of the reading room, they free us up to engage with the records in new ways. The archivist-on-duty would probably not be pleased if I pulled out some scissors and started snipping photos out of certificates. Or if I pulled a file apart and pasted it’s contents on the wall. But online we are free to experiment.

The power of cultural heritage organisations is perhaps expressed most forcefully in their ability to control the arrangement and description of their collections. ‘Every representation, every model of description, is biased’, note Verne Harris and Wendy Duff, ‘because it reflects a particular world-view and is constructed to meet specific purposes’. Archives, libraries and museums are already starting to share this power, by allowing tagging, or seeking public assistance with description through crowd sourcing projects. But most of the these activities still happen within spaces created and curated by the institutions themselves. Our cathedrals of culture might be opening their doors and inviting the public to participate in their ceremonies, but that doesn’t make them bazaars. The architecture stills speaks of authority.

In any case, people already have a space where they can explore and enrich collections — it’s called the internet.

It would be great to see cultural institutions doing more to watch, understand and support what people are doing with collections in their own spaces — following them as they pursue their passions, rather than thinking of ways to motivate them.

A quick example… You might have heard of Zotero, it’s an open source project that lets you capture, annotate and organize your research materials.

Zotero

One cool thing about Zotero is that you can build and contribute little screen scrapers, called translators, that let Zotero extract structured data from any old collection database. You might not be surprised to learn that I’ve created a translator for Recordsearch. Another cool thing about Zotero is that you can share the stuff that you collect in public groups.

Invisible Australians Zotero group

Put those two cool things together and what do you have? Well to me they spell out user generated finding aids — parallel collection databases created by researchers simply pursuing their own passions.

Linked Open Data greatly increases opportunities for collection description to leak into the wider web. If objects and documents are identified with a unique URL, then anyone can can make and publish statements about them in machine-readable form. These statements can then be aggregated and explored. Initiatives such as the Open Annotation Collaboration will hasten the development of these shared descriptive and interpretative layers around our cultural collections.

And of course all this descriptive and interpretative work can be harvested back to enhance existing collection databases. We could start doing it now — though I will spare you today my rant about the possibilities of mining footnotes.

As well as exploring the possibilities of user-generated content, cultural institutions are starting to open up their collection data for re-use. APIs are great (though Linked Open Data is better), and New Zealand is lucky to have an organisation like DigitalNZ which just gets it. People can and will make cool things with your stuff.

But again, we don’t have to wait for everything to be delivered in a convenient, machine-readable form. If it’s on the web anybody can scrape, harvest and experiment.

You probably all know about the National Library of Australia’s newspaper digitisation project — it’s building a magnificent resource. But I wanted to do more than just find articles. I wanted to explore and analyze their content on a large scale. So I built a screen scraper to extract structured data from search results, and then used the scraper to  power a series of tools. I have a harvester that lets you download an entire results set — hundreds or thousands of articles — with metadata neatly packaged for further analysis.

Harvester

Or what about a script that graphs the occurrence of search terms over time, and allows you to ask questions like When did the Great War become the First World War?.

When did the Great War become the First World War?

In the end I got a bit carried away and built my own public API to the Trove newspaper database.

Unofficial Trove newspapers API

I think it’s important to note that the tools I developed were guided by the types of questions I wanted to ask. While we should welcome APIs and celebrate their possibilities, we should also remain critical. APIs are interfaces, they too embed power relations. Every API has an argument. What questions do they let us ask? What questions do they prevent us from asking?

Even as we move from the age of lumbering, slow-witted data silos into the rapidly-evolving realms of Linked Open Data, we have to constantly question the models we make of the world. Ontologies and vocabularies are culturally determined and historically specific. Yes, they too are interfaces, complete with their own distributions of power and authority. But we can revisit and change them. And we can relate our new models to our old models, capturing complex, long-term shifts in the way we think about the world. That’s incredibly exciting.

All of this hacking, harvesting, questioning, enriching and meaning-making makes me think about the possibilities of grassroots leadership. Online technologies enable people to take cultural institutions into unexpected realms. They can build their own interfaces, ask their own questions, determine their own needs — they can point the way instead of simply waiting to be served.

You might wonder what the National Library of Australia thinks of my various scrapers and harvesters. I can’t speak for them, but I can say that they’ve awarded me a fellowship to explore further the possibilities of text-mining in their newspaper database.

The idea of grassroots leadership brings me back to the title of this talk — ‘It’s all about the stuff’. It seems to me that we tend to model the interactions between cultural institutions and the public as transactions. The public are ‘clients’, ‘patrons’, ‘users’ or ‘visitors’. But the sorts of things I’ve been talking about today give us a chance to put the collections themselves squarely at the centre of our thoughts and actions. Instead of concentrating on the relationship between the institution and the public, we can can focus on the relationship we both have with the collections.

It’s all about the stuff.

It’s all about the respect and responsibility we both have for our collections.

It’s all about the respect and responsibility we both have for people like this.

 

 

Every story has a beginning

Entering the web of data

[view the presentation...] [view the triples...]

Keynote delivered at the annual conference of the Australia and New Zealand Society of Indexers, 14 September 2011.


This is me.

Today, Wednesday, 14 September 2011, I’m honoured to be able to join you here in the luxurious surrounds of the Brighton Savoy Hotel for the ‘Indexing See Change‘ conference. This is an event, a moment in history; we can pinpoint ourselves, this gathering, both in time and in space.

If we do that, if we move outside the moment and position ourselves on a timeline or a map, interesting things start to happen. Connections emerge.

Here we are at number 150, The Esplanade, in Brighton. A bit over a kilometre away is the stately villa, Kamesburgh. For many years Kamesburgh was also known as the Anzac Hostel — a refuge for permanently-incapacitated World War One veterans.

The Anzac Hostel opened on 5 July 1919. Here it is draped in its patriotic finery, from the collections of the Australian War Memorial. According to the caption, the Anzac Hostel was ‘a home, not an institute’.

Also amongst the War Memorial’s holdings is a wheeled bed that was used at the hostel. This particular bed was apparently occupied by one man, Albert Ward, for forty-three years.

Death notice for Alexander Kelley. Argus, 29 January 1944.

It was probably in a bed just like this that Alexander Dewar Kelley passed away on 27 January 1944. Alexander Kelley was cremated, and his remains interred amongst the roses at what is now called the Springvale Botanical Cemetery. Not far from my own grandparents.

Alexander Kelley spent close to half his life in the Anzac Hostel. Like many young men, he bravely answered his nation’s call to arms, but returned from war much changed. We can follow Alex’s war through his service record, easily-accessible through the website ‘Mapping Our Anzacs‘.

Alex was a coach painter who enlisted in the AIF in January 1916. Within a year he was in France. In May 1917 he suffered a gunshot wound to the head, but was able to rejoin his unit in August. Less than a month later though, he was wounded again, this time more severely. For Alex the war was over, and he was shipped back to Australia in May 1918.

‘Mapping Our Anzacs’ includes a scrapbook feature through which visitors to the site can attach notes or photographs to a service record. Amongst the the many thousands of postings is a fragment from a diary, found tucked inside the bible of Alexander Kelley’s mother. The diary entry reads simply: ‘Alex arrived from Front. Wet day. Saw him at “Caulfield”.’

Alex had survived and had returned to his family. This was a day to remember. But there was sadness too, for Alex was not the same young man who had left for the battlefields of Europe. In the diary fragment, ‘Caulfield’ is enclosed in inverted commas, indicating perhaps that the reunion took place, not in the suburb, but in the Caulfield rehabilitation hospital. Alexander Kelley was wounded in the face, hands and legs. He was left blind in both eyes and his right leg was amputated. He would live the remainder of his life a little over a kilometre away from here at the Anzac Hostel.

This is just one story. There are over 375,000 World War One service records held by the National Archives of Australia. How can we hope to understand a number like that? How can we hope to imagine the war’s impact on families, on communities?

‘Mapping Our Anzacs’ uses familiar Google maps to display the places of birth and enlistment recorded in many of those service records. But technical limitations make it impossible to display all the places at once. You can, however, take the same data and open it in Google Earth. If you then zoom in on Victoria, you see something like this.

Mapping Our Anzacs data viewed in Google Earth.

Each marker represents a place where a service person was born or enlisted. It’s impossible to read, of course, but that’s the point. There is so little blank space. As you zoom further, more markers appear, more place names resolve. It’s simple, but it’s powerful. They came from everywhere. From the smallest village to the biggest city; nowhere was untouched.

The ‘Mapping Our Anzacs’ scrapbook offers another perspective. It’s possible to extract the images posted to the scrapbook and present them on a 3D wall. Amidst an assortment of memorabilia, there are faces. Not places, or records — this is a wall of people.

Mapping Our Anzacs Scrapbook photos viewed through CoolIris

It’s worth noting too that like the markers on the maps, these faces link back to the actual service records. So they’re not just a new way of seeing the collection, they’re a new way of exploring it.

But the records don’t stand in isolation, they themselves have a context. A couple of years ago, Mitchell Whitelaw from the University of Canberra, undertook a project called ‘The Visible Archive‘ to investigate ways of visualising the holdings of the National Archives of Australia. Have you ever wondered what 360km worth of records looks like?

The collections of the NAA visualised by Mitchell's Series Browser.

This represents the holdings of the National Archives. Files within the archives are organised into series, and each square in this image represents a single series — there are about 60,000 of them. Naturally the size of the square gives an indication of the size of the series itself. It’s a fascinating and strangely beautiful picture.

It’s easy enough to pick out the World War One service records — Series B2455. In the interactive version of Mitchell’s series browser you can click on a box and display links between series, as well as other series created by the same government agency. Again, it’s not just a way of seeing the collection, but a means of exploring and interpreting it. As Mitchell says:

Visualisation enables us to literally show everything, to display large volumes of data in a way that reveals patterns and communicates context, but also provides access to the fine grain of individual elements.

But we can also employ such techniques to ask new kinds of questions. Can you imagine how Alexander Kelley and the other inhabitants of the Anzac Hostel must have felt in 1939? They had lost so much in the Great War, the ‘war to end all wars’, and yet within their own lifetime it was all happening again. More young men were answering the call, more lives were going to be destroyed.

There must have been a dreadful, disheartening moment when Australians realised that the Great War was not an end, but a beginning — the first in a series of devastating global conflicts. At some point the ‘Great War’ became the ‘First World War’, but when?

When did the 'Great War' become the 'First World War'?


This is one possible answer. This graph draws its data from the 50 million or so digitised newspaper articles in Trove, the National Library of Australia’s discovery service. It shows the proportion of newspaper articles that included the phrase ‘the great war’ compared to the proportion containing ‘the first world war’ (and variations thereof). The lines cross late in 1941. With German victories in Europe and Africa, the opening of the Eastern Front and the Japanese attack on Pearl Harbour, 1941 makes sense.

What is perhaps more intriguing is the dramatic peak in the occurrence of ‘the great war’ in 1939. It’s no surprise that the looming threat of a new conflict would provoke comment and comparisons, but it does make you wonder about the context of those discussions and how they might have changed as the reality of war edged closer.

To start exploring this I’ve harvested the content of the 6,600 articles from 1939 that included the phrase ‘the great war’. Using an online text analysis service called VoyeurTools I can quickly generate a picture of their contents.

This simple visualisation shows us the relative frequencies of words within the articles. It doesn’t reveal any great mysteries, but it does suggest some possibilities for further prodding. The prevalence of ‘time’ and ‘new’, for example — might these help us understand the shift in perspective from one war to the next? We can follow this up by browsing the different contexts in which the words were used.

But what actually is it that we’re actually searching? We know that Trove includes newspapers from 1803 to 1954, but if we’re really going to analyse shifting words and ideas it’s important to have a clear picture of the sources of those words.

Something like this perhaps. This graph shows the holdings of the Trove newspaper database on 4 August 2011, organised by state. You can see, for example, that if you’re searching on a topic between the 1920s and 1940s you’re probably likely to get more results from Queensland than anywhere else.

So starting from our location here, today, we can make connections across time and space. We can pull back and look at the big picture, or dive in and examine the fabric of a single life. Through the web we can build and explore a rich and complex contextual network.


It’s an exciting time to be a cultural data hacker. We now have a growing range of tools and technologies available for extracting interesting data from a wide variety of sources, both structured and unstructured.

The ‘Visible Archive’ project started with well-structured data, courtesy of Peter Scott, the developer of the Series System — the descriptive framework used by many Australian archives. But we’re rarely so lucky.

Even when the data starts off in nicely-organised fields in a database there’s no guarantee that that’s how it’s going to be delivered to our web browser. In order to extract the data from my Trove graphs, for example, I had to write a little program called a ‘screen scraper‘ to identify and save the important metadata elements from the raw web page itself.

Where there are no subject keywords we can infer them using techniques such as topic modelling. Where there are no access points we can identify people, organisations, places and events using special tools developed for named entity extraction. Where there are no common identifiers across datasets we can employ record linkage technologies to find possible connections.

We can count words, we can identify parts of speech, we can formulate a measure of the similarity of any two pieces of text. Once we have some useful data we can manipulate and enrich it. Place names can be geolocated — you simply send your place name off to a web service and get back its latitude and longitude.

Increasingly these sorts of tools are becoming accessible to anyone. For historians they offer a means of wrestling with rapidly-growing bulk of source material that is becoming available in digital form. How do you make use of 5 million digitised books, 50 million newspaper articles or the complete archive of every public message ever sent on Twitter?

The digital historian Dan Cohen has noted:

These computational methods which allow us to find patterns, determine relationships, categorize documents, and extract information from massive corpuses, will form the basis for new tools for research in the humanities and other disciplines in the coming decade.

Dan is involved in a number of interesting projects investigating the possibilities of these techniques — often grouped together under the heading ‘text mining’. One of these projects, ‘With Criminal Intent‘, is looking to see what patterns can be drawn out of the digitised proceedings of criminal trials held at the Old Bailey from 1645 to 1913. That’s 197,745 trials, in case you were wondering.

Here’s one of their visualisations showing how the length of trials varies over time. Much to the surprise of the research team, this graph suggests a dramatic shift in legal practice around 1825 — defendants started pleading guilty!

A visualisation by the With Criminal Intent project showing changing trial lengths.

Rather than falter under the growing weight of digital sources, these technologies can actually thrive. The more raw material available, the more chance there is to observe and track new patterns. As digitisation continues apace will we ever reach the point when history can simply be read from a graph?

There are some researchers at Harvard who seem to think that’s where we’re heading. Borrowing liberally from the store of scientific metaphors they have staked out the new field of ‘culturomics‘. By mining massive digital resources, like Google’s scanned books, they hope to map the ‘cultural genome’ that would enable us to follow the evolution of language and culture.

But there’s something quite barren in this ambition. I prefer the vision of digital humanist Stephen Ramsay, who commented in regard to the ‘With Criminal Intent’ project:

The Old Bailey, like the Naked City, has eight million stories. Accessing those stories involves understanding trial length, numbers of instances of poisoning, and rates of bigamy. But being stories, they find their more salient expression in the weightier motifs of the human condition: justice, revenge, dishonor, loss, trial. This is what the humanities are about. This is the only reason for an historian to fire up Mathematica or for a student trained in French literature to get into Java.

Ultimately it’s the stories that nourish, anger, inspire and depress us. The closely-packed map of places recorded in World War I service records is so powerful because we know that under each marker are men, women, families, communities — each with their own story. These new technologies offer new perspectives, they raise new questions, and they challenge us with new contexts to explore and understand. But there is still space for stories and perhaps we can use them to give our stories new life and depth.


This is another World War One service record. It belongs to Charlie Allen. Charlie enlisted three times in the AIF and was discharged on medical grounds each time. It seems he had a problem with his ankle.

Charlie’s service record notes a tattoo, proclaiming his love for ‘Maud Gordon’. He married Maud in Sydney in 1917 and had two daughters soon after.

Charlie survived the war without further injury, but was not so lucky in peace. On 11 March 1938, Charlie was crushed to death between two railway cars. The accident happened at the Bunnerong Power Station, only a short distance from his home in Matraville. He was buried nearby in the Botany Cemetery.

We also know quite a bit about Charlie’s early life. Why? Because Charlie’s father was Chinese and he was therefore categorised as a ‘half-caste’, as someone who was not white, and therefore fell under the restrictions imposed by the White Australia Policy.

Charlie was born in Sydney in 1896. His mother was Frances Allen (sometime sweet shop owner and brothel keeper), his father Charlie Gum (a buyer for Wing On company). Charlie was raised by his mother, but in 1909, at the age of 13, he was taken to China by his father.

NAA: ST84/1, 1909/22/41-50

This certificate granted Charlie an exemption to the Dictation Test. Without it, he may not have been allowed back into the country.

Every time one of many thousands of non-Europeans resident in Australia sought to travel overseas and return home again they needed one of these certificates.

Charlie’s father returned to Sydney, leaving him in China. He lived with relatives in the town of Shekki (inland from Hong Kong). Charlie was naturally homesick, but had no means of getting back to Australia. He wrote to his mother in 1910:

Do try and bring me home every minute I think of you and long for a piece of bread and butter this tucker is not doing me well.

His mother wrote to the Prime Minister Billy Hughes in an attempt to enlist government help but to no avail. Charlie finally returned to Australia in 1915.

Despite this experience, Charlie visited China again in 1922 for 7 months. Once again carrying papers to grant him re-entry to the country of his birth.

These fragments of Charlie’s life have been assembled by my partner, Kate Bagnall, a historian of Chinese-Australia. They are remarkable, and yet not so, because there are many thousands of stories like Charlie’s contained within the voluminous records generated by the administration of the White Australia Policy.

We’re all of course familiar with the general outlines of the White Australia Policy, and the way it underpinned conceptions of Australia as a nation in the first half of the 20th century.

But what we sometimes forget is that it was also a massive bureaucratic exercise.

Forms and certificates were printed, issued, used and filed. Regulations were modified, guidelines were distributed and administering officers were managed and advised. Individual cases were reviewed, policy was changed and new forms and certificates were printed, issued, used and filed…

Much of this system is now preserved in the National Archives.

You can get a idea of the range of material available from a case study Kate has prepared focusing on the efforts of Poon Gooey, a successful businessman in Horsham, to keep his wife and family in Australia.

If we look again at Charlie’s certificate from 1909 we can see that it contains a lot of interesting structured data:

  • name
  • place of birth
  • age
  • height
  • destination
  • date of departure
  • name of ship

We estimate that there are probably about 50,000 of these forms remaining in the Archives, and then there’s case files and a variety of other government documents.

Wouldn’t it be great if we could extract this structured data. If we could piece together the slivers of identity that remain within the Archives and give people back their lives.

This is the dream of Invisible Australians, a project Kate and I are trying to turn into a reality. Our aim is to build systems that will enable this data to be extracted, aggregated, shared and connected — whether to a family tree, a cemetery record, or another document in another archive.

Imagine being able to navigate the network of lives, families and relationships. To follow their journeys, to share their tragedies, to celebrate their small victories against a repressive system.

Imagine being able to watch them age.


We tend to assume that new technologies require us to change, to adapt. But sometimes they can take advantage of our strengths. Mitchell Whitelaw is interested in finding out what happens when you take large cultural datasets and try to ‘show everything’. Such an approach, he suggests, takes advantage of the raw processing power of computers, while giving us space to do what we’re good at — finding patterns, making connections, crafting meanings.

The History Wall tries to create a similar sort of space. The History Wall brings together material from a range of different sources — newspaper articles from Trove, biographies from the Australian Dictionary of Biography, records from a database of NSW convicts, population statistics, collection items from the National Museum of Australia — you can pretty much plug anything in as long as it has a date attached to it.

Irish History Wall

For a particular year, the Wall retrieves a random sample from the available sources, jumbles everything up and then throws it onto the screen. As a result, no two views of the Wall are ever quite the same. This is not a traditional exhibition. There is no curator controlling the content or designing the structure. It’s ephemeral, it’s serendipitous — instead of relying on an authorial voice to smooth over the gaps and transitions, it leaves open the cracks and allows new contexts to seep in and around each item.

As the pioneering digital historian Edward Ayers noted:

even isolated and inert pieces of evidence — a list, a letter, a map, a picture — can assume new and unimagined meanings when placed in juxtaposition with other fragments.

This is not an absence of narrative, but an opportunity for narration. Edward Ayers suggests that we’re actually quite comfortable filling in blanks and untwisting timelines:

Humans, presented with pieces of information about people, put things into the form of a story. They need not be simple stories, for we know how to deal with unexplained lapses of time, flashbacks, and overlapping narratives. We know how to imagine, infer, things happening at the same time in different places. Film and television train all of us at early ages to weave strands of narrative out of intentional (if carefully constructed) confusion and to take pleasure in that weaving.

And so I can show you a death notice, or a certificate and you will take those fragments, those isolated data points and you will construct a story — you will see the person behind them, you will imagine their life. It’s what we do. We’re good at it.

Computers on the other hand will just see data.

In her ode in praise of humanities data, digital humanist Amanda French wonders whether we always need to crunch our data into abstract, pliable forms:

What I wonder is whether instead we can begin with the data, or with a datum, and simply watch for what it may tell us, even if what it tells us is simply a story.

Yes we can. And we should teach computers how to do it as well. Not because we want them to take over. Not because they can necessarily do it faster or better. But because they can help us share, preserve and connect those stories.

Let’s think again about the array of documents that Kate has assembled to piece together the story of Charles Allen. How can you share this sort of material? Typically you’d ‘write it up’. You’d capture the story behind the data and commit it to words. The documents would then become evidence — points of connection between your text and the historical record.

So in order to share the meanings of these documents we remove them from the context of the person’s life and marshal them as allies to proclaim the authenticity of our rendering. Wouldn’t it be better if we could tell the story, but maintain within our texts the direct connections between sources and subject?

What we need is a data framework that sits beneath the text, identifying people, dates and places, and defining relationships between them and our documentary sources. A framework that computers could understand and interpret, so that if they saw something they knew was a placename they could head off and look for other people associated with that place. Instead of just presenting our research we’d be creating a whole series of points of connection, discovery and aggregation.

Sounds a bit far-fetched? Well it’s not. We have it already — it’s called the Semantic Web.

The Semantic Web exposes the structures that are implicit in our web pages and our texts in ways that computers can understand. The Linked Data movement takes the basic ideas of the Semantic Web and turns them into a collaborative activity. You share vocabularies, so that other people (and computers) know when you’re talking about the same sorts of things. You share identifiers, so that other people (and computers) know that you’re talking about a specific person, place, object or whatever.

Linked Data is Storytelling 101 for computers. It doesn’t have the full richness, complexity and nuance that we invest in our narratives, but it does at least help computers to fit all the bits together in meaningful ways. And if we talk nice to them, then they can apply their newly-acquired interpretative skills to the things that they’re already good at — like searching, aggregating, or generating the sorts of big pictures that enable us to explore the contexts of our stories.

This is why we’ve always imagined Invisible Australians to be something more than an online database. We want to provide points of connection that other people can build into their own stories. But to do that we have to pay attention to things like vocabulary management and authority control, we have to construct web addresses that are not going to break every time we upgrade our software. We have to think about the sorts of things we’re talking about — not just people, but government agencies, legislation, certificates, and correspondence. How do we describe these entities and what sorts of relationships do they have?

And of course we need to expose all these structures so that we can say, these things are people, these are events, these are places and these are documents.

Or perhaps, to introduce Alexander Kelley.

Or remember Charles Allen.


You might be wondering why we don’t just leave it all to the computers themselves. Didn’t I just talk about all the exciting new tools and techniques that enable us to analyse the structures of texts? Perhaps we should just wait for the Culturomics guys to solve all the problems.

But who defines the problems?

Our postmodern sensibilities encourage a suspicion of neutrality. Labels like ‘the new museology’ or Archives 2.0 reflect an awareness that the way we describe and arrange our collections is itself culturally-determined. It’s not just a matter of what our descriptive systems show, but what they hide.

Tim Hitchcock, another member of the ‘With Criminal Intent’ team, has described how online technologies can change the way we access archives. Instead of being forced to navigate the hierarchical structures that archives impose on records, which in turn tend to reflect the workings of the institutions that created the records, we can directly find the people whose lives were regulated, influenced, shaped or controlled by the policies of those institutions.

Instead of merely hearing ‘the institutional voice… in all its stentorian splendour’, he says, we can listen in to ‘the quieter tones uttered by the individual’.

This reminds us that search boxes, along with other digital tools, themselves embody arguments. There are assumptions built into their code about what is relevant, what is significant, what is necessary.

We can build our own tools of course, and we can critique other people’s algorithms. But what if we just want to collect and share stories?

Linked Data gives us a way to present an alternative to Google’s version of the world. We can argue back against the search engines, defining our own criteria for relevance, and building our own discovery networks.

Changing the way we access resources changes the sorts of stories we can tell. Tim Hitchcock asks:

What happens when institutions and archives are ‘decentred’ in favour of the individual? What changes when we examine the world through the collected fragments of knowledge that we can recover about a single person, reorganised as a biographical narrative, rather than as part of an archival system?

Perhaps the invisible become visible.

Seeing the women and children

I’ve been thinking further about the possibilities of Tim’s wall of faces as a finding aid, as something to help both locate archival documents and to understand their context.

The series we used in our test (ST84/1) was one in which we knew there was a very high percentage of photographs. Each item contains ten certificates, most of which have both a front and profile portrait attached. There is a small amount of other paperwork included in some files, but not a whole lot. We therefore knew what sorts of things we were going to get back.

But what about if we apply the same facial detection technology to a series in which we aren’t so sure of the photographic content? Unfortunately, Tim’s current laptop isn’t up to the task of doing all the grunt work (donations, anyone?), but here’s what I reckon might happen when we are able to move on to other series.

With series like SP42/1 and B13, which hold applications for CEDTs and similar records, I know that there are photographs in many, even most, of the personal case files. (B13 is complicated because it also contains other Customs files that don’t relate to individuals and don’t relate to the administration of the Immigration Restriction Act.) Because files might hold applications for a family, or a parent and child/ren, or an uncle and nephew, or siblings, you don’t always know from the item title exactly who the file relates to. Also, those who were Australian born did not necessarily apply for CEDTs since they could travel using their birth certificates as proof of their right to return, meaning that they don’t appear in CEDT series like ST84/1.

It was usual practice, though, to supply photographs of each person who was travelling (whether on a CEDT or not), and so by extracting those photographs, you would be able to have a better impression about who files related to. Of course, for files that are digitised (or even not) you could go through each one individually (which I’ve done, believe me…), but think how much more fun it would be to scroll through a wall of beautiful faces!

With B13 it would also be useful because there is no separate series of CEDTs; they are mixed in with the application/case files. Facial detection could be a way of extracting the forms themselves from the larger files.

My main research interest is in families, and women, and children – and we know that women are often hidden in archives because of bureaucratic systems which gave priority to the men in their lives. Although there are many White Australia records which relate to individual women and children, they can be lost in files organised and catalogued under the names of husbands and fathers. But scroll through a wall of mostly male faces, and the women and children just leap out at you!

I’m feeling a bit impatient, really, about running SP42/1 and B13 through Tim’s facial detection script. There are so many, so very interesting possibilities.

The real face of White Australia

In October 1911, the Sydney Morning Herald published a short article under the headline, ‘An indignity: photographs and finger-prints’. The article discussed the situation of Charles Yee Wing, a wealthy and respected Sydney businessman, who had asked to be exempted from having to supply his handprint and photograph as part of the process of being issued a CEDT.

Yee Wing had travelled before and was well-known to Customs officials. In this case, the Customs Department was willing to dispense with the necessity of taking his fingerprints, but Yee Wing was still required to provide his photograph. As the Herald wrote:

Mr Wing is a merchant of some standing, held in high esteem by Europeans and Chinese alike, and it was supposed that in his case the notification would be a purely formal business, and that he would not, since everybody who has business relations with the Chinese community knows him, have to go through the process by which the officials identify on their return Chinese domicilied in Australia who have been for trips to their native land.

Yee Wing’s primary objection was that the officials insisted upon photographing him, in various positions, ‘just like a criminal’.

(This photograph of Charles Yee Wing was taken three years earlier in 1908, when he travelled to Fiji where he had business interests. It was the ‘profile’ photograph attached to his CEDT (Certificate Exempting from Dictation Test). NAA: ST84/1, 1908/301-310.)

Today our images are used to identify us in all sorts of situations—passports, drivers licences, student cards, work ID cards, building swipe cards and even online with sites like Twitter or Facebook. We have varying amounts of control over what images of ourselves are used in these contexts—I know that I have a couple of passports with photographs that I would rather had never seen light of day, and I hope that they aren’t the only images of me that survive for future generations! But we generally accept that these representations of ourselves are necessary. And we certainly don’t think when we head to the post office for a new passport photo that we are being treated ‘like a criminal’. So why did Charles Yee Wing feel that way?

A hundred years ago, few people had formal papers which stated their identity, and the use of photographs on such identity documents was still in its infancy. It wasn’t until World War I, for example, that countries like the United States and Britain developed passports specifically designed with a space for a photograph. But over the second half of the nineteenth-century, authorities had begun to use photographs for administrative purposes, particularly as technologies such as the carte de visite made photographs cheaper and more portable.

In Australia, authorities began using photographs in an ad hoc way to assist in the identification of Chinese entering Australia in the 1890s, perhaps even the 1880s, but by far the most common official use of the photograph at this time was in the photographing of criminals. In New South Wales, for instance, the keeping of gaol photograph description books commenced around 1870. Such mug shots were used by police in identifying and keeping track of criminals and, in fact, the close tie between this form of portrait photography and its criminal subjects led some to criticise its use—because it tainted the practice, and art, of photography more generally.

In 2005, the Public Record Office of Victoria (PROV), together with the Golden Dragon Museum in Bendigo, launched what became a popular travelling exhibition, Forgotten Faces: Chinese and the Law. The exhibition presented large reproductions of gaol photographs of Chinese men imprisoned in Victoria between the 1870s and 1900, accompanied by brief biographical sketches drawn mostly from court and prison records. Dr Sophie Couchman, who knows more about photographs of and by Chinese Australians than any other person alive, was critical of the exhibition for ‘deliberately pulling photographs of Chinese prisoners from the wider prison archive’, thereby presenting the Chinese in colonial Victoria as both criminals and powerless victims of government bureaucracy (Couchman 2009, p. 122). Sophie futher noted that in doing this, the exhibition obscured the fact that Chinese were being treated in the same way as other residents of Victoria. In 2011, the PROV has put a selection of the images from the exhibition in its wiki, encouraging user contributions and plotting the subjects’ place of residence on a Google map.

A wall of faces

As part of our Invisible Australians project, Tim Sherratt has recently been experimenting with facial detection technology to automatically extract and crop photographs from CEDTs. You can read Tim’s discussion of what he’s done over at his blog. After extracting 7,000 photographs from Sydney series ST84/1, about a seventh of which is digitised in RecordSearch, Tim built an interface to display them as an interactive wall of faces. As Tim was putting it all together, I thought of Sophie’s critique of the use of photographs of Chinese people in the Forgotten Faces exhibition and of the way the images had been assembled together in rows as a kind of rogues gallery. I also thought of Charles Yee Wing’s comments a hundred years ago about the indignity of having to provide his photograph for a CEDT.

Could the same kind of criticisms be levelled at our wall of faces as at Forgotten Faces? Are we representing our subjects as more than passive victims of a racist bureaucracy? Are we using their images respectfully and decently? Are their images able to be understood by our contemporary audience? And how should we acknowledge the resistance and opposition of people like Charles Yee Wing?

I have been working with the CEDTs and other associated records (the ‘White Australia records’, for want of a better term) for about 12 years. The photographs are a significant part of what keeps me coming back to them—the photographs and the details about real people that are also found in the records. One of the challenges with writing about the early Chinese community in Australia has been to break through particular stereotypes, and one of the most powerful ways of doing this is through close-grained and detailed studies of individual lives. Yet uncovering those lives can be a difficult and time-consuming enterprise, for they were mostly ‘small lives’ which left only a faint trace scattered across the archive. The White Australia records provide an illumination of those lives, and are now widely used by families to uncover important and unknown information about their forebears.

When I began my research, the CEDTs and case files were not described individually in any catalogue or database, and they were certainly not online; the only ‘finding aids’ were the original handwritten indexes. I used to trek out to the archives, order up box after box after box, and look through the files one by one. In some instances I was the first person to have looked at the records for perhaps decades—the descendants of the men and women whose lives are recorded there knew nothing of the treasures the records held. But putting stuff online and allowing it to be discovered can have really meaningful results.

Since I put my PhD online, for instance, I’ve been contacted by a number of people who cite my research as the catalyst for their own journey of discovery into the families’ Chinese pasts—leading them to the White Australia records, which the National Archives has also done a lot of work on to make them more accessible. As Tim and I would both argue, online technologies and new digital methods really do provide significant and meaningful possibilities in providing access to, and ways of understanding, the lives documented in the White Australia records.

So what of our wall of faces? As Tim has noted, it’s not just an exhibition, it’s a finding aid. To me, this is the key. The wall of faces is another way of seeing into the records and into the lives of the individual men and women, the Australians, who were subject to the indignities of the White Australia Policy. Each image links to a copy of the document it was taken from, which then links to the digitised file in RecordSearch, which then links to other items in the same record series, which then links to other record series created by the same government agency—rich archival context.

But through the Invisible Australians project we also want to provide different links and detail other contexts. For instance, the first experimental version of our wall of faces is based on a small set of records, from Sydney and from the first decade of the 20th century. From this sample, we can see that most of those travelling from Sydney were Chinese men, but there were also non-Chinese and women and family groups. Records from other ports and other decades would produce a different pattern of faces—such as a greater proportion of younger or older people, more women and children, or a different ethnic make-up.

This first effort is certainly not perfect, and we’re already learning from it. We made the decision to leave the images at different sizes, and to widen out the crop so that you can see more than just the person’s face. We hope that this allows for some of the individuality in the images to come through—it’s not so neat perhaps, but maybe it’s also not so prescriptive. As Sophie Couchman has noted, the photographic portraits provided to the authorities by Chinese Australians were far from standardised, and many were studio portraits in which the subjects had a great deal of say in how they were represented. As Sophie has put it, they are ‘not so mug mugshots’. And we want our wall of faces to reflect that.

And now back to Charles Yee Wing

Among the images on our wall are the two portraits of Charles Yee Wing taken before his 1908 trip to Fiji. Those from his 1911 trip, when he made his objections known to both the authorities and the press, aren’t yet digitised. I have done a bit of research into Yee Wing’s family, finding a trove of files about his and his children’s travels over several decades. I don’t think, though, that I had come across this particular CEDT—a typo in the item title means that it doesn’t come up under a keyword search for ‘Yee Wing’. But I did find it browsing through the images in our wall.

Bibliography

the real face of white australia

In many of the presentations I’ve given in recent times I’ve managed to include a question raised by Tim Hitchcock in his chapter in The Virtual Representation of the Past. Tim asks:

What changes when we examine the world through the collected fragments of knowledge that we can recover about a single person, reorganised as a biographical narrative, rather than as part of an archival system?

The idea of turning archival systems on their head to expose the people rather than the bureaucracy is what motivates Kate Bagnall and I in our attempts to make the Invisible Australians project into a reality.

Invisible Australians aims to liberate the lives of those who suffered under the restrictions of the White Australia Policy from the rich archival holdings of the National Archives of Australia and elsewhere.

We always knew that the portrait photographs, included on a range of government documents, would provide a compelling perspective on these lives, but we weren’t quite sure how we were going to extract them. Up until last weekend, I’d assumed that we’d develop a crowdsourcing tool that contributors would use to mark-up the photos.

Now I’m not so sure.

In the space of a couple of days I’ve extracted over 7,000 photographs and built an application to browse them — here is the real face of White Australia

How did I do it? Paul Hagon, at the National Library of Australia, gave a presentation last year in which he explored the possibilities of facial detection in developing access to photographic collections. The idea lodged in my brain somewhere and a few days ago I started to poke around looking to see how practical it might be for Invisible Australians.

It didn’t take long to find a python script that used the OpenCV library to detect faces in photographs. I tried the script on a few of the NAA documents and was impressed — there were a few false positives, but the faces were being found!

So then the excitement kicked in. I modified the script so that instead of just finding the coordinates of faces it would enlarge the selected area by 50px on each side and then crop the image. This did a great job of extracting the portraits. I tweaked a few of the settings as well to try and reduce the number of false positives. Eventually, I developed a two-pass system that repeated the detection process after the image had been cropped and it’s contrast adjusted. This seemed to weed out a few more errors. You can find the code on GitHub.

Once the script was working I had to assemble the documents. I already had a basic harvester that would retrieve both the file metadata and digitised images for any series in the NAA database. Acting on Kate’s advice, I pointed it at series ST84/1 and downloaded 12,502 page images.

All I then had to do was loop the facial detection script over the images. Simple! The only problem was that my 3-year-old laptop wasn’t quite up to the task. As it’s CPU temperature rose and rose, I was forced to employ a special high-tech cooling system.

Keeping my laptop alive...

But after running for several hours, my faithful old laptop finally worked it’s way through all the documents. The result was a directory full of 11,170 cropped images.

The results

There were still quite a lot of false positives and so I simply worked my way through the files, manually deleting the errors. I ended up with 7,247 photos of people. That’s a strike rate of nearly 65% which seems pretty good. The classifier, which does the actual facial detection, was probably trained on conventional photographs rather than on the mixed-format documents I was feeding it.

Then it was just a matter of building a web app to display the portraits. I used Django for the backend work of managing the metadata and delivering the content, while the interface was built using a combination or Isotope, Infinite Scroll and FancyBox.

It’s important to note that the portraits provide a way of exploring the records themselves. If you click on a face you see a copy of the document from which the photo was extracted. A link is provided to examine the full context of the image in RecordSearch. This is not just an exhibition, it’s a finding aid.

What next? There are many more of these documents to be harvested and processed (and many more still yet to be digitised). I will be adding more series as I can (though I might have to wait until I can afford a new computer!). I’d also like to explore the possibilities of facial or object detection a bit more. Could I train my own classifier? Could I detect handprints, or even classify the type of form?

In the meantime, I think our experimental browser helps us to understand why the Invisible Australians project is so important — you look at their faces and you simply want to know more. Who are they? What were their lives like?

UPDATE: For more on the photos and the issues they raise, see Kate Bagnall’s posts over at the Tiger’s Mouth.

Birth certificate registers

In October 1913 Secretary of the Department of External Affairs, Atlee Hunt, sent a circular to the state Customs departments asking if they kept records of Chinese Australians who used their birth certificates as identity papers when travelling overseas.

Queensland already kept such a register, and Hunt felt that:

Such a register is very desirable to enable a check to be kept on persons claiming admission to Australia on birth certificates, as it is an easy matter for a number of copies of the same certificate to be obtained, and the experience of the past shows that in some instances several Chinese have attempted, sometimes successfully, to land on copies of the same certificate. (NAA: A1, 1913/20069)

An example of the early difficulties that both Chinese Australians and government officials had with using birth certificates as identification can be found in the case of Fred Hong See (see NAA: BP342/1, 13021/357/1903). Fred was born in Sydney in 1885 to Chinese parents who, when he was very young, took their son back to China. Fred’s father later died and, in 1903, Fred returned to live with other relatives in Sydney. When he arrived, Customs officer J.T.T. Donohoe doubted his identity and would not allow him to land. Donohoe’s suspicions were based on the fact that Fred could not speak any English and his feeling that Fred looked older than the age stated on the birth certificate he presented.

Fred was quickly sent on his way back to China, and it was only through the threat of legal action by his well-respected relatives in Sydney and their payment of a deposit of £100 that Fred was permitted to stop at Brisbane for re-examination. With evidence provided by Fred’s relatives, the Brisbane Collector of Customs, W.H. Irving, was satisfied that he was, in fact, telling the truth. After Atlee Hunt’s approval, Fred was allowed to stay.

This is the copy of Fred Hong See’s birth certificate that he presented to officials on his return to Australia in 1903. It can be found with other correspondence about the case in NAA: BP342/1, 13021/357/1903.

In the decade after the introduction of the Immigration Restriction Act, the processes for its administration continued to be refined and tightened, primarily to prevent the fraudulent entry of Chinese into Australia. Hunt’s request for the keeping of birth certificate registers came about from a concern that ‘as other channels of fraudulent entry are being blocked, the Chinese will make a determined effort to utilize birth certificates to that end.’

His Customs circular of 1913 set out the details that Customs officers should record to enable correct identification on a person’s return to Australia:

  • name
  • number of birth certificate
  • date of issue
  • date of birth
  • where born
  • date of departure from Australia
  • remarks concerning departure
  • date of return
  • by whom examined, landed or rejected
  • general remarks

The Collectors of Customs responded thus:

  • Victoria reported that had been keeping a register from the beginning of the year (1913), but without the level of detail requested.
  • New South Wales had not been keeping records, but was now ordering a book for the purpose.
  • Western Australia had no special register, but would immediately open one.
  • South Australia said they had not had any need for a register, as there had been no cases of Chinese being admitted on birth certificates there.
  • Tasmania would begin keeping a record, but had only had four cases to date.
  • And the Northern Territory had been keeping record of Chinese arriving on birth certificates since 1911.

It became the practice for birth certificates to be endorsed by Customs officials on a person’s departure. This usually included taking a handprint and attaching a photograph, as well as recording the details in a register. Some people also went through the formality of applying for a CEDT.

The two remaining registers

To my knowledge, only two of the birth certificate registers still exist, those for Queensland and New South Wales. The Queensland register is held in the Brisbane office of the National Archives, and a digital copy is available through RecordSearch:

The first volume, of 16 double pages, has suffered flood damage and can be difficult to read in parts. The second volume, which has 23 double pages, is much more legible. A sample page from the second volume is shown below – this is a left-hand page, with the remainder of the details about each person completed on the corresponding right-hand page.

The single register for New South Wales, held in the National Archives’ Sydney office, is more substantial than those for Queensland, demonstrating the greater amount of travel that occurred from Sydney. The register contains around 150 double pages and includes an alphabetical index at the front. The entries date from 1904 to 1962; those before 1913 were presumably copied from records elsewhere. It is also fragile and difficult to read in places, but it has recently also been digitised and made available through RecordSearch:

The page reproduced below is a left-hand page, with further details about the travels of each person available on the corresponding right-hand page.

Making use of the registers

These registers are valuable sources of information about Chinese Australian families in Queensland and New South Wales, and can provide missing pieces of information for people who did not apply for CEDTs when they travelled overseas (which many Australian-born Chinese did not).

Having them digitised is great, especially for those of us who can’t easily get to the Brisbane or Sydney reading rooms – but what would be even more useful is if the information contained in the registers was in a form that could be searched and sorted. I’m working on a bigger project relating to Chinese families in New South Wales, based around a database of information sourced from marriage and birth records up to 1918. I’m part-way into transcribing relevant details from the published BDM indexes with 1000 entries (out of an estimated 3000–4000) in the database so far!

The information found in the birth certificate registers obviously relates very strongly to this, so I have another crazy plan to also transcribe the information held in the Sydney register. It’s not going to be a quick job – and it’s one that could easily be shared since the New South Wales register is online. So, if you happen to have some spare time and don’t mind deciphering old handwriting, I’d love to hear from you!

Liberating lives: invisible Australians and biographical networks

Presented at the Life of Information Symposium, 24 September 2010.
Slides are available on Slideshare.

Charlie Allen's palm print
This palm print belongs to a 12-year-old boy called Charlie Allen.

Charlie was born in Sydney in 1896.

His mother was Frances Allen (sometime sweet shop owner and brothel keeper), his father Charlie Gum (a buyer for Wing On company).

Charlie was raised by his mother, but in 1909, at the age of 13, he was taken to China by his father.

His father returned to Sydney, leaving Charlie in China. He lived with relatives in the town of Shekki (inland from Hong Kong) for 6 years.

Charlie was homesick, but had no means of getting back to Australia. His mother attempted to enlist government help but to no avail. Charlie finally returned in 1915.

The following year he enlisted in First AIF (well actually he enlisted three times, and was discharged as medically unfit each time).

Charlie married in Sydney in 1917 and had two daughters soon after. He returned to China in 1922 for 7 months.

Charlie Allen died in 1938 as the result of an industrial accident. He was 41.

How do we know all this about Charlie Allen?

We know this because there are fragments of Charlie’s life scattered throughout the holdings of the National Archives of Australia.

The CEDT from 1909 when he left Australia with his father:

Charles Allen 1909 - CEDT front

NAA: ST84/1, 1909/22/41-50


A letter from his mother to Prime Minister Billy Hughes, seeking help to return Charlie to Australia:
Letter to Billy Highes from Charlie's mother.

NAA: A1, 1911/13854


His WWI service record:
Charles Allen's WWI attestation form

NAA: B2455, ALLEN C A


An identity form relating to his trip to China in 1922:

NAA: SP42/1, C1922/4449


But of course Charlie is not alone in the archives.

Charlie’s father was Chinese, he was therefore categorised as a ‘half-caste’, as someone who was not white, and fell under the restrictions imposed by the White Australia Policy.

The certificate from 1909 granted Charlie an exemption to the Dictation Test. Without it, he may not have been allowed back into the country.

Every time one of many thousands of non-Europeans resident in Australia sought to travel overseas and return home again they needed one of these certificates.

We’re all of course familiar with the general outlines of the White Australia Policy, and the way it underpinned conceptions of Australia as a nation in the first half of the 20th century.

But what we sometimes forget is that it was also a massive bureaucratic exercise.

Forms and certificates were printed, issued, used and filed. Regulations were modified, guidelines were distributed and administering officers were managed and advised. Individual cases were reviewed, policy was changed and new forms and certificates were printed, issued, used and filed…

For example, between 1901 and 1911, 400 circulars were issued to port officers about immigration restriction. The confidential manual on immigration restriction grew from one page in 1902 to more than 200 in 1912.

Much of this system is now preserved in the National Archives.

For the years between 1902 and 1948 there remain:

  • More than 50,000 CEDTs
  • 90 shelf metres of records
  • 15,000 case files

And within those many thousands of files are the scattered fragments of lives such as Charlie’s — lives that were controlled, monitored and documented in a vain attempt to make Australia ‘white’.

We’ve already seen today some wonderful examples of how these fragments, these slivers of existence, can be found, extracted, aggregated and displayed. But I think it’s worth considering for a moment what happens when we do this.

The historian Tim Hitchcock, behind projects such as the Old Bailey Online and London Lives, has reflected on the impact of digitisation on our access to archives. Archives, he notes, tend to reflect the assumptions and practices of the institutions that created them.

But by providing new ways into these records systems, technology can undermine the power relations that persist within their structures.

‘What changes’, asks Tim Hitchcock, ‘when we examine the world through the collected fragments of knowledge that we can recover about a single person, reorganised as a biographical narrative, rather than as part of an archival system?’

I don’t know, but I think we should find out, don’t you?

**********

I hope you’ve all collected a mini card. These themselves provide a little glimpse at the real face of White Australia and I’d invite you all to head over to the National Archives website, do battle with the monster that is RecordSearch, and look up the file references that are on each card.

The cards are part of a project that Kate Bagnall and I are trying to develop — Invisible Australians.

I should note too that the cards, and most of the examples I’m showing you here today are the product of Kate’s long and detailed research into Chinese-Australian families. In modern project management parlance, Kate is the domain expert, while I am merely the technical resource.

If we look again at one of the CEDTs, we can see that there’s a lot of useful structured data:

  • name
  • place of birth
  • age
  • height
  • destination
  • date of departure
  • name of ship

Invisible Australians has the modest aim of extracting this data from the 50,000+ forms in the National Archives. But of course that’s just the start, because each person might have used a number of certificates — so then it’s a matter of matching these identities.

Invisible Australians

http://invisibleaustralians.org

And then there are a range of other related forms, not to mention case files, alien registration documents, naturalisation applications…

Obviously we can’t do it alone. We’ll be creating a crowdsourcing tool to extract and link the data.

It’s ridiculously ambitious, totally unfunded and is likely to take over our lives.

Is it worth it?

Imagine being able to navigate the network of lives, families and relationships. To follow their journeys, to share their tragedies, to celebrate their small victories against a repressive system.

Imagine being able to watch them age.

Pauline Ah Hee and Shadee Khan

Is it worth it? We think so.

**********

For Tim Hitchcock technology opens up the possibility of writing a new history from below, exploring how the poor, the marginalised and the powerless navigated the institutions of the modern state. But it’s not just about search engines and databases. He talks about making ‘best use of the technology of emotions and representation — how you use words and pictures and a story to impact, not just on what people think, but what they see in their mind’s eye’.

In this project, the photos matter. I hope the irony in our project title is obvious.

Some of the faces of Invisible Australia

This is the real face of White Australia.

The photos remind us that the project is not just about shifting data around — these are lives, these are people.

But this brings its own challenge, for if we are seeking to liberate these lives from the fragmentation and obscurity of bureaucratic systems then we should be asking what are we liberating them into?

A database?

This is not just an exercise in data creation and management. We also have to think carefully and creatively about issues of representation, access and discovery.

We have to give these lives back their freedom to associate, to have relationships, to make connections.

We need to embed these lives in a variety of contexts and combinations. To make room for serendipity, celebration, sadness, and yes, even play.

We need to bring these lives into a rich and ongoing conversation with the world.

But how?

**********

I’ve been working on a little experiment for the National Museum of Australia called The History Wall. What the History Wall does is quite simple, it pulls together data on the fly from a variety of sources including People Australia, the Australian Dictionary of Biography, the National Library’s newspapers project, historical population data from the Bureau of Statistics, photos from the Flickr accounts of the PowerHouse Museum and the National Archives, and the collection database of the National Museum itself. It chooses randomly from all this stuff, throws the results up into the air and then displays them however they happen to fall. No two views are ever quite the same.

The History Wall

http://defining.net.au/wall/

It’s something more than a timeline. To me it’s more like a celebration of context and serendipity. There’s a richness to it, a sense of discovery and fun, but there’s also fragility — next time you look it might be gone.

It’s a bit like history itself.

It’s a bit like the world.

How do we create spaces for our data to merge and mingle? How do we encourage the development of new contexts and connections?

I think the first thing we have to do is stop thinking about databases and dictionaries, registers and encyclopaedias. Don’t get me wrong, I’m not being critical of the wonderful projects we’ve seen today. I just think we can use all this work better if we stop thinking about individual resources and start developing on a web scale, on a global scale.

Yes, we have the technology. Time today has spared you from a detailed discourse on the Semantic Web, but I do want to focus on one aspect.

You may have heard of Linked Data, it’s a set of guidelines to help you publish your data to the Semantic Web. There are only four basic principles and I’m only going to talk about one of them. It’s one of those deceptively simple things. You look at it and think, ‘yeah, ok’, but before too long it’s starting to turn your brain inside out.

Use URLs to identify things in the real world.

Yeah, ok…

You know what URLs are, web addresses, the things you type in your browser’s location field.

And hopefully you know what things in the real world are: people, places, objects, events, ideas…

Now you may have detected a problem here, because no matter how many times you click the refresh button, your web browser is not going to be able to use such a URL to magically deliver you the real world thing.

Well, unless you’re on eBay.

Fortunately, the Linked Data guidelines provide for a bit of technical trickery that allow your browser to retrieve not the real world thing, but some information about that thing — perhaps in the form of a web page.

Why bother?

Names are powerful.

We share and use names to talk about things. Computers are the same. If we use URLs to identify things in the real world, then computers can start talking about them.

We can define and explore real-world relationships in an online environment. We can create rich, meaningful linkages across databases, across disciplines, across the world.

We can start building and thinking on a web scale.

**********

Thanks to the People Australia project, I can confidently claim that this is me:

http://nla.gov.au/nla.party-479364#foaf:Person

I keep meaning to get it on a t-shirt.

The most exciting thing about People Australia is not the EAC records or the aggregation of resources — it’s the identifiers, because they enable us to say things about people anywhere on the web that computers can understand and relate back to a specific real world entity — a person.

You can start doing it now with Wragge’s Identity Browser.

Wragge's Identity Browser

http://wraggelabs.com/identities/

This is a little tool I built using the People Australia API. It makes it easy to find identifiers for people and organisations, and it supplies you with some code that you can drop into a blog post or web page that will tell a computer that a name relates to a thing called a ‘person’ , that this person’s name has a certain standard form, and that this person can be uniquely identified by People Australia.

Even if you don’t publish a website or a blog, you can use People Australia identifiers to build semantic linkages. Wragge’s Identity Browser also creates machine tags for you. Machine tags are like normal tags but with built in semantics. When coupled with identifiers they enable you to do some pretty powerful things.

You could for example use machine tags in Flickr to tell computers that a certain photo depicts a person uniquely identified by People Australia. In fact, people have been doing just that.

Flickr Machine Tag Challenge

http://wraggelabs.com/fmtc/

The Flickr Machine Tag Challenge is a sort of scoreboard that I built to encourage people to start adding People Australia enriched machine tags to photos. More than 1200 tags have been added to over 1000 photos. Feel free to join in!

The point is that the technologies already exist to enable us to build web scale biographical resources. Not dictionaries or databases as we know them, but networks capable of constant expansion, elaboration, and cooperation.

What we need are more tools to make it simple, recipes to make it obvious, examples and applications to make it popular, and leadership to make it all seem possible.

**********

Of course most of the lives we hope to liberate through Invisible Australians will not be represented in People Australia.

Not yet.

But Invisible Australians will offer a point of aggregation and disambiguation that will enable our people to find their way from the bureaucratic recesses of the White Australia Policy to a place on the national stage.

And we will encourage others to do likewise. Basil can’t do all the work. The centralised system has to be fed through centres of aggregation and collaboration.

Similarly, there are many great resources already out there relating to Chinese-Australians. There are hordes of family and local historians compiling and publishing biographical data. We want to identify people in these resources and link to them.

We want to publish up to People Australia and link down to a single headstone in a lonely country cemetery.

But to do this we need to help people make their resources linkable. To help them create persistent, re-usable URLs, and expose their data in standard formats. To create Linked Data, even if they have no particular interest in the Semantic Web.

Invisible Australians

http://invisibleaustralians.org/

Invisible Australians is not just about extracting data from archives. It’s also about working with others to build capacities and demonstrate possibilities.

It’s ridiculously ambitious, totally unfunded and is likely to take over our lives.

Is it worth it?

We think so.

Form 21(i): Certificate of Domicile, 1902

This is the first in a series of five posts that looks at the different iterations of Form 21 over the first decade of the 20th century. Form 21 is better known as a Certificate of Domicile or Certificate Exempting from Dictation Test (CEDT), but there is something reassuringly bureaucratic in it having a number. There is something practical in it too, because there were a bevy of other forms as well (32, 22, 19, 9 etc), including the confusion-causing Certificate of Exemption (Form 2, which was a temporary entry permit rather than a re-entry permit).

I have located what I’m fairly confident are the first examples of each variation of Form 21 between 1902, when the Immigration Restriction Act came into effect, and 1908. After then things settled down a bit and the form remained more or less the same over the following decades. My examples are taken from New South Wales.

You can see these examples and others in my Invisible Australians library in Zotero.

Certificate of Domicile for Ah Shooey

The first Certificate of Domicile issued in New South Wales would have been numbered 02/1 – ’02′ being the year 1902 and ’1′ being the certificate number. There is a volume of certificates from 1902 in NAA: SP11/6, Box 3 (more about this in an earlier post), and my guess is that the first Certificate of Domicile is probably to be found there. Unfortunately it’s not digitised and I’m not in Sydney, so we’ll have to leave confirmation of that ’til a later time.

The first Certificate of Domicile that I can include here is, therefore, from a year later. It was the first Certificate of Domicile issued in New South Wales in 1903 (no. 03/1) and is the first certificate to be found in series NAA: ST84/1, ‘Certificates of Domicile and Certificates of Exemption from Dictation Test, chronological series’. (Here’s a link to the record item it is held in: NAA: ST84/1, 1903/1-10 – the whole item is digitised.)

The certificate was issued in the name of Ah Shooey, a 47-year-old Chinese man from Canton, who was departing Sydney for China on the Kasuga Maru on 1 January 1903. The certificate notes that Ah Shooey has one son, who is accompanying him. This is presumably 22-year-old labourer Louey Back Keong, whose certificate is no. 03/2.

Two copies of the form were completed; the one pictured above includes the word ‘Duplicate’ handwritten in red on the front. This copy was kept on file in Sydney, while the other copy (also found in NAA: ST84/1, 1903/1-10) would have been given to Ah Shooey to use during his travels, before being collected and filed on his return. Details of Ah Shooey’s arrival were also marked on the used certificate (‘Landed Empire 27/05/05′).

Ah Shooey’s form records the following information:

Duplicate

No. 03/1

COMMONWEALTH OF AUSTRALIA
Immigration Restriction Act 1901 and Regulations.

CERTIFICATE OF DOMICILE

I, Nicholas Lockyer Collector of Customs at the port of Sydney New South Wales in the said Commonwealth, hereby certify that Ah Shooey, hereinafter described, has satisfied me that he is domiciled in the Commonwealth, and is leaving the Commonwealth temporarily.

[Signature of Nicholas Lockyer] Collector of Customs
Date 31st December 1902

DESCRIPTION

Nationality Chinese
Birthplace Canton
Age 47 years
Complexion
Height 5ft 5 1/2 inch in Boots
Hair Turning grey
Build Stout
Eyes Brown
Particular marks Nail on little finger left hand missing. Top of third finger on right hand off from first joint.

(For impression of hand, see back of this document.)

Family One son
Where resident Accompanying
Date of arrival in Australia Year 1877
Place of residence in Australia Deniliquin
Occupation Storekeeper
Property Value £400 Deniliquin

Date of departure 1st January 1903
Destination China
Ship Kasuga Maru

References in Australia (names and addresses) Police Magistrate Deniliquin. A Fordham Deniliquin. C Hitchin Jerilderie.

Form No. 21.

On the reverse, the form includes the words ‘Impression of Left Hand’ and Ah Shooey’s handprint.

Reverse of Certificate of Domicle for Ah Shooey, 1903. NAA: ST84/1, 1903/1-10

Collecting CEDT applications and certificates

The administration of the Immigration Restriction Act was overseen by the Department of External Affairs, but the day-to-day work was undertaken by the state-based Collector of Customs/Department of Customs & Excise.

The Collectors of Customs had been responsible for administering colonial immigration restriction laws, and each had their own systems in place when the new federal legislation was implemented from 1902. Atlee Hunt, Secretary of the Department of External Affairs for the first two decades of the 20th century, set about ensuring that officials in each state implemented federal policy consistently, issuing a book of published guidelines as well as dozens of circulars that kept Customs officials up-to-date on decisions made by the Department.

The chap pictured below is WH Barkley, who was the New South Wales Collector of Customs between 1914 and 1933. His signature can be seen on hundreds of CEDTs issued in Sydney during that period.

Anyway, the different recordkeeping systems used by the state Collectors of Customs means that each state/territory now has a different set of records of CEDT applications and certificates.

To me, the system in Sydney seems pretty nicely organised – basically there is one series with correspondence files containing the applications (Form 22), another series that holds copies of the CEDTs that were issued in Sydney (Form 21), another that has the duplicate CEDTs (and other papers including Form 32s) of people arriving back into Sydney. (Okay, it’s really more complicated than that, but let’s not confuse things too much.)

Things are also very tidily done in Darwin (although on a much smaller scale), with all the paper work filed in the one file – the application (Form 22), the CEDT (Form 21), the return authorisation form (Form 32) as well as any other correspondence.

This post is an attempt to document what CEDT applications and certificates exist for each state, what series they are in, and whether they’re available online through RecordSearch. My list also includes registers of applications, as well as records that were created under colonial legislation.

NOTE: Although I’ve done a lot of research using the Sydney records in the flesh, most of what I know about records in the other states is based on what can be found in RecordSearch and in the National Archives’ guide to Chinese records. There will, therefore, be gaps! Any contributions of local knowledge gratefully accepted (especially Tasmania and South Australia).

NOTE TOO: These are the ‘main’ series with CEDT applications and certificates. There are other odd series that also include CEDT stuff that I haven’t included.

New South Wales

Applications: SP11/26

Series number: SP11/26
Series name: Applications for Certificates of Domicile
Dates: 1902
Contents: Applications by for certificates of domicile. Included are references, statutory declarations, submissions, and the Minister’s decision.
Location: Sydney
Shelf metres: 0.18 m

Number of items listed in RecordSearch: 27 (100 % of series)
Number of items digitised in RecordSearch: 0
Item title: Includes person’s name, so can be searched by name.
Item titles example: William Ah Bow, application for a certificate of domicile [7 pages and 4 photographs]

Link to series description in RecordSearch: NAA: SP11/26
Link to example item in RecordSearch: NAA: SP11/26, A1

Applications: SP42/1

Series number: SP42/1
Series name: Correspondence of the Collector of Customs relating to Immigration Restriction and Passports
Dates: c.1898–1948
Contents: Correspondence files, varying in size from a few to dozens of pages, mostly concerning one person or family group. Because this series stretches over several decades, the contents varies a bit. Most later files include Form 22.
Location: Sydney
Shelf metres: 119.79 m

Number of items listed in RecordSearch: 6531 (% of series unknown, but probably a significant proportion)
Number of items digitised in RecordSearch: 722 (as of 29 July 2010)
Item title: Generally includes personal name of subject/s, so can be searched by name.
Item title example: Ah Sun [includes 2 photographs showing front and side views] [box 106]

Link to series description in RecordSearch: NAA: SP42/1
Link to example item in RecordSearch: NAA: SP42/1, C1917/4159

Certificates: SP115/10

Series number: SP115/10
Series name: Certificates Exempting from the provisions of ‘The Influx of Chinese Restriction Act 1881′
Dates: 1884–88
Contents: Includes about 450 exemption certificates issued under the NSW 1881 Act and 2 certificates and documents relating to the 1861 Act. The certificates include scant information about the applicants themselves, giving their name, date of issue of the certificate and period of exemption. There may be handwritten annotations on the front and back, some in Chinese, which provide more personal information such as occupation, age and height.
Location: Sydney
Shelf metres: 0.72 m

Number of items listed in RecordSearch: 1 (Whole series item)
Number of items digitised in RecordSearch: 0
Item title: 1 item only. Certificates are not listed as individual items.
Item title example: Certificates Exempting from the provisions of ‘The Influx of Chinese Restriction Act 1881′

Link to series description in RecordSearch: NAA: SP115/10
Link to example item in RecordSearch: NAA: SP115/10, WHOLE SERIES

Certificates: ST84/1

Series number: ST84/1
Series name: Certificates of Domicile and Certificates of Exemption from Dictation Test, chronological series
Dates: c.1903–53
Contents: Certificates of Domicile and CEDTs (Form 21). Each item includes a bundle with the certificates of about 10 people. There may be used duplicate copies of the certificates and other material including Form 32.
Location: Sydney
Shelf metres: 49.14 m

Number of items listed in RecordSearch: 2754 (probably 100% of series)
Number of items digitised in RecordSearch: 344 (as of 29 July 2010)
Item title: Includes the names of certificate holders, so can be searched by name.
Item title example: Jong Say, Wong Kwong, Lee You Wing, Foo Gun, Mar Kum, Gock Buck, Ah Get, Jeong Keong, Percy Zuinn and Ah Yum [Certificate Exempting from Dictation Test - includes left hand impression and photographs] [box 122]

Link to series description in RecordSearch: NAA: ST84/1
Link to example item in RecordSearch: NAA: ST84/1,1908/11/31-40

Used certificates: SP115/1

Series number: SP115/1
Series name: Folders containing Certificates of Exemption and related papers for passengers arriving in Australia by ship, chronological series
Dates: c.1911–43
Contents: CEDTs (Form 21) and other identity documents (such as birth certificates) of people arriving into Sydney, as well as completed Form 32s which document why they were exempted from the Immigration Restriction Act. Each item contains the documents of multiple people.
Location: Sydney
Shelf metres: 24.84 m

Number of items listed in RecordSearch: 1433 (probably about 80% of series – items from 1911–14 are not listed in RecordSearch)
Number of items digitised in RecordSearch: 6 (it seems that for most of these the whole item has not been copied) (as of 29 July 2010)
Item title: Gives the name of the ship and the date of its arrival. Does not include people’s names.
Item title example: EASTERN 20/12/1922 [part 3] [Certificates of Exemption for passengers; includes photographs] [2.5cm]

Link to series description in RecordSearch: NAA: SP115/1
Link to example item in RecordSearch: NAA: SP115/1, BOX 18

Used certificates: SP11/6

Series number: SP11/6
Series name: Certificates of Exemption from Dictation Test (Forms 32 and 21)
Dates: 1902–46
Contents: Documents held in this series are, for the most part, similar to those held in SP115/1. The files contain copies of Form 32 and CEDTs (Form 21) or other identity documents of Chinese arriving into Sydney from overseas.
Location: Sydney
Shelf metres: 1.62

Number of items listed in RecordSearch: 100 (100% of series)
Number of items digitised in RecordSearch: 0
Item title: Gives the name of the ship and the date of its arrival. Does not include people’s names.
Item title example: Certificate Exempting From Dictation Test Immigration Act 1901-1925: Chinese passengers per SS Tango Maru Sydney 11/10/26 [Box 2]

Link to series description in RecordSearch: NAA: SP11/6
Link to example item in RecordSearch: SP11/6, NN

Register of applications: SP726/1

NOTE: This series does not contain application forms and certificates like the others listed. It is included here, however, as it provides a full record of the CEDTs issued in Sydney.

Series number: SP726/1
Series name: Register of Applications for Certificate of Exemption Dictation Tests
Dates: 1902–59
Contents: 6 volumes listing details of people who applied for CEDTs in Sydney. Registers list details such as name, certificate and file numbers and dates of travel. The registers have a name index at the front.
Location: Sydney
Shelf metres: 0.9 m

Number of items listed in RecordSearch: 6 (100% of series)
Number of items digitised in RecordSearch: 0
Item title: Description of register and date range
Item title example: Register of names relating to exemption from Dictation Tests (1902-1910)

Link to series description in RecordSearch: NAA: SP726/1
Link to example item in RecordSearch: NAA: SP726/1, BOOK 1

Victoria

Applications & certificates: B13

Series number: B13
Series name: General and classified correspondence, annual single number series
Dates: From 1902
Contents: Correspondence files of the Department of Customs & Excise/Department of Trade & Customs, concerning a range of Customs matters including immigration restriction. Because of culling, most files before the 1930s relate to immigration restriction. Files can include applications, supporting correspondence, photographs and certificates.
Location: Melbourne
Shelf metres: 104.08 m

Number of items listed in RecordSearch: 20,120 (100% of series)
Number of items digitised in RecordSearch: 131
Item title: Case files include person’s name, so can be searched by name.
Item title example: Ah Lipp – application for Certificate of Exemption from Dictation Test

Link to series description in RecordSearch: NAA: B13
Link to example item in RecordSearch: NAA: B13, 1908/4495

Register of applications: B6003

NOTE: This series does not contain application forms and certificates like the others listed. It is included here, however, as it provides a record of the CEDTs issued in Melbourne.

Series number: B6003
Series name: Registers of Certificates Exempting from the Dictation Test (Departures), Melbourne
Dates: 1904–59
Contents: Three volumes of registers recording details of people departing Melbourne on CEDTs, noting the following details: Vic. no., CEDT Book no., C&E file no., date of issue, name, age, nationality, occupation, address, period of residence in the Commonwealth, departure – date and vessel and port, return – date and vessel and port, examined by, remarks. The registers date 1904–14, 1915–33 and 1934–59.
Location: Melbourne
Shelf metres: 0.72 m

Number of items listed in RecordSearch: 3 (100% of series)
Number of items digitised in RecordSearch: 0
Item title: All 3 items have the same item title
Item title example: Register of Certificates Exempting from the Dictation Test (Departures), Melbourne

Link to series description in RecordSearch: NAA: B6003
Link to example item in RecordSearch: NAA: B6003, NN

Queensland

Certificates: J2481

Series number: J2481
Series name: Proclamations under The Chinese Immigration Restriction Act 1888 & related correspondence, annual single number series
Dates: 1897–1902
Contents: Proclamations issued during the years 1897–1902 exempting persons named from the provisions of the Chinese Immigration Restriction Act 1888 for a period of two years from the date of departure from Australia. They are in a standard form with photographs and personal details.
Location: Brisbane
Shelf metres: 1.8 m

Number of items listed in RecordSearch: 858 (100% of series)
Number of items digitised in RecordSearch: 858
Item title: Includes person’s name, so can be searched by name.
Item titles example: Foo Lang

Link to series description in RecordSearch: NAA: J2481
Link to example item in RecordSearch: NAA: J2481, 1899/298

Certificates: J2482

Series number: J2482
Series name: Certificates of Domicile issued under The Immigration Restriction Act 1901 and Regulations, annual single number series
Dates: 1902–06
Contents: Certificates of Domicile (Form 21)
Location: Brisbane
Shelf metres: 1.8 m

Number of items listed in RecordSearch: 799 (100% of series)
Number of items digitised in RecordSearch: 798
Item title: Includes person’s name, place of residence and birthplace, so can be searched by name
Item titles example: Ah Tong of Redlynch near Cairns, Qld – birthplace: Canton, China – departed Cairns, Queensland on the Changsha 27 July 1904

Link to series description in RecordSearch: NAA: J2482
Link to example item in RecordSearch: NAA: J2482, 1903/99

Certificates: J2483

Series number: J2483
Series name: Certificates Exempting from Dictation Test [CEDT] issued under “The Immigration Restriction Acts 1901-1905″ and Regulations (and amending legislation), two number series
Dates: 1908–56
Contents: CEDTs (Form 21) and Form 32s. Each item contains one certificate (and duplicate) and one Form 32.
Location: Brisbane
Shelf metres: 30.6 m

Number of items listed in RecordSearch: 14,429 (100% of series)
Number of items digitised in RecordSearch: 203
Item title: Includes person’s name, nationality and birthplace, so can be searched by name.
Item titles example: Certificate Exempting from Dictation Test (CEDT) – Name: Margaret Chun Tie [also known as Margaret Choy Larn] – Nationality: Chinese [Australian born] – Birthplace: Croydon

Link to series description in RecordSearch: NAA: J2483
Link to example item in RecordSearch: NAA: J2483, 18/9

Applications: J3115

Series number: J3115
Series name: Alien Immigration files relating to applications for Certificate of Domicile, Certificates of Exemption from the Chinese Immigration Restriction Act 1888 and Certificates of Exemption from the Dictation Test that includes photographs, birth certificates and other historical documents, imposed single number series
Dates: 1899–1928
Contents: Applications for Certificates of Domicile and some for CEDTs, also applications under earlier colonial legislation, so contents of the files is not consistent.
Location: Brisbane
Shelf metres: 2.17 m

Number of items listed in RecordSearch: 161 (100% of series)
Number of items digitised in RecordSearch: 62
Item title: Includes person’s name and where they live, so can be searched by name.
Item titles example: Certificate of Domicile for Young Chin, a storekeeper from Cairns – includes photographs

Link to series description in RecordSearch: NAA: J3115
Link to example item in RecordSearch: NAA: J3115, 25

Registers: BP343/15

Series number: BP343/15
Series name: Registers of aliens departing from the Port of Townsville who were granted a certificate exempting from dictation test [CEDT]
Dates: 1916–55
Contents: Details of aliens leaving the Commonwealth via the Port of Townsville for a temporary period who were been granted a CEDT. The vast majority of records contain a name, description, nationality, place of birthplace, right handprint, place and date fee paid, warrant number, date of departure and name of ship, date of return and name of ship, and number of CEDT. Most also
contain 2 photographs, showing full face and profile.
Location: Brisbane
Shelf metres: 5.22 m

Number of items listed in RecordSearch: 2566 (100% of series)
Number of items digitised in RecordSearch: 17
Item title: Includes person’s name, place of residence, nationality and birthplace, so can be searched by name.
Item titles example: Name: Willie Mar (of Richmond) – Nationality: Chinese – Birthplace: Canton – Certificate of Exemption from the Dictation Test (CEDT) number: 336A/87

Link to series description in RecordSearch: NAA: BP343/15
Link to example item in RecordSearch: NAA: BP343/15, 13/872

Western Australia

Applications: PP4/2

Series number: PP4/2
Series name: Applications for CEDTs with supporting documents, annual single number series
Dates: c.1915–41
Contents: Applications for CEDTs, accompanied by references, photographs of the applicant, and reports by the police and customs officials regarding the character etc of the applicant. Includes Form 22s.
Location: Perth
Shelf metres: 5.22 m

Number of items listed in RecordSearch: 611 (100% of series)
Number of items digitised in RecordSearch: 3
Item title: Includes the name of the person and their ethnicity (Japanese, Chinese etc), so can be searched by name.
Item title example: Quong Leong SET [Chinese] [Application for certificate of exemption from dictation test]

Link to series description in RecordSearch: NAA: PP4/2
Link to example item in RecordSearch: NAA: 1931/94

Applications: PP6/1

NOTE: This series has one of the best series descriptions that I have ever seen in RecordSearch.

Series number: PP6/1
Series name: Correspondence files [subject and client], annual single number series with ‘H’ infix
Dates: 1926–50
Contents: Immigration correspondence files, including those concerning applications for CEDTs. The series also documents other immigration functions such as temporary admissions and naturalisation. Only a small proportion of files in the series concern Chinese, Japanese etc.
Location: Perth
Shelf metres: 36.54 m

Number of items listed in RecordSearch: 6005 (100% of series)
Number of items digitised in RecordSearch: 58
Item title: Includes the name of the applicant and what the file was about, so can be searched by name.
Item title example: Yick YOU [Application for Certificate of Exemption of Dictation Test]

Link to series description in RecordSearch: NAA: PP6/1
Link to example item in RecordSearch: NAA: PP6/1, 1927/H/325

Certificates: K1145

Series number: K1145
Series name: Certificates of Exemption from Dictation Test, annual certificate number order
Dates: c.1901–45
Contents: Contains CEDTs (Form 21) arranged in certificate number order commencing at one (1) each year.
Location: Perth
Shelf metres: 6.84 m

Number of items listed in RecordSearch: 4787 (100% of series)
Number of items digitised in RecordSearch: 24
Item title: Includes person’s name and ethnicity, so can be searched by name.
Item title example: Ah Kett [Chinese]

Link to series description in RecordSearch:
NAA: K1145
Link to example item in RecordSearch: NAA: K1145, 1918/137

Northern Territory

Applications and certificates: E752

Series number: E752
Series name: Certificate Exempting from Dictation Test
Dates: 1905–41 (most date from 1915 and after)
Contents: Applications for CEDTs (and one Certificate of Domicile), CEDTs and correspondence. The series includes Form 21s (CEDTs) and Form 32s, which were completed on return to Australia.
Location: Darwin
Shelf metres: 4.5 m

Number of items listed in RecordSearch: 720 (100% of series)
Number of items digitised in RecordSearch: 715
Item title: Includes the name of the applicant, so can be searched by name.
Item title example: [Certificate of Exemption from Dictation Test - Fong Yan]

Link to series description in RecordSearch: NAA: E752
Link to example item in RecordSearch: NAA: E752, 1917/11

South Australia

Register of applications: D2860

Series number: D2860
Series name: Immigration Restriction Act exemption certificate register
Dates: 1902–57
Contents: A register and alphabetical index of CEDTs and related matters. Includes a chronological record of departures from various Australian ports of holders of CEDTs showing date of issue, certificate number, person to whom issued (full name), date of departure, ship (and port if other than Adelaide), certifying officer, correspondence reference number, and number of previous certificate (if any). There are corresponding details for the certificate holder’s return to Australia as follows: date, ship, certifying officer, remarks. The volume is divided into other sections including birth certificates, applications for CEDTs refused, lapsed applications for CEDTs and CEDTs issued in other states to applicants departing from Port Adelaide.
Location: Sydney (a copy is held in Adelaide)
Shelf metres: 0.81 m

Number of items listed in RecordSearch: 1 (100% of series)
Number of items digitised in RecordSearch: 0
Item title: 1 item only
Item title example: Immigration Restriction Act exemption certificate register

Link to series description in RecordSearch: NAA: D2860
Link to example item in RecordSearch: NAA: D2860, WHOLE SERIES

Book butts: D5036

Series number: D5036
Series name: Certificates exempting from dictation test (CEDT) book butts (forms 21)
Dates: 1902–59
Contents: 2 volumes. Comprises book butts of CEDTs (Form 21). The butts include provision for certificate number, name (sometimes showing address, when and where born, occupation and other details), nationality, date of issue to Sub-Collector (of Customs), date of issue to holder and payment of fee. In some cases where certificates have not been issued, the record is cancelled and 2 copies of the certificate remain attached to the butt.
Location: Sydney
Shelf metres: 0.9 m

Number of items listed in RecordSearch: 1 (100% of series)
Number of items digitised in RecordSearch: 0
Item title: 1 item only
Item title example: Certificates exempting from dictation test (CEDT) book butts (form 21)

Link to series description in RecordSearch: NAA: D5036
Link to example item in RecordSearch: NAA: D5036, WHOLE SERIES

Applications: D596

Series number: D596
Series title: Correspondence files, annual single number series
Dates: c.1902–1930s
Contents: Correspondence files of the Collector of Customs, including a small number (less than 100) concerning applications for CEDTs.
Location:: Adelaide
Shelf metres: 53.91 m

Number of items listed in RecordSearch: 11,390 (100% of series)
Number of items digitised in RecordSearch: 63
Item title: Relevant file titles include the person’s name, so can be searched by name; they also generally include the words ‘exemption’ or ‘certificate’
Item title example: Abdul KHALICK – Certificate of Exemption from Dictation Test

Link to series description in RecordSearch: NAA: D596
Link to example item in RecordSearch: D596, 1919/4386

Tasmania

NOTE: The 2 series listed here appear to hold the only remaining Customs records in Hobart relating to the issuing of CEDTs. The Department of External Affairs series A1 held in Canberra contains material relating to Tasmanian Chinese, and it is possible that Melbourne records do too.

Book butts: P526

NOTE: From the series description in RecordSearch, it would seem that this series contains book butts of CEDTs (Form 21) issued in Hobart. There appears to be no remaining copies of the certificates themselves. I’m happy to be corrected on this if someone knows better.

Series number: P526
Series name: Immigration permit butts (form 21) issued to foreign nationals at Launceston and Burnie outports
Dates: 1908–18
Contents: Volumes containing butts of immigration permits issued to foreign nationals wanting to enter Launceston and Burnie outports. The butts include information on the person name, nationality and date of issue. They were issued in Hobart.
Location: Hobart
Shelf metres: 0.06 m

Number of items listed in RecordSearch: 2 (100% of series)
Number of items digitised in RecordSearch: 0
Item title: The 2 items have the same title
Item title example: Australian Customs Service, Tasmania – butts of immigration permit certificates issued

Link to series description in RecordSearch: NAA: P526
Link to example item in RecordSearch: NAA: P526, CUST47

Applications: P437

Series number: P437
Series name: Correspondence Files, Annual Single Number Series
Dates: From 1909
Contents: Correspondence files of the Collector of Customs, including files on immigration matters such as applications for CEDTs. Most items in the series do not relate to CEDT applications, however.
Location: Hobart
Shelf metres: 94.68 m

Number of items listed in RecordSearch: 4959 (100% of series)
Number of items digitised in RecordSearch: 1
Item title: Relevant items have person’s name in the title, so can be searched by name. It appears item titles are taken directly from the files’ original titles as they are not consistent.
Item title example: Gi Hung – Statutory Declaration re Immigration Restriction Acts. – visit China 36 months.

Link to series description in RecordSearch: NAA: P437
Link to example item in RecordSearch: NAA: P437, 1911/291