Topic modelling in the archives

There seems to be a lot of topic modelling going on at the moment. Any why not? Projects like Mining the Dispatch are demonstrating the possibilities. Tools like Mallet are making it easy. And generous DHers like Ted Underwood and Scott Weingart are doing a great job explaining what it is and how it works.

I’ve talked briefly about using topic modelling to explore digitised newspapers, something that the Mapping Texts project has also been investigating. But I’ve also been following with interest Chad Black’s use of algorithmic techniques, including topic modelling, to look for local variations amidst the legal system of the early modern Spanish empire.

As part of the Invisible Australians project, Kate and I are exploring the bureaucracy of the White Australia Policy. In particular, we’re interested in the interaction between policy and practice, between the highly-centralised bureaucracy and the activities of individual port officials. Like Chad, we’re interested in mapping local variations — to try and understand the bureaucracy from the point of view of an individual forced to live within its restrictions.

I recently gave a presentation about the project at Digital Humanities Australasia (post coming soon!), and in preparation I decided to try a few topic modelling experiments. They were very simple, but I was impressed by the possibilities for exploring archival systems.

The problem I started with was this. The workings of the White Australia Policy are well documented by records held by the National Archives of Australia. Some series within the archives are specifically related to the operations of the policy — such as those containing many thousands of CEDTs. But there are also general correspondence series created by the customs offices in each state, as well as the Commonwealth Department of External Affairs which administered the Immigration Restriction Act (responsibility was later taken by the Department of Home and Territories and it’s successors). These general correspondence series are important, because they often include details of difficult or controversial cases — those that required a policy judgment, or prompted a change in existing practices. But how do you find relevant files within series that can contain large numbers of items?

Series A1, for example, is a correspondence series created by the Department of External Affairs. It contains more than 60,000 items. Past research tells us that amongst these 60,000 files are records of important policy discussions relating to White Australia. But these files tend to be labelled with the names of the people involved, so unless you know the names in advance they can be difficult to find.

Mitchell Whitelaw’s A1 Explorer, part of the Visible Archive project, lets you to explore the contents of Series A1 in a easy and engaging way. But while the A1 Explorer provides new opportunities for discovery, it doesn’t offer the fine-grained analysis we need to sift out the files we’re after. And so… topic modelling.

The process was pretty simple. While I can dip into my bag of screen-scrapers to harvest series directly from the NAA’s RecordSearch database, there was already an XML dump of A1 available from So I extracted the basic file metadata from the XML and wrote the identifiers and titles out to a text file, one item per line. Following the instructions on the website I then loaded this file into Mallet:

/Applications/Mallet/bin/mallet import-file --input ./A1.txt --output A1.mallet --keep-sequence --remove-stopwords

Then it was just a matter of firing up the topic modeller:

/Applications/Mallet/bin/mallet train-topics --input ./A1.mallet --output-state ./A1.gz --output-doc-topics ./A1-topics.txt --output-topic-keys ./A1-keys.txt --num-topics 40

Again, I just followed the examples on the Mallet site.

Once it was finished I opened up A1-keys.txt to browse the ‘topics’ Mallet had found. The results were intriguing. There are a large number of applications for naturalisation in A1, so it’s no surprise that ‘naturalisation’ figures prominently in a number of the topics. What was more interesting was the way Mallet had grouped the naturalisation files. For example:

naturalization christian hans hansen jensen petersen andersen nielsen larsen christensen johannes jens niels pedersen andreas johansen martin jorgensen


naturalisation certificate giuseppe salvatore frank la leo samios spina sorbello leonardo fisher natale patane torrisi barbagallo luka rossi ross

Based on the co-occurrence of names within the file titles, Mallet had created groupings that roughly reflected the ethnic origins of applicants. It makes sense when you think about what Mallet is doing, but I still found it pretty amazing.

Mallet also found clusters around the major activities of the department, such as the administration of the territories. But of most interest to us was:

1 0.55539 passport ah student exemption students lee wong chinese young deserter education sing wing chong readmission son hing chin wife

The Chinese names alongside words such as ‘readmission’ and ‘wife’ suggested that this topic revolved around the administration of the White Australia Policy. This was easy to test. In A1-topics.txt was a list of every file in the series and their weightings in relation to each of the topics. I wasn’t sure what was a reasonable cut-off value to use in assessing the weightings, but after a bit of trial and error I fixed on a value of 0.7. I then just extracted the identifiers of every file that had a weighting greater than 0.7 for this topic. I used the identifiers to build a simple web page that Kate and I could browse. I also included links back to RecordSearch so we could explore further.

Browse the full list

It’s a pretty impressive result. Instead of fumbling with the uncertainties of keyword searches, we now have a list of more than 1,300 files that are clearly of relevance to Invisible Australians. There’s a few false positives and there are likely to be other files that we’ll have missed altogether, but now we have a much clearer picture of the types of files that are included and how they are described.

And that was at my first attempt, simply using the default settings. I’m now starting to play around with some of Mallet’s configuration options to see what sort of difference they make. I’m also keen to try out GenSim, a topic modelling package for Python.

I’m really excited about the possibilities of these sort of tools for analysing the contents of archival descriptive systems, something I mentioned in my Digital Humanities Australasia paper. Much more to come on this I suspect…

It’s all about the stuff: collections, interfaces, power and people

This is the full version of a paper I presented at the National Digital Forum, 30 November 2011.

In 1901, one of the first acts of the Commonwealth of Australia was to create a system of exclusion and control designed to keep the newly-formed nation ‘white’. But White Australia was always a myth. As well as the Indigenous population, there were already many thousands of people classified as ‘non-white‘ living in Australia — most were Chinese, but there were also Japanese, Indians, Syrians and Indonesians.

Here are some of them…

The real face of White Australia

The administration of what became known as the White Australia Policy created a huge volume of records, much of which is still preserved within the National Archives of Australia. These photographs are attached to certificates that non-white residents needed to get back into the country if they decided to travel overseas. There are thousands upon thousands of these certificates in the Archives. Thousands of certificates representing thousands of lives — all monitored and controlled.

But is is too easy to see these people as the powerless victims of a repressive system. There were many acts of resistance. Some argued against the need to be identified ‘just like a criminal’. Others exercised control over their representation, submitting formal studio portraits instead of mug shots.

Most commonly and most powerfully, people resisted the policy simply by going ahead and living rich and productive lives.

My partner, Kate Bagnall, is helping to rewrite Australian-Chinese history by overthrowing the stereotype of the culturally isolated Chinese man living a lonely, meagre existence surrounded by gambling and opium dens. By mining the available records, by reading against the grain of contemporary reports and by working with family historians, Kate is documenting their intimate lives — their wives, their lovers, their families and descendants — the sorts of relationships that sent a shudder through the edifice of White Australia. Power can be reclaimed in many subtle and subversive ways.

‘The real face of White Australia’ is an experiment. It uses facial detection to technology to find and extract the photographs from digital copies of the original certificates made available through the National Archives of Australia’s collection database. The photographs you see here come from just one series, ST84/1. There’s no API to the collection so I reverse-engineered the web interface to create a script that would harvest the item metadata and download copies of all the digitised images. There are 2,756 files in this series. On the day I harvested the metadata, 347 of those files had been digitised, comprising 12,502 images. It took a few hours, but I just ran my script and soon I had a copy of all of this in my local database.

Then came the exciting part. Using a facial detection script I found through Google and an open source computer vision library, I started experimenting with ways of extracting the photos. After a few tweaks I had something that worked pretty well, so I pointed my aging laptop at the 12,502 images and watched anxiously as the CPU temperature rose and rose. It took a few emergency cooling measures, but the laptop survived and I had a folder containing 11,170 cropped images. About a third of these weren’t actually faces, but it was easy to manually remove the false positives, leaving 7,247 photos.

These photos. These people.

With my database fully primed and loaded it was just a matter of creating a simple web interface using Django for the backend and Isotope (a jQuery plugin) at the front. Both are open source projects. All together, from idea to interface, it took a bit more than a weekend to create, and most of that was waiting for the harvesting and facial detection scripts to complete. It would be silly to say it was easy, but I would say that it wasn’t hard.

What we ended up with was a new way of seeing and understanding the records — not as the remnants of bureaucratic processes, but as windows onto the lives of people. All the faces are linked to copies of the original certificates and back to the collection database of the National Archives. So this is also a finding aid. A finding aid that brings the people to the front.

According to Margaret Hedstrom the archival interface ‘is a site where power is negotiated and exercised’. Whether in a reading room or online, finding aids or collection databases are ‘neither neutral nor transparent’, but the product of ‘conscious design decisions’. We would like to think that this interface gives some power back to the people within the records. Their photographs challenge us to do something, to think something, to feel something. We cannot escape their discomfiting gaze.

But this interface represents another subtle shift in power. We could create it without any explicit assistance or involvement by the National Archives itself. Simply by putting part of the collection online, they provided us with the opportunity to develop a resource that both extends and critiques the existing collection database. Interfaces to cultural heritage collections are no longer controlled solely by cultural heritage institutions.

It’s these two aspects of the power of interfaces that I want to focus on today.

There are a growing number of examples where the records created by repressive or discriminatory regimes have, in Eric Ketelaar’s words, ‘become instruments of empowerment and liberation, salvation and freedom’. Nazi records of assets confiscated during the Holocaust have been used to inform processes of restitution and reparation. Government records have helped members of Australia’s Stolen Generations trace family members. Descendants of inmates incarcerated by American colonial authorities in what was the world’s largest leprosy colony in the Philippines, have embraced the administrative record as an affirmation of their own heritage and survival. Records can find new meanings. Power can be reclaimed.

Technology can help. Tim Hitchcock has described how something as simple as keyword searching can turn archives on their heads. Recordkeeping systems tend to reflect the structures and power relations of the organisations that create them. The ‘hierarchical and institutional nature of most archives’, Hitchcock argues, ‘contains an ideological component which is sucked in with every dust-filled breath’. But digitisation and keyword searching free us from having to follow the well-worn paths of institutional power. We can find people and follow their lives against the flow of bureaucratic convenience. We can gain a wholly new perspective on the workings of society. ‘What changes’, Hitchcock asks, ‘when we examine the world through the collected fragments of knowledge that we can recover about a single person, reorganised as a biographical narrative, rather than as part of an archival system?’

Projects such as Unknown no longer may help us answer that question.

Unknown no longer

It’s aiming to extract the names and biographical details of slaves from the 8 million manuscript documents held by the Virginia Historical Society. The documents include court records, receipts, wills and inventories. Here is a page from the ‘Inventory of Negroes at Berry Plain Plantation, King George County, Virginia’ for 1855, listing names, occupations and valuations.

Tim Hitchcock is one of the directors of London Lives a project that similarly seeks to find the people in 240,000 manuscript pages documenting the lives of plebeian Londoners in the 17th century.

London Lives

More than three million names have already been extracted from the records of courts, workhouses, hospitals and other institutions. Work is continuing to link these names together, to merge these various shards of identity and piece together the experiences of London’s poorest inhabitants.

Remember me from the US Holocaust Memorial Museum is working with photographs taken by relief agencies in the aftermath of World War Two. The photographs are of displaced children who survived the Holocaust but were separated from families. What happened to them? The project is seeking public help to identify and trace the children.

Remember me

These are all projects about finding people. Finding the oppressed, the vulnerable, the displaced, the marginalized and the poor and giving them their place in history. This is what Kate and I hope to do with Invisible Australians, the broader project of which our faces experiment is part.

Invisible Australians

‘Invisible Australians’ aims to extract more than just photographs. We want to record and aggregate the biographical data contained within the records of the White Australia Policy — to extract the data and rebuild identities.

But we want to do more, we want to link these identities up with with other records, with the research of family and local historians, with cemetery registers and family trees, with newspaper articles and databases we don’t even know about yet. We want to find people, families and communities.

It’s ridiculously ambitious and totally unfunded. But it is possible.

The most exciting part of online technology is the power it gives to people to pursue their passions. As with the faces, we don’t need the help of the National Archives. We need the records to be digitized, but that’s happening anyway and we can afford to be patient. Most of the tools we need already exist, and are free. In the past 12 months, for example, there have been a number of open source tools released for crowd-sourced transcription of manuscript records.

People with passions, people with dreams, people who are just annoyed and impatient, don’t have to wait for cultural institutions to create exactly what they need. They can take what’s on offer and change it.

Interfaces can be modified. It is amazingly easy to write a script that will change the way a web page looks and behaves in your browser. I was frustrated by the standard interface to digitized files in the National Archives of Australia’s Recordsearch database — so I changed it.

Before and after

Not only did make it look a bit nicer, I added new functions. My script lets you print a whole file or a range of pages and display the entire contents of the file on a pretty cool 3d wall.

I’ve shared this script, and a few other Recordsearch enhancements. Anyone can install them with a click and use them.

Wragge Labs Emporium

Interfaces are sites of power and we can claim some of that power for ourselves. Online technologies not only free us from the having to brave the physical intimidation of the reading room, they free us up to engage with the records in new ways. The archivist-on-duty would probably not be pleased if I pulled out some scissors and started snipping photos out of certificates. Or if I pulled a file apart and pasted it’s contents on the wall. But online we are free to experiment.

The power of cultural heritage organisations is perhaps expressed most forcefully in their ability to control the arrangement and description of their collections. ‘Every representation, every model of description, is biased’, note Verne Harris and Wendy Duff, ‘because it reflects a particular world-view and is constructed to meet specific purposes’. Archives, libraries and museums are already starting to share this power, by allowing tagging, or seeking public assistance with description through crowd sourcing projects. But most of the these activities still happen within spaces created and curated by the institutions themselves. Our cathedrals of culture might be opening their doors and inviting the public to participate in their ceremonies, but that doesn’t make them bazaars. The architecture stills speaks of authority.

In any case, people already have a space where they can explore and enrich collections — it’s called the internet.

It would be great to see cultural institutions doing more to watch, understand and support what people are doing with collections in their own spaces — following them as they pursue their passions, rather than thinking of ways to motivate them.

A quick example… You might have heard of Zotero, it’s an open source project that lets you capture, annotate and organize your research materials.


One cool thing about Zotero is that you can build and contribute little screen scrapers, called translators, that let Zotero extract structured data from any old collection database. You might not be surprised to learn that I’ve created a translator for Recordsearch. Another cool thing about Zotero is that you can share the stuff that you collect in public groups.

Invisible Australians Zotero group

Put those two cool things together and what do you have? Well to me they spell out user generated finding aids — parallel collection databases created by researchers simply pursuing their own passions.

Linked Open Data greatly increases opportunities for collection description to leak into the wider web. If objects and documents are identified with a unique URL, then anyone can can make and publish statements about them in machine-readable form. These statements can then be aggregated and explored. Initiatives such as the Open Annotation Collaboration will hasten the development of these shared descriptive and interpretative layers around our cultural collections.

And of course all this descriptive and interpretative work can be harvested back to enhance existing collection databases. We could start doing it now — though I will spare you today my rant about the possibilities of mining footnotes.

As well as exploring the possibilities of user-generated content, cultural institutions are starting to open up their collection data for re-use. APIs are great (though Linked Open Data is better), and New Zealand is lucky to have an organisation like DigitalNZ which just gets it. People can and will make cool things with your stuff.

But again, we don’t have to wait for everything to be delivered in a convenient, machine-readable form. If it’s on the web anybody can scrape, harvest and experiment.

You probably all know about the National Library of Australia’s newspaper digitisation project — it’s building a magnificent resource. But I wanted to do more than just find articles. I wanted to explore and analyze their content on a large scale. So I built a screen scraper to extract structured data from search results, and then used the scraper to  power a series of tools. I have a harvester that lets you download an entire results set — hundreds or thousands of articles — with metadata neatly packaged for further analysis.


Or what about a script that graphs the occurrence of search terms over time, and allows you to ask questions like When did the Great War become the First World War?.

When did the Great War become the First World War?

In the end I got a bit carried away and built my own public API to the Trove newspaper database.

Unofficial Trove newspapers API

I think it’s important to note that the tools I developed were guided by the types of questions I wanted to ask. While we should welcome APIs and celebrate their possibilities, we should also remain critical. APIs are interfaces, they too embed power relations. Every API has an argument. What questions do they let us ask? What questions do they prevent us from asking?

Even as we move from the age of lumbering, slow-witted data silos into the rapidly-evolving realms of Linked Open Data, we have to constantly question the models we make of the world. Ontologies and vocabularies are culturally determined and historically specific. Yes, they too are interfaces, complete with their own distributions of power and authority. But we can revisit and change them. And we can relate our new models to our old models, capturing complex, long-term shifts in the way we think about the world. That’s incredibly exciting.

All of this hacking, harvesting, questioning, enriching and meaning-making makes me think about the possibilities of grassroots leadership. Online technologies enable people to take cultural institutions into unexpected realms. They can build their own interfaces, ask their own questions, determine their own needs — they can point the way instead of simply waiting to be served.

You might wonder what the National Library of Australia thinks of my various scrapers and harvesters. I can’t speak for them, but I can say that they’ve awarded me a fellowship to explore further the possibilities of text-mining in their newspaper database.

The idea of grassroots leadership brings me back to the title of this talk — ‘It’s all about the stuff’. It seems to me that we tend to model the interactions between cultural institutions and the public as transactions. The public are ‘clients’, ‘patrons’, ‘users’ or ‘visitors’. But the sorts of things I’ve been talking about today give us a chance to put the collections themselves squarely at the centre of our thoughts and actions. Instead of concentrating on the relationship between the institution and the public, we can can focus on the relationship we both have with the collections.

It’s all about the stuff.

It’s all about the respect and responsibility we both have for our collections.

It’s all about the respect and responsibility we both have for people like this.



Liberating lives: invisible Australians and biographical networks

Presented at the Life of Information Symposium, 24 September 2010.
Slides are available on Slideshare.

Charlie Allen's palm print
This palm print belongs to a 12-year-old boy called Charlie Allen.

Charlie was born in Sydney in 1896.

His mother was Frances Allen (sometime sweet shop owner and brothel keeper), his father Charlie Gum (a buyer for Wing On company).

Charlie was raised by his mother, but in 1909, at the age of 13, he was taken to China by his father.

His father returned to Sydney, leaving Charlie in China. He lived with relatives in the town of Shekki (inland from Hong Kong) for 6 years.

Charlie was homesick, but had no means of getting back to Australia. His mother attempted to enlist government help but to no avail. Charlie finally returned in 1915.

The following year he enlisted in First AIF (well actually he enlisted three times, and was discharged as medically unfit each time).

Charlie married in Sydney in 1917 and had two daughters soon after. He returned to China in 1922 for 7 months.

Charlie Allen died in 1938 as the result of an industrial accident. He was 41.

How do we know all this about Charlie Allen?

We know this because there are fragments of Charlie’s life scattered throughout the holdings of the National Archives of Australia.

The CEDT from 1909 when he left Australia with his father:

Charles Allen 1909 - CEDT front

NAA: ST84/1, 1909/22/41-50

A letter from his mother to Prime Minister Billy Hughes, seeking help to return Charlie to Australia:
Letter to Billy Highes from Charlie's mother.

NAA: A1, 1911/13854

His WWI service record:
Charles Allen's WWI attestation form


An identity form relating to his trip to China in 1922:

NAA: SP42/1, C1922/4449

But of course Charlie is not alone in the archives.

Charlie’s father was Chinese, he was therefore categorised as a ‘half-caste’, as someone who was not white, and fell under the restrictions imposed by the White Australia Policy.

The certificate from 1909 granted Charlie an exemption to the Dictation Test. Without it, he may not have been allowed back into the country.

Every time one of many thousands of non-Europeans resident in Australia sought to travel overseas and return home again they needed one of these certificates.

We’re all of course familiar with the general outlines of the White Australia Policy, and the way it underpinned conceptions of Australia as a nation in the first half of the 20th century.

But what we sometimes forget is that it was also a massive bureaucratic exercise.

Forms and certificates were printed, issued, used and filed. Regulations were modified, guidelines were distributed and administering officers were managed and advised. Individual cases were reviewed, policy was changed and new forms and certificates were printed, issued, used and filed…

For example, between 1901 and 1911, 400 circulars were issued to port officers about immigration restriction. The confidential manual on immigration restriction grew from one page in 1902 to more than 200 in 1912.

Much of this system is now preserved in the National Archives.

For the years between 1902 and 1948 there remain:

  • More than 50,000 CEDTs
  • 90 shelf metres of records
  • 15,000 case files

And within those many thousands of files are the scattered fragments of lives such as Charlie’s — lives that were controlled, monitored and documented in a vain attempt to make Australia ‘white’.

We’ve already seen today some wonderful examples of how these fragments, these slivers of existence, can be found, extracted, aggregated and displayed. But I think it’s worth considering for a moment what happens when we do this.

The historian Tim Hitchcock, behind projects such as the Old Bailey Online and London Lives, has reflected on the impact of digitisation on our access to archives. Archives, he notes, tend to reflect the assumptions and practices of the institutions that created them.

But by providing new ways into these records systems, technology can undermine the power relations that persist within their structures.

‘What changes’, asks Tim Hitchcock, ‘when we examine the world through the collected fragments of knowledge that we can recover about a single person, reorganised as a biographical narrative, rather than as part of an archival system?’

I don’t know, but I think we should find out, don’t you?


I hope you’ve all collected a mini card. These themselves provide a little glimpse at the real face of White Australia and I’d invite you all to head over to the National Archives website, do battle with the monster that is RecordSearch, and look up the file references that are on each card.

The cards are part of a project that Kate Bagnall and I are trying to develop — Invisible Australians.

I should note too that the cards, and most of the examples I’m showing you here today are the product of Kate’s long and detailed research into Chinese-Australian families. In modern project management parlance, Kate is the domain expert, while I am merely the technical resource.

If we look again at one of the CEDTs, we can see that there’s a lot of useful structured data:

  • name
  • place of birth
  • age
  • height
  • destination
  • date of departure
  • name of ship

Invisible Australians has the modest aim of extracting this data from the 50,000+ forms in the National Archives. But of course that’s just the start, because each person might have used a number of certificates — so then it’s a matter of matching these identities.

Invisible Australians

And then there are a range of other related forms, not to mention case files, alien registration documents, naturalisation applications…

Obviously we can’t do it alone. We’ll be creating a crowdsourcing tool to extract and link the data.

It’s ridiculously ambitious, totally unfunded and is likely to take over our lives.

Is it worth it?

Imagine being able to navigate the network of lives, families and relationships. To follow their journeys, to share their tragedies, to celebrate their small victories against a repressive system.

Imagine being able to watch them age.

Pauline Ah Hee and Shadee Khan

Is it worth it? We think so.


For Tim Hitchcock technology opens up the possibility of writing a new history from below, exploring how the poor, the marginalised and the powerless navigated the institutions of the modern state. But it’s not just about search engines and databases. He talks about making ‘best use of the technology of emotions and representation — how you use words and pictures and a story to impact, not just on what people think, but what they see in their mind’s eye’.

In this project, the photos matter. I hope the irony in our project title is obvious.

Some of the faces of Invisible Australia

This is the real face of White Australia.

The photos remind us that the project is not just about shifting data around — these are lives, these are people.

But this brings its own challenge, for if we are seeking to liberate these lives from the fragmentation and obscurity of bureaucratic systems then we should be asking what are we liberating them into?

A database?

This is not just an exercise in data creation and management. We also have to think carefully and creatively about issues of representation, access and discovery.

We have to give these lives back their freedom to associate, to have relationships, to make connections.

We need to embed these lives in a variety of contexts and combinations. To make room for serendipity, celebration, sadness, and yes, even play.

We need to bring these lives into a rich and ongoing conversation with the world.

But how?


I’ve been working on a little experiment for the National Museum of Australia called The History Wall. What the History Wall does is quite simple, it pulls together data on the fly from a variety of sources including People Australia, the Australian Dictionary of Biography, the National Library’s newspapers project, historical population data from the Bureau of Statistics, photos from the Flickr accounts of the PowerHouse Museum and the National Archives, and the collection database of the National Museum itself. It chooses randomly from all this stuff, throws the results up into the air and then displays them however they happen to fall. No two views are ever quite the same.

The History Wall

It’s something more than a timeline. To me it’s more like a celebration of context and serendipity. There’s a richness to it, a sense of discovery and fun, but there’s also fragility — next time you look it might be gone.

It’s a bit like history itself.

It’s a bit like the world.

How do we create spaces for our data to merge and mingle? How do we encourage the development of new contexts and connections?

I think the first thing we have to do is stop thinking about databases and dictionaries, registers and encyclopaedias. Don’t get me wrong, I’m not being critical of the wonderful projects we’ve seen today. I just think we can use all this work better if we stop thinking about individual resources and start developing on a web scale, on a global scale.

Yes, we have the technology. Time today has spared you from a detailed discourse on the Semantic Web, but I do want to focus on one aspect.

You may have heard of Linked Data, it’s a set of guidelines to help you publish your data to the Semantic Web. There are only four basic principles and I’m only going to talk about one of them. It’s one of those deceptively simple things. You look at it and think, ‘yeah, ok’, but before too long it’s starting to turn your brain inside out.

Use URLs to identify things in the real world.

Yeah, ok…

You know what URLs are, web addresses, the things you type in your browser’s location field.

And hopefully you know what things in the real world are: people, places, objects, events, ideas…

Now you may have detected a problem here, because no matter how many times you click the refresh button, your web browser is not going to be able to use such a URL to magically deliver you the real world thing.

Well, unless you’re on eBay.

Fortunately, the Linked Data guidelines provide for a bit of technical trickery that allow your browser to retrieve not the real world thing, but some information about that thing — perhaps in the form of a web page.

Why bother?

Names are powerful.

We share and use names to talk about things. Computers are the same. If we use URLs to identify things in the real world, then computers can start talking about them.

We can define and explore real-world relationships in an online environment. We can create rich, meaningful linkages across databases, across disciplines, across the world.

We can start building and thinking on a web scale.


Thanks to the People Australia project, I can confidently claim that this is me:

I keep meaning to get it on a t-shirt.

The most exciting thing about People Australia is not the EAC records or the aggregation of resources — it’s the identifiers, because they enable us to say things about people anywhere on the web that computers can understand and relate back to a specific real world entity — a person.

You can start doing it now with Wragge’s Identity Browser.

Wragge's Identity Browser

This is a little tool I built using the People Australia API. It makes it easy to find identifiers for people and organisations, and it supplies you with some code that you can drop into a blog post or web page that will tell a computer that a name relates to a thing called a ‘person’ , that this person’s name has a certain standard form, and that this person can be uniquely identified by People Australia.

Even if you don’t publish a website or a blog, you can use People Australia identifiers to build semantic linkages. Wragge’s Identity Browser also creates machine tags for you. Machine tags are like normal tags but with built in semantics. When coupled with identifiers they enable you to do some pretty powerful things.

You could for example use machine tags in Flickr to tell computers that a certain photo depicts a person uniquely identified by People Australia. In fact, people have been doing just that.

Flickr Machine Tag Challenge

The Flickr Machine Tag Challenge is a sort of scoreboard that I built to encourage people to start adding People Australia enriched machine tags to photos. More than 1200 tags have been added to over 1000 photos. Feel free to join in!

The point is that the technologies already exist to enable us to build web scale biographical resources. Not dictionaries or databases as we know them, but networks capable of constant expansion, elaboration, and cooperation.

What we need are more tools to make it simple, recipes to make it obvious, examples and applications to make it popular, and leadership to make it all seem possible.


Of course most of the lives we hope to liberate through Invisible Australians will not be represented in People Australia.

Not yet.

But Invisible Australians will offer a point of aggregation and disambiguation that will enable our people to find their way from the bureaucratic recesses of the White Australia Policy to a place on the national stage.

And we will encourage others to do likewise. Basil can’t do all the work. The centralised system has to be fed through centres of aggregation and collaboration.

Similarly, there are many great resources already out there relating to Chinese-Australians. There are hordes of family and local historians compiling and publishing biographical data. We want to identify people in these resources and link to them.

We want to publish up to People Australia and link down to a single headstone in a lonely country cemetery.

But to do this we need to help people make their resources linkable. To help them create persistent, re-usable URLs, and expose their data in standard formats. To create Linked Data, even if they have no particular interest in the Semantic Web.

Invisible Australians

Invisible Australians is not just about extracting data from archives. It’s also about working with others to build capacities and demonstrate possibilities.

It’s ridiculously ambitious, totally unfunded and is likely to take over our lives.

Is it worth it?

We think so.

Hacking a research project

Amongst the holdings of the National Archives of Australia are some of the most visually arresting documents you’ll see — thousands and thousands of forms from the early decades of the twentieth century, each with a portrait photograph and palm print, each documenting the movements of a non-white resident. Along with many other certificates, regulations, correspondence and case files, these forms are part of the massive bureaucratic legacy of the White Australia Policy.

These certificates allowed non-white Australians travelling overseas to re-enter the country. NAA: ST84/1, 1906/21-30

But these are more than just interesting looking pieces of paper, they are snapshots of people’s lives. The forms capture data about an individual’s place of birth, physical characteristics and more. Over time a person might have submitted several of these forms, so by bringing them together we could trace their history, we could map their journeys — we could even watch them age.

The system which sought to render non-whites invisible has captured and preserved the outlines of their lives. By extracting and linking this data we could build a picture of another Australia, an Australia in which non-white residents lived, loved, struggled and succeeded, despite the impositions of a repressive regime.

I talked about these records at the AAHC conference last year, inspired in part by Tim Hitchcock’s chapter in the Virtual Representation of the Past. Tim Hitchcock argues that technology can allow us to restructure archives, looking beyond institutional hierarchies to the lives of individuals contained within:

What changes when we examine the world through the collected fragments of knowledge that we can recover about a single person, reorganised as a biographical narrative, rather than as part of an archival system?

I don’t know, but I’d like to find out.

During my AAHC talk, Dave Lester suggested that the extraction of data from these forms might make a good crowdsourcing project. It’s a great idea. As you can see, the data is generally well-structured and legible, it should be possible to construct a simple series of forms that would allow volunteers to transcribe the data. The next stage would be to try and match identities across forms. That’s more complicated, but projects such as Tim Hitchcock’s London Lives show how users can construct identities by connecting a range of historical documents.

Then there are connections to resources outside of the archives — photographs, local histories, newspapers, genealogies, cemetery registers and more. By keeping our system open and extensible, and by working with others to help them expose their information in standard ways, it should be possible to develop the framework for an evolving mesh of biographical data.

So, how do we get started? This is the point when you usually have to start thinking about money — how can I fund this? In Australia that generally means a journey into the arcane world of the Australian Research Council. The ARC suffers from all the problems of a peer-reviewed system, but added to this is a rather antiquated notion of what research is.

In the rules covering each of the main schemes it’s clearly stated that the ‘compilation of data’ and the ‘development of research aids or tools’ are not supported. I spend part of my life working for the Australian National Data Service, an organisation that seeks to highlight how the sharing and reuse of data can open up new research possibilities. The ARC, however, seems to think that data has little value beyond its original research context.

Of course you can still mount a case for such activities. Applicants for a ‘Discovery’ grant can argue that data creation is integral to their project and provide details of the ‘specific research questions to be addressed’. But what if you don’t yet know what the questions are? Part of the point of a project such as this is to try and find out what questions we are able to ask. Until we start to compile, link and explore the data, the ‘specific research questions’ will be little more than convenient fictions, dreamt up to satisfy the prodding of peer reviewers.

Tom Scheinfeldt wrote a fantastic blog post recently, responding to concerns about the failure of many digital humanities projects to make arguments or answer questions. Drawing examples from the history of science, Tom argues:

we need to make room for both kinds of digital humanities, the kind that seeks to make arguments and answer questions now and the kind that builds tools and resources with questions in mind, but only in the back of its mind and only for later. We need time to experiment and even… time to play.

The ARC does not fund play.

You might imagine that the ARC’s infrastructure funding scheme would offer more hope for a project such as this. And yes, there are many worthy projects involving databases and online tools that have been supported in this way (and I have benefited from some of them!). But it seems that in the minds of research funders infrastructure is always BIG. Grants start at $150,000, and applications are expected to involve multiple institutional partners. Projects have to be scaled up to fit the ARC’s definition of infrastructure, often resulting in complex, lumbering, long-term projects whose products are out of date by the time of their release.

There is no room in our current infrastructure models for agile, innovative, user-focused digital toolmakers seeking small amounts to experiment with apps, prototypes, datasets or visualisations. I often look with envy upon the US National Endowment for the Humanities Digital Humanities Start-Up Grants.

In any case, neither I nor my partner in this endeavour, Kate Bagnall (@baibi), are currently in academic positions, so our chances of gaining any sort of research funding are next to none. We have the expertise — Kate has spent many years researching Australian-Chinese families and knows the records back-to-front, while I just can’t help playing with biographical data — but is that enough? How can you mount an ongoing research project without institutional support, research funding and the various badges and signifiers of academic authority?

I don’t know that either, but I have some ideas.

Ah Yin Pak Chong

Mrs Ah Yin Pak Chong. NAA: ST84/1, 1907/321-330

I didn’t manage to get a contribution together for Dan Cohen and Tom Scheinfeldt’s crowdsourced-in-a-week book, Hacking the Academy, but watching the process from afar I did begin to wonder about how we might hack the way we build and run major research projects. This is what I have in mind:

  • To strip down the large, lumbering beasts and design projects that are modular and opportunistic — able to grow quickly when resources allow, to bolt on related projects, to absorb existing tools.
  • To follow the data freely across technological and institutional boundaries, developing open networks that invite participation and use.
  • To develop a floating pool of collaborators, both inside and outside of academia, who are able to come and go, contributing whatever and whenever they can.
  • To make everything public, accessible and standards-compliant, so that even if the project stalls it could be picked up and developed by someone else.

Most of all I just want to be able to do it. I don’t want to second-guess the ARC. I don’t want to spend months negotiating with potential partners or begging for an institutional home. I want to build, experiment and play. I want to make a start.

So that’s what we’re going to do.

We have a topic, plenty of raw materials, some basic principles and the beginnings of a plan. We even have a name — Invisible Australians: Living under the White Australia Policy.

As the project develops, I’ll be blogging here about some of the technical stuff, while Kate will be exploring the content over at the tiger’s mouth. I hope to have a prototype of the transcription tool ready to demo at THATCamp Canberra, while Kate is already at work putting together guides on using the records and developing an Omeka site that follows a number of Chinese-Australian families through the archives.

Can we hack together a major research project? Let’s find out.