Open Web search: Making a European Search Index

Professor Djoerd Hiemstra

Professor Djoerd Hiemstra

On our quest on “How to make a European Search Engine”, we found out the hardest part is to build a search index, crawling the entire web to know where to find what. I had the opportunity to talk to Djoerd Hiemstra. He is a professor of Data Science at Radboud University in Nijmegen, in the Netherlands. Together with his colleagues from several European Universities, he has been working on building a European search index for the project openwebsearch.eu.

Listen to the audio of the full interview here:

Djoerd describes a search index this way:

“So a search index allows a search engine to very quickly provide answers to your questions. And it basically consists of two things, a whole bunch of metadata, like things you want to show to the user, like a title of a page and the URL, obviously, because you want to go there. Maybe a little snippet of information about the page or a nice picture. And then there’s a data structure which we call the inverted index, which is very much like the index in the back of a book. So for every word, we list the pages on the web that contain the word.”

How to crawl the web:

So in the Open Web Search project, we try to collaboratively crawl the web, so there’s different sites that crawl part of the web and it’s all coordinated by a fourth site that tries to keep track of the horizon. Where are we on the Web? What did we see and what did we not see? And how fast you can crawl the web, basically, is determined by the number of machines you have.

On funding and the future of the Open Web Search project:

“So this project is funded in the European Horizon project. So it’s funded for three years. And the three years are now almost done. So we built a very large index that people can now use. It has about 10 billion pages. And now the question is: Will someone take this up and actually build this new Google? Because that’s that’s out of our current budgets.

 

 

 

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *