Demo and workshop papers accepted at The WEBConference (ex WWW) 2019🗓 2019-03-14 ✍ Gezim Sejdiu
We are very pleased to announce that our group got a demo paper accepted for presentation at the 2019 edition of The Web Conference (30th edition of the former WWW conference), which will be held on May 13-17, 2019, in San Francisco, US. The 2019 edition of The Web Conference will offer many opportunities to present and discuss latest advances in academia and industry. This first joint call for contributions provides a list of the first calls for: research tracks, workshops, tutorials, exhibition, posters, demos, developers' track, W3C track, industry track, PhD symposium, challenges, minute of madness, international project track, W4A, hackathon, the BIG web, journal track. Here is the pre-print of the accepted paper with its abstract:
- Querying Data Lakes using Spark and Presto by Mohamed Najib Mami>, Damien Graux, Hajira Jabeen>, Simon Scerri, and Sören Auer.
Abstract: Squerall is a tool that allows the querying of heterogeneous, large-scale data sources by leveraging state-of-the-art Big Data processing engines: Spark and Presto. Queries are posed on-demand against a Data Lake, i.e., directly on the original data sources without requiring prior data transformation. We showcase Squerall's ability to query five different data sources, including inter alia the popular Cassandra and MongoDB. In particular, we demonstrate how it can jointly query heterogeneous data sources, and how interested developers can easily extend it to support additional data sources. Graphical user interfaces (GUIs) are offered to support users in (1) building intra-source queries, and (2) creating required input files.Furthermore, we are pleased to inform that we got a workshop paper accepted at the 5th Workshop On Managing The Evolution And Preservation of The Data Web, which will be co-located with TheWebConference 2019. The MEPDaW’19 aims at addressing challenges and issues on managing Knowledge Graph evolution and preservation by providing a forum for researchers and practitioners to discuss, exchange and disseminate their ideas and work, to network and cross-fertilise new ideas. Here is the pre-print of the accepted paper with its abstract:
- Summarizing Entity Temporal Evolution in Knowledge Graphs by Mayesha Tasnim, Diego Collarana, Damien Graux, Fabrizio Orlandi, and Maria-Esther Vidal.
Abstract: Knowledge graphs are dynamic in nature, new facts about an entity are added or removed over time. Therefore, multiple versions of the same knowledge graph exist, each of which represents a snapshot of the knowledge graph at some point in time. Entities within the knowledge graph undergo evolution as new facts are added or removed. The problem of automatically generating a summary out of different versions of a knowledge graph is a long-studied problem. However, most of the existing approaches limit to pair-wise version comparison. Making it difficult to capture complete evolution out of several versions of the same graph. To overcome this limitation, we envision an approach to create a summary graph capturing temporal evolution of entities across different versions of a knowledge graph. The entity summary graphs may then be used for documentation generation, profiling or visualization purposes. First, we take different temporal versions of a knowledge graph and convert them into RDF molecules. Secondly, we perform Formal Concept Analysis on these molecules to generate summary information. Finally, we apply a summary fusion policy in order to generate a compact summary graph which captures the evolution of entities.Acknowledgment This research was supported by the German Ministry of Education and Research (BMBF) in the context of the project MLwin (Maschinelles Lernen mit Wissensgraphen, grant no. 01IS18050F). Looking forward to seeing you at The Web Conference 2019.
Paper accepted at Knowledge-Based Systems Journal🗓 2019-03-12 ✍ Gezim Sejdiu
We are very pleased to announce that our group got a paper accepted at the Knowledge-Based Systems Journal. Knowledge-Based Systems is an international, interdisciplinary and applications-oriented journal. This journal focuses on systems that use knowledge-based (KB) techniques to support human decision-making, learning, and action; emphases the practical significance of such KB-systems; its computer development and usage; covers the implementation of such KB-systems: design process, models and methods, software tools, decision-support mechanisms, user interactions, organizational issues, knowledge acquisition and representation, and system architectures. Here is the accepted paper with its abstract:
- New label noise injection methods for the evaluation of noise filters by Luís Paulo F. Garcia, Jens Lehmann, André C.P.L.F. de Carvalho, and Ana C. Lorena.
Abstract: Noise is often present in real datasets used for training Machine Learning classifiers. Their disruptive effects in the learning process may include: increasing the complexity of the induced models, a higher processing time and a reduced predictive power in the classification of new examples. Therefore, treating noisy data in a preprocessing step is crucial for improving data quality and to reduce their harmful effects in the learning process. There are various filters using different concepts for identifying noisy examples in a dataset. Their ability in noise preprocessing is usually assessed in the identification of artificial noise injected into one or more datasets. This is performed to overcome the limitation that only a domain expert can guarantee whether a real example is indeed noisy. The most frequently used label noise injection method is the noise at random method, in which a percentage of the training examples have their labels randomly exchanged. This is carried out regardless of the characteristics and example space positions of the selected examples. This paper proposes two novel methods to inject label noise in classification datasets. These methods, based on complexity measures, can produce more challenging and realistic noisy datasets by the disturbance of the labels of critical examples situated close to the decision borders and can improve the noise filtering evaluation. An extensive experimental evaluation of different noise filters is performed using public datasets with imputed label noise and the influence of the noise injection methods are compared in both data preprocessing and classification steps.
Paper accepted at EDBT 2019🗓 2019-02-25 ✍ Gezim Sejdiu
We are very pleased to announce that our group got a paper accepted for presentation at The 2019 edition of The EDBT conference, which will be held on March 26-29, 2019 - Lisbon, Portugal. The International Conference on Extending Database Technology is a leading international forum for database researchers, practitioners, developers, and users to discuss cutting-edge ideas, and to exchange techniques, tools, and experiences related to data management. Here is the pre-print of the accepted paper with its abstract:
- Big POI Data Integration with Linked Data Technologies by Spiros Athanasiou, Giorgos Giannopoulos, Damien Graux, Nikos Karagiannakis, Jens Lehmann, Axel-Cyrille Ngonga Ngomo, Kostas Patroumpas, Mohamed Ahmed Sherif, and Dimitrios Skoutas.
Abstract: Point of Interest (POI) data constitutes the cornerstone in many modern applications. From navigation to social networks, tourism, and logistics, we use POI data to search, communicate, decide and plan our actions. POIs are semantically diverse and spatio-temporally evolving entities, having geographical, temporal, and thematic relations. Currently, integrating POI datasets to increase their coverage, timeliness, accuracy and value is a resource-intensive and mostly manual process, with no specialized software available to address the specific challenges of this task. In this paper, we present an integrated toolkit for transforming, linking, fusing and enriching POI data, and extracting additional value from them. In particular, we demonstrate how Linked Data technologies can address the limitations, gaps and challenges of the current landscape in Big POI data integration. We have built a prototype application that enables users to define, manage and execute scalable POI data integration workflows built on top of state-of-the-art software for geospatial Linked Data. This application abstracts and hides away the underlying complexity, automates quality-assured integration, scales efficiently for world-scale integration tasks, and lowers the entry barrier for end-users. Validated against real-world POI datasets in several application domains, our system has shown great potential to address the requirements and needs of cross-sector, cross-border and cross-lingual integration of Big POI data.Acknowledgment This work was partially funded by the EU H2020 project SLIPO(#731581). Looking forward to seeing you at The EDBT 2019 conference.
Paper accepted at Oxford Bioinformatics Journal🗓 2019-02-12 ✍ Gezim Sejdiu
We are very pleased to announce that our group got a paper accepted at the Oxford Bioinformatics Journal. Oxford Bioinformatics Journal is a bi-weekly peer-reviewed scientific journal that focuses on genome bioinformatics and computational biology. The journal is leading its field, and publishes scientific papers that are relevant to academic and industrial researchers. Here is the pre-print of the accepted paper with its abstract:
- BioKEEN: A library for learning and evaluating biological knowledge graph embeddings by Mehdi Ali, Charles Tapley Hoyt, Daniel Domingo-Fernandez, Jens Lehmann, and Hajira Jabeen.
Abstract: Knowledge graph embeddings (KGEs) have received significant attention in other domains due to their ability to predict links and create dense representations for graphs' nodes and edges. However, the software ecosystem for their application to bioinformatics remains limited and inaccessible for users without expertise in programming and machine learning. Therefore, we developed BioKEEN (Biological KnowlEdge EmbeddiNgs) and PyKEEN (Python KnowlEdge EmbeddiNgs) to facilitate their easy use through an interactive command line interface. Finally, we present a case study in which we used a novel biological pathway mapping resource to predict links that represent pathway crosstalks and hierarchies. Availability: BioKEEN and PyKEEN are open source Python packages publicly available under the MIT License at https://github.com/SmartDataAnalytics/BioKEEN and https://github.com/SmartDataAnalytics/PyKEEN as well as through PyPI.Acknowledgement We thank our partners from the Bio2Vec, MLwin, and SimpleML projects for their assistance. This research was supported by Bio2Vec project (http://bio2vec.net/, CRG6 grant 3454) with funding from King Abdullah University of Science and Technology (KAUST).
Papers accepted at AAAI / CompexQA & RecNLP Workshops🗓 2019-01-23 ✍ Gezim Sejdiu
We are very pleased to announce that our group got two papers got accepted for presentation at the Thirty-First The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) workshops (ComplexQA 2019 and RecNLP 2019), which will be held January 27 – February 1, 2019 at the Hilton Hawaiian Village, Honolulu, Hawaii, USA. The purpose of the Association for the Advancement of Artificial Intelligence (AAAI) conference series is to promote research in artificial intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers in AI and its affiliated disciplines. Reasoning for Complex Question Answering Workshop is a new series of workshops on the reasoning for complex question answering (QA). QA has become a crucial application problem in evaluating the progress of AI systems in the realm of natural language processing and understanding, and to measure the progress of machine intelligence in general. The computational linguistics communities (ACL, NAACL, EMNLP et al.) have devoted significant attention to the general problem of machine reading and question answering, as evidenced by the emergence of strong technical contributions and challenge datasets such as SQuAD. However, most of these advances have focused on “shallow” QA tasks that can be tackled very effectively by existing retrieval-based techniques. Instead of measuring the comprehension and understanding of the QA systems in question, these tasks test merely the capability of a technique to “attend” or focus attention on specific words and pieces of text. The main aim of this workshop is to bring together experts from the computational linguistics (CL) and AI communities to: (1) catalyze progress on the CQA problem, and create a vibrant test-bed of problems for various AI sub-fields; and (2) present a generalized task that can act as a harbinger of progress in AI. Recommender Systems Meet Natural Language Processing (RecNLP) is an interdisciplinary workshop covering the intersection between Recommender Systems (RecSys) and Natural Language Processing (NLP). The primary goal of RecNLP is to identify common ideas and techniques that are being developed in both disciplines, and to further explore the synergy between the two and to bring together researchers from both domains to encourage and facilitate future collaborations. Here is the pre-print of the accepted papers with their abstract:
- Translating Natural Language to SQL using Pointer-Generator Networks and How Decoding Order Matters by Denis Lukovnikov, Nilesh Chakraborty, Jens Lehmann and Asja Fischer
Abstract: Translating natural language to SQL queries for table-based question answering is a challenging problem and has received significant attention from the research community. In this work, we extend a pointer-generator network and investigate how query decoding order matters in semantic parsing for SQL. Even though our model is a straightforward extension of a general-purpose pointer-generator, it outperforms early work for WikiSQL and remains competitive to concurrently introduced, more complex models. Moreover, we provide a deeper investigation of the potential “order-matters” problem due to having multiple correct decoding paths, and investigate the use of REINFORCE as well as a non-deterministic oracle in this context.
- Metaresearch Recommendations using Knowledge Graph Embeddings by Veronika Henk, Sahar Vahdati, Mojataba Nayyeri, Mehdi Ali, Hamed Shariat Yazdi and Jens Lehmann
Abstract: Discovering relevant research collaborations is crucial for performing extraordinary research and promoting the careers of scholars. Therefore, building recommender systems capable of suggesting relevant collaboration opportunities is of huge interest. Most of the existing approaches for collaboration and co-author recommendation focus on semantic similarities using bibliographic metadata such as publication counts, and citation network analysis. These approaches neglect relevant and important metadata information such as author affiliation and conferences attended, affecting the quality of the recommendations. To overcome these drawbacks, we formulate the task of scholarly recommendation as a link prediction task based on knowledge graph embeddings. A knowledge graph containing scholarly metadata is created and enriched with textual descriptions. We tested the quality of the recommendations based on the TransE, TranH and DistMult models that consider only triples in the knowledge graph and DKRL which in addition incorporates natural language descriptions of entities during training.
Looking forward to seeing you at The AAAI-19.
New Year at SDA - Looking back at 2018🗓 2019-01-03 ✍ Prof. Dr. Jens Lehmann
2019 has just started and we want to take a moment to look back at a very busy and successful year 2018, full of new members, inspirational discussions, exciting conferences, accepted research papers, new software releases and a lot of highlights we had throughout the year.
Below is a short summary of the main cornerstones for 2018:
An interesting future for AI and knowledge graphs
Artificial intelligence/machine learning and semantic technologies/knowledge graphs are central topics for SDA. Throughout the year, we have been able to accomplish a range of interesting research achievements. One particularly active area was question answering and dialogue systems (with and without knowledge graphs). We acquired new projects for more than a million Euro this year and were able to transfer our expertise to industry via successful projects at Fraunhofer. External interest in our results has been remarkably high. Furthermore, we extended our already established position in scalable distributed querying, inference, and analysis of large RDF datasets. Among the race for ever-improving achievements in AI, which has gone far beyond what many could have imagined 10 years ago, our researchers were able to deliver important contributions and continued to shape different sub-areas of the growing AI research landscape.
We had 41 papers accepted at well-known conferences (i.e., the AAAI 2019 workshops, ISWC 2018, ESWC 2018, Nature Scientific Data Journal, Journal of Web Semantics, Semantic Web Journal, WWW 2018 workshops, EMNLP 2018 workshops, ECML 2018 workshops, CoNLL 2018, SIGMOD 2018 workshops, SIGIR 2018, ICLR 2018, EKAW 2018, SEMANTiCS 2018, ICWE 2018, ICSC 2018, TPDL 2018, JURIX 2018 and more. We estimate that SDA members had approximately 2500+ citations per year (based on Google Scholar profiles).
SANSA - An open source data flow processing engine for performing distributed computation over large-scale RDF datasets had 2 successfully released during 2018 (SANSA 0.5 and SANSA 0.4).
From the funded projects we were happy to launch the first major release of the Big Data Ocean platform - a platform for Exploiting Ocean's of Data for Maritime Applications.
There were several other releases:
- SML-Bench - A Structured Machine Learning benchmark framework 0.2 has been released.
- WebVOWL - A web-based visualization for ontologies had several releases in 2018. AS a major new feature characterizing WebVOWL is the integration of the WebVOWL Editor - a Device-Independent Visual Ontology Modeling.
- AskNowQA - A Suite of Natural Language interaction technologies that behave intelligently through domain knowledge. The 0.1 version has been released.
- Move to the brand new Computer Science Campus: After many delays, we finally moved into our new campus where we have modern rooms and equipment.
- A Best Demo Award at ISWC 2018
- Two PhD defenses: Mikhail Galkin and Lavdim Halilaj both successfully defended their PhD thesis. Congratulations to them again! Four more theses have been submitted, with defenses scheduled for January and February.
- Many invited speakers (Prof. Dr. John Domingue, Prof. Dr. Khalid Saeed, Dr. Anastasia Dimou, Svitlana Vakulenko and Dr. Katherine Thornton).
- We did an off-site meeting together with the EIS department of Fraunhofer IAIS, at their place.
Likewise, SDA deeply values team bonding activities. Often we try to introduce fun activities that involve teamwork and teambuilding. At our X-mas party, we enjoyed a very international and lovely dinner together while exchanging a `Secret Santa` gifts and played some ad-hoc games.
Long-term team building through deeper discussions, genuine connections and healthy communication helps us to connect within the group!
Many thanks to all who have accompanied and supported us on this way! So from all of us at SDA, we wish you a wonderful new year!
Jens Lehmann on behalf of The SDA Research Team
Dr. Katherine Thornton visits SDA🗓 2018-12-20 ✍ Gezim Sejdiu
Dr. Katherine Thornton from Yale University Library, New Haven, Connecticut, US visited the SDA group on November 28, 2018.
Katherine Thornton is an information scientist at the Yale University Library working on creating metadata as linked open data. Katherine earned a PhD in Information Science from the University of Washington in 2016 and works on the Scaling Emulation as a Service Infrastructure (EaaSI) project describing the software and configured environments in Wikidata. Katherine has been a volunteer contributor to the Wikidata project since 2012.
Dr. Thornton was invited to give a talk on “Sharing RDF data models and validating RDF graphs with ShEx“ and “Documenting and preserving programming languages and software in Wikidata” at the SWIB conference (Semantic Web in Libraries). SWIB conference is an annual conference, being held for the 10th time, focusing on Linked Open Data (LOD) in libraries and related organizations. It is well established as an event where IT staff, developers, librarians, and researchers from all over the world meet and mingle and learn from each other. The topics of talks and workshops at SWIB revolve around opening data, linking data and creating tools and software for LOD production scenarios. These areas of focus are supplemented by presentations of research projects in applied sciences, industry applications, and LOD activities in other areas.
At the bi-weekly “SDA colloquium presentations” she gave a talk on “Wikidata for Digital Preservation” and describe the workflow of creating the metadata for resources in the domain of computing using the Wikidata platform. While reusing these URIs in metadata to describe pre-configured emulated computing environments in which users can interact with legacy software. She introduced this project in the context of current work at Yale University Library to provide Emulation as a Service. Afterwords, she discussed her data curation work in Wikidata as well as the Wikidata for Digital Preservation portal available at wikidp.org. WikiDP is a streamlined interface for the digital preservation community to interact with Wikidata. The system is available online at http://wikidp.org.
The goal of Dr. Thornton’s visit was to exchange experience and ideas on digital preservation using RDF technologies. In addition to presenting various use-cases where these technologies have applied, Dr. Thornton shared with our group future research problems and challenges related to this research area. During the meeting, SDA core research topics and main research projects were presented and we investigated suitable topics for future collaborations with Dr. Thornton and her research group.
SANSA 0.5 (Scalable Semantic Analytics Stack) Released🗓 2018-12-13 ✍ Prof. Dr. Jens Lehmann
We are happy to announce SANSA 0.5 – the fifth release of the Scalable Semantic Analytics Stack. SANSA employs distributed computing via Apache Spark and Flink in order to allow scalable machine learning, inference and querying capabilities for large knowledge graphs.
- Website: http://sansa-stack.net
- GitHub: https://github.com/SANSA-Stack
- Download: http://sansa-stack.net/downloads-usage/
- ChangeLog: https://github.com/SANSA-Stack/SANSA-Stack/releases
- Reading and writing RDF files in N-Triples, Turtle, RDF/XML, N-Quad format
- Reading OWL files in various standard formats
- Query heterogeneous sources (Data Lake) using SPARQL – CSV, Parquet, MongoDB, Cassandra, JDBC (MySQL, SQL Server, etc.) are supported
- Support for multiple data partitioning techniques
- SPARQL querying via Sparqlify and Ontop
- Graph-parallel querying of RDF using SPARQL (1.0) via GraphX traversals (experimental)
- RDFS, RDFS Simple and OWL-Horst forward chaining inference
- RDF graph clustering with different algorithms
- Terminological decision trees (experimental)
- Knowledge graph embedding approaches: TransE (beta), DistMult (beta)
- A data lake concept for querying heterogeneous data sources has been integrated into SANSA
- New clustering algorithms have been added and the interface for clustering has been unified
- Ontop RDB2RDF engine support has been added
- RDF data quality assessment methods have been substantially improved
- Dataset statistics calculation has been substantially improved
- Improved unit test coverage
- There are template projects for SBT and Maven for Apache Spark as well as for Apache Flink available to get started.
- The SANSA jar files are in Maven Central i.e. in most IDEs you can just search for “sansa” to include the dependencies in Maven projects.
- Example code is available for various tasks.
- We provide interactive notebooks for running and testing code via Docker.
Paper accepted at Nature Scientific Data Journal🗓 2018-12-05 ✍ Gezim Sejdiu
We are very pleased to announce that our group got a paper accepted at the Nature Journal on Scientific Data.
Nature is a weekly international journal publishing the finest peer-reviewed research in all fields of science and technology on the basis of its originality, importance, interdisciplinary interest, timeliness, accessibility, elegance and surprising conclusions. Nature also provides rapid, authoritative, insightful and arresting news and interpretation of topical and coming trends affecting science, scientists and the wider public. Scientific Data is a peer-reviewed, open-access journal for descriptions of scientifically valuable datasets, and research that advances the sharing and reuse of scientific data. It covers a broad range of research disciplines, including descriptions of big or small datasets, from major consortiums to single research groups. Scientific Data primarily publishes Data Descriptors, a new type of publication that focuses on helping others reuse data, and crediting those who share. Here is the pre-print of the accepted paper with its abstract:
- “A linked open data representation of patents registered in the US from 2005-2017” by Mofeed Hassan, Amrapali Zaveri, Jens Lehmann
Abstract: Patents are widely used to protect intellectual property and a measure of innovation output. Each year, the USPTO grants over 150,000 patents to individuals and companies all over the world. In fact, there were more than 280,000 patent grants issued in the US in 2015. However, accessing, searching and analyzing those patents is often still cumbersome and inefficient. To overcome those problems, Google indexes patents and converts them to Extensible Markup Language (XML) files using Optical Character Recognition (OCR) techniques. In this article, we take this idea one step further and provide semantically rich, machine-readable patents using the Linked Data principles. We have converted the data spanning 12 years - i.e. 2005 - 2017 from XML to Resource Description Framework (RDF) format, conforming to the Linked Data principles and made them publicly available for re-use. This data can be integrated with other data sources in order to further simplify use cases such as trend analysis, structured patent search & exploration and societal progress measurements. We describe the conversion, publishing, interlinking process along with several use cases for the USPTO Linked Patent data.
Prof. Dr. John Domingue visits SDA🗓 2018-11-22 ✍ Gezim Sejdiu
John Domingue is a full Professor at the Open University and Director of the Knowledge Media Institute in Milton Keynes, focusing on research in the Semantic Web, Linked Data, Services, Blockchain, and Education. He also serves as the President of STI International, a semantics focused networking organization which runs the ESWC conference series.
His current work focuses on how a combination of blockchain and Linked Data technologies can be used to process personal data in a decentralized trusted manner and how this can be applied in the educational domain (see http://blockchain.open.ac.uk/). This work is funded by a number of projects. The Institute of Coding is a £20M funded UK initiative which aims to increase the graduate computing skills base in the UK. As leader of the first of five project Themes John Domingue is focusing on the use of blockchain micro-accreditation to support the seamless transition of learners between UK universities and UK industry. From January 2019, he will play a leading role in the EU funded QualiChain project which has the aim of revolutionizing public education and its relationship to the labor market and policy-making by disrupting the way accredited educational titles and other qualifications are archived, managed, shared and verified, taking advantage of blockchain, semantics, data analytics and gamification technologies.
From January 2015 to January 2018 he served as the Project Coordinator for the European Data Science Academy which aimed to address the skills gap in data science across Europe. The project was a success leading to a number of outcomes including a combined data science skills and courses portal enabling learners to find jobs across Europe which match their qualifications.
Prof. Domingue was invited to give a talk “Towards the Decentralisation of Personal Data through Blockchains and Linked Data“ at the Computer Science Colloquium at the University of Bonn co-organized by SDA.
At the bi-weekly “SDA colloquium presentations” he presented KMi and the main research topics of the institute. The goal of Prof. Domingue’s visit was to exchange experience and ideas on decentralized applications using blockchains technologies in combination with Linked Data. In addition to presenting various use-cases where blockchains and linked data technologies have helped communities to get useful insights, Prof. Dr. Domingue shared with our group future research problems and challenges related to this research area. During the meeting, SDA core research topics and main research projects were presented and we investigated suitable topics for future collaborations with Prof. Domingue and his research group.
Papers accepted at JURIX 2018🗓 2018-11-19 ✍ Gezim Sejdiu
We are very pleased to announce that our group got one paper accepted for presentation at The 31st international conference on Legal Knowledge and Information Systems (JURIX 2018) conference, which will be held on December 12–14, 2018 in Groningen, The Netherland.
JURIX organizes yearly conferences on the topic of Legal Knowledge and Information Systems. The proceedings of the conferences are published in the Frontiers of Artificial Intelligence and Applications series of IOS Press.
The JURIX conference attracts a wide variety of participants, coming from the government, academia, and business. It is accompanied by workshops on topics ranging from eGovernment, legal ontologies, legal XML, alternative dispute resolution (ADR), argumentation, deontic logic, etc.
Here is the accepted paper with its abstract:
- “A Question Answering System on Regulatory Documents” by Diego Collarana, Timm Heuss, Jens Lehmann, Ioanna Lytra, Gaurav Maheshwari, Rostislav Nedelchev, Thorsten Schmidt, Priyansh Trivedi. Abstract: In this work, we outline an approach for question answering over regulatory documents. In contrast to traditional means to access information in the domain, the proposed system attempts to deliver an accurate and precise answer to user queries. This is accomplished by a two-step approach which first selects relevant paragraphs given a question; and then compares the selected paragraph with user query to predict a span in the paragraph as the answer. We employ neural network-based solutions for each step and compare them with existing, and alternate baselines. We perform our evaluations with a gold-standard benchmark comprising over 600 questions on the MaRisk regulatory document. In our experiments, we observe that our proposed system outperforms other baselines.
This research was partially supported by an EU H2020 grant provided for the WDAqua project (GA no. 642795).
Looking forward to seeing you at the JURIX 2018.
Prof. Dr. Khalid Saeed visits SDA🗓 2018-11-08 ✍ Gezim Sejdiu
Prof. Dr. Khalid Saeed (ResearchGate) from Bialystok University of Technology, Bialystok, Poland was visiting the SDA group on October 24, 2018.
Khalid Saeed is a full Professor of Computer Science in the Faculty of Computer Science at Bialystok University of Technology and Faculty of Mathematics and Information Science at Warsaw University of Technology, Poland. He was with AGH Krakow in 2008-2014.
Khalid Saeed received the BSc Degree in Electrical and Electronics Engineering from Baghdad University in 1976, the MSc and PhD Degrees from the Wroclaw University of Technology in Poland in 1978 and 1981, respectively. He was nominated by the President of Poland for the title of Professor in 2014. He received his DSc Degree (Habilitation) in Computer Science from the Polish Academy of Sciences in Warsaw in 2007. He has published more than 200 publications – 23 edited books and 8 text and reference books. He supervised more than 110 MSc and 12 PhD theses. He received more than 20 academic awards. His areas of interest are Image Analysis and Processing, Biometrics and Computer Information Systems.
Prof. Jens Lehmann invited the speaker to the bi-weekly “SDA colloquium presentations”. 20-30 researchers and students from SDA attended. The goal of his visit was to exchange experience and ideas on biometrics applications in daily life, including face recognition, fingerprints, privacy and many more. Apart from presenting various use-cases where biometrics has helped scientists to get useful insights from image analysis and processing and row data, Prof. Dr. Saeed shared with our group future research problems and challenges related to this research area and gave a talk on “Biometrics in everyday life”.
As part of a national BMBF funded project, Prof. Saeed (BUT) is cooperating currently with Fraunhofer IAIS in the field of cognitive engineering, and as an outcome of this visit, we expect to strengthen our research collaboration networks with WUT and BUT, mainly on combining semantic knowledge and Ubiquitous Computing and its applications; Emotion Detection and Kansei Engineering.
Simultaneously, this talk was a continuous networking within EU H2020 LAMBDA project (Learning, Applying, Multiplying Big Data Analytics) and as part of this event, Dr. Valentina Janev from the Institute “Mihajlo Pupin” (PUPIN) was attending the SDA meeting to investigate further networking with potential partners from Poland as well. Among other points, co-organizing coming conferences and writing joint-research papers have been discussed.
SDA at ISWC 2018 and a Best Demo Award🗓 2018-11-05 ✍ Gezim Sejdiu
The International Semantic Web Conference (ISWC) is the premier international forum where Semantic Web / Linked Data researchers, practitioners, and industry specialists come together to discuss, advance, and shape the future of semantic technologies on the web, within enterprises and in the context of the public institution. We are very pleased to announce that we got 3 papers accepted at ISWC 2018 for presentation at the main conference. Additionally, we also had 5 Posters/Demo papers accepted. Furthermore, we are very happy to announce that we won the Best Demo Award for the WebVOWLEditor: “WebVOWL Editor: Device-Independent Visual Ontology Modeling” by Vitalis Wiens, Steffen Lohmann, and Sören Auer.Here are some further pointers in case you want to know more about WebVOWL Editor:
- GitHub: https://github.com/VisualDataWeb/WebVOWL/tree/vowl_editor
- Demo: https://www.youtube.com/watch?v=XWXhpEr9LPY
- “EARL: Joint Entity and Relation Linking for Question Answering over Knowledge Graphs” by Mohnish Dubey, Debayan Banerjee, Debanjan Chaudhuri and Jens Lehmann
Mohnish Dubey presented EARL: A relation & entity linking for DBpedia Question Answering on LC-QuAD via Elasticsearch using fastText embeddings and LSTM. It proposed two fold approaches, using GTSP solver and connection density (3 features) classifier for adaptive re-ranking.
@MohnishDubey is presenting "EARL: Joint Entity and Relation Linking for Question Answering over Knowledge Graphs" for the Research Track at #iswc2018 https://t.co/TCaWRGGqf9 pic.twitter.com/Csc8aOLdjZ — SDA Research (@SDA_Research) October 10, 2018GitHub: https://github.com/AskNowQA/EARL Slides: https://www.slideshare.net/MohnishDubey/earl-joint-entity-and-relation-linking-for-question-answering-over-knowledge-graphs Demo: https://earldemo.sda.tech/
- “DistLODStats: Distributed Computation of RDF Dataset Statistics” by Gezim Sejdiu, Ivan Ermilov, Jens Lehmann and Mohamed Nadjib Mami Gezim Sejdiu presented DistLODStats, a novel software component for distributed in-memory computation of RDF Datasets statistics implemented using the Spark framework. The tool is maintained and has an active community due to its integration into the larger framework, SANSA. GitHub: https://github.com/SANSA-Stack/SANSA-RDF Slides: https://www.slideshare.net/GezimSejdiu/distlodstats-distributed-computation-of-rdf-dataset-statistics-iswc-2018-talk
- “Synthesizing Knowledge Graphs from web sources with the MINTE+ framework” by Diego Collarana, Mikhail Galkin, Christoph Lange, Simon Scerri, Sören Auer and Maria-Esther Vidal
Diego Collarana presented the synthesizing KG from different web sources using MINTE+, an RDF Molecule-Based Integration Framework, in three domain-specific applications.
@collarad is presenting "Synthesizing #Knowledge #Graphs from #web sources with the MINTE+ framework" for the In-Use Track at #iswc2018 https://t.co/Dl3Jddmgeu pic.twitter.com/fMmapEcqeK — SDA Research (@SDA_Research) October 10, 2018GitHub: https://github.com/RDF-Molecules/MINTE Slides: https://docs.google.com/presentation/d/1tV1tEuIMJoOhaTvlsgndi4YTk5ZoIuBfl9Bi3XbVr0c/edit?usp=sharing Demo: https://youtu.be/6bNP21XSu6s
- Visualization and Interaction for Ontologies and Linked Data (VOILA 2018) Steffen Lohmann co-organized the International Workshop on Visualization and Interaction for Ontologies and Linked Data (VOILA 2018) for the third time at ISWC. Overall, more than 40 researchers and practitioners took part in this full-day event featuring talks, discussions, and tool demonstrations, including an interactive demo session. The workshop proceedings are published as CEUR-WS vol. 2187.
Paper accepted at the Journal of Web Semantics🗓 2018-10-17 ✍ Gezim Sejdiu
We are very pleased to announce that our group got a paper accepted at the Journal of Web Semantics on Managing the Evolution and Preservation of the Data Web (MEPDaW) issue. The Journal of Web Semantics is an interdisciplinary journal based on research and applications of various subject areas that contribute to the development of a knowledge-intensive and intelligent service Web. These areas include knowledge technologies, ontology, agents, databases and the semantic grid, obviously, disciplines like information retrieval, language technology, human-computer interaction, and knowledge discovery are of major relevance as well. All aspects of Semantic Web development are covered. The publication of large-scale experiments and their analysis is also encouraged to clearly illustrate scenarios and methods that introduce semantics into existing Web interfaces, contents, and services. The journal emphasizes the publication of papers that combine theories, methods, and experiments from different subject areas in order to deliver innovative semantic methods and applications. Here is the pre-print of the accepted paper with its abstract:
- “TISCO: Temporal Scoping of Facts” by Anisa Rula, Matteo Palmonari, Simone Rubinacci, Axel-Cyrille Ngonga Ngomo, Jens Lehmann, Andrea Maurino and Diego Esteves
Abstract: Some facts in the Web of Data are only valid within a certain time interval. However, most of the knowledge bases available on the Web of Data do not provide temporal information explicitly. Hence, the relationship between facts and time intervals is often lost. A few solutions are proposed in this field. Most of them are concentrated more in extracting facts with time intervals rather than trying to map facts with time intervals. This paper studies the problem of determining the temporal scopes of facts, that is, deciding the time intervals in which the fact is valid. We propose a generic approach which addresses this problem by curating temporal information of facts in the knowledge bases. Our proposed framework, Temporal Information Scoping (TISCO) exploits evidence collected from the Web of Data and the Web. The evidence is combined within a three-step approach which comprises matching, selection and merging. This is the first work employing matching methods that consider both a single fact or a group of facts at a time. We evaluate our approach against a corpus of facts as input and different parameter settings for the underlying algorithms. Our results suggest that we can detect temporal information for facts from DBpedia with an f-measure of up to 80%.Acknowledgment This research has been supported in part by the research grant number 17A209 from the University of Milano-Bicocca and by a scholarship from the University of Bonn
Papers accepted at EMNLP 2018 / FEVER & W-NUT Workshops🗓 2018-10-04 ✍ Gezim Sejdiu
We are very pleased to announce that our group got 3 workshop papers accepted for presentation at EMNLP 2018 conference, that will be held on 1st of November 2018, Brussels, Belgium. FEVER: The First Workshop on Fact Extraction and Verification: With billions of individual pages on the web providing information on almost every conceivable topic, we should have the ability to collect facts that answer almost every conceivable question. However, only a small fraction of this information is contained in structured sources (Wikidata, Freebase, etc.) – we are therefore limited by our ability to transform free-form text to structured knowledge. There is, however, another problem that has become the focus of a lot of recent research and media coverage: false information coming from unreliable sources. In an effort to jointly address both problems, a workshop promoting research in joint Fact Extraction and VERification (FEVER) has been proposed. W-NUT: The 4th Workshop on Noisy User-generated Text: focuses on Natural Language Processing applied to noisy user-generated text, such as that found in social media, online reviews, crowdsourced data, web forums, clinical records and language learner essays. Here are the accepted papers with their abstracts:
- Belittling the Source: Trustworthiness Indicators to Obfuscate Fake News on the Web by Diego Esteves, Aniketh Janardhan Reddy, Piyush Chawla and Jens Lehmann.
Abstract: With the growth of the internet, the number of fake-news online has been proliferating every year. The consequences of such phenomena are manifold, ranging from lousy decision-making process to bullying and violence episodes. Therefore, fact-checking algorithms became a valuable asset. To this aim, an important step to detect fake-news is to have access to a credibility score for a given information source. However, most of the widely used Web indicators have either been shut-down to the public (e.g., Google PageRank) or are not free for use (Alexa Rank). Further existing databases are short-manually curated lists of online sources, which do not scale. Finally, most of the research on the topic is theoretical-based or explore confidential data in a restricted simulation environment. In this paper we explore current research, highlight the challenges and propose solutions to tackle the problem of classifying websites into a credibility scale. The proposed model automatically extracts source reputation cues and computes a credibility factor, providing valuable insights which can help in belittling dubious and confirming trustful unknown websites. Experimental results outperform state of the art in the 2-classes and 5-classes setting.
Abstract: Named Entity Recognition (NER) is an important subtask of information extraction that seeks to locate and recognise named entities. Despite recent achievements, we still face limitations in correctly detecting and classifying entities, prominently in short and noisy text, such as Twitter. An important negative aspect in most of NER approaches is the high dependency on hand-crafted features and domain-specific knowledge, necessary to achieve state-of-the-art results. Thus, devising models to deal with such linguistically complex contexts is still challenging. In this paper, we propose a novel multi-level architecture that does not rely on any specific linguistic resource or encoded rule. Unlike traditional approaches, we use features extracted from images and text to classify named entities. Experimental tests against state-of-the-art NER for Twitter on the Ritter dataset present competitive results (0.59 F-measure), indicating that this approach may lead towards better NER models.
- DeFactoNLP: Fact Verification using Entity Recognition, TFIDF Vector Comparison and Decomposable Attention by Aniketh Janardhan Reddy and Gil Rocha and Diego Esteves.
Abstract: In this paper, we describe DeFactoNLP, the system we designed for the FEVER 2018 Shared Task. The aim of this task was to conceive a system that can not only automatically assess the veracity of a claim but also retrieve evidence supporting this assessment from Wikipedia. In our approach, the Wikipedia documents whose Term Frequency-Inverse Document Frequency (TFIDF) vectors are most similar to the vector of the claim and those documents whose names are similar to those of the named entities (NEs) mentioned in the claim are identified as the documents which might contain evidence. The sentences in these documents are then supplied to a textual entailment recognition module. This module calculates the probability of each sentence supporting the claim, contradicting the claim or not providing any relevant information to assess the veracity of the claim. Various features computed using these probabilities are finally used by a Random Forest classifier to determine the overall truthfulness of the claim. The sentences which support this classification are returned as evidence. Our approach achieved a 0.4277 evidence F1-score, a 0.5136 label accuracy and a 0.3833 FEVER score.Acknowledgment This research was partially supported by an EU H2020 grant provided for the WDAqua project (GA no. 642795) and by the DAAD under the “International promovieren in Deutschland fur alle” (IPID4all) project.
Looking forward to seeing you at The EMNLP/FEVER 2018.
Papers accepted at EKAW 2018🗓 2018-09-21 ✍ Gezim Sejdiu
We are very pleased to announce that our group got 2 papers accepted for presentation at The 21st International Conference on Knowledge Engineering and Knowledge Management (EKAW 2018) conference, which will be held on 12 - 16 November 2018 in Nancy, France.The 21st International Conference on Knowledge Engineering and Knowledge Management is in concern with all aspects about eliciting, acquiring, modeling and managing knowledge, and the construction of knowledge-intensive systems and services for the semantic web, knowledge management, e-business, natural language processing, intelligent information integration, and so on. The special theme of EKAW 2018 is “Knowledge and AI”. We are indeed calling for papers that describe algorithms, tools, methodologies, and applications that exploit the interplay between knowledge and Artificial Intelligence techniques, with a special emphasis on knowledge discovery. Accordingly, EKAW 2018 will put a special emphasis on the importance of Knowledge Engineering and Knowledge Management with the help of AI as well as for AI.Here is the list of accepted papers with their abstracts:
- “Divided we stand out! Forging Cohorts fOr Numeric Outlier Detection in large scale knowledge graphs (CONOD)” by Hajira Jabeen, Rajjat Dadwal, Gezim Sejdiu, and Jens Lehmann.
Abstract : With the recent advances in data integration and the concept of data lakes, massive pools of heterogeneous data are being curated as Knowledge Graphs (KGs). In addition to data collection, it is of utmost importance to gain meaningful insights from this composite data. However, given the graph-like representation, the multimodal nature, and large size of data, most of the traditional analytic approaches are no longer directly applicable. The traditional approaches could collect all values of a particular attribute, e.g. height, and try to perform anomaly detection for this attribute. However, it is conceptually inaccurate to compare one attribute representing different entities, e.g.~the height of buildings against the height of animals. Therefore, there is a strong need to develop fundamentally new approaches for the outlier detection in KGs. In this paper, we present a scalable approach, dubbed CONOD, that can deal with multimodal data and performs adaptive outlier detection against the cohorts of classes they represent, where a cohort is a set of classes that are similar based on a set of selected properties. We have tested the scalability of CONOD on KGs of different sizes, assessed the outliers using different inspection methods and achieved promising results.
Looking forward to seeing you at The EKAW 2018.
Paper accepted at CoNLL 2018🗓 2018-09-17 ✍ Gezim Sejdiu
We are very pleased to announce that our group got one paper accepted for presentation at The SIGNLL Conference on Computational Natural Language Learning (CoNLL 2018) conference. CoNLL is a top-tier conference, yearly organized by SIGNLL (ACL’s Special Interest Group on Natural Language Learning). This year, CoNLL will be colocated with EMNLP 2018 and will be held on October 31 – November 1, 2018, Brussels, Belgium.
The aim of the CoNLL conference is to bring researchers and practitioners from both academia and industry, in the areas of deep learning, natural language processing, and learning. It is among the top-10 Natural language processing and Computational linguistics conferences.
Here is the accepted paper with its abstract:
“Improving Response Selection in Multi-turn Dialogue Systems by Incorporating Domain Knowledge” by Debanjan Chaudhuri, Agustinus Kristiadi, Jens Lehmann and Asja Fischer.
Abstract : Building systems that can communicate with humans is a core problem in Artificial Intelligence. This work proposes a novel neural network architecture for response selection in an end-to-end multi-turn conversational dialogue setting. The architecture applies context level attention and incorporates additional external knowledge provided by descriptions of domain-specific words. It uses a bi-directional Gated Recurrent Unit (GRU) for encoding context and responses and learns to attend over the context words given the latent response representation and vice versa. In addition, it incorporates external domain specific information using another GRU for encoding the domain keyword descriptions. This allows better representation of domain-specific keywords in responses and hence improves the overall performance. Experimental results show that our model outperforms all other state-of-the-art methods for response selection in multi-turn conversations.
This research was supported by the KDDS project at Fraunhofer.
Looking forward to seeing you at The CoNLL 2018.
AskNow 0.1 Released🗓 2018-09-12 ✍ Prof. Dr. Jens Lehmann
Dear all,the Smart Data Analytics group is happy to announce AskNow 0.1 – the initial release of Question Answering Components and Tools over RDF Knowledge Graphs.
The following components with corresponding features are currently supported by AskNow:
- EARL 0.1 EARL performs entity linking and relation linking as a joint task. It uses machine learning in order to exploit the Connection Density between nodes in the knowledge graph. It relies on three base features and re-ranking steps in order to predict entities and relations.
ISWC 2018: https://arxiv.org/pdf/1801.03825.pdf
- SQG 0.1: This is a SPARQL Query Generator with modular architecture. SQG enables easy integration with other components for the construction of a fully functional QA pipeline. Currently entity relation, compound, count, and boolean questions are supported.
ESWC 2018: http://jens-lehmann.org/files/2018/eswc_qa_query_generation.pdf
- AskNow UI 0.1: The UI interface works as a platform for users to pose their questions to the AskNow QA system. The UI displays the answers based on whether the answer is an entity or a list of entities, boolean or literal. For entities it shows the abstracts from DBpedia.
- SemanticParsingQA 0.1: The Semantic Parsing-based Question Answering system is built on the integration of EARL, SQG and AskNowUI.
View this announcement on Twitter: https://twitter.com/AskNowQA/status/1040205350853599233
The AskNow Development Team
Workshop Papers accepted at ICML/FAIM 2018🗓 2018-09-03 ✍ Gezim Sejdiu
We are very pleased to announce that our group got 2 workshop papers accepted for presentation at The Federated Artificial Intelligence Meeting (FAIM) → NAMPI workshop co-organized with ICML, IJCAI/ECAI, AAMAS. The workshop took place in Stockholm, Sweden on the 15th of July 2018. The aim of the NAMPI workshop was to bring researchers and practitioners from both academia and industry, in the areas of deep learning, program synthesis, probabilistic programming, programming languages, inductive programming and reinforcement learning, together to exchange ideas on the future of program induction with a special focus on neural network models and abstract machines. Through this workshop we look to identify common challenges, exchange ideas among and lessons learned from the different fields, as well as establish a (set of) standard evaluation benchmark(s) for approaches that learn with abstraction and/or reason with induced programs. Here are the accepted papers with their abstracts:
- Neural Machine Translation for Query Construction and Composition by Tommaso Soru, Edgard Marx, André Valdestilhas, Diego Esteves, Diego Moussallem and Gustavo Publio.
Abstract: Research on question answering with knowledge base has recently seen an increasing use of deep architectures. In this extended abstract, we study the application of the neural machine translation paradigm for question parsing. We employ a sequence-to-sequence model to learn graph patterns in the SPARQL graph query language and their compositions. Instead of inducing the programs through question-answer pairs, we expect a semi-supervised approach, where alignments between questions and queries are built through templates. We argue that the coverage of language utterances can be expanded using late notable works in natural language generation.
- ML-Schema: Exposing the Semantics of Machine Learning with Schemas and Ontologies by Gustavo Correa Publio, Diego Esteves, Agnieszka Ławrynowicz, Panče Panov, Larisa Soldatova, Tommaso Soru, Joaquin Vanschoren and Hamid Zafar.
Abstract: The ML-Schema, proposed by the W3C Machine Learning Schema Community Group, is a top-level ontology that provides a set of classes, properties, and restrictions for representing and interchanging information on machine learning algorithms, datasets, and experiments. It can be easily extended and specialized and it is also mapped to other more domain-specific ontologies developed in the area of machine learning and data mining. In this paper we overview existing state-of-the-art machine learning interchange formats and present the first release of ML-Schema, a canonical format resulted of more than seven years of experience among different research institutions. We argue that exposing semantics of machine learning algorithms, models, and experiments through a canonical format may pave the way to better interpretability and to realistically achieve the full interoperability of experiments regardless of platform or adopted workflow solution.AcknowledgmentThis work was partially supported by NEAR AI.
Demo and Poster Papers accepted at ISWC 2018🗓 2018-08-28 ✍ Gezim Sejdiu
We are very pleased to announce that our group got 4 demo/poster papers accepted for presentation at ISWC 2018 : The 17th International Semantic Web Conference, which will be held on October 8 - 12, 2018 in Monterey, California, USA. The International Semantic Web Conference (ISWC) is the premier international forum where Semantic Web / Linked Data researchers, practitioners, and industry specialists come together to discuss, advance, and shape the future of semantic technologies on the web, within enterprises and in the context of the public institution. Here is the list of the accepted papers with their abstract:
- “STATisfy Me: What are my Stats?” by Gezim Sejdiu, Ivan Ermilov, Mohamed Nadjib Mami and Jens Lehmann
Abstract: The increasing adoption of the Linked Data format, RDF, over the last two decades has brought new opportunities. It has also raised new challenges though, especially when it comes to managing and processing large amounts of RDF data. In particular, assessing the internal structure of a data set is important, since it enables users to understand the data better. One prominent way of assessment is computing statistics about the instances and schema of a data set. However, computing statistics of large RDF data is computationally expensive. To overcome this challenging situation, we previously built DistLODStats, a framework for parallel calculation of 32 statistical criteria over large RDF datasets, based on Apache Spark. Running DistLODStats is, thus, done via submitting jobs to a Spark cluster. Often times, this process is done manually, either by connecting to the cluster machine or via a dedicated resource manager. This approach is inconvenient as it requires acquiring new software skills as well as the direct interaction of users with the cluster. In order to make the use of DistLODStats easier, we propose in this paper an approach for triggering RDF statistics calculation remotely simply using HTTP requests. DistLODStats is built as a plugin into the larger SANSA Framework and makes use of Apache Livy, a novel lightweight solution for interacting with Spark cluster via a REST Interface.
- “Joint Entity and Relation Linking using EARL” by Debayan Banerjee, Mohnish Dubey, Debanjan Chaudhuri and Jens Lehmann
Abstract: In order to answer natural language questions over knowledge graphs,most processing pipelines involve entity and relation linking. Traditionally, entity linking and relation linking have been performed either as dependent sequential tasks or independent parallel tasks. In this demo paper, we present EARL, which performs entity linking and relation linking as a joint single task. The system determines the best semantic connection between all keywords of the question by referring to the knowledge graph. This is achieved by exploiting the connection density between entity candidates and relation candidates. EARL uses bloom filters for faster retrieval of connection density and uses an extended label vocabulary for higher recall to improve the overall accuracy
- “Generating SPARQL Query Containment Benchmarks using the SQCFramework” by Muhammad Saleem , Qaiser Mehmood, Claus Stadler, Jens Lehmann and Axel-Cyrille Ngonga Ngomo
Abstract: In this demo paper, we present the interface of the SQCFramework, a SPARQL query containment benchmark generation framework. SQCFramework is able to generate customized SPARQL containment benchmarks from real SPARQL query logs. To this end, the framework makes use of different clustering techniques. It is flexible enough to generate benchmarks of varying sizes and complexities according to user-defined criteria on important SPARQL features for query containment benchmarking. We evaluate the usability of the interface by using the standard system usability scale questionnaire. Our overall usability score of 82.33 suggests that the online interface is consistent, easy to use, and the various functions of the system are well integrated.
- “Synthesizing a Knowledge Graph of Data Scientist Job Offers with MINTE+” by Mikhail Galkin, Diego Collarana, Mayesha Tasnim and Maria-Esther Vidal
Abstract: Data Scientist is one of the most sought-after jobs of this decade. In order to analyze the job market in this domain, interested institutions have to integrate numerous job advertising coming from heterogeneous Web sources e.g., job portals, company websites, professional community platforms such as StackOverflow, GitHub, etc. In this demo, we show the application of the RDF Molecule-Based Integration Framework MINTE+ in the domain-specific application of job market analysis. The use of RDF molecules for knowledge representation is a core element of the framework gives MINTE+ enough flexibility to integrate job advertising from different web resources and countries. Attendees will observe how exploration and analysis of the data science job market in Europe can be facilitated by synthesizing at query time a consolidated knowledge graph of job advertising. The demo is available at: https://github.com/RDF-Molecules/MINTE/blob/master/README.md#live-demoAcknowledgmentThis work has received funding from the EU Horizon 2020 projects BigDataEurope (GA 644564) and QROWD (GA no. 723088), the Marie Skłodowska-Curie action WDAqua(GA No 642795), and HOBBIT (GA. 688227), and (project SlideWiki, grant no. 688095), and the German Ministry of Education and Research (BMBF) in the context of the projects LiDaKrA (Linked-Data-basierte Kriminalanalyse, grant no. 13N13627) and InDaSpacePlus (grant no. 01IS17031).
Looking forward to seeing you at The ISWC 2018.
Paper and Poster Papers accepted at SEMANTICS 2018🗓 2018-08-20 ✍ Gezim Sejdiu
We are very pleased to announce that our group got two papers and two poster papers accepted for presentation at SEMANTiCS 2018 conference which will take place in Vienna, Austria on 10th - 13th of September 2018.SEMANTiCS is an established knowledge hub where technology professionals, industry experts, researchers and decision makers can learn about new technologies, innovations and enterprise implementations in the fields of Linked Data and Semantic AI. Since 2005, the conference series has focused on semantic technologies, which are today together with other methodologies such as NLP and machine learning the core of intelligent systems. The conference highlights the benefits of standards-based approaches.Here is the list of accepted papers with their abstracts:
- “Profiting from Kitties on Ethereum: Leveraging Blockchain RDF with SANSA” by Damien Graux, Gezim Sejdiu, Hajira Jabeen, Jens Lehmann, Danning Sui, Dominik Muhs and Johannes Pfeffer (Poster & Demo Track)
Abstract: In this poster, we will present attendees how the recent state-of-the-art Semantic Web tool SANSA could be used to tackle blockchain specific challenges. In particular, the poster will focus on the use case of CryptoKitties: a popular Ethereum-based online game where users are able to trade virtual kitty pets in a secure way.
- “SPIRIT: A Semantic Transparency and Compliance Stack” by Patrick Westphal, Javier Fernández, Sabrina Kirrane and Jens Lehmann (Poster & Demo Track)
Abstract: The European General Data Protection Regulation (GDPR) sets new precedents for the processing of personal data. In this paper, we propose an architecture that provides an automated means to enable transparency with respect to personal data processing and sharing transactions and compliance checking with respect to data subject usage policies and GDPR legislative obligations.
- “SemSur: A Core Ontology for the Semantic Representation of Research Findings” by Said Fathalla, Sahar Vahdati, Sören Auer and Christoph Lange (Research & Innovation)
Abstract: The way how research is communicated using text publications has not changed much over the past decades. We have the vision that ultimately researchers will work on a common structured knowledge base comprising comprehensive semantic and machine-comprehensible descriptions of their research, thus making research contributions more transparent and comparable. We present the SemSur ontology for semantically capturing the information commonly found in survey and review articles. SemSur is able to represent scientific results and to publish them in a comprehensive knowledge graph, which provides an efficient overview of a research field, and to compare research findings withrelated works in a structured way, saving researchers a significant amount of time and effort. The new release of SemSur covers more domains, defines better alignment with external ontologies and rules for eliciting implicit knowledge. We discuss possible applications and present an evaluation of our approach with the retrospective, exemplary semantification of a survey. We demonstrate the utility of the SemSur ontology to answer queries about the different research contributions covered by the survey. SemSur is currently used and maintained at OpenResearch.org.
- “Cross-Lingual Ontology Enrichment Based on Multi-Agent Architecture” by Mohamed Ali, Said Fathalla, Shimaa Ibrahim, Mohamed Kholief, Yasser Hassan (Research & Innovation)
Abstract: The proliferation of ontologies and multilingual data available on the Web has motivated many researchers to contribute to multilingual and cross-lingual ontology enrichment. Cross-lingual ontology enrichment greatly facilitates ontology learning from multilingual text/ontologies in order to support collaborative ontology engineering process.This article proposes a cross-lingual ontology enrichment (CLOE) approach based on a multi-agent architecture in order to enrich ontologies from a multilingual text or ontology. This has several advantages: 1) an ontology is used to enrich another one, written in a different natural language, and 2) several ontologies could be enriched at the same time using a single chunk of text (Simultaneous Ontology Enrichment). A prototype for the proposed approach has been implemented in order to enrich several ontologies using English, Arabic and German text. Evaluation results are promising and showing that CLOE performs well in comparison with four state-of-the-art approaches.Furthermore, we are pleased to inform that we got a talk accepted, which will be co-located with the Industry track.Here is the accepted talk and its abstract :
- “Using the SANSA Stack on a 38 Billion Triple Ethereum Blockchain Dataset”
Abstract: SANSA is the first open source project that allows out of the box horizontally scalable analytics for large knowledge graphs. The talk will cover the main features of SANSA introducing its different layers namely, RDF, Query, Inference and Machine Learning. The talk also covers a large-scale Etherum blockchain use case at Alethio, a spinoff company of Consensys. Alethio is building an analytics dashboard that strives to provide transparency over what’s happening on the Ethereum p2p network, the transaction pool and the blockchain in order to provide “blockchain archaeology”. Their 6 billion triple dataset contains large-scale blockchain transaction data modelled as RDF according to the structure of the Ethereum ontology. Alethio chose to work with SANSA after experimenting with other existing engines. Specifically, the initial goal of Alethio was to load a 2TB EthOn dataset containing more than 6 billion triples and then performing several analytic queries on it with up to three inner joins.SANSA has successfully provided a platform that allows running these queries. Speaker: Hajira JabeenAcknowledgmentThis work has received funding from the EU Horizon 2020 projects BigDataOcean (GA. 732310) and QROWD (GA no. 723088), the Marie Skłodowska-Curie action WDAqua (GA No 642795), and SPECIAL (GA. 731601).
Looking forward to seeing you at The SEMANTiCS 2018.
Short Paper accepted at ECML/PKDD 2018🗓 2018-07-23 ✍ Gezim Sejdiu
We are very pleased to announce that our group got a short paper accepted for presentation at ECML/PKDD 2018 (nectar track) : The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases will take place in the Croke Park Conference Centre, Dublin, Ireland during the 10 – 14 September 2018. This event is the premier European machine learning and data mining conference and builds upon over 16 years of successful events and conferences held across Europe. reland is delighted to host and to bring together participants to Croke Park- one of the iconic sporting venues but also providing a world-class conference facility. Here is the accepted paper with its abstract:
- “Deep Query Ranking for Question Answering over Knowledge Bases” by Hamid Zafar, Giulio Napolitano, and Jens Lehmann (Nectar track)
Abstract: We study question answering systems over knowledge graphs which map an input natural language question into candidate formal queries. Often, a ranking mechanism is used to discern the queries with higher similarity to the given question. Considering the intrinsic complexity of the natural language, finding the most accurate formal counter-part is a challenging task. In our recent paper, we leveraged Tree-LSTM to exploit the syntactical structure of input question as well as the candidate formal queries to compute the similarities. An empirical study shows that taking the structural information of the input question and candidate query into account enhances the performance, when compared to the baseline system.Acknowledgment This research was supported by EU H2020 grants for the projects HOBBIT (GA no. 688227) and WDAqua (GA no. 642795) as well as by German Federal Ministry of Education and Research (BMBF) funding for the project SOLIDE (no. 13N14456).
Looking forward to seeing you at The TPDL 2018.
SOLIDE at the BMBF Innovation Forum “Civil Security” 2018🗓 2018-07-13 ✍ Gezim Sejdiu
SDA as part of SOLIDE project participated at the invitation of the Federal Ministry of Education and Research, the BMBF Innovation Forum “Civil Security” 2018 took place on 19 and 20 June 2018. The two-day conference on the framework program “Research for Civil Security” was held in the Café Moskau conference center in Berlin.
SOLIDE, as one of the funded project from BMBF has been presented during the event in the context of the session “Mission Support – Better Situation Management through Intelligent Information Acquisition”
The SOLIDE project aims to examine a new approach for efficient access to operational data using the command mission management software TecBos Command. The focus here is on the fact that information can be accessed in a natural language dialogue. For this purpose, we do research into subject-specific algorithms for filtering relevant knowledge as well as suitable data integration procedures to make the available data usable and retrievable via dialogues.
SOLIDE is a joint project of PRO DV (Dortmund), Aristech GmbH (Heidelberg) together with the research group Smart Data Analytics (SDA) of the University of Bonn and the Data Science Chair (DICE) of the University of Paderborn.
SDA contribute to the project by providing a cut edge dialog system for providing information support in emergency situations.
SANSA Collaboration with Alethio🗓 2018-07-13 ✍ Gezim Sejdiu
The SANSA team is excited to announce our collaboration with Alethio (a ConsenSys formation). SANSA is the major distributed, open source solution for RDF querying, reasoning and machine learning. Alethio is building an Ethereum analytics platform that strives to provide transparency over what’s happening on the Ethereum p2p network, the transaction pool and the blockchain and provide “blockchain archeology”. Their 5 billion triple data set contains large scale blockchain transaction data modelled as RDF according to the structure of the Ethereum ontology. EthOn - The Ethereum Ontology - is a formalization of concepts/entities and relations of the Ethereum ecosystem represented in RDF and OWL format. It describes all Ethereum terms including blocks, transactions, contracts, nonces etc. as well as their relationships. Its main goal is to serve as a data model and learning resource for understanding Ethereum. Alethio is interested in using SANSA as a scalable processing engine for their large-scale batch and stream processing tasks, such as querying the data in real time via SPARQL and performing related analytics on a wide range of subjects (e.g. asset turnover for sets of accounts, attack pattern detection or Opcode usage statistics). At the same time, SANSA is interested in further industrial pilot applications for testing the scalability on larger datasets, mature its code base and gain experience on running the stack on production clusters. Specifically, the initial goal of Alethio was to load a 2TB EthOn dataset containing more than 5 billion triples and then performing several analytic queries on it with up to three inner joins. The queries are used to characterize movement between groups of ethereum accounts (e.g. exchanges or investors in ICOs) and aggregate their in and out value flow over the history of the Ethereum blockchain. The experiments were successfully run by Alethio on a cluster with up to 100 worker nodes and 400 cores that have a total of over 3TB of memory available. “I am excited to see that SANSA works and scales well to our data. Now, we want to experiment with more complex queries and tune the Spark parameters to gain the optimal performance for our dataset” said Johannes Pfeffer, co-founder of Alethio. “I am glad that Alethio managed to run their workload and to see how well our methods scale to a 5 billion triple dataset”, added Gezim Sejdiu, PhD student at the Smart Data Analytics Group and SANSA core developer. Parts of the SANSA team, including its leader Prof. Jens Lehmann as well as Dr. Hajira Jabeen, Dr. Damien Graux and Gezim Sejdiu, will now continue the collaboration together with the data science team of Alethio after those successful experiments. Beyond the above initial tests, we are jointly discussing possibilities for efficient stream processing in SANSA, further tuning of aggregate queries as well as suitable Apache Spark parameters for efficient processing of the data. In the future, we want to join hands to optimize the performance of loading the data (e.g. reducing the disk footprint of datasets using compression techniques allowing then more efficient SPARQL evaluation), handling the streaming data, querying, and analytics in real time. The SANSA team is happily looking forward to further interesting scientific research as well as industrial adaptation. Tweet
Core model of the fork history of the Ethereum Blockchain modeled in EthOn
A BETTER project for exploiting Big Data in Earth Observation🗓 2018-07-05 ✍ Gezim Sejdiu
The SANSA Stack is one of the earmarked big data analytics components to be employed in the BETTER data pipelines.
Big-data Earth observation Technology and Tools Enhancing Research and development is an EU-H2020 research and innovation project started in November 2017 to the end of October 2020.
The project’s main objective is to implement Big Data solutions (denominated as Data Pipelines) based on the usage of large volumes and heterogeneous Earth Observation datasets. This should help addressing key Societal Challenges, so the users can focus on the analysis of the extraction of the potential knowledge within the data and not on the processing of the data itself.
To achieve that, BETTER is improving the way Big Data service developers interact with end-users. After defining the challenges, the promoters validate the pipelines requirements and co-design the solution with a dedicated development team in a workshop. During the implementation, promoters can continuously test and validate the pipelines. Later, the implemented pipelines will be used by the public in the scope of Hackathons, enabling the use of specific solutions in other areas and the collection of additional user feedback. www.ec-better.eu
SUBSCRIBE HERE for major project updates.
SANSA 0.4 (Semantic Analytics Stack) Released🗓 2018-06-26 ✍ Prof. Dr. Jens Lehmann
We are happy to announce SANSA 0.4 - the fourth release of the Scalable Semantic Analytics Stack. SANSA employs distributed computing via Apache Spark and Flink in order to allow scalable machine learning, inference and querying capabilities for large knowledge graphs.
- Website: http://sansa-stack.net
- GitHub: https://github.com/SANSA-Stack
- Download: http://sansa-stack.net/downloads-usage/
- ChangeLog: https://github.com/SANSA-Stack/SANSA-Stack/releases
- Reading and writing RDF files in N-Triples, Turtle, RDF/XML, N-Quad format
- Reading OWL files in various standard formats
- Support for multiple data partitioning techniques
- SPARQL querying via Sparqlify
- Graph-parallel querying of RDF using SPARQL (1.0) via GraphX traversals (experimental)
- RDFS, RDFS Simple, OWL-Horst, EL (experimental) forward chaining inference
- Automatic inference plan creation (experimental)
- RDF graph clustering with different algorithms
- Terminological decision trees (experimental)
- Anomaly detection (beta)
- Knowledge graph embedding approaches: TransE (beta), DistMult (beta)
- Parser performance has been improved significantly e.g. DBpedia 2016-10 can be loaded in <100 seconds on a 7 node cluster
- Support for a wider range of data partitioning strategies
- A better unified API across data representations (RDD, DataFrame, DataSet, Graph) for triple operations
- Improved unit test coverage
- Improved distributed statistics calculation (see ISWC paper)
- Initial scalability tests on 6 billion triple Ethereum blockchain data on a 100 node cluster
- New SPARQL-to-GraphX rewriter aiming at providing better performance for queries exploiting graph locality
- Numeric outlier detection tested on DBpedia (en)
- Improved clustering tested on 20 GB RDF data sets
- There are template projects for SBT and Maven for Apache Spark as well as for Apache Flink available to get started.
- The SANSA jar files are in Maven Central i.e. in most IDEs you can just search for “sansa” to include the dependencies in Maven projects.
- Example code is available for various tasks.
- We provide interactive notebooks for running and testing code via Docker.
Papers accepted at ISWC 2018🗓 2018-06-21 ✍ Gezim Sejdiu
We are very pleased to announce that our group got 3 papers accepted for presentation at ISWC 2018: The 17th International Semantic Web Conference, which will be held on October 8 - 12, 2018 in Monterey, California, USA. The International Semantic Web Conference (ISWC) is the premier international forum where Semantic Web / Linked Data researchers, practitioners, and industry specialists come together to discuss, advance, and shape the future of semantic technologies on the web, within enterprises and in the context of the public institution. Here is the list of the accepted papers with their abstract:
- “EARL: Joint Entity and Relation Linking for Question Answering over Knowledge Graphs” by Mohnish Dubey, Debayan Banerjee, Debanjan Chaudhuri and Jens Lehmann (Research Track)
Abstract: Many question answering systems over knowledge graphs rely on entity and relation linking components in order to connect the natural language input to the underlying knowledge graph. Traditionally, entity linking and relation linking has been performed either as a dependent, sequential tasks or as independent, parallel tasks. In this paper, we propose a framework called EARL, which performs entity linking and relation linking as a joint task. EARL implements two different solution strategies for which we provide a comparative analysis in this paper: The first strategy is a formalization of the joint entity and relation linking tasks as an instance of the Generalised Travelling Salesman Problem (GTSP). In order to be computationally feasible, we employ approximate GTSP solvers. The second strategy uses machine learning in order to exploit the connection density between nodes in the knowledge graph. It relies on three base features and re-ranking steps in order to predict entities and relations. We compare the strategies and evaluate them on a dataset with 5000 questions. Both strategies significantly outperform the current state-of-the-art approaches for entity and relation linking.
- “DistLODStats: Distributed Computation of RDF Dataset Statistics” by Gezim Sejdiu, Ivan Ermilov, Jens Lehmann and Mohamed Nadjib Mami (Resources Track)
Abstract: Over the last years, the Semantic Web has been growing steadily. Today, we count more than 10,000 datasets made available online following Semantic Web standards. Nevertheless, many applications, such as data integration, search, and interlinking, may not take the full advantage of the data without having a priori statistical information about its internal structure and coverage. In fact, there are already a number of tools, which offer such statistics, providing basic information about RDF datasets and vocabularies. However, those usually show severe deficiencies in terms of performance once the dataset size grows beyond the capabilities of a single machine. In this paper, we introduce a software library for statistical calculations of large RDF datasets, which scales out to clusters of machines. More specifically, we describe the first distributed in-memory approach for computing 32 different statistical criteria for RDF datasets using Apache Spark. The preliminary results show that our distributed approach improves upon a previous centralized approach we compare against and provides approximately linear horizontal scale-up. The criteria are extensible beyond the 32 default criteria, is integrated into the larger SANSA framework and employed in at least four major usage scenarios beyond the SANSA community.
- “Synthesizing Knowledge Graphs from web sources with the MINTE+ framework” by Diego Collarana, Mikhail Galkin, Christoph Lange, Simon Scerri, Sören Auer and Maria-Esther Vidal (In-Use Track)
Abstract: Institutions from different domains require the integration of data coming from heterogeneous Web sources. Typical use cases include Knowledge Search, Knowledge Building, and Knowledge Completion. We report on the implementation of the RDF Molecule-Based Integration Framework MINTE+ in three domain-specific applications: Law Enforcement, Job Market Analysis, and Manufacturing. The use of RDF molecules as data representation and a core element in the framework gives MINTE+ enough flexibility to synthesize knowledge graphs in different domains. We first describe the challenges in each domain-specific application, then the implementation and configuration of the framework to solve the particular problems of each domain. We show how the parameters defined in the framework allow to tune the integration process with the best values according to each domain. Finally, we present the main results, and the lessons learned from each application.AcknowledgmentThis work has received funding from the EU Horizon 2020 projects BigDataEurope (GA no. 644564) and QROWD (GA no. 723088), the Marie Skłodowska-Curie action WDAqua(GA No 642795), and HOBBIT (GA. 688227), and (project SlideWiki, grant no. 688095), and the German Ministry of Education and Research (BMBF) in the context of the projects LiDaKrA (Linked-Data-basierte Kriminalanalyse, grant no. 13N13627) and InDaSpacePlus (grant no. 01IS17031).
Looking forward to seeing you at The ISWC 2018.Tweet
Paper accepted at GRADES 2018 workshop at SIGMOD / PODS🗓 2018-05-15 ✍ Gezim Sejdiu
We are very pleased to announce that our group got 1 paper accepted for presentation at the GRADES workshop at SIGMOS / PODS 2018: The International ACM International Conference on Management of Data, which will be held in Houston, TX, USA, on June 10th - June 15th, 2018. The annual ACM SIGMOD/PODS Conference is a leading international forum for database researchers, practitioners, developers, and users to explore cutting-edge ideas and results and to exchange techniques, tools, and experiences. The conference includes a fascinating technical program with research and industrial talks, tutorials, demos, and focused workshops. It also hosts a poster session to learn about innovative technology, an industrial exhibition to meet companies and publishers, and a careers-in-industry panel with representatives from leading companies. The focus of the GRADES 2018 workshop is the application areas, usage scenarios and open challenges in managing large-scale graph-shaped data. The workshop is a forum for exchanging ideas and methods for mining, querying and learning with real-world network data, developing new common understandings of the problems at hand, sharing of data sets and benchmarks where applicable, and leveraging existing knowledge from different disciplines. Additionally, considering specific techniques (e.g., algorithms, data/index structures) in the context of the systems that implement them, rather than describing them in isolation, GRADES-NDA aims to present technical contributions inside the graph, RDF and other data management systems on graphs of a large size. Here is the accepted paper with its abstract:
- “Two for One -- Querying Property Graphs using SPARQL via GREMLINATOR” by Harsh Thakkar, Dharmen Punjani, Jens Lehmann and Sören Auer
Abstract: In the past decade Knowledge graphs have become very popular and frequently rely on the Resource Description Framework (RDF) or Property Graphs (PG) as their data models. However, the query languages for these two data models – SPARQL for RDF and the PG traversal language Gremlin – are lacking basic interoperability. In this demonstration paper, we present Gremlinator, the first translator from SPARQL – the W3C standardized language for RDF – to Gremlin – a popular property graph traversal language. Gremlinator translates SPARQL queries to Gremlin path traversals for executing graph pattern matching queries over graph databases. This allows a user, who is well versed in SPARQL, to access and query a wide variety of Graph databases avoiding the steep learning curve for adapting to a new Graph Query Language (GQL). Gremlin is a graph computing system-agnostic traversal language (covering both OLTP graph databases and OLAP graph processors), making it a desirable choice for supporting interoperability for querying Graph databases. Gremlinator is planned to be released as an Apache TinkerPop plugin in the upcoming releases.AcknowledgmentThis work has received funding from the EU H2020 R&I programme for the Marie Skłodowska-Curie action WDAqua (GA No 642795).
Looking forward to seeing you at The GRADES 2018. Tweet
Demo Paper accepted at SIGIR 2018🗓 2018-05-04 ✍ Gezim Sejdiu
We are very pleased to announce that our group got 1 papers accepted for presentation at the demo session on SIGIR 2018: The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, which will be held on Ann Arbor Michigan, U.S.A. July 8-12, 2018.The annual SIGIR conference is the major international forum for the presentation of new research results, and the demonstration of new systems and techniques, in the broad field of information retrieval (IR). The 41st ACM SIGIR conference welcomes contributions related to any aspect of information retrieval and access, including theories and foundations, algorithms and applications, and evaluation and analysis. The conference and program chairs invite those working in areas related to IR to submit high-impact original papers for review. Here is the accepted paper with its abstract:
- “Dynamic Composition of Question Answering Pipelines with Frankenstein” by Kuldeep Singh, Ioanna Lytra, Arun Sethupat Radhakrishna, Akhilesh Vyas and Maria Esther Vidal.
Abstract: Question answering (QA) systems provide user-friendly interfaces for retrieving answers from structured and unstructured data to natural language questions. Several QA systems, as well as related components, have been contributed by the industry and research community in recent years. However, most of these efforts have been performed independently from each other and with different focuses and their synergies in the scope of QA have not been addressed adequately.Frankenstein is a novel framework for developing QA systems over knowledge bases by integrating existing state-of-the-art QA components performing different tasks. It incorporates several reusable QA components, employs machine-learning techniques to predict best performing components and QA pipelines for a given question to generate static and dynamic executable QA pipelines. In this demo, attendees will be able to view the different functionalities of Frankenstein for performing independent QA component execution, QA component prediction given an input question as well as the static and dynamic composition of different QA pipelines.AcknowledgmentThis work has received funding from the EU H2020 R&I programme for the Marie Skłodowska-Curie action WDAqua (GA No 642795.
Looking forward to seeing you at The SIGR 2018.Tweet
Papers accepted at ICWE 2018🗓 2018-04-23 ✍ Gezim Sejdiu
We are very pleased to announce that our group got 2 papers accepted for presentation at the ICWE 2018 : The 18th International Conference on Web Engineering, which will be held on CÁCERES, SPAIN. 5 - 8 JUNE, 2018. The ICWE is the prime yearly international conference on the different aspects of designing, building, maintaining and using Web applications. The theme for the year 2018 -- the 18th edition of the event -- is Enhancing the Web with Advanced Engineering. The conference will cover the different aspects of Web Engineering, including the design, creation, maintenance, and usage of Web applications. ICWE2018 is endorsed by the International Society for the Web Engineering (ISWE) and belongs to the ICWE conference series owned by ISWE. Here are the accepted papers with their abstracts:
- “Efficiently Pinpointing SPARQL Query Containments” by Claus Stadler, Muhammad Saleem, Axel-Cyrille Ngonga Ngomo, and Jens Lehmann.
Abstract: Query containment is a fundamental problem in database research, which is relevant for many tasks such as query optimisation, view maintenance and query rewriting. For example, recent SPARQL engines built on Big Data frameworks that precompute solutions to frequently requested query patterns, are conceptually an application of query containment. We present an approach for solving the query containment problem for SPARQL queries – the W3C standard query language for RDF datasets. Solving the query containment problem can be reduced to the problem of deciding whether a sub graph isomorphism exists between the normalized algebra expressions of two queries. Several state-of-the-art methods are limited to matching two queries only, as well as only giving a boolean answer to whether a containment relation holds. In contrast, our approach is fit for view selection use cases, and thus capable of efficiently enumerating all containment mappings among a set of queries. Furthermore, it provides the information about how two queries’ algebra expression trees correspond under containment mappings. All of our source code and experimental results are openly available.
- “OpenBudgets.eu: A Platform for SemanticallyRepresenting and Analyzing Open Fiscal Data” by Fathoni A. Musyaffa, Lavdim Halilaj, Yakun Li, Fabrizio Orlandi, Hajira Jabeen, Sören Auer, and Maria-Esther Vidal.
Abstract: Budget and spending data are among the most published Open Data datasets on the Web and continuously increasing in terms of volume over time. These datasets tend to be published in large tabular files – without predefined standards – and require complex domain and technical expertise to be used in real-world scenarios. Therefore, the potential benefits of having these datasets open and publicly available are hindered by their complexity and heterogeneity. Linked Data principles can facilitate integration, analysis and usage of these datasets. In this paper, we present OpenBudgets.eu (OBEU), a Linked Data-based platform supporting the entire open data life-cycle of budget and spending datasets: from data creation to publishing and exploration. The platform is based on a set of requirements specifically collected by experts in the budget and spending data domain. It follows a micro-services architecture that easily integrates many different software modules and tools for analysis, visualization and transformation of data. Data i represented according to a logical model for open fiscal data which is translated into both RDF data and a tabular data formats. We demonstrate the validity of the implemented OBEU platform with real application scenarios and report on a user study conducted to confirm its usability.
Looking forward to seeing you at ICWE 2018. Tweet
Invited talk by Dr. Anastasia Dimou🗓 2018-04-19 ✍ Gezim Sejdiu
On Wednesday, 21st of March Anastasia Dimou from the Internet Technology & Data Science Lab visited SDA and gave a talk entitled “High Quality Linked Data Generation from Heterogeneous data”
Anastasia Dimou is a Post-Doc Researcher at the Internet Technology & Data Science Lab at Gent University, Belgium. Anastasia joined the IDLab research group in February 2013. Her research expertise lies in the area of the Semantic Web, Linked Data Generation and Publication, Data Quality and Integration, Knowledge Representation and Management. She has broad experience on Semantic Wikis and Classification. As part of her research, she investigated a uniform language for describing the mapping rules for generating high-quality Linked Data from multiple heterogeneous data formats and access interfaces and she also conducted research on Linked Data generation and publishing workflows. Her research activities led to the development of the RML tool chain (RMLProcessor, RMLEditor, RMLValidator, and RMLWorkbench). Anastasia has been involved in different national and l research projects and publications.
Prof. Jens Lehmann invited the speaker to the bi-weekly “SDA colloquium presentations”. The goal of her visit was to exchange experience and ideas on RML tools specialized for data quality and on the fly mapping, including heterogeneous dataset mapping into LOD. Apart from presenting various use cases where RML tools were used, she introduced a declarative RML serialization which models the mapping rules using the well-known yaml language. Anastasia shared with our group future research problems and challenges related to this research area.
In her talk, she introduced a full workflow aka the RML tool chain which models components of an RML mapping lifecycle. She discussed its application to the structure of heterogeneous data sources. Anastasia Dimou mentioned that adding support for data quality during the mapping shall allow users to efficiently explore a structured search space to enable the future violations not only map the range of the known domain but also help to discover new knowledge from the existing knowledge base worth mapping.
During the visit, SDA core research topics and main research projects were presented in a (successful!) attempt to find an intersection on the future collaborations with Anastasia and her research group.
As an outcome of this visit, we expect to strengthen our research collaboration networks with the Internet Technology & Data Science Lab at UGent, mainly on combining semantic knowledge for exploratory and mapping tools and apply those techniques for a very large-scale KG using our distributed analytics framework SANSA and DBpedia.
Papers and a tutorial accepted at ESWC 2018🗓 2018-04-11 ✍ Gezim Sejdiu
We are very pleased to announce that our group got 3 papers accepted for presentation at the ESWC 2018 : The 15th edition of The Extended Semantic Web Conference, which will be held on June 3-7, 2018 in Heraklion, Crete, Greece.The ESWC is a major venue for discussing the latest scientific results and technology innovations around semantic technologies. Building on its past success, ESWC is seeking to broaden its focus to span other relevant related research areas in which Web semantics plays an important role. ESWC 2018 will present the latest results in research, technologies, and applications in its field. Besides the technical program organized over twelve tracks, the conference will feature a workshop and tutorial program, a dedicated track on Semantic Web challenges, system descriptions and demos, a posters exhibition and a doctoral symposium.Here are the accepted papers with their abstracts:
- “Formal Query Generation for Question Answering over Knowledge Bases” by Hamid Zafar, Giulio Napolitano and Jens Lehmann.
Abstract: Question answering (QA) systems often consist of several components such as Named Entity Disambiguation (NED), Relation Extraction (RE), and Query Generation (QG). In this paper, we focus on the QG process of a QA pipeline on a large-scale Knowledge Base (KB), with noisy annotations and complex sentence structures. We therefore propose SQG, a SPARQL Query Generator with modular architecture, enabling easy integration with other components for the construction of a fully functional QA pipeline. SQG can be used on large open-domain KBs and handle noisy inputs by discovering a minimal subgraph based on uncertain inputs, that it receives from the NED and RE components. This ability allows SQG to consider a set of candidate entities/relations, as opposed to the most probable ones, which leads to a significant boost in the performance of the QG component. The captured subgraph covers multiple candidate walks, which correspond to SPARQL queries. To enhance the accuracy, we present a ranking model based on Tree-LSTM that takes into account the syntactical structure of the question and the tree representation of the candidate queries to find the one representing the correct intention behind the question.
- “Frankenstein: a Platform Enabling Reuse of Question Answering Components Paper” Resource Track by Kuldeep Singh, Andreas Both, Arun Sethupat, Saeedeh Shekarpour.
Abstract: Recently remarkable trials of the question answering (QA) community yielded in developing core components accomplishing QA tasks. However, implementing a QA system still was costly. While aiming at providing an efficient way for the collaborative development of QA systems, the Frankenstein framework was developed that allows dynamic composition of question answering pipelines based on the input question. In this paper, we are providing a full range of reusable components as independent modules of Frankenstein populating the ecosystem leading to the option of creating many different components and QA systems. Just by using the components described here, 380 different QA systems can be created offering the QA community many new insights. Additionally, we are providing resources which support the performance analyses of QA tasks, QA components and complete QA systems. Hence, Frankenstein is dedicated to improve the efficiency within the research process w.r.t. QA.
- “Using Ontology-based Data Summarization to Develop Semantics-aware Recommender Systems” by Tommaso Di Noia, Corrado Magarelli, Andrea Maurino, Matteo Palmonari, Anisa Rula.
Abstract: In the current information-centric era, recommender systems are gaining momentum as tools able to assist users in daily decision-making tasks. They may exploit users’ past behavior combined with side/contextual information to suggest them new items or pieces of knowledge they might be interested in. Within the recommendation process, Linked Data (LD) have been already proposed as a valuable source of information to enhance the predictive power of recommender systems not only in terms of accuracy but also of diversity and novelty of results. In this direction, one of the main open issues in using LD to feed a recommendation engine is related to feature selection: how to select only the most relevant subset of the original LD dataset thus avoiding both useless processing of data and the so called “course of dimensionality” problem. In this paper we show how ontology-based (linked) data summarization can drive the selection of properties/features useful to a recommender system. In particular, we compare a fully automated feature selection method based on ontology-based data summaries with more classical ones and we evaluate the performance of these methods in terms of accuracy and aggregate diversity of a recommender system exploiting the top-k selected features. We set up an experimental testbed relying on datasets related to different knowledge domains. Results show the feasibility of a feature selection process driven by ontology-based data summaries for LD-enabled recommender systems.
- How to build a Question Answering system overnight
Author: Andreas Both, Denis Lukovnikov, Gaurav Maheshwari, Ioanna Lytra, Jens Lehmann, Kuldeep Singh, Mohnish Dubey, Priyansh Trivedi
Аbstract: With this tutorial, we aim to provide the participants with an overview of the field of Question Answering, insights into commonly faced problems, its recent trends, and developments. At the end of the tutorial, the audience would have hands-on experience of developing two working QA systems- one based on rule-based semantic parsing, and another, a deep learning based method. In doing so, we hope to provide a suitable entry point for the people new to this field and ease their process of making informed decisions while creating their own QA systems.Website: http://qatutorial.sda.tech/
Looking forward to seeing you at The ESWC 2018.Tweet
Paper accepted at Semantic Web Journal🗓 2018-04-09 ✍ Gezim Sejdiu
We are very pleased to announce that our group got a paper accepted at Semantic Web Journal on the Benchmarking Linked Data 2017 issue. The journal Semantic Web – Interoperability, Usability, Applicability (published and printed by IOS Press, ISSN: 1570-0844), in short Semantic Web journal, brings together researchers from various fields which share the vision and need for more effective and meaningful ways to share information across agents and services on the future internet and elsewhere. As such, Semantic Web technologies shall support the seamless integration of data, on-the-fly composition, and interoperation of Web services, as well as more intuitive search engines. The semantics – or meaning – of information, however, cannot be defined without a context, which makes personalization, trust, and provenance core topics for Semantic Web research. New retrieval paradigms, user interfaces, and visualization techniques have to unleash the power of the Semantic Web and at the same time hide its complexity from the user. Based on this vision, the journal welcomes contributions ranging from theoretical and foundational research over methods and tools to descriptions of concrete ontologies and applications in all areas. Here is the accepted paper with its abstract:
- “SML-Bench -- A Benchmarking Framework for Structured Machine Learning” by Patrick Westphal, Lorenz Bühmann, Simon Bin, Hajira Jabeen, Jens Lehmann. Abstract: The availability of structured data has increased significantly over the past decade and several approaches to learn from structured data have been proposed. These logic-based, inductive learning methods are often conceptually similar, which would allow a comparison among them even if they stem from different research communities. However, so far no efforts were made to define an environment for running learning tasks on a variety of tools, covering multiple knowledge representation languages. With SML-Bench, we propose a benchmarking framework to run inductive learning tools from the ILP and semantic web communities on a selection of learning problems. In this paper, we present the foundations of SML-Bench, discuss the systematic selection of benchmarking datasets and learning problems, and showcase an actual benchmark run on the currently supported tools.
Paper accepted at ICLR 2018🗓 2018-02-03 ✍ Gezim Sejdiu
We are very pleased to announce that our group in collaboration with Fraunhofer IAIS got a paper accepted for poster presentation at ICLR 2018 : The Sixth International Conference on Learning Representations, which will be held on April 30 - May 03, 2018 in Vancouver Convention Center, Vancouver CANADA.The Sixth edition of ICLR will offer many opportunities to present and discuss latest advances in the performance of machine learning methods and deep learning. With a broad view of the field and include topics such as feature learning, metric learning, compositional modeling, structured prediction, reinforcement learning, and issues regarding large-scale learning and non-convex optimization. The range of domains to which these techniques apply is also very broad, from vision to speech recognition, text understanding, gaming, music, etc. Here is the accepted paper with its abstract:
- “On the regularization of Wasserstein GANs” by Henning Petzka, Asja Fischer, Denis Lukovnikov
Abstract: Since their invention, generative adversarial networks (GANs) have become a popular approach for learning to model a distribution of real (unlabeled) data. Convergence problems during training are overcome by Wasserstein GANs which minimize the distance between the model and the empirical distribution in terms of a different metric, but thereby introduce a Lipschitz constraint into the optimization problem. A simple way to enforce the Lipschitz constraint on the class of functions, which can be modeled by the neural network, is weight clipping. Augmenting the loss by a regularization term that penalizes the deviation of the gradient norm of the critic (as a function of the network's input) from one, was proposed as an alternative that improves training. We present theoretical arguments why using a weaker regularization term enforcing the Lipschitz constraint is preferable. These arguments are supported by experimental results on several data sets.
Looking forward to seeing you at ICLR 2018. Tweet
Invited talk by Svitlana Vakulenko🗓 2018-02-02 ✍ Gezim Sejdiu
On Wednesday, 31 st of January Svitlana Vakulenko from the Institute for Information Business visited SDA and gave a talk entitled “Semantic Coherence for Conversational Browsing of a Knowledge Graph”Svitlana Vakulenko is a researcher at the Institute for Information Business at WU Wien and a PhD student in the Computer Science Department at TU Wien. Her research expertise lies in the area of machine learning for natural language processing. She has been involved in several international research projects and is currently working in CommuniData FFG project (communidata.at), which aims to enhance the usability of Open Data and its accessibility for non-expert users in local communities. She is involved in other projects as well with the main focus on Question Answering from Tabular data and Open Data Conversational Search and Exploratory Search. Prof. Jens Lehmann invited the speaker to the bi-weekly “SDA colloquium presentations”. The goal of her visit was to exchange experience and ideas on semantic search and dialogue systems techniques specialized for question answering, including conversational search and exploratory search. Apart from presenting various use cases where semantic exploration using table data and open data has been used she introduced a framework which models these conversational browsing systems. Svitlana shared with our group future research problems and challenges related to this research area and shown that the Semantic Coherence will provide more insight and meaningful results to the conversational browsing scenario. In this talk, she introduces the task of conversational browsing that goes beyond Question Answering. A framework which models components of a conversational browsing system has been presented and discussed its application to the structure of a Knowledge Graph (KG). She mentioned that adding support for conversational browsing functionality shall allow users to efficiently explore a structured search space to enable the future conversational search systems not only answer a range of questions but also help to discover questions worth asking. During the visit, SDA core research topics and main research projects were presented in an attempt to find an intersection on the future collaborations with Svitlana and her research group. As an outcome of this visit, we expect to strengthen our research collaboration networks with the Institute of Information Business at WU Wien, mainly on combining semantic knowledge for exploratory and conversation search and apply those techniques for a very large-scale KG using our distributed analytics framework SANSA.Tweet