JensLehmann – Smart Data Analytics

SANSA 0.7.1 (Semantic Analytics Stack) Released

2020-01-172020-05-08JensLehmann

We are happy to announce SANSA 0.7.1 – the seventh release of the Scalable Semantic Analytics Stack. SANSA employs distributed computing via Apache Spark and Flink in order to allow scalable machine learning, inference and querying capabilities for large knowledge graphs.

Website: http://sansa-stack.net
GitHub: https://github.com/SANSA-Stack
Download: http://sansa-stack.net/downloads-usage/
ChangeLog: https://github.com/SANSA-Stack/SANSA-Stack/releases

You can find usage guidelines and examples at http://sansa-stack.net/user-guide.

The following features are currently supported by SANSA:

Reading and writing RDF files in N-Triples, Turtle, RDF/XML, N-Quad, TRIX format
Reading OWL files in various standard formats
Query heterogeneous sources (Data Lake) using SPARQL – CSV, Parquet, MongoDB, Cassandra, JDBC (MySQL, SQL Server, etc.) are supported
Support for multiple data partitioning techniques
SPARQL querying via Sparqlify and Ontop and Tensors
Graph-parallel querying of RDF using SPARQL (1.0) via GraphX traversals (experimental)
RDFS, RDFS Simple and OWL-Horst forward chaining inference
RDF graph clustering with different algorithms
Terminological decision trees (experimental)
Knowledge graph embedding approaches: TransE (beta), DistMult (beta)

Noteworthy changes or updates since the previous release are:

TRIX support
A new query engine over compressed RDF data
OWL/XML Support

Deployment and getting started:

There are template projects for SBT and Maven for Apache Spark as well as for Apache Flink available to get started.
The SANSA jar files are in Maven Central i.e. in most IDEs you can just search for “sansa” to include the dependencies in Maven projects.
Example code is available for various tasks.
We provide interactive notebooks for running and testing code via Docker.

We want to thank everyone who helped to create this release, in particular the projects Big Data Ocean, SLIPO, QROWD, BETTER, BOOST, MLwin, PLATOON and Simple-ML. Also check out our recent articles in which we describe how to use SANSA for tensor based querying, scalable RDB2RDF query execution, quality assessment and semantic partitioning.

Spread the word by retweeting our release announcement on Twitter. For more updates, please view our Twitter feed and consider following us.

Greetings from the SANSA Development Team

New Year at SDA – Looking back at 2018

2019-01-032019-01-03JensLehmann

2019 has just started and we want to take a moment to look back at a very busy and successful year 2018, full of new members, inspirational discussions, exciting conferences, accepted research papers, new software releases and a lot of highlights we had throughout the year.

Below is a short summary of the main cornerstones for 2018:

An interesting future for AI and knowledge graphs

Artificial intelligence/machine learning and semantic technologies/knowledge graphs are central topics for SDA. Throughout the year, we have been able to accomplish a range of interesting research achievements. One particularly active area was question answering and dialogue systems (with and without knowledge graphs). We acquired new projects for more than a million Euro this year and were able to transfer our expertise to industry via successful projects at Fraunhofer. External interest in our results has been remarkably high. Furthermore, we extended our already established position in scalable distributed querying, inference, and analysis of large RDF datasets. Among the race for ever-improving achievements in AI, which has gone far beyond what many could have imagined 10 years ago, our researchers were able to deliver important contributions and continued to shape different sub-areas of the growing AI research landscape.

Papers accepted

We had 41 papers accepted at well-known conferences (i.e., the AAAI 2019 workshops, ISWC 2018, ESWC 2018, Nature Scientific Data Journal, Journal of Web Semantics, Semantic Web Journal, WWW 2018 workshops, EMNLP 2018 workshops, ECML 2018 workshops, CoNLL 2018, SIGMOD 2018 workshops, SIGIR 2018, ICLR 2018, EKAW 2018, SEMANTiCS 2018, ICWE 2018, ICSC 2018, TPDL 2018, JURIX 2018 and more. We estimate that SDA members had approximately 2500+ citations per year (based on Google Scholar profiles).

Software releases

SANSA – An open source data flow processing engine for performing distributed computation over large-scale RDF datasets had 2 successfully released during 2018 (SANSA 0.5 and SANSA 0.4).

From the funded projects we were happy to launch the first major release of the Big Data Ocean platform – a platform for Exploiting Ocean’s of Data for Maritime Applications.

There were several other releases:

SML-Bench – A Structured Machine Learning benchmark framework 0.2 has been released.
WebVOWL – A web-based visualization for ontologies had several releases in 2018. AS a major new feature characterizing WebVOWL is the integration of the WebVOWL Editor – a Device-Independent Visual Ontology Modeling.
AskNowQA – A Suite of Natural Language interaction technologies that behave intelligently through domain knowledge. The 0.1 version has been released.

Highlights

Move to the brand new Computer Science Campus: After many delays, we finally moved into our new campus where we have modern rooms and equipment.
A Best Demo Award at ISWC 2018
Two PhD defenses: Mikhail Galkin and Lavdim Halilaj both successfully defended their PhD thesis. Congratulations to them again! Four more theses have been submitted, with defenses scheduled for January and February.
Many invited speakers (Prof. Dr. John Domingue, Prof. Dr. Khalid Saeed, Dr. Anastasia Dimou, Svitlana Vakulenko and Dr. Katherine Thornton).
We did an off-site meeting together with the EIS department of Fraunhofer IAIS, at their place.

Likewise, SDA deeply values team bonding activities. Often we try to introduce fun activities that involve teamwork and teambuilding. At our X-mas party, we enjoyed a very international and lovely dinner together while exchanging a `Secret Santa` gifts and played some ad-hoc games.

Long-term team building through deeper discussions, genuine connections and healthy communication helps us to connect within the group!

Many thanks to all who have accompanied and supported us on this way! So from all of us at SDA, we wish you a wonderful new year!

Jens Lehmann on behalf of The SDA Research Team

SANSA 0.5 (Scalable Semantic Analytics Stack) Released

2018-12-132018-12-13JensLehmann

We are happy to announce SANSA 0.5 – the fifth release of the Scalable Semantic Analytics Stack. SANSA employs distributed computing via Apache Spark and Flink in order to allow scalable machine learning, inference and querying capabilities for large knowledge graphs.

Website: http://sansa-stack.net
GitHub: https://github.com/SANSA-Stack
Download: http://sansa-stack.net/downloads-usage/
ChangeLog: https://github.com/SANSA-Stack/SANSA-Stack/releases

You can find the FAQ and usage examples at http://sansa-stack.net/faq/.

The following features are currently supported by SANSA:

Reading and writing RDF files in N-Triples, Turtle, RDF/XML, N-Quad format
Reading OWL files in various standard formats
Query heterogeneous sources (Data Lake) using SPARQL – CSV, Parquet, MongoDB, Cassandra, JDBC (MySQL, SQL Server, etc.) are supported
Support for multiple data partitioning techniques
SPARQL querying via Sparqlify and Ontop
Graph-parallel querying of RDF using SPARQL (1.0) via GraphX traversals (experimental)
RDFS, RDFS Simple and OWL-Horst forward chaining inference
RDF graph clustering with different algorithms
Terminological decision trees (experimental)
Knowledge graph embedding approaches: TransE (beta), DistMult (beta)

Noteworthy changes or updates since the previous release are:

A data lake concept for querying heterogeneous data sources has been integrated into SANSA
New clustering algorithms have been added and the interface for clustering has been unified
Ontop RDB2RDF engine support has been added
RDF data quality assessment methods have been substantially improved
Dataset statistics calculation has been substantially improved
Improved unit test coverage

Deployment and getting started:

There are template projects for SBT and Maven for Apache Spark as well as for Apache Flink available to get started.
The SANSA jar files are in Maven Central i.e. in most IDEs you can just search for “sansa” to include the dependencies in Maven projects.
Example code is available for various tasks.
We provide interactive notebooks for running and testing code via Docker.

We want to thank everyone who helped to create this release, in particular the projects HOBBIT, Big Data Ocean, SLIPO, QROWD, BETTER, BOOST, MLwin and Simple-ML.

Spread the word by retweeting our release announcement on Twitter. For more updates, please view our Twitter feed and consider following us.

Greetings from the SANSA Development Team

AskNow 0.1 Released

2018-09-122018-09-13JensLehmann

Dear all,

the Smart Data Analytics group is happy to announce AskNow 0.1 – the initial release of Question Answering Components and Tools over RDF Knowledge Graphs.

Website: http://asknow.sda.tech/
Demo: http://asknowdemo.sda.tech
GitHub: https://github.com/AskNowQA

The following components with corresponding features are currently supported by AskNow:

EARL 0.1 EARL performs entity linking and relation linking as a joint task. It uses machine learning in order to exploit the Connection Density between nodes in the knowledge graph. It relies on three base features and re-ranking steps in order to predict entities and relations.
ISWC 2018: https://arxiv.org/pdf/1801.03825.pdf
Demo: https://earldemo.sda.tech/
Github: https://github.com/AskNowQA/EARL

SQG 0.1: This is a SPARQL Query Generator with modular architecture. SQG enables easy integration with other components for the construction of a fully functional QA pipeline. Currently entity relation, compound, count, and boolean questions are supported.
ESWC 2018: http://jens-lehmann.org/files/2018/eswc_qa_query_generation.pdf
Github: https://github.com/AskNowQA/SQG

AskNow UI 0.1: The UI interface works as a platform for users to pose their questions to the AskNow QA system. The UI displays the answers based on whether the answer is an entity or a list of entities, boolean or literal. For entities it shows the abstracts from DBpedia.
Github: https://github.com/AskNowQA/AskNowUI

SemanticParsingQA 0.1: The Semantic Parsing-based Question Answering system is built on the integration of EARL, SQG and AskNowUI.
Demo: http://asknowdemo.sda.tech/
Github: https://github.com/AskNowQA/SemanticParsingQA

We want to thank everyone who helped to create this release, in particular the projects HOBBIT, SOLIDE, WDAqua, BigDataEurope.

View this announcement on Twitter: https://twitter.com/AskNowQA/status/1040205350853599233

Kind regards,
The AskNow Development Team
(http://asknow.sda.tech/people/)

SANSA 0.4 (Semantic Analytics Stack) Released

2018-06-262018-06-26JensLehmann

We are happy to announce SANSA 0.4 – the fourth release of the Scalable Semantic Analytics Stack. SANSA employs distributed computing via Apache Spark and Flink in order to allow scalable machine learning, inference and querying capabilities for large knowledge graphs.

Website: http://sansa-stack.net
GitHub: https://github.com/SANSA-Stack
Download: http://sansa-stack.net/downloads-usage/
ChangeLog: https://github.com/SANSA-Stack/SANSA-Stack/releases

You can find the FAQ and usage examples at http://sansa-stack.net/faq/.

The following features are currently supported by SANSA:

Reading and writing RDF files in N-Triples, Turtle, RDF/XML, N-Quad format
Reading OWL files in various standard formats
Support for multiple data partitioning techniques
SPARQL querying via Sparqlify
Graph-parallel querying of RDF using SPARQL (1.0) via GraphX traversals (experimental)
RDFS, RDFS Simple, OWL-Horst, EL (experimental) forward chaining inference
Automatic inference plan creation (experimental)
RDF graph clustering with different algorithms
Terminological decision trees (experimental)
Anomaly detection (beta)
Knowledge graph embedding approaches: TransE (beta), DistMult (beta)

Noteworthy changes or updates since the previous release are:

Parser performance has been improved significantly e.g. DBpedia 2016-10 can be loaded in <100 seconds on a 7 node cluster
Support for a wider range of data partitioning strategies
A better unified API across data representations (RDD, DataFrame, DataSet, Graph) for triple operations
Improved unit test coverage
Improved distributed statistics calculation (see ISWC paper)
Initial scalability tests on 6 billion triple Ethereum blockchain data on a 100 node cluster
New SPARQL-to-GraphX rewriter aiming at providing better performance for queries exploiting graph locality
Numeric outlier detection tested on DBpedia (en)
Improved clustering tested on 20 GB RDF data sets

Deployment and getting started:

There are template projects for SBT and Maven for Apache Spark as well as for Apache Flink available to get started.
The SANSA jar files are in Maven Central i.e. in most IDEs you can just search for “sansa” to include the dependencies in Maven projects.
Example code is available for various tasks.
We provide interactive notebooks for running and testing code via Docker.

We want to thank everyone who helped to create this release, in particular the projects Big Data Europe, HOBBIT, SAKE, Big Data Ocean, SLIPO, QROWD, BETTER, BOOST and SPECIAL.

Spread the word by retweeting our release announcement on Twitter. For more updates, please view our Twitter feed and consider following us.

Greetings from the SANSA Development Team

SANSA Collaboration with Alethio

2018-02-202018-02-20JensLehmann

The SANSA team is excited to announce our collaboration with Alethio (a ConsenSys formation). SANSA is the major distributed, open source solution for RDF querying, reasoning and machine learning. Alethio is building an Ethereum analytics platform that strives to provide transparency over what’s happening on the Ethereum p2p network, the transaction pool and the blockchain and provide “blockchain archeology”. Their 5 billion triple data set contains large scale blockchain transaction data modelled as RDF according to the structure of the Ethereum ontology. EthOn – The Ethereum Ontology – is a formalization of concepts/entities and relations of the Ethereum ecosystem represented in RDF and OWL format. It describes all Ethereum terms including blocks, transactions, contracts, nonces etc. as well as their relationships. Its main goal is to serve as a data model and learning resource for understanding Ethereum.

Alethio is interested in using SANSA as a scalable processing engine for their large-scale batch and stream processing tasks, such as querying the data in real time via SPARQL and performing related analytics on a wide range of subjects (e.g. asset turnover for sets of accounts, attack pattern detection or Opcode usage statistics). At the same time, SANSA is interested in further industrial pilot applications for testing the scalability on larger datasets, mature its code base and gain experience on running the stack on production clusters. Specifically, the initial goal of Alethio was to load a 2TB EthOn dataset containing more than 5 billion triples and then performing several analytic queries on it with up to three inner joins. The queries are used to characterize movement between groups of ethereum accounts (e.g. exchanges or investors in ICOs) and aggregate their in and out value flow over the history of the Ethereum blockchain. The experiments were successfully run by Alethio on a cluster with up to 100 worker nodes and 400 cores that have a total of over 3TB of memory available.

“I am excited to see that SANSA works and scales well to our data. Now, we want to experiment with more complex queries and tune the Spark parameters to gain the optimal performance for our dataset” said Johannes Pfeffer, co-founder of Alethio. “I am glad that Alethio managed to run their workload and to see how well our methods scale to a 5 billion triple dataset”, added Gezim Sejdiu, PhD student at the Smart Data Analytics Group and SANSA core developer.

Parts of the SANSA team, including its leader Prof. Jens Lehmann as well as Dr. Hajira Jabeen, Dr. Damien Graux and Gezim Sejdiu, will now continue the collaboration together with the data science team of Alethio after those successful experiments. Beyond the above initial tests, we are jointly discussing possibilities for efficient stream processing in SANSA, further tuning of aggregate queries as well as suitable Apache Spark parameters for efficient processing of the data. In the future, we want to join hands to optimize the performance of loading the data (e.g. reducing the disk footprint of datasets using compression techniques allowing then more efficient SPARQL evaluation), handling the streaming data, querying, and analytics in real time.

The SANSA team is happily looking forward to further interesting scientific research as well as industrial adaptation.

Core model of the fork history of the Ethereum Blockchain modeled in EthOn

Christmas Time at SDA – Time to look back at 2017

2017-12-222018-01-09JensLehmann

We are looking back at a busy and successful year 2017 full of new members, inspirational discussions, exciting conferences, a lot of accepted papers and awards as well as new software releases.

Below is a short summary of the main cornerstones for 2017:

The growth of the group in 2017

SDA is a new group, but not new in the field :). As a group, it was founded by Prof. Dr. Jens Lehmann at the beginning of 2016. The group has members at the University of Bonn with associated researchers at the Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS) and the Institute for Applied Computer Science Leipzig. Within 2017, the group has grown from 20 members to around 55 members (1 Professor, 1 Akademischer Rat / Assistant Professor, 11 PostDocs, 31 PhD Students,11 master students) as you can see on our team page.

An interesting future for AI and knowledge graphs

Artificial intelligence / machine learning and semantic technologies / knowledge graphs are central topics for SDA. Throughout the year, we have been able to achieve a range of interesting research achievements. This ranges from internationally leading results in question answering over knowledge graphs, to scalable distributed querying, inference and analysis of large RDF datasets as well as new perspectives on industrial data spaces and data integration. Among the race for ever improving achievements in AI, which go far beyond what many could have imagined 10 years ago, our researchers were able to deliver important contributions and continue to shape different sub areas of the growing AI research landscape.

Papers accepted

We had 46 papers accepted at well-known conferences (i.e The Web Conference 2018, WWW 2017, AAAI 2017, ISWC 2017, ESWC 2017, DEXA 2017, SEMANTiCS 2017, K-CAP 2017, WI 2017, KESW 2017, IEEE BigData 2017, NIPS 2017, TPDL 2017, ICSC 2018, ICEGOV 2018 and more. We estimate our articles to be cited around 3000+ times per year (based on Google Scholar profiles).

Awards

We received several awards in 2017 – click on the posts to find out more:

Software releases

SANSA – An open source data flow processing engine for performing distributed computation over large-scale RDF datasets had 2 successfully released during 2017 (SANSA 0.3 and SANSA 0.2).

From the funded projects we were happy to launch the final release of the Big Data Europe platform – an open source Big Data Processing Platform allowing users to install numerous big data processing tools and frameworks and create working data flow applications.

There were several other releases:

SML-Bench – A Structured Machine Learning benchmark framework 0.2 has been released.
WebVOWL – A Web-based Visualization of Ontologies had several releases in 2017.
OpenResearch.org – A Crowd-Sourcing platform for collaborative management of scholarly metadata reached coverage of more than 5K computer science conferences in 2017.

Furthermore, SDA deeply values team bonding activities. 🙂 Often we try to introduce fun activities that involve teamwork and teambuilding. At our X-mas party, we enjoyed a very international and lovely dinner together, we played a `Secret Santa` and Pantomime game.

Long-term team building through deeper discussions, genuine connections and healthy communication helps us to connect within the group!

Many thanks to all of you who have accompanied and supported us on this way!

Jens Lehmann on behalf of The SDA Research Team

SANSA 0.3 (Semantic Analytics Stack) Released

2017-12-152017-12-15JensLehmann

We are happy to announce SANSA 0.3 – the third release of the Scalable Semantic Analytics Stack. SANSA employs distributed computing via Apache Spark and Flink in order to allow scalable machine learning, inference and querying capabilities for large knowledge graphs.

Website: http://sansa-stack.net
GitHub: https://github.com/SANSA-Stack
Download: http://sansa-stack.net/downloads-usage/
ChangeLog: https://github.com/SANSA-Stack/SANSA-Stack/releases

You can find the FAQ and usage examples at http://sansa-stack.net/faq/.

The following features are currently supported by SANSA:

Reading and writing RDF files in N-Triples, Turtle, RDF/XML, N-Quad format
Reading OWL files in various standard formats
Support for multiple data partitioning techniques
SPARQL querying via Sparqlify (with some known limitations until the next Spark 2.3.* release)
SPARQL querying via conversion to Gremlin path traversals (experimental)
RDFS, RDFS Simple, OWL-Horst (all in beta status), EL (experimental) forward chaining inference
Automatic inference plan creation (experimental)
RDF graph clustering with different algorithms
Rule mining from RDF graphs based AMIE+
Terminological decision trees (experimental)
Anomaly detection (beta)
Distributed knowledge graph embedding approaches: TransE (beta), DistMult (beta), several further algorithms planned

Deployment and getting started:

There are template projects for SBT and Maven for Apache Spark as well as for Apache Flink available to get started.
The SANSA jar files are in Maven Central i.e. in most IDEs you can just search for “sansa” to include the dependencies in Maven projects.
There is example code for various tasks available.
We provide interactive notebooks for running and testing code via Docker.

We want to thank everyone who helped to create this release, in particular the projects Big Data Europe, HOBBIT, SAKE, Big Data Ocean, SLIPO, QROWD and BETTER.

Greetings from the SANSA Development Team

SANSA 0.1 (Semantic Analytics Stack) Released

2016-12-092016-12-12JensLehmann

Dear all,

The Smart Data Analytics group is very happy to announce SANSA 0.1 – the initial release of the Scalable Semantic Analytics Stack. SANSA combines distributed computing and semantic technologies in order to allow powerful machine learning, inference and querying capabilities for large knowledge graphs.

Website: http://sansa-stack.net
GitHub: https://github.com/SANSA-Stack
Download: http://sansa-stack.net/downloads-usage/
ChangeLog: https://github.com/SANSA-Stack/SANSA-Stack/releases

You can find the FAQ and usage examples at http://sansa-stack.net/faq/.

The following features are currently supported by SANSA:

Support for reading and writing RDF files in N-Triples format
Support for reading OWL files in various standard formats
Querying and partitioning based on Sparqlify
Support for RDFS/RDFS Simple/OWL-Horst forward chaining inference
Initial RDF graph clustering support
Initial support for rule mining from RDF graphs

We want to thank everyone who helped to create this release, in particular, the projects Big Data Europe, HOBBIT and SAKE.

Kind regards,

The SANSA Development Team

DL-Learner 1.3 (Supervised Structured Machine Learning Framework) Released

2016-10-112016-10-20JensLehmann

Dear all,

the Smart Data Analytics group is happy to announce DL-Learner 1.3.

DL-Learner is a framework containing algorithms for supervised machine learning in RDF and OWL. DL-Learner can use various RDF and OWL serialization formats as well as SPARQL endpoints as input, can connect to most popular OWL reasoners and is easily and flexibly configurable. It extends concepts of Inductive Logic Programming and Relational Learning to the Semantic Web in order to allow powerful data analysis.

Website: http://dl-learner.org
GitHub page: https://github.com/AKSW/DL-Learner
Download: https://github.com/AKSW/DL-Learner/releases
ChangeLog: http://dl-learner.org/development/changelog/

DL-Learner is used for data analysis tasks within other tools such as ORE and RDFUnit. Technically, it uses refinement operator based, pattern-based and evolutionary techniques for learning on structured data. For a practical example, see http://dl-learner.org/community/carcinogenesis/. It also offers a plugin for Protégé, which can give suggestions for axioms to add.

In the current release, we added a large number of new algorithms and features. For instance, DL-Learner supports terminological decision tree learning, it integrates the LEAP and EDGE systems as well as the BUNDLE probabilistic OWL reasoner. We migrated the system to Java 8, Jena 3, OWL API 4.2 and Spring 4.3. We want to point to some related efforts here:

A new DL-Learner overview article is available at the Journal of Web Semantics (pre-print PDF).
We started a benchmarking framework for supervised machine learning from structured data (not restricted to RDF/OWL).
An article about the SPARQL reasoning component is now available (published at ECAI).
An article about terminological decision tree learning is available (published at EKAW).

We want to thank everyone who helped to create this release, in particular we want to thank Giuseppe Cota who visited the core developer team and significantly improved DL-Learner. We also acknowledge support by the recently SAKE project, in which DL-Learner will be applied to event analysis in manufacturing use cases, as well as Big Data Europe and HOBBIT projects.

Kind regards,

Lorenz Bühmann, Jens Lehmann, Patrick Westphal and Simon Bin