Google Summer of Code 2018 Accepted projects
- 1 Adding Greek language on NLP library Spacy.io
- 2 Extraction of Responsibilities per unit in public sector organizations from the Government Gazette
- 3 Epoptes
- 4 Government Gazette text mining, cross linking, and codification
- 5 Libreoffice customization and creation of legal Templates for LibreOffice
- 6 Software components and IP management
- 7 WSO2 Identity Server Userstore using Web Services to get claims
- 8 Python PenTest Library (PyPen)
Adding Greek language on NLP library Spacy.io
Spacy is an open-source Python library for advanced Natural Language Processing. It's a very powerful and modern tool for applying NLP to real world problems. Among other functionality it provides Named Entity Recognition, deep learning integration, part-of-speech tagging and includes built in visualizers for syntax and NER. Spacy supports more than 25 languages but not Greek. Adding the Greek language will provide massive improvements on applying NLP on the Greek language, and allow for actions as Named entity recognition and Part-of-speech tagging
The procedure is well specified on https://spacy.io/usage/adding-languages, custom language data (stop words, tokenizer exceptions, punctuation rules etc) need to be added and tested.
The vocabulary, syntax, entities and word vectors for the Greek language. These will be produced with Spacy/gensim, after the language information is successfully added.
The Greek language model with then be added to Spacy.io for usage as a supported language model.
As a real world scenario in order to test the language model, analysis on a large number of Official Greek Government's Gazette (FEK-ΦΕΚ) is proposed, in order to extract entities and categorize these documents.
Strong knowledge of the Greek language, Python language fluency and Regular Expressions knowledge are necessary for this.
Mentors: Markos Gogoulos Panos Louridas
Extraction of Responsibilities per unit in public sector organizations from the Government Gazette
The objective of this project is to extend existing Government Gazette(GG) text mining code with Named Entity Recognition features that will allow the identification of Government Directorates and Divisions with the responsibilities assigned to them and the types of services they are required to provide according with their legal framework published in http://www.et.gr/ and the extraction of this information with related metadata (decision number, date of the GG issue). The aim is to link the management units with assigned roles and services per unit(Directorates, Divisions & Sections) and codify this specific information, which is hidden in the GG issue raw text. For this, the PDFs must be downloaded, converted into text and cleaned. Then, syntactic-based heuristics and/or machine learning techniques must be applied to identify specific Named Entities types with references to assigned responsibilities-services per unit(Directorates, Divisions & Sections) and links between the two must be extracted. Metadata concerning the GG issue and decision and/or law number will be also associated with the extracted association. The produced associations will be extracted in a machine usable/structured format (e.g. as RDF triples).
- A module for manually annotating related entities and responsibilities-services assignment sections in raw text
- A NER module, with trained models for detecting Governmental Directions and Divisions in raw text
- A module that associates entities with responsibilities and extracts related metadata from the GG issue
Python, Java, Machine Learning
Epoptes (Επόπτης - a Greek word for overseer) is an open source computer lab management and monitoring tool. It allows for screen broadcasting and monitoring, remote command execution, message sending, imposing restrictions like screen locking or sound muting the clients and much more! It can be installed in Ubuntu, Debian and openSUSE based labs that may contain any combination of the following: LTSP servers, thin and fat clients, non LTSP servers, standalone workstations, NX or XDMCP clients etc.
Related GitHub repositories
Rewrite Epoptes with Python 3 support
Gtk3 with GObject Introspection instead of pygtk2
Improvements in the code structure ( Break existing code into python modules/packages)
Mentors: Fotis Tsiamis, Avgoustos Tsinakos
The objective of this project is to extend existing Government Gazette text mining code to cross-link legal texts and detect the ministers that sign them. For this the text PDFs need to be downloaded and converted into text. Then, heuristic rules must be applied to detect references to other legal texts, which will be converted into hypertext form. Similar techniques will be used to detect the competent ministers. Two possible extensions are proposed. First, detect amendments incorporated within another law. Second, implement a prototype for editing a law in its codified form (e.g. on GitHub) and automatically creating from the changes the text to be legislated (the differences from the original law).
Related GitHub repositories
Detection of references to other laws; detection of competent ministers; codified legislation prototype
Libreoffice customization and creation of legal Templates for LibreOffice
LibreOffice customization in order to achieve a "familiar" look and menus for users that convert from MS Office 2013, and creation of specific templates for the Greek Legal system. The customization and templates should follow the development guidelines at https://wiki.documentfoundation.org/Development/GetInvolved .
- Development of specific menu customizations through the use of Libreoffice Software Development Kit 6.0 in various modules of Libreoffice (eg https://api.libreoffice.org/docs/idl/ref/namespacecom_1_1sun_1_1star_1_1ui.html)
- Design and development of Templates and LibreOffice applications that request/get and fill specific information in the templates through the use of APIs for the Greek legal system
Customization and Templates should be accompanied with detailed documentation and instructions for developers and end users.
- Libreoffice Software Development Kit 6.0
Software components and IP management
More details in the separate page Clio.
A web-based system to manage data on software components and their relations.
Nowadays every piece of software is including and using many other software components, each one coming with their own license.
The goal of this project is to build a simple web system to be able to (manually) input and maintain this information!
This is a brand-new project; some analysis has been done but no code is available yet.
A complete web-based system to manage the above-mentioned data.
Web (any technology welcome)
Mentors: Alexios Zavras Georgia Kapitsaki
WSO2 Identity Server Userstore using Web Services to get claims
WSO2 Identity Server provides secure identity management for enterprise web applications, services, and APIs by managing identity and entitlements of the users securely and efficiently. The Identity Server enables enterprise architects and developers to reduce identity provisioning time, guarantee secure online interactions, and deliver a reduced single sign-on environment. WSO2 Identity Server is fully open source and is released under Apache Software License Version 2.0.
The aim of this project is to create a new type of userstore where credentials will be separeted from attirbutes and attributes (claims) will be able to be configured from the web UI as a SOAP or REST web service. The end-user should be able to
- configure credentials for LDAP or JDBC
- configure web service authentication
- configure claims to consume the above web service
A new userstore where end-user can configure using existing web interface, user claims through web services client. The appropriate changes in the source code should be uploaded in the upstream branch of the latest version (5.4.0)
Related GitHub repository
- Java JSP
- OSGI Framework
- A modern development framework for interactive web content
Mentors: Panagiotis Kranidiotis Stamelos Ioannis
Python PenTest Library (PyPen)
A collection of tools supporting penetration testers
Development of a Python library for penetration testers. The library will include a set of tools for performing the basic tasks for attacking a remote host. It will include reconnaissance tools such as modules that will be able to collect data for a specific target either through the web or through user input. Moreover, other tools will be developed to create custom dictionaries for username and password attacks. Other attack techniques that will be supported include DoS attack, BruteForce attack as well as Inclusion attack. The library will also include various statistical functions for extracting additional information from a captured host.
Related GitHub repositories
Development of an independent Python library which will also integrate other existing and well consolidated tools such as CUPP (already in Kali Linux) for assisting in penetration testing.
A. User Reconnaissance & Information gathering
Α.1/ PyFBSniff: Facebook scraper
Α.2/ PyGenUser: Username list creation
Α.3/ PyDic: Dictionary creation
Future extensions will include tools similar to PyFBSniff for other social media such as Twitter and Google+.
B. Target System Reconnaissance & Information gathering
A collection of supportive tools gathering and presenting information about the Operating System and its processes.
Β.1/ PyPScanner: Port Scanner
Β.2/ PyPidStat: Process statistics creation
Β.3/ PySocketStat: Socket statistics creation
Β.4/ PyPipeStat: Pipe statistics creation
Β.5/ PyFileStat: File statistics creation
C. Attack PenTest tools
C.1/ PyDoS : DoS attack by flooding
C.2/ PyBruftp: Bruteforce attack to ftp server
C.3/ PyRansom: Ransomware script
The library will be expandable in order to incorporate more tools in the future.