DBoss is a project aimed at creating and maintaining a comprehensive (but lightweight) database for historical statistics on open source software projects hosted in "forges" (hosting services such as Sourceforge.net). On top of this database, we are also developing a dedicated user interface, with options for automatic generation of custom reports.
Our goals and motivations
We currently work with data from FLOSSmole and SRDA (Notre Dame). These databases already contain much of the information we want, but they lack the speed, reliability and flexibility needed for detailed statistical studies. Our goal is thus to make this information (and its derived meanings) more reliable and readily accessible to the OSS researcher. Additionally, we intend to add some more information gathered by our own means.
Data sources currently in use
FLOSSmole: huge database containing data from several forges (in separate tables); their data comes mostly from web crawlers. SRDA (Notre Dame): Sourceforge-only, all data donated directly by Sourceforge; data is served only through basic SQL queries on a page in their wiki.
With both databases, we cannot be completely sure of the exact path the data has gone through, and sometimes its exact meaning is obscure. Also some inconsistencies of varied nature are noticeable.
Database
After months of planning, talks and testing, we have managed to create a very lightweight database, intended for in-memory use as well as direct SQL querying. At its current version, this database is still under test, but we expect all in-memory queries to take less than 30 seconds (but typically less) in an ordinary personal computer.
Our database is centered on monthly numbers (such as "downloads" and "web views") and multi-valued properties (such as "programming languages" and "operating systems"). All its tables are designed to accomodate data from any forge.
We are currently in the process of migrating and merging data from FLOSSmole and SRDA into our database. The routines currently in use for this migration will be integrated in the software framework as an "automatic update" feature that will check for new data from theses sources. Any redundant data from different sources will be compared and then merged.
After we complete this migration and the respective automatic update functionality, we plan to keep web crawlers of our own regularly adding new data. Through preliminary experimentation with development of some crawling routines, we have already seen a huge potential for automated collection of a wealth of details, in a relatively simple fashion.
User interface
Our stand-alone user interface (currently in pre-alpha status) is based on Python and Qt, and has a built-in command line with direct access to all its features. The same internal interface used by this command line can also be used for automation with user scripting, in Python.
For the future, we plan to also develop a web interface, built upon the the same internal framework and using Django.
Tracker
Our development tracker is located at http://ccsl.ime.usp.br/redmine/projects/dboss and our source code is hosted at http://github.com/mbonci/dboss. Our software is not ready yet for the user, but developers are invited to participate.