Bibliometrics for software engineering conferences

ICSE, International Conference on Software Engineering
ICSM, IEEE International Conference on Software Maintenance
ASE, IEEE/ACM International Conference on Automated Software Engineering
FSE, ACM SIGSOFT Symposium on the Foundations of Software Engineering
FASE, International Conference on Fundamental Approaches to Software Engineering
MSR, Working Conference on Mining Software Repositories
WCRE, Working Conference on Reverse Engineering
CSMR, European Conference on Software Maintenance and Reengineering
GPCE, Generative Programming and Component Engineering
ICPC, IEEE International Conference on Program Comprehension
SCAM, International Working Conference on Source Code Analysis & Manipulation

The data is stored in a MySQL database (see the SQL dump) with the following schema:

Alternatively, the database can be recreated (hence easily extended) from CSV files using Python and the SQLAlchemy Object Relational Mapper using the scripts included (more details below).

Data provenance

Papers and authors: the DBLP data dump. Papers which were part of the main (research) track have been (manually) marked as such in the main_track column. The conference impact factor (the impact column from the conferences table) is the SHINE h-index for the period 2000-2012.
Number of submssions: Tao Xie's software engineering conference statistics; foreword to proceedings.
Composition of programme committee: conference websites, only programme committee members for the main tracks have been included. Disambiguation was performed to align the spelling used on the different websites to that found in DBLP.

In some cases the DBLP data also contains the session title(s) for a given paper. For example, for papers published at ICSE 2012, a session title (such as Technical Research, originally encoded as an HTML h2 header and recorded in the session_h2 column) and a session subtitle (such as Fault Handling, originally encoded as an HTML h3 header and recorded in the session_h3 column) is available. When available, such titles could be used to automatically filter papers if so desired for a certain bibliometric analysis.

Using the database

Directly

Most simply, you can import the SQL dump into your favourite database management system (tested on MySQL) and start querying.

Via Python

Alternatively, you can take a look at how the database was created using MySQL, Python and SQLAlchemy, and use these mechanisms also for querying. This will allow you to easily extend the database or update its schema.

Dependencies and installation instructions

If you take this path, make sure you have Python and a MySQL server installed before attempting anything. Thanks a lot to Leon Moonen for spelling out the exact steps (tested on his OS X 10.8.5 machine with Python 2.7.2).

Install Unidecode: easy_install Unidecode
Install SQLAlchemy: easy_install SQLAlchemy
Make sure that mysql bin dir is in path (or next step will fail on mysql_config)
Make sure that mysql lib dir is in dynamic library (or next step will fail on loading the library)
Install MySQL-Python: easy_install mysql-python
Tweek populateDB.py for your particular MySQL user and password (the script assumes user root with empty password)

Python scripts

initDB.py: declares the database schema using Python classes (will be automatically mapped to tables by SQLAlchemy).
populateDB.py: reads data about the papers and programme committees for each conference and loads it into the database.
metrics.py: defines a metrics model and how to compute the metrics. To account for the different ages of the conferences, we use sliding window metrics. For example,
- author turnover RNA(c,y,k): fraction of authors at conference c in year y that have not been author between y-k and y-1.
- programme committee turnover RNC(c,y,k): fraction of PC of c in year y that have not served on the PC between y-k and y-1.
- inbreeding ratio RAC(c,y,k): fraction of papers published at c in year y co-authored by PC members from y-k to y.
For a complete list of metrics check this list, or see our Science of Computer Programming article.
queryDB.py: queries the database, computes the metrics defined in the metrics model, and outputs the results to CSV files. For an example of a visualisation of these results, we include the visualisation.r R script that produces the following plot for RAC(c,y,0), the fraction of papers each year co-authored by PC members from that year.

Licenses

The database is made available under the Open Database License
Any rights in individual contents of the database (i.e., the data) are licensed under the Database Contents License
The tooling (e.g., Python scripts and R scripts) used are licensed under the GNU Lesser General Public License version 3

Citation information

If you find the dataset or tooling useful in your research, please consider citing the following paper:

Bogdan Vasilescu, Alexander Serebrenik, and Tom Mens, "Mining software engineering conference data", in MSR '13: Proceedings of the 10th Working Conference on Mining Software Repositories, May 18-–19, 2013. San Francisco, California, USA; pages 373–376, ACM.

Additionally, if you're interested in a "health assessment" of software engineering conferences, consider reading our Science of Computer Programming paper:

Bogdan Vasilescu, Alexander Serebrenik, Tom Mens, Mark van den Brand, and Ekaterina Pek, "How healthy are software engineering conferences?", Science of Computer Programming 89, Part C, (2014), 251–272.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

readme.md

Bibliometrics for software engineering conferences

Contents

Data provenance

Using the database

Directly

Via Python

Dependencies and installation instructions

Python scripts

Licenses

Citation information

Files

readme.md

Latest commit

History

readme.md

File metadata and controls

Bibliometrics for software engineering conferences

Contents

Data provenance

Using the database

Directly

Via Python

Dependencies and installation instructions

Python scripts

Licenses

Citation information