This collection of Python scripts allow the creation and preparation of a dataset composed by Java and Kotlin applications.
Currently, only F-Droid is provided as an index source for projects. Still, given a list of Github URLs, you can use these scripts to clone and prepare any dataset of Java and Kotlin applications (not limited to Android projects).
The available Makefile helps to automate various tasks.
The provided scripts are written in Python, therefore, to install all required libraries (definied in the requirements.txt file) run the command:
make setup
Some scripts make use of Github's APIs/libraries in order to analyse and clone repositories. This way, you may be required to specify your Github account Access Key in an .env file such as
GITHUB-ACCESS-TOKEN={your-access-key}
In the config.env file you can specify your interest in analyzing/cloning Java and/or Kotlin projects.
JAVA-PROJECTS-ANALYSIS = <boolean>
KOTLIN-PROJECTS-ANALYSIS = <boolean>
If those variables are not present in the config.env file, scripts are going to consider both languages as accepted.
Since the execution of scripts that analyse and/or clone Github projects may take a while, you can segment by chunks the projects to analyse/clone by using the pagination variables - offset and limit - in config.env file such as
FILTER-PAGINATION-OFFSET={insert-number}
FILTER-PAGINATION-LIMIT={insert-number}
If those variables are not present in the config.env file, the script will analyse/fetch all projects.
Many scripts create output files. In general, the location and names for those files are specified in the filePaths.env file.
To clean all the output files and cloned projects produced by the scripts, run the command:
make clean-output
To run the complete flow of filtering and cloning projects from the F-Droid index, run the command:
make get-fdroid-dataset
| Script | Description | Input | Output |
|---|---|---|---|
| getFDroidIndex.py | Downloads F-Droid index file and save each project's URL. | 0-fdroid-index.xml; 0-f-droid-projects.txt | |
| removeNonGithubURL.py | Removes non Github URLs. | 0-f-droid-projects.txt | 1-github-projects.txt |
| removeDuplicatedURL.py | Removes duplicated URLs. | 1-github-projects.txt | 2-non-duplicated-projects.txt |
| filterProjectsByStats.py | Fetches info and stats for each project and filters them. | 2-non-duplicated-projects.txt | 3-projects-stats.csv; 3-filtered-projects.txt |
| createVersionsFiles.py | Augments list with URLs for first and last version of each project. | 3-filtered-projects.txt | 4-projects-versions.txt |
| cloneProjects.py | Clones all listed repositories. | 4-projects-versions.txt | 5-projects/* |
| cleanProjects.py | Removes from each project unwanted files such as tests and assets. | 5-projects/* | 5-projects/* |
| prepareDataset.py | Zips each cloned repositories and creates final folder structure. | 5-projects/* | 6-dataset/* |
- ".env"
GITHUB-ACCESS-TOKEN="xxx"
- ".filePaths.env"
F-DROID-INDEX-FILE = "output/0-fdroid-index.xml"
F-DROID-PROJECTS-LIST-FILE = "output/0-f-droid-projects.txt"
GITHUB-PROJECTS-LIST-FILE = "output/1-github-projects.txt"
NON-DUPLICATED-PROJECTS-LIST-FILE = "output/2-non-duplicated-projects.txt"
FILTERED-PROJECTS-LIST-FILE = "output/3-filtered-projects.txt"
PROJECTS-STATS-FILE = "output/3-projects-stats.csv"
PROJECTS-VERSIONS-FILE = "output/4-projects-versions.txt"
CLONED-PROJECTS-FOLDER = "output/5-projects"
DATASET-PROJECTS-FOLDER = "output/6-dataset"
DATASET-PROJECTS-LIST = "output/6-dataset.txt"
- "config.env"
# Filter
JAVA-PROJECTS-ANALYSIS = True
KOTLIN-PROJECTS-ANALYSIS = True
#FILTER-PAGINATION-OFFSET=0
#FILTER-PAGINATION-LIMIT=50
# Versions file
#VERSIONS-FILE-OFFSET=0
#VERSIONS-FILE-LIMIT=10
# Clone
#CLONE-FILE-OFFSET=0
#CLONE-FILE-LIMIT=10