Project Directory

The ProjectDirectory object represents a collection of git repositories (perhaps in a single directory). It can be created in 3 main ways:

  • By specifying the directory in which the repositories live locally
  • By explicitly passing a list of local directories for each git repository
  • By explicitly passing remote git repositories to be cloned locally in temporary directories.

Once constructed, all work out equally, and as long as all are specified explicitly, you can even mix remote and local repositories.

Using each method:

Directory Of Repositories

To create a ProjectDirectory object from a directory that contains multiple repositories simply use:

from gitpandas import ProjectDirectory
pd = ProjectDirectory(working_dir='/path/to/dir/', ignore=None, verbose=True)

Where ignore can be a list of directories to explicitly ignore. This method uses os.walk to search the passed directory for any .git directories, so even if a repository is many directories deep below the passed working dir, it will be included. To check what repositories are included in your object:

print(pd._repo_name())

Explicit Local

Explicit local directories are passed by using a list rather than a string for working dir:

from gitpandas import ProjectDirectory
pd = ProjectDirectory(working_dir=['/path/to/repo1/', '/path/to/repo2/'], ignore=None, verbose=True)

In this case, the subdirectories of the directories passed are not searched, so every directory passed must have a .git directory in it.

Explicit Remote

Explicit local directories are passed by using a list rather than a string for working dir:

from gitpandas import ProjectDirectory
pd = ProjectDirectory(working_dir=['git://github.com/user/repo.git'], ignore=None, verbose=True)

As mentioned, you can mix explicit remote and explicit local repositories, the remote repos will be cloned into temporary directories and treated as local ones under the hood. Because of this, for large repos, it can be relatively slow to create ProjectDirectory objects with many explicit remote repositories.

Detailed API Documentation

class gitpandas.project.GitHubProfile(username, ignore_forks=False, ignore_repos=None, verbose=False)[source]

An extension of the ProjectDirectory object that is based off of a single github.com user’s public profile.

class gitpandas.project.ProjectDirectory(working_dir=None, ignore_repos=None, verbose=True)[source]

An object that refers to a directory full of git repositories, for bulk analysis. It contains a collection of git-pandas repository objects, created by os.walk-ing a directory to file all child .git subdirectories.

Parameters:
  • working_dir – (optional, default=None), the working directory to search for repositories in, None for cwd, or an explicit list of directories containing git repositories
  • ignore – (optional, default=None), a list of directories to ignore when searching for git repos.
  • verbose – (default=True), if True, will print out verbose logging to terminal
Returns:

blame(committer=True, by='repository', ignore_globs=None, include_globs=None)[source]

Returns the blame from the current HEAD of the repositories as a DataFrame. The DataFrame is grouped by committer name, so it will be the sum of all contributions to all repositories by each committer. As with the commit history method, extensions and ignore_dirs parameters can be passed to exclude certain directories, or focus on certain file extensions. The DataFrame will have the columns:

  • committer
  • loc
Parameters:
  • committer – (optional, default=True) true if committer should be reported, false if author
  • by – (optional, default=repository) whether to group by repository or by file
  • ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
  • include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
Returns:

DataFrame

branches()[source]

Returns a data frame of all branches in origin. The DataFrame will have the columns:

  • repository
  • local
  • branch
Returns:DataFrame
bus_factor(ignore_globs=None, include_globs=None, by='projectd')[source]

An experimental heuristic for truck factor of a repository calculated by the current distribution of blame in the repository’s primary branch. The factor is the fewest number of contributors whose contributions make up at least 50% of the codebase’s LOC

Parameters:
  • ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
  • include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
Returns:

commit_history(branch, limit=None, days=None, ignore_globs=None, include_globs=None)[source]

Returns a pandas DataFrame containing all of the commits for a given branch. The results from all repositories are appended to each other, resulting in one large data frame of size <limit>. If a limit is provided, it is divided by the number of repositories in the project directory to find out how many commits to pull from each project. Future implementations will use date ordering across all projects to get the true most recent N commits across the project.

Included in that DataFrame will be the columns:

  • repository
  • date (index)
  • author
  • committer
  • message
  • lines
  • insertions
  • deletions
  • net
Parameters:
  • branch – the branch to return commits for
  • limit – (optional, default=None) a maximum number of commits to return, None for no limit
  • days – (optional, default=None) number of days to return if limit is None
  • ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
  • include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
Returns:

DataFrame

coverage()[source]

Will return a DataFrame with coverage information (if available) for each repo in the project).

If there is a .coverage file available, this will attempt to form a DataFrame with that information in it, which will contain the columns:

  • repository
  • filename
  • lines_covered
  • total_lines
  • coverage

If it can’t be found or parsed, an empty DataFrame of that form will be returned.

Returns:DataFrame
cumulative_blame(branch='master', by='committer', limit=None, skip=None, num_datapoints=None, committer=True, ignore_globs=None, include_globs=None)[source]

Returns a time series of cumulative blame for a collection of projects. The goal is to return a dataframe for a collection of projects with the LOC attached to an entity at each point in time. The returned dataframe can be returned in 3 forms (switched with the by parameter, default ‘committer’):

  • committer: one column per committer
  • project: one column per project
  • raw: one column per committed per project
Parameters:
  • branch – (optional, default ‘master’) the branch to work in
  • limit – (optional, default None), the maximum number of revisions to return, None for no limit
  • skip – (optional, default None), the number of revisions to skip. Ex: skip=2 returns every other revision, None for no skipping.
  • num_datapoints – (optional, default=None) if limit and skip are none, and this isn’t, then num_datapoints evenly spaced revs will be used
  • committer – (optional, default=True) true if committer should be reported, false if author
  • by – (optional, default=’committer’) whether to arrange the output by committer or project
  • ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
  • include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
Returns:

DataFrame

file_change_history(branch='master', limit=None, days=None, ignore_globs=None, include_globs=None)[source]

Returns a DataFrame of all file changes (via the commit history) for the specified branch. This is similar to the commit history DataFrame, but is one row per file edit rather than one row per commit (which may encapsulate many file changes). Included in the DataFrame will be the columns:

  • repository
  • date (index)
  • author
  • committer
  • message
  • filename
  • insertions
  • deletions
Parameters:
  • branch – the branch to return commits for
  • limit – (optional, default=None) a maximum number of commits to return, None for no limit
  • days – (optional, default=None) number of days to return if limit is None
  • ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
  • include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
Returns:

DataFrame

file_change_rates(branch='master', limit=None, coverage=False, days=None, ignore_globs=None, include_globs=None)[source]

This function will return a DataFrame containing some basic aggregations of the file change history data, and optionally test coverage data from a coverage_data.py .coverage file. The aim here is to identify files in the project which have abnormal edit rates, or the rate of changes without growing the files size. If a file has a high change rate and poor test coverage, then it is a great candidate for writing more tests.

Parameters:
  • branch – (optional, default=master) the branch to return commits for
  • limit – (optional, default=None) a maximum number of commits to return, None for no limit
  • coverage – (optional, default=False) a bool for whether or not to attempt to join in coverage data.
  • days – (optional, default=None) number of days to return if limit is None
  • ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
  • include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
Returns:

DataFrame

file_detail(rev='HEAD', committer=True, ignore_globs=None, include_globs=None)[source]

Returns a table of all current files in the repos, with some high level information about each file (total LOC, file owner, extension, most recent edit date, etc.).

Parameters:
  • ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
  • include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
  • committer – (optional, default=True) true if committer should be reported, false if author
Returns:

has_coverage()[source]

Returns a DataFrame of repo names and whether or not they have a .coverage file that can be parsed

Returns:DataFrame
hours_estimate(branch='master', grouping_window=0.5, single_commit_hours=0.5, limit=None, days=None, committer=True, by=None, ignore_globs=None, include_globs=None)[source]

inspired by: https://github.com/kimmobrunfeldt/git-hours/blob/8aaeee237cb9d9028e7a2592a25ad8468b1f45e4/index.js#L114-L143

Iterates through the commit history of repo to estimate the time commitement of each author or committer over the course of time indicated by limit/extensions/days/etc.

Parameters:
  • branch – the branch to return commits for
  • limit – (optional, default=None) a maximum number of commits to return, None for no limit
  • grouping_window – (optional, default=0.5 hours) the threhold for how close two commits need to be to consider them part of one coding session
  • single_commit_hours – (optional, default 0.5 hours) the time range to associate with one single commit
  • days – (optional, default=None) number of days to return, if limit is None
  • committer – (optional, default=True) whether to use committer vs. author
  • ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
  • include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
Returns:

DataFrame

is_bare()[source]

Returns a dataframe of repo names and whether or not they are bare.

Returns:DataFrame
punchcard(branch='master', limit=None, days=None, by=None, normalize=None, ignore_globs=None, include_globs=None)[source]

Returns a pandas DataFrame containing all of the data for a punchcard.

  • day_of_week
  • hour_of_day
  • author / committer
  • lines
  • insertions
  • deletions
  • net
Parameters:
  • branch – the branch to return commits for
  • limit – (optional, default=None) a maximum number of commits to return, None for no limit
  • days – (optional, default=None) number of days to return, if limit is None
  • by – (optional, default=None) agg by options, None for no aggregation (just a high level punchcard), or ‘committer’, ‘author’, ‘repository’
  • normalize – (optional, default=None) if an integer, returns the data normalized to max value of that (for plotting)
  • ignore_globs – (optional, default=None) a list of globs to ignore, default none excludes nothing
  • include_globs – (optinal, default=None) a list of globs to include, default of None includes everything.
Returns:

DataFrame

repo_information()[source]

Returns a DataFrame with the properties of all repositories in the project directory. The returned DataFrame will have the columns:

  • local_directory
  • branches
  • bare
  • remotes
  • description
  • references
  • heads
  • submodules
  • tags
  • active_branch
Returns:DataFrame
repo_name()[source]

Returns a DataFrame of the repo names present in this project directory

Returns:DataFrame
revs(branch='master', limit=None, skip=None, num_datapoints=None)[source]

Returns a dataframe of all revision tags and their timestamps for each project. It will have the columns:

  • date
  • repository
  • rev
Parameters:
  • branch – (optional, default ‘master’) the branch to work in
  • limit – (optional, default None), the maximum number of revisions to return, None for no limit
  • skip – (optional, default None), the number of revisions to skip. Ex: skip=2 returns every other revision, None for no skipping.
  • num_datapoints – (optional, default=None) if limit and skip are none, and this isn’t, then num_datapoints evenly spaced revs will be used
Returns:

DataFrame

tags()[source]

Returns a data frame of all tags in origin. The DataFrame will have the columns:

  • repository
  • tag
Returns:DataFrame