GenomeHouse Logo
GenomeHouse Logo

GenomeHouse Documentation

Complete guide to the modular bioinformatics toolkit

Latest Version: 1.2.0

This documentation covers all features available in the latest release.

Installation

Install from PyPI

pip install genomehouse

Install from Source

git clone https://github.com/GenomeHouse/GenomeHouse-1.1.git
cd GenomeHouse-1.1
pip install -e .

Requirements

Python Requirements

  • Python ≥ 3.8
  • NumPy ≥ 1.19.0
  • Pandas ≥ 1.3.0

Optional Dependencies

  • Matplotlib (visualization)
  • Scikit-learn (ML features)
  • Seaborn (advanced plots)

Quick Start

# Import GenomeHouse modules
from genomehouse import sequence_tools, genomic_parsers

# Basic sequence analysis
seq = "ATGCGTACGGCTA"
gc_content = sequence_tools.gc_content(seq)
print(f"GC Content: {gc_content}%")

# Parse FASTA file
for header, sequence in genomic_parsers.parse_fasta("example.fasta"):
    print(header, len(sequence))

sequence_tools Module

gc_content(sequence)

Calculate the GC content percentage of a DNA sequence.

Parameters

  • sequence (str): DNA sequence string

Returns

float: GC content as percentage

# Example usage
sequence = "ATGCGCGCTA"
gc_percent = sequence_tools.gc_content(sequence)
# Returns: 60.0

reverse_complement(sequence)

Generate the reverse complement of a DNA sequence.

# Example usage
sequence = "ATGCGT"
rev_comp = sequence_tools.reverse_complement(sequence)
# Returns: "ACGCAT"

find_motifs(sequence, motif)

Find all occurrences of a motif pattern in a sequence.

# Example usage
sequence = "ATGCGCGCTA"
motifs = sequence_tools.find_motifs(sequence, "GC")
# Returns: [2, 4, 6]

translate(sequence, frame=0)

Translate DNA sequence to amino acid sequence.

# Example usage
sequence = "ATGAAATAG"
protein = sequence_tools.translate(sequence)
# Returns: "MK*"

genomic_parsers Module

parse_fasta(filename)

Parse FASTA files and yield header-sequence pairs.

# Example usage
for header, sequence in genomic_parsers.parse_fasta("sequences.fasta"):
    print(f">{header}")
    print(f"Length: {len(sequence)}")

parse_fastq(filename)

Parse FASTQ files with quality scores.

# Example usage
for record in genomic_parsers.parse_fastq("reads.fastq"):
    print(f"ID: {record.id}")
    print(f"Quality: {record.quality_score}")

parse_vcf(filename)

Parse VCF (Variant Call Format) files.

# Example usage
for variant in genomic_parsers.parse_vcf("variants.vcf"):
    print(f"{variant.chrom}:{variant.pos}")
    print(f"{variant.ref} -> {variant.alt}")

ml_tools Module

extract_features(sequences, method='kmer')

Extract numerical features from biological sequences.

# Example usage
sequences = ["ATGCGT", "GCATGC", "TGCATG"]
features = ml_tools.extract_features(sequences, method='kmer')
# Returns: numpy array of k-mer frequencies

SequenceClassifier()

Machine learning classifier for biological sequences.

# Example usage
classifier = ml_tools.SequenceClassifier()
classifier.fit(features, labels)
predictions = classifier.predict(new_features)
accuracy = classifier.score(test_features, test_labels)

CLI Reference

GenomeHouse provides a command-line interface for common bioinformatics tasks.

genomehouse-cli --help

parse-fasta

Parse and analyze FASTA files from command line.

genomehouse-cli parse-fasta data/sample.fasta
genomehouse-cli parse-fasta data/sample.fasta --stats
genomehouse-cli parse-fasta data/sample.fasta --output results.txt

gc-content

Calculate GC content for sequences.

genomehouse-cli gc-content ATGCGTAC
GC Content: 50.0%

convert

Convert between different file formats.

genomehouse-cli convert reads.fastq output.fasta
genomehouse-cli convert --format fasta reads.fastq

Tutorials

Basic Sequence Analysis

# Complete sequence analysis workflow
from genomehouse import sequence_tools

# Define your sequence
dna_sequence = "ATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGA"

# Basic analysis
gc_content = sequence_tools.gc_content(dna_sequence)
length = len(dna_sequence)
reverse_comp = sequence_tools.reverse_complement(dna_sequence)

# Translation
protein = sequence_tools.translate(dna_sequence)

# Find motifs
acc_motifs = sequence_tools.find_motifs(dna_sequence, "ACC")

# Print results
print(f"Sequence length: {length} bp")
print(f"GC content: {gc_content:.1f}%")
print(f"Protein: {protein}")
print(f"ACC motifs at positions: {acc_motifs}")

Troubleshooting

Common Issues

ImportError: No module named 'genomehouse'

Make sure GenomeHouse is installed: pip install genomehouse

FileNotFoundError when parsing files

Check that the file path is correct and the file exists. Use absolute paths if needed.

Memory issues with large files

GenomeHouse uses streaming parsers. For very large files, process in chunks or use the CLI tools.

Contributing

We welcome contributions to GenomeHouse! Here's how you can help:

Development Setup

git clone https://github.com/GenomeHouse/GenomeHouse-1.1.git
cd GenomeHouse-1.1
pip install -e .[dev]
pytest tests/

Contribution Guidelines

  • Follow PEP 8 style guidelines
  • Add tests for new features
  • Update documentation
  • Submit pull requests