Complete guide to the modular bioinformatics toolkit
Latest Version: 1.2.0
This documentation covers all features available in the latest release.
pip install genomehouse
git clone https://github.com/GenomeHouse/GenomeHouse-1.1.git cd GenomeHouse-1.1 pip install -e .
# Import GenomeHouse modules
from
genomehouse
import
sequence_tools, genomic_parsers
# Basic sequence analysis
seq =
"ATGCGTACGGCTA"
gc_content = sequence_tools.gc_content(seq)
print
(
f"GC Content: {gc_content}%"
)
# Parse FASTA file
for
header, sequence
in
genomic_parsers.parse_fasta(
"example.fasta"
):
(header,
len
(sequence))
Calculate the GC content percentage of a DNA sequence.
sequence
float: GC content as percentage
float
# Example usage
sequence =
"ATGCGCGCTA"
gc_percent = sequence_tools.gc_content(sequence)
# Returns: 60.0
Generate the reverse complement of a DNA sequence.
"ATGCGT"
rev_comp = sequence_tools.reverse_complement(sequence)
# Returns: "ACGCAT"
Find all occurrences of a motif pattern in a sequence.
motifs = sequence_tools.find_motifs(sequence,
"GC"
# Returns: [2, 4, 6]
Translate DNA sequence to amino acid sequence.
"ATGAAATAG"
protein = sequence_tools.translate(sequence)
# Returns: "MK*"
Parse FASTA files and yield header-sequence pairs.
"sequences.fasta"
f">{header}"
f"Length: {len(sequence)}"
Parse FASTQ files with quality scores.
record
genomic_parsers.parse_fastq(
"reads.fastq"
f"ID: {record.id}"
f"Quality: {record.quality_score}"
Parse VCF (Variant Call Format) files.
variant
genomic_parsers.parse_vcf(
"variants.vcf"
f"{variant.chrom}:{variant.pos}"
f"{variant.ref} -> {variant.alt}"
Extract numerical features from biological sequences.
sequences = [
,
"GCATGC"
"TGCATG"
]
features = ml_tools.extract_features(sequences, method=
'kmer'
# Returns: numpy array of k-mer frequencies
Machine learning classifier for biological sequences.
classifier = ml_tools.SequenceClassifier()
classifier.fit(features, labels)
predictions = classifier.predict(new_features)
accuracy = classifier.score(test_features, test_labels)
GenomeHouse provides a command-line interface for common bioinformatics tasks.
genomehouse-cli --help
Parse and analyze FASTA files from command line.
genomehouse-cli parse-fasta data/sample.fasta
genomehouse-cli parse-fasta data/sample.fasta --stats
genomehouse-cli parse-fasta data/sample.fasta --output results.txt
Calculate GC content for sequences.
genomehouse-cli gc-content ATGCGTAC
GC Content: 50.0%
Convert between different file formats.
genomehouse-cli convert reads.fastq output.fasta
genomehouse-cli convert --format fasta reads.fastq
# Complete sequence analysis workflow
sequence_tools
# Define your sequence
dna_sequence =
"ATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGA"
# Basic analysis
gc_content = sequence_tools.gc_content(dna_sequence)
length =
(dna_sequence)
reverse_comp = sequence_tools.reverse_complement(dna_sequence)
# Translation
protein = sequence_tools.translate(dna_sequence)
# Find motifs
acc_motifs = sequence_tools.find_motifs(dna_sequence,
"ACC"
# Print results
f"Sequence length: {length} bp"
f"GC content: {gc_content:.1f}%"
f"Protein: {protein}"
f"ACC motifs at positions: {acc_motifs}"
Make sure GenomeHouse is installed: pip install genomehouse
Check that the file path is correct and the file exists. Use absolute paths if needed.
GenomeHouse uses streaming parsers. For very large files, process in chunks or use the CLI tools.
We welcome contributions to GenomeHouse! Here's how you can help:
git clone https://github.com/GenomeHouse/GenomeHouse-1.1.git
cd GenomeHouse-1.1
pip install -e .[dev]
pytest tests/