Python notes

3 May 2024

These are my notes for Python, where I save clever small scripts and various other information about Python. I'll update them continuously.

Basics
- Packages
APIs
Files
Multithreading
Natural language processing
OS
PDF

Basics

Packages

Installing packages

pip install instaloader

Upgrading packages

Use the same command as installing with the added --upgrade command.

pip install --upgrade instaloader

Requirements.txt

Requirements.txt is a text file that is usually located in the root of a project. The file describes the minimum versions of PIP packages that is required to run the project. Each package has its own line and is listed like this:

python-dotenv==0.19.2

To install all the packages for the project you can run the following command:

pip install -r requirements.txt

This will install the required packages and ensure that you are able to run the project.

Finding your packages for the requirements.txt

Running pip list will list all PIP packages installed and display what version you have. Grep'ing from this list will help you quickly getting your version numbers. E.g.:

pip list | grep dotenv

The above command will display something like this:

python-dotenv            0.19.2

Using .env-files for storing credentials

I use the Python package dotenv to load my environment variables. It's used like this:

import dotenv
dotenv.load_dotenv()

The above loads variables saved in your .env-file, which contains credentials in the following format:

DB_PASSWORD=qwerty

You can then use the variables in your script like this:

password = os.environ.get(DB_PASSWORD)

APIs

Several well-known APIs have Python packages that help dealing with them. Below I have outlined some of the packages I use for certain APIs.

Instagram

For Instagram I use Instaloader. I have written about creating your own self-hosted Instagram, where I detail the use of Instaloader.

For Reddit I use Python Reddit API Wrapper (PRAW).

Twitter

For Twitter I use Tweepy.

Celery

Task queue ligesom RabbitMQ.

DALLE-2 clone

DALLE-2 was paywall-released by an extremely well-funded company. However a group of independent researches released their own model (Stable Diffusion) that you can use in just a few lines of code for free.

This is based on this tweet by Mark Tenenholtz:

from torch import autocast
from diffusers import StableDiffusionPipeline, LMSDiscreteScheduler

# this will substitute the default PNDM scheduler for K-LMS
lms = LMSDiscreteScheduler(
    beta_start = 0.00085,
    beta_end = 0.012,
    beta_schedule = scaled_linear
)

pipe = StableDiffusionPipeline.from_pretrained(
    CompVis/stable-diffusion-v1-4,
    scheduler = lms,
    use_auth_token = True
).to(cuda)

prompt = a photo of an astronaut riding a horse on mars
with autocast(cuda):
    image = pipe(prompt)[sample][0]

image.save(astronaut_rides_horse.png)

Files

File handling is extremely useful in Python. I have even written a separate post about organizing my media files with Python.

Converting a file from one encoding to another in Python

I had a project where I would receive txt-files from a Windows machine. The files would be encoded with windows-1252 when I received them. In the beginning I would manually convert them to utf-8, however this quickly grew tedious. Therefore I created the following script. The script converts windows-1252 encoded files to utf-8.

import glob
import magic

files = glob.glob('*.txt')
for file in files:
blob = open(file, 'rb').read()
m = magic.open(magic.MAGIC_MIME_ENCODING)
m.load()
encoding = m.buffer(blob)

if encoding == 'iso-8859-1':
target = open(file, 'wb')
target.write((blob.decode(encoding)).encode('utf-8'))

Multithreading in Python

It's quite easy to start utilizing multihreading or multiprocessing in Python. You simply need to define a function and a list of input that is given to your function. Here's a simple example:

from multiprocessing import Pool

def myFunction(x):
	print(x * x)
myList = range(1, 11)

with Pool(processes = 4) as p:
	p.map(myFunction, myList)

The script uses 4 cores, but you can define how many you want to use. On Ubuntu you can grab the number of CPU cores available with the nproc command. So to utilize the maximum available cores in your Python scripts, you can grab the number like this:

import subprocess
numCPU = int(subprocess.check_output(['nproc'])

Natural Language Processing

English to IPA

eng-to-ipa is a Python package that uses the CMU Pronouncing dictionary similar to the Pronouncing package described below. The package can be used to convert english words into International Phonetic Alphabet (IPA). It's used like so:

import eng_to_ipa as ipa
ipa.convert(The quick brown fox jumped over the lazy dog.)
# 'ðə kwɪk braʊn fɑks ʤəmpt ˈoʊvər ðə ˈleɪzi dɔg.'

English words

The Python package english-words-py contains 4 lists of english words.

english_words_set
A set of English words containing both upper- and lower-case letters; with punctuation.
english_words_lower_set
A set of English words containing lower-case letters; with punctuation.
english_words_alpha_set
A set of English words containing both upper- and lower-case letters; with no punctuation.
english_words_lower_alpha_set
A set of English words containing lower-case letters; with no punctuation.

It's used like so:

from english_words import english_words_set
'ghost' in english_words_set
# True

NLTK

Edit distance

The edit distance between words can be found like this:

nltk.edit_distance(hunpty, dumpty) # 1

Wordnet

WordNet is a lexical database for the English language. Holds all the english words, their descriptions, examples, synonyms, antonyms and so on.

from nltk.corpus import wordnet

Pronouncing

Pronouncing is a Python package that provides a simple interface for the CMU Pronouncing dictionary. It's a helpful tool for finding the syllables in words, words that rhyme, words that sound similar and so on.

import pronouncing
pronouncing.rhymes(climbing)
# ['diming', 'liming', 'priming', 'rhyming', 'timing']

Wordhoard

Wordhoard is a Python package for finding antonyms, synonyms, hypernyms, hyponyms and homophones.

Nginx

Following packages can be used to analyze data from Nginx logs:

OS

OS specific notes and notes relating to the os-package.

How can I create a directory?

import os
if not os.path.exists('my_folder'):
    os.makedirs('my_folder')

PDF

Splitting a x page PDF into x 1 page PDFs / Splitting a PDF into multiple PDFs

I don't really know how this situation occured, but we had a 95 page PDF that we needed to be 95 individual 1 page PDFs. The solution is based on this StackOverflow answer.

from PyPDF2 import PdfFileWriter, PdfFileReader

inputpdf = PdfFileReader(open(myfile.pdf, rb))

for i in range(inputpdf.numPages):
    output = PdfFileWriter()
    output.addPage(inputpdf.getPage(i))
    with open(myfile-%s.pdf % i, wb) as outputStream:
        output.write(outputStream)

Weasyprint

Weasyprint is a free and open source package that enables you to generate beautiful PDFs from HTML.

You might also enjoy

How to easily web scrape any website with Python

Published 2024-05-03

Datahoarding

Notes

Python

Web development

Learn how to easily web scrape any website using Python. I go through the various techniques I use.

Read the post →

Removing EXIF data from an image using Python

Published 2024-09-18 — Updated 2024-11-21

Python

EXIF data is information that is embedded within digital images and is automatically generated by digital cameras and smartphones.

Read the post →

Quick and easy image recognition with 9 lines of code in Python

Published 2024-05-05 — Updated 2024-07-28

Machine Learning

Python

Need a quick and easy image recognition solution in Python? Learn how to create one in 9 lines of code.

Read the post →