dashwood.net -

Ryan Stefan's Micro Blog

Installing from Github with pipenv - Fix pip on Linux

Jun 222019

Obviously not every pacakge on github is going to be available via pip, but downloading and installing manually clutters up your project directory. That kind of defeats the purpose of using pipenv in the first place. However, installing a package by using the git uri with pipenv is possible just like it is with pip. Here's what you type:

pipenv install -e git+git://github.com/user/project.git#egg=<project>

Pretty simple right? Here's an example of one that I've used recently just in case:

pipenv install -e git+git://github.com/miso-belica/sumy.git#egg=sumy 

Which is the command to install this package: https://github.com/miso-belica/sumy

 

If you have pipenv command not found use this to fix it:

sudo -H pip install -U pipenv

for scrapy with Python 3, you'll need

sudo apt-get install python3 python-dev python3-dev \
     build-essential libssl-dev libffi-dev \
     libxml2-dev libxslt1-dev zlib1g-dev \
     python-pip

with Python 2, you'll need

sudo apt-get install python-dev  \
     build-essential libssl-dev libffi-dev \
     libxml2-dev libxslt1-dev zlib1g-dev \
     python-pip

Quick Collection of Python CSV Read/Write Techniques

Jun 142019

Import CSV as Dict

  • Creates ordered dict
  • You can increase file size limit
  • Using next() can bypass the header row
import csv

# Dict reader creates an ordered dict (first row will be headers)
with open('./data/file.csv', newline='') as file:
    # Huge csv files might give you a size limit error
    csv.field_size_limit(100000000)
    results = csv.DictReader(file, delimiter=';', quotechar='*', quoting=csv.QUOTE_ALL)

    # next() can help in iterations sometimes
    next(results)
    for row in results:
        # prints each item in the column with header 'key'
        print(row['key'])

Import CSV with No Header (nested lists)

  • newline='' prevents blank lines
  • csv.reader uses indexes [0], [1] 
# newline='' prevents blank lines
with open('./data/file.csv', newline='') as file:
    results = csv.reader(file, delimiter=':', quoting=csv.QUOTE_NONE)
    for row in results:
        # csv reader uses indexes
        print(row[0])

Writing and Creating Headers

  • Create a csv.writer object
  • Create header manually before loop
  • Nested lists are better than tuples inside lists
  • writer.writerow and writer.writerows 
# Creates a csv writer object
writer = csv.writer(
    file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)

# Write header first if you would like
writer.writerow(['title', 'price', 'shipping'])

''' Tuples inside list (list inside lists are usually better though).
If you're using tuples and they are variable size, note that a single tuple
will convert to string type in a loop so indexing it [0] won't work. '''
products = [['slinky', '$5', 'Free'],
            ['pogo', '$12', '$6'],
            ['Yoyo', '$7', '$2']]

# write each row normal
for item in products:
    writer.writerow(map(str, item))

# Writes all items into a single row
writer.writerow(sum(products, []))

# Writes all 3 rows
writer.writerows(products)

Using DictWriter for Headers

  • fieldnames indicates header to object
  • writer.writeheader() writes those fields
# DictWriter field names will add the headers for you when you call writeheader()
with open("./data/file.csv", "w") as file:
    writer = csv.DictWriter(
        file, fieldnames=['title', 'price', 'shipping'],
        quoting=csv.QUOTE_NONNUMERIC)

    writer.writeheader()
    writer.writerows([['slinky', '$5', 'Free'],
                      ['pogo', '$12', '$6'],
                      ['Yoyo', '$7', '$2']])

Bonus - Flatten any List

  • Function will flatten any level of nested lists
  • or type == tuple() to catch tuples too
# -- Bonus (Off Topic) --
# You can flatten any list with type checking and recursion
l = [1, 2, [3, 4, [5, 6]], 7, 8, [9, [10]]]
output = []
def flatten_list(l):
    for i in l:
        if type(i) == list:
            flatten_list(i)
        else:
            output.append(i)

reemovNestings(l)

Python Dict Key Recursive Attempt Function

Jun 132019

So I'm dealing with this huge database where each item has a bunch of levels and some have different keys than others. So I made a script that takes a list of lists and tries those keys, if they work the function breaks. Pretty simple and I'm sure I could make it cleaner, but it works well enough and I don't expect any 4 level items any time soon. Anyways, here's the code:

for item in items:
''' items is a list of dicts to run the keys on '''
results = {} def key_attempter(name, key_lists): nonlocal item nonlocal results for key in key_lists: if len(key) == 1: try: results[name] = item[key[0]] break except: pass elif len(key) == 2: try: results[name] = item[key[0]][key[1]] break except: pass elif len(key) == 3: try: results[name] = item[key[0]][key[1]][key[2]] break except: pass feat_lists = { 'price': [ ['hdpData', 'homeInfo', 'price'], ['price'], ['hdpData', 'priceForHDP'], ['priceLabel'], ['hdpData', 'homeInfo', 'zestimate'], ['hdpData', 'festimate'] ], 'bed': [ ['beds'], ['hdpData', 'homeInfo', 'bedrooms'] ], 'bath': [ ['baths'], ['hdpData', 'homeInfo', 'bathrooms'] ]} for k in feat_lists.keys(): key_attempter(k, feat_lists[k])

return results

Multithreading and Run Once Decorator

Jun 112019

I've been coding again and just remembered how well this website works for keeping track of cool tricks I learn. Sometimes it's really hard to find simple and generic examples of things to help teach the fundamentals. I needed to write to a file without opening the text document 1000 times and I finally found a really clean example that helped me understand the pieces.

Edit** Threadpool is a lot easier and you can thread inside a loop:

from multiprocessing.pool import ThreadPool as Pool

threads = 100

p = Pool(threads)
p.map(function, list)

More complicated version:

import threading
 
lock = threading.Lock()
  
def thread_test(num):
    phrase = "I am number " + str(num)
    with lock:
        print phrase
        f.write(phrase + "\n")
 
threads = []
f = open("text.txt", 'w')
for i in range (100):
    t = threading.Thread(target = thread_test, args = (i,))
    threads.append(t)
    t.start()
  
while threading.activeCount() > 1:
    pass
else:
    f.close()

Close something on Scrapy spider close without using a pipeline:

from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher

class MySpider(CrawlSpider):
    def __init__(self):
        dispatcher.connect(self.spider_closed, signals.spider_closed)

    def spider_closed(self, spider):
      # second param is instance of spder about to be closed.

Instead of using an if time or if count to activate something I found a decorator that will make sure the function on runs once:

def run_once(f):
    def wrapper(*args, **kwargs):
        if not wrapper.has_run:
            wrapper.has_run = True
            return f(*args, **kwargs)
    wrapper.has_run = False
    return wrapper


@run_once
def my_function(foo, bar):
    return foo+bar

You can also resize the terminal inside the code:

import sys
sys.stdout.write("\x1b[8;{rows};{cols}t".format(rows=46, cols=54))

I got stuck for a while trying to get my repository to let me login without creating an ssh key (super annoying imo) and I figured out that I added the ssh url for the origin url and needed to reset it to the http:

change origin url
git remote set-url origin <url-with-your-username>

Combine mp3 files with linux:

ls *.mp3
sudo apt-get install mp3wrap
mp3wrap output.mp3 *.mp3

Regex is always better than splitting a bunch of times and making the code messy. Plus it's a lot easier to pick up the code later on and figure out what's going on. So I decided to take my regex to the next level and start labeling groups (I'm even going to give it it's very own tag :3:

pat = r'(?<=\,\"searchResults\"\:\{)(?<list_results>.*)(?=\,\"resultsHash\"\:)'

m = re.match(pat, url)
if m:
    self.domain = m.group('list_results')

Scraping Domains in Order with Scrapy and Time Meta

Feb 112019

I wanted to come up with a way to scrape domains that have had the most time to cool off first, so I just used a time.time() stamp in the meta (or after the request) and grab the smallest number (the oldest).

class PageSpider(Spider):
    name = 'page_spider'
    allowed_urls = ['https://www.amazon.com', 'https://www.ebay.com', 'https://www.etsy.com']
    custom_settings = {
        'ITEM_PIPELINES': {
            'pipelines.MainPipeline': 90,
        },
        'CONCURRENT_REQUESTS': 200,
        'CONCURRENT_REQUESTS_PER_DOMAIN': 200,
        'ROBOTSTXT_OBEY': False,
        'CONCURRENT_ITEMS': 800,
        'REACTOR_THREADPOOL_MAXSIZE': 1600,
        # Hides printing item dicts
        'LOG_LEVEL': 'INFO',
        'RETRY_ENABLED': False,
        'REDIRECT_MAX_TIMES': 1,
        # Stops loading page after 5mb
        'DOWNLOAD_MAXSIZE': 5592405,
        # Grabs xpath before site finish loading
        'DOWNLOAD_FAIL_ON_DATALOSS': False

    }

    def __init__(self):
        self.links = ['www.test.com', 'www.different.org', 'www.pogostickaddict.net']
        self.domain_count = {}

    def start_requests(self):
        while self.links:
            start_time = time.time()
            url = next(x for x in self.links if min(domain_count, key=domain_count.get) in x)
            request = scrapy.Request(url, callback=self.parse, dont_filter=True,
                                     meta={'time': time.time()})

            request.meta['start_time'] = start_time
            request.meta['url'] = url
            yield request

    def parse(self, response):
        domain = response.url.split('//')[-1].split('/')[0]
        self.domain_count[domain] = time.time()

        pageloader = PageItemLoader(PageItem(), response=response)

        pageloader.add_xpath('search_results', '//div[1]/text()')
        self.links.remove(response.meta['url'])

        yield pageloader.load_item()

Postgres Refresh & Filling Tables Incrementally

Feb 032019

I got tired of trying to write to files because of the write limitations and switched everything over to postgres. Now I remember how well it works, but also how many problems can arise. The cool thing about my recent script is that it solves a lot of issues all in one go. Since I will by completing rows in the DB in multiple increments I had to check if it exists and then check which part to update. In other words I had to SELECT, UPDATE, and INSERT in a few different ways. Here's the code:

import re
import datetime

import psycopg2
import json
with open('./data/database.json') as f:
    DATABASE = json.load(f)


class DBTest:
    def __init__(self, keyword, results):
        self.con = psycopg2.connect(**DATABASE)
        self.cur = con.cursor()
        
        self.mkeyword = keyword
        self.results = results
        self.pg_2 = 'https://www.amazon.com/s/ref=sr_pg_2?rh=i%3Aaps%2Ck%3AXbox+One+Controller+Stand&page=2&keywords=Xbox+One+Controller+Stand'

    def updater(self, domain):
        if domain == 'www.ebay.com':
            self.cur.execute("UPDATE keyword_pages SET ebay_results='" + self.results + "' WHERE keyword='" + self.mkeyword + "'")
            self.con.commit()
        elif domain == 'www.etsy.com':
            self.cur.execute("UPDATE keyword_pages SET etsy_results='" + self.results + "' WHERE keyword='" + self.mkeyword + "'")
            self.con.commit()
        elif domain == 'www.amazon.com':
            self.cur.execute("UPDATE keyword_pages SET amazon_results='" + self.results +
                        "', amazon_pg2='" + self.pg_2 + "' WHERE keyword='" + self.mkeyword + "'")
            self.con.commit()

    def test(self):
        self.cur.execute("""SELECT * FROM keyword_pages WHERE NOT complete AND amazon_results
         != 'blank' AND ebay_results != 'blank' AND etsy_results != 'blank'""")
        rows = self.cur.fetchall()
        for row in rows:
            print(row[0])

        self.cur.execute("select exists(select keyword from keyword_pages where keyword='" + self.mkeyword + "')")
        exists = self.cur.fetchone()[0]

        if exists:
            self.updater('www.etsy.com')
        else:
            columns = "keyword, amazon_results, amazon_pg2, ebay_results, etsy_results, complete"
            values = "'pogo stick', 'blank', 'blank', '14', 'blank', 'f'"
            self.cur.execute('INSERT INTO keyword_pages (' + columns + ') VALUES (' + values + ')')
            self.con.commit()

        self.con.close()

class LinkGen:
    def __init__(self):
        self.link_pieces = []
        self.links = []
        self.keywords = {
            'extra black coffee': [['www.amazon.com', '4', '/jumprope/s?ie=UTF8&page=2&rh=i%3Aaps%2Ck%3Ajumprope'], ['www.ebay.com', '5'], ['www.etsy.com', '7']],
            'decaf coffee': [['www.amazon.com', '5', 'https://www.amazon.com/s/ref=sr_pg_2?rh=i%3Aaps%2Ck%3Ablack+coffee&page=2&keywords=black+coffee&ie=UTF8&qid=1549211788'],
            ['www.ebay.com', '3'], ['www.etsy.com', '9']],
        }

    # How Amazon identifies if a link is internal/new or external/old (very simple actually)
    def qid(self):
        return round((datetime.datetime.today() - datetime.datetime(1970, 1, 1)).total_seconds())


    def amazon_gen(self, search_term, page_total, page_2):
        self.link_pieces = ['https://www.amazon.com/s/ref=sr_pg_', '?rh=', '&page=', '&keywords=', '&ie=UTF8&qid=']
        rh = re.search('rh=([^&|$]*)', str(page_2), re.IGNORECASE).group(1)
        print(rh)
        all_links = []
        for page in range(1, int(page_total) + 1):
            all_links.append(
                f'{self.link_pieces[0]}{page}{self.link_pieces[1]}{rh}{self.link_pieces[2]}{page}{self.link_pieces[3]}{"+".join(search_term.split(" "))}{self.link_pieces[4]}')
        return all_links


    def link_gen(self, domain, search_term, page_total):
        if domain == 'www.ebay.com':
            self.link_pieces = ['https://www.ebay.com/sch/i.html?_nkw=', '&rt=nc&LH_BIN=1&_pgn=']
        elif domain == 'www.etsy.com':
            self.link_pieces = ['https://www.etsy.com/search?q=', '&page=']
        all_links = []
        for page in range(1, int(page_total) + 1):
            all_links.append(f'{self.link_pieces[0]}{"+".join(search_term.split(" "))}{self.link_pieces[1]}{page}')

        return all_links


    def test(self):
        for keyword in self.keywords.keys():
            for results in keywords[keyword]:
                if results[0] == 'www.amazon.com':
                    self.links.append(self.amazon_gen(keyword, results[1], results[2]))
                else:
                    self.links.append(self.link_gen(results[0], keyword, results[1]))
            print(self.links)



if __name__ == "__main__":
    links = LinkGen()
    db = DBTest('pogo stick', '15')
    db.test()

Since I had to dig through many other project's code base to figure a lot of this out, not to mention Google, I figured I should put what I collected here so I can find it later. 

-- psql
****
sudo apt install postgresql
sudo service postgresql start
sudo su - postgres
createuser --superuser ryan
psql # <- command line tool for making queries
\password ryan
\q # <- exit psql to create new users/dbs or import/export db's (psql is for sql)
createdb ryan # or whatever# exit and now you can run psql in your own console with your username.
***
#start automatically
sudo systemctl enable postgresql
# do database commands
psql -d <database>

alter user ryan with encrypted password <password>;

sudo -i -u ryan

# export
pg_dump -U ryan ebay_keywords > database-dec-18.txt --data-only

# importable export
pg_dump -U ryan ebay_keywords > database-dec-18.pgsql # Import psql reviewmill_scraped < database-dec-18.pgsql CREATE TABLE keyword_pages ( keyword VARCHAR(255) NOT NULL PRIMARY KEY, amazon_results VARCHAR(16), amazon_pg2 VARCHAR(255), ebay_results VARCHAR(16), etsy_results VARCHAR(16), complete BOOLEAN NOT NULL ); ALTER TABLE keyword_pages ALTER COLUMN etsy_results TYPE VARCHAR(16); INSERT INTO keyword_pages (keyword, amazon_results, amazon_pg2, ebay_results, etsy_results, complete) VALUES ('extra strong coffee', 12, 'https://www.amazon.com/s/ref=sr_pg_2?rh=i%3Aaps%2Ck%3Ablack+coffee&page=2&keywords=black+coffee&ie=UTF8&qid=1549211788', 12, 4, 'f'); CREATE TABLE reviews ( review_id VARCHAR(30) PRIMARY KEY, asin VARCHAR(20) NOT NULL ); ALTER TABLE reviews ADD CONSTRAINT asin FOREIGN KEY (asin) REFERENCES products (asin) MATCH FULL; # Extra stuff ALTER TABLE reviews ALTER COLUMN asin TYPE varchar(30); ALTER TABLE reviews ADD COLUMN review_helpful INTEGER;

Windows has been making this increasingly difficult, but I think I've avoided the worst of my connection issues. First Windows decided to automatically detect my proxies, then I find out that my ethernet card driver had some power saving crap on, and I've been having random permission issues between WSL, pipenv, and postgres. 

ipconfig /release
ipconfig /renew

I haven't tried this yet, but if my internet starts acting up again this is going to be my first thing to try, considering restarting windows seems to fix it I think this should as well.

Handling Public Proxies with Scrapy Quickly - Remove Bad on N Failures

Jan 312019

You can catch 404 and connection errors by using errback= inside of the scrapy.Request object. From there I just add the failed proxy inside of the request meta to a list of failed proxies inside of the ProxyEngine class. If a proxy is seen inside of the failed list N times it can be removed with the ProxyEngine.remove_bad() class function. I also discovered that passing the download_timeout inside of the request meta works a lot better than inside of the Spider's global settings. Now the spider doesn't hang on slow or broken proxies and will be much much faster. 

Next I plan to refactor the ProxyEngine data to serialize attempts so that I can catch proxies that have been banned by one domain, but not others. Also, I need to feed bad_proxies back into the request generator after being down for N time and save all of the proxy data to a database. Here's the code:

Proxy Engine

class ProxyEngine:
    def __init__(self, limit=3):
        self.proxy_list = []
        self.bad_proxies = []
        self.good_proxies = []
        self.failed_proxies = []
        self.limit = limit

    def get_new(self, file='./proxies.txt'):
        new_proxies = []
        with open(file, 'r') as file:
            for line in file:
                new_proxies.append(f'https://{line.strip()}')
        return [self.proxy_list.append(x) for x in new_proxies if x not in self.proxy_list and x not in self.bad_proxies]

    def remove_bad(self):
        for proxy in self.proxy_list:
            if self.failed_proxies.count(proxy) >= self.limit:
                self.bad_proxies.append(proxy)
        return [self.proxy_list.remove(x) for x in self.proxy_list if x in self.bad_proxies]

Proxy Spider

class ProxyTest(Spider):
    name = 'proxy_test'
    custom_settings = {
        'ITEM_PIPELINES': {
            '__main__.ProxyPipeline': 400
        },
        'CONCURRENT_REQUESTS_PER_IP': 2,
    }

    def __init__(self):
        self.prox = ProxyEngine(limit=20)

    def start_requests(self):

        self.prox.get_new()
        for proxy in self.prox.proxy_list:
            request = scrapy.Request("https://dashwood.net/post/python-3-new-string-formatting/456ft",
                                     callback=self.get_title, errback=self.get_error, dont_filter=True)
            request.meta['proxy'] = proxy
            request.meta['dont_retry'] = True
            request.meta['download_timeout'] = 5
            yield request

    def get_title(self, response):
        print(response.status)
        print('*' * 15)

    def get_error(self, failure):
        if failure.check(HttpError):
            response = failure.value.response
            print("HttpError occurred", response.status)
            print('*' * 15)

        elif failure.check(DNSLookupError):
            request = failure.request
            print("DNSLookupError occurred on", request.url)
            print('*' * 15)

        elif failure.check(TimeoutError, TCPTimedOutError):
            request = failure.request
            self.prox.failed_proxies.append(request.meta["proxy"])
            print("TimeoutError occurred", request.meta)
            print('*' * 15)

        else:
            request = failure.request
            print("Other Error", request.meta)
            print(f'Proxy: {request.meta["proxy"]}')
            self.prox.failed_proxies.append(request.meta["proxy"])
            print('Failed:', self.prox.failed_proxies)
            print('*' * 15)

Stand Alone Scrapy in Action + New Dev Tricks

Jan 172019

Python 3 New String Formatting

I've been using the .format method and explicitly calling variables in case I change the order down the line for all of my code because it just seemed the cleanest way:

print('this is a string {variable}'.format(variable=variable))

But there is a much simpler/cleaner method called f-strings. They are shorter, easier to read, and just plain more efficient. I'll be refactoring any code I come across using the older formatting styles just our of principle. Using this format with the scraping project I worked on this week was a significant convienence and I looooove it. Here's how it works (so simple):

test = 'this'

number = 254

print(f"{test} is a test for number {number}.")
this is a test for number 254.

Stand-alone Scrapy Script with Item Loading and Processing

Jan 072019

I've been meaning to do this for a while now, but honestly it's been really difficult to find reference material to copy off of. Fortunatly today I found some really good repositories with almost exactly what I was looking for. Then, after I got it working, I combed the Scrapy docs very slowly and made sure that I understood all of the item loader functions and added simple examples / documentation on most of the features.

One Stand-alone Scrapy Script to Rule Them All

Basically what I wanted was a minimal clean Scrapy script that I could use in other projects without being tied down to the scrapy-cli project crap. I actually feel like I have full control of my script and have been taking great care to organize it correctly. Also, using item loaders / processors is really cool and should open the door to solve issues really cleanly. 

Note I added a few interesting features to showcase some of the functionality of item loaders.

#! /usr/local/bin/python3
# -*- coding: utf-8 -*-

from scrapy.crawler import CrawlerProcess
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join
from scrapy import Spider, Item, Field
from scrapy.settings import Settings

# Originally built off of:
# https://gist.github.com/alecxe/fc1527d6d9492b59c610
def extract_tag(self, values): # Custom function for Item Loader Processor for value in values: yield value[5:-1] class DefaultAwareItem(Item): # Converts field default meta into default value fallback def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) # Use python's built-in setdefault() function on all items for field_name, field_metadata in self.fields.items(): if not field_metadata.get('default'): self.setdefault(field_name, 'No default set') else: self.setdefault(field_name, field_metadata.get('default')) # Item Field class CustomItem(DefaultAwareItem): ''' Input / Output processors can also be declared in the field meta, e.g — name = scrapy.Field( input_processor=MapCompose(remove_tags), output_processor=Join(), ) ''' title = Field(default="No Title") link = Field(default="No Links") desc = Field() tag = Field(default="No Tags") class CustomItemLoader(ItemLoader): '''
Item Loader declaration — input and output processors, functions
https://doc.scrapy.org/en/latest/topics/loaders.html#module-scrapy.loader.processors
Processors (Any functions applied to items here) Identity() - leaves as is TakeFirst - Takes first non null value Join() - basically equivelent to u' '.join Compose() - applies a list of functions one at a time **accepts loader_context MapCompose() - applies a list of functions to a list of objects **accepts loader_context \ first function is applied to all objects then altered objects to next function etc.. https://doc.scrapy.org/en/latest/topics/loaders.html#declaring-input-and-output-processors
_in processors are applied to extractions as soon as received _out processors are applied to collected data once loader.load_item() is yielded single items are always converted to iterables custom processor functions must receive self and values ''' default_input_processor = MapCompose(str.strip) default_output_processor = TakeFirst() desc_out = Join() tag_in = extract_tag # function assigned as class variable tag_out = Join(', ') # Define a pipeline class WriterPipeline(object): def __init__(self): self.file = open('items.txt', 'w') def process_item(self, item, spider): self.file.write(item['title'] + '\n') self.file.write(item['link'] + '\n') self.file.write(item['desc'] + '\n') self.file.write(item['tag'] + '\n\n') return item # Define a spider class CustomSpider(Spider): name = 'single_spider' allowed_domains = ['dashwood.net'] start_urls = ['https://dashwood.net/'] def parse(self, response): for sel in response.xpath('//article'): loader = CustomItemLoader( CustomItem(), selector=sel, response=response) loader.add_xpath('title', './/h2/a/text()') loader.add_xpath('link', './/a/@href') loader.add_xpath('desc', './/p/text()') loader.add_xpath('tag', './/a[@class="tag"]//@href') yield loader.load_item() # Declare some settings / piplines settings = Settings({ # piplines start with the project/module name so replace with __main__ 'ITEM_PIPELINES': { '__main__.WriterPipeline': 100, }, 'DEFAULT_REQUEST_HEADERS': { 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'accept-encoding': 'gzip, deflate, sdch', 'accept-language': 'en-US,en;q=0.8', 'upgrade-insecure-requests': '1', 'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3' }, 'DOWNLOADER_MIDDLEWARES': { 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, 'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90, 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350, } }) process = CrawlerProcess(settings) # you can run 30 of these at once if you want, e.g — # process.crawl(CustomSpider) # process.crawl(CustomSpider) etc.. * 30 process.crawl(CustomSpider) process.start()

Managing URLs with Python

Dec 232018
import urllib

params = urllib.parse.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params)


print(urllib.parse.quote_plus("+kite+/"))

%2Bkite%2B%2F


Note that it's different for python2 -> urllib.quote_plus() & urllib.urlencode()

Lexsum in Action

Nov 092018

I finally got around to working on my Amazon project again. 

Misc Notes

# Change postgres data directory

File path:
/etc/postgresql/10/main/postgresql.conf

File System Headache

I decided to clean up my hard drives, but I forgot how much of a headache it was trying to get an NTFS drive to work with transmission-daemon. Whatever I'll just save to my EX4 partition for now and fix it later. 

*Update

I bricked my OS install and had to go down a 3 hour nightmare trying to fix it. I eventually discovered that it was a label from my old partition mount point in the fstab file. Solution:

sudo nano /etc/fstab

# comment out old label

ctrl + o to save
ctrl + x to exit

reboot

My computer still doesn't restart properly because I broke something in the boot order trying to fix it. Not a big deal I just enter my username/password in the terminal then type startx.

LexSum Progress

Had to slice to 50 for each rating to save time, but I can probably make it longer for launch. At first I was thinking there would be 60 million entities to process, but actually its more like 900k x 5 (for each rating) and as long as I don't lexsum 1000+ reviews for ratings it should finish in a few days. I reallllly need to add a timer function asap. I can just time 1000 or so products and multiply that by 900k or whatever the total number of products in my database is and I should have a pretty good idea how long it will take.

if len(titles) > 50:
    titlejoin = ' '.join(lex_sum(' '.join(titles[:50]), sum_count))
    textjoin = ' '.join(lex_sum(' '.join(comments[:50]), sum_count))
else:
    titlejoin = ' '.join(lex_sum(' '.join(titles), sum_count))
    textjoin = ' '.join(lex_sum(' '.join(comments), sum_count))

I'm thinking I can clean these lines up now that I'm staring at it. Maybe something like:

titlejoin = ' '.join(
    lex_sum(' '.join(titles[:min(len(titles), 50)]), sum_count))
textjoin = ' '.join(
    lex_sum(' '.join(comments[:min(len(titles), 50)]), sum_count))

My estimated time remaining function adds time elapsed ever ten iterations to a list, takes the last 500 or less of that list and averages them, and finally multiplies that average by the total remaining iterations and displays it in a human readable format:

avg_sec = 0
times = []
start = time.time()

# Display time remaining
if avg_sec:
    seconds_left = ((limit - count) / 10) * avg_sec
    m, s = divmod(seconds_left, 60)
    h, m = divmod(m, 60)
    print('Estimated Time Left: {}h {}m {}s'.format(
        round(h), round(m), round(s)))

if(not count % 10):
    end = time.time()
    time_block = end - start
    start = end
    times.append(time_block)
    avg_sec = functools.reduce(
        lambda x, y: x + y, times[-min(len(times), 500):]) / len(times[-min(len(times), 500):])
    print('Average time per 10:', round(avg_sec, 2), 'seconds')

Another thought I had is that this save_df module I coded (it's at like 400 lines of code already x_x) is actually a crucial part of my ultimate code base. I'm pretty happy that I spent so much time writing it into proper functions.

Fixed Slow Database Queries - Indexing to the Rescue!

Nov 012018

So I ran my summarizer yesterday and it took literally all day to run only 200 products through the lex sum function. So I went through my code and added a timer for each major step in the process like so:

start = time.time()
asin_list = get_asins(limit)
end = time.time()
print('Get ASINs: ', end - start)

 Turns out it was taking over 60 seconds per query . I did the math and at the rate it was going, it would take almost two years to complete every product in my database. So I started looking around at different ways to group large databases. Turns out databases are a lot more complicated than I believed. It felt like looking for a PHP solution back in high school when I didn't know enough to know what to look for. Finally I stumbled upon a feature called Indexing. First I added the indexing code inside of my script, which had no effect, but it seemed like it had worked properly. Still though I was not going to give up that easy and I decided to open up postgres directly in the terminal and poke around to see if the indexing was applied properly. Turns out that it was not applied at all. Here is the code I used to index the asin table in reviews:

# Remote Connect 
postgres psql -U ryan -h 162.196.142.159 -p 5432 databasename
# Display table Indexes
SELECT * FROM pg_indexes WHERE tablename = 'reviews';

# Create Index
CREATE INDEX asin_index ON reviews (asin);

Ureka! It worked, now the script that took all day to run yesterday ran in about a minute flat! That is the biggest difference in performance time I've ever experienced and I cant wait to see where else indexing will help my databases.

Other than that, Erin showed me a bunch of stuff in illustrator and Phototshop.

  • ctrl+click with select tool enables auto-select
  • ctrl+d — deselect
  • ctrl+shift+i — invert selection
  • ctrl+j — duplicate layer
  • ctrl+alt+j — duplicate and name layer