dashwood.net -

Ryan Stefan's Micro Blog

Postges Password Reset and Allowing Connections

Feb 102019

No password on postgres user fix:

pg_hba.conf

local  all   all   trust
ALTER USER postgres with password 'newpassword';

Then you can add your user account as a superuser:

ALTER ROLE ryan with SUPERUSER;

# then restart the server
sudo /etc/init.d/postgresql restart

You'll probably need to change your pg_hba.conf file back to something like this:

local  all                                          trust
host   all        127.0.0.1      255.255.255.255    trust
host   booktown   192.168.1.3    255.255.255.255    ident    sales
host   all        192.168.1.4    255.255.255.255    ident    audit

Postgres Refresh & Filling Tables Incrementally

Feb 032019

I got tired of trying to write to files because of the write limitations and switched everything over to postgres. Now I remember how well it works, but also how many problems can arise. The cool thing about my recent script is that it solves a lot of issues all in one go. Since I will by completing rows in the DB in multiple increments I had to check if it exists and then check which part to update. In other words I had to SELECT, UPDATE, and INSERT in a few different ways. Here's the code:

import re
import datetime

import psycopg2
import json
with open('./data/database.json') as f:
    DATABASE = json.load(f)


class DBTest:
    def __init__(self, keyword, results):
        self.con = psycopg2.connect(**DATABASE)
        self.cur = con.cursor()
        
        self.mkeyword = keyword
        self.results = results
        self.pg_2 = 'https://www.amazon.com/s/ref=sr_pg_2?rh=i%3Aaps%2Ck%3AXbox+One+Controller+Stand&page=2&keywords=Xbox+One+Controller+Stand'

    def updater(self, domain):
        if domain == 'www.ebay.com':
            self.cur.execute("UPDATE keyword_pages SET ebay_results='" + self.results + "' WHERE keyword='" + self.mkeyword + "'")
            self.con.commit()
        elif domain == 'www.etsy.com':
            self.cur.execute("UPDATE keyword_pages SET etsy_results='" + self.results + "' WHERE keyword='" + self.mkeyword + "'")
            self.con.commit()
        elif domain == 'www.amazon.com':
            self.cur.execute("UPDATE keyword_pages SET amazon_results='" + self.results +
                        "', amazon_pg2='" + self.pg_2 + "' WHERE keyword='" + self.mkeyword + "'")
            self.con.commit()

    def test(self):
        self.cur.execute("""SELECT * FROM keyword_pages WHERE NOT complete AND amazon_results
         != 'blank' AND ebay_results != 'blank' AND etsy_results != 'blank'""")
        rows = self.cur.fetchall()
        for row in rows:
            print(row[0])

        self.cur.execute("select exists(select keyword from keyword_pages where keyword='" + self.mkeyword + "')")
        exists = self.cur.fetchone()[0]

        if exists:
            self.updater('www.etsy.com')
        else:
            columns = "keyword, amazon_results, amazon_pg2, ebay_results, etsy_results, complete"
            values = "'pogo stick', 'blank', 'blank', '14', 'blank', 'f'"
            self.cur.execute('INSERT INTO keyword_pages (' + columns + ') VALUES (' + values + ')')
            self.con.commit()

        self.con.close()

class LinkGen:
    def __init__(self):
        self.link_pieces = []
        self.links = []
        self.keywords = {
            'extra black coffee': [['www.amazon.com', '4', '/jumprope/s?ie=UTF8&page=2&rh=i%3Aaps%2Ck%3Ajumprope'], ['www.ebay.com', '5'], ['www.etsy.com', '7']],
            'decaf coffee': [['www.amazon.com', '5', 'https://www.amazon.com/s/ref=sr_pg_2?rh=i%3Aaps%2Ck%3Ablack+coffee&page=2&keywords=black+coffee&ie=UTF8&qid=1549211788'],
            ['www.ebay.com', '3'], ['www.etsy.com', '9']],
        }

    # How Amazon identifies if a link is internal/new or external/old (very simple actually)
    def qid(self):
        return round((datetime.datetime.today() - datetime.datetime(1970, 1, 1)).total_seconds())


    def amazon_gen(self, search_term, page_total, page_2):
        self.link_pieces = ['https://www.amazon.com/s/ref=sr_pg_', '?rh=', '&page=', '&keywords=', '&ie=UTF8&qid=']
        rh = re.search('rh=([^&|$]*)', str(page_2), re.IGNORECASE).group(1)
        print(rh)
        all_links = []
        for page in range(1, int(page_total) + 1):
            all_links.append(
                f'{self.link_pieces[0]}{page}{self.link_pieces[1]}{rh}{self.link_pieces[2]}{page}{self.link_pieces[3]}{"+".join(search_term.split(" "))}{self.link_pieces[4]}')
        return all_links


    def link_gen(self, domain, search_term, page_total):
        if domain == 'www.ebay.com':
            self.link_pieces = ['https://www.ebay.com/sch/i.html?_nkw=', '&rt=nc&LH_BIN=1&_pgn=']
        elif domain == 'www.etsy.com':
            self.link_pieces = ['https://www.etsy.com/search?q=', '&page=']
        all_links = []
        for page in range(1, int(page_total) + 1):
            all_links.append(f'{self.link_pieces[0]}{"+".join(search_term.split(" "))}{self.link_pieces[1]}{page}')

        return all_links


    def test(self):
        for keyword in self.keywords.keys():
            for results in keywords[keyword]:
                if results[0] == 'www.amazon.com':
                    self.links.append(self.amazon_gen(keyword, results[1], results[2]))
                else:
                    self.links.append(self.link_gen(results[0], keyword, results[1]))
            print(self.links)



if __name__ == "__main__":
    links = LinkGen()
    db = DBTest('pogo stick', '15')
    db.test()

Since I had to dig through many other project's code base to figure a lot of this out, not to mention Google, I figured I should put what I collected here so I can find it later. 

-- psql
****
sudo apt install postgresql
sudo service postgresql start
sudo su - postgres
createuser --superuser ryan
psql # <- command line tool for making queries
\password ryan
\q # <- exit psql to create new users/dbs or import/export db's (psql is for sql)
createdb ryan # or whatever# exit and now you can run psql in your own console with your username.
***
#start automatically
sudo systemctl enable postgresql
# do database commands
psql -d <database>

alter user ryan with encrypted password <password>;

sudo -i -u ryan

# export
pg_dump -U ryan ebay_keywords > database-dec-18.txt --data-only

# importable export
pg_dump -U ryan ebay_keywords > database-dec-18.pgsql # Import psql reviewmill_scraped < database-dec-18.pgsql CREATE TABLE keyword_pages ( keyword VARCHAR(255) NOT NULL PRIMARY KEY, amazon_results VARCHAR(16), amazon_pg2 VARCHAR(255), ebay_results VARCHAR(16), etsy_results VARCHAR(16), complete BOOLEAN NOT NULL ); ALTER TABLE keyword_pages ALTER COLUMN etsy_results TYPE VARCHAR(16); INSERT INTO keyword_pages (keyword, amazon_results, amazon_pg2, ebay_results, etsy_results, complete) VALUES ('extra strong coffee', 12, 'https://www.amazon.com/s/ref=sr_pg_2?rh=i%3Aaps%2Ck%3Ablack+coffee&page=2&keywords=black+coffee&ie=UTF8&qid=1549211788', 12, 4, 'f'); CREATE TABLE reviews ( review_id VARCHAR(30) PRIMARY KEY, asin VARCHAR(20) NOT NULL ); ALTER TABLE reviews ADD CONSTRAINT asin FOREIGN KEY (asin) REFERENCES products (asin) MATCH FULL; # Extra stuff ALTER TABLE reviews ALTER COLUMN asin TYPE varchar(30); ALTER TABLE reviews ADD COLUMN review_helpful INTEGER;

Windows has been making this increasingly difficult, but I think I've avoided the worst of my connection issues. First Windows decided to automatically detect my proxies, then I find out that my ethernet card driver had some power saving crap on, and I've been having random permission issues between WSL, pipenv, and postgres. 

ipconfig /release
ipconfig /renew

I haven't tried this yet, but if my internet starts acting up again this is going to be my first thing to try, considering restarting windows seems to fix it I think this should as well.

Lexsum in Action

Nov 092018

I finally got around to working on my Amazon project again. 

Misc Notes

# Change postgres data directory

File path:
/etc/postgresql/10/main/postgresql.conf

File System Headache

I decided to clean up my hard drives, but I forgot how much of a headache it was trying to get an NTFS drive to work with transmission-daemon. Whatever I'll just save to my EX4 partition for now and fix it later. 

*Update

I bricked my OS install and had to go down a 3 hour nightmare trying to fix it. I eventually discovered that it was a label from my old partition mount point in the fstab file. Solution:

sudo nano /etc/fstab

# comment out old label

ctrl + o to save
ctrl + x to exit

reboot

My computer still doesn't restart properly because I broke something in the boot order trying to fix it. Not a big deal I just enter my username/password in the terminal then type startx.

LexSum Progress

Had to slice to 50 for each rating to save time, but I can probably make it longer for launch. At first I was thinking there would be 60 million entities to process, but actually its more like 900k x 5 (for each rating) and as long as I don't lexsum 1000+ reviews for ratings it should finish in a few days. I reallllly need to add a timer function asap. I can just time 1000 or so products and multiply that by 900k or whatever the total number of products in my database is and I should have a pretty good idea how long it will take.

if len(titles) > 50:
    titlejoin = ' '.join(lex_sum(' '.join(titles[:50]), sum_count))
    textjoin = ' '.join(lex_sum(' '.join(comments[:50]), sum_count))
else:
    titlejoin = ' '.join(lex_sum(' '.join(titles), sum_count))
    textjoin = ' '.join(lex_sum(' '.join(comments), sum_count))

I'm thinking I can clean these lines up now that I'm staring at it. Maybe something like:

titlejoin = ' '.join(
    lex_sum(' '.join(titles[:min(len(titles), 50)]), sum_count))
textjoin = ' '.join(
    lex_sum(' '.join(comments[:min(len(titles), 50)]), sum_count))

My estimated time remaining function adds time elapsed ever ten iterations to a list, takes the last 500 or less of that list and averages them, and finally multiplies that average by the total remaining iterations and displays it in a human readable format:

avg_sec = 0
times = []
start = time.time()

# Display time remaining
if avg_sec:
    seconds_left = ((limit - count) / 10) * avg_sec
    m, s = divmod(seconds_left, 60)
    h, m = divmod(m, 60)
    print('Estimated Time Left: {}h {}m {}s'.format(
        round(h), round(m), round(s)))

if(not count % 10):
    end = time.time()
    time_block = end - start
    start = end
    times.append(time_block)
    avg_sec = functools.reduce(
        lambda x, y: x + y, times[-min(len(times), 500):]) / len(times[-min(len(times), 500):])
    print('Average time per 10:', round(avg_sec, 2), 'seconds')

Another thought I had is that this save_df module I coded (it's at like 400 lines of code already x_x) is actually a crucial part of my ultimate code base. I'm pretty happy that I spent so much time writing it into proper functions.

Fixed Slow Database Queries - Indexing to the Rescue!

Nov 012018

So I ran my summarizer yesterday and it took literally all day to run only 200 products through the lex sum function. So I went through my code and added a timer for each major step in the process like so:

start = time.time()
asin_list = get_asins(limit)
end = time.time()
print('Get ASINs: ', end - start)

 Turns out it was taking over 60 seconds per query . I did the math and at the rate it was going, it would take almost two years to complete every product in my database. So I started looking around at different ways to group large databases. Turns out databases are a lot more complicated than I believed. It felt like looking for a PHP solution back in high school when I didn't know enough to know what to look for. Finally I stumbled upon a feature called Indexing. First I added the indexing code inside of my script, which had no effect, but it seemed like it had worked properly. Still though I was not going to give up that easy and I decided to open up postgres directly in the terminal and poke around to see if the indexing was applied properly. Turns out that it was not applied at all. Here is the code I used to index the asin table in reviews:

# Remote Connect 
postgres psql -U ryan -h 162.196.142.159 -p 5432 databasename
# Display table Indexes
SELECT * FROM pg_indexes WHERE tablename = 'reviews';

# Create Index
CREATE INDEX asin_index ON reviews (asin);

Ureka! It worked, now the script that took all day to run yesterday ran in about a minute flat! That is the biggest difference in performance time I've ever experienced and I cant wait to see where else indexing will help my databases.

Other than that, Erin showed me a bunch of stuff in illustrator and Phototshop.

  • ctrl+click with select tool enables auto-select
  • ctrl+d — deselect
  • ctrl+shift+i — invert selection
  • ctrl+j — duplicate layer
  • ctrl+alt+j — duplicate and name layer