dashwood.net -

Ryan Stefan's Micro Blog

Lexsum in Action

Nov 092018

I finally got around to working on my Amazon project again. 

Misc Notes

# Change postgres data directory

File path:

File System Headache

I decided to clean up my hard drives, but I forgot how much of a headache it was trying to get an NTFS drive to work with transmission-daemon. Whatever I'll just save to my EX4 partition for now and fix it later. 


I bricked my OS install and had to go down a 3 hour nightmare trying to fix it. I eventually discovered that it was a label from my old partition mount point in the fstab file. Solution:

sudo nano /etc/fstab

# comment out old label

ctrl + o to save
ctrl + x to exit


My computer still doesn't restart properly because I broke something in the boot order trying to fix it. Not a big deal I just enter my username/password in the terminal then type startx.

LexSum Progress

Had to slice to 50 for each rating to save time, but I can probably make it longer for launch. At first I was thinking there would be 60 million entities to process, but actually its more like 900k x 5 (for each rating) and as long as I don't lexsum 1000+ reviews for ratings it should finish in a few days. I reallllly need to add a timer function asap. I can just time 1000 or so products and multiply that by 900k or whatever the total number of products in my database is and I should have a pretty good idea how long it will take.

if len(titles) > 50:
    titlejoin = ' '.join(lex_sum(' '.join(titles[:50]), sum_count))
    textjoin = ' '.join(lex_sum(' '.join(comments[:50]), sum_count))
    titlejoin = ' '.join(lex_sum(' '.join(titles), sum_count))
    textjoin = ' '.join(lex_sum(' '.join(comments), sum_count))

I'm thinking I can clean these lines up now that I'm staring at it. Maybe something like:

titlejoin = ' '.join(
    lex_sum(' '.join(titles[:min(len(titles), 50)]), sum_count))
textjoin = ' '.join(
    lex_sum(' '.join(comments[:min(len(titles), 50)]), sum_count))

My estimated time remaining function adds time elapsed ever ten iterations to a list, takes the last 500 or less of that list and averages them, and finally multiplies that average by the total remaining iterations and displays it in a human readable format:

avg_sec = 0
times = []
start = time.time()

# Display time remaining
if avg_sec:
    seconds_left = ((limit - count) / 10) * avg_sec
    m, s = divmod(seconds_left, 60)
    h, m = divmod(m, 60)
    print('Estimated Time Left: {}h {}m {}s'.format(
        round(h), round(m), round(s)))

if(not count % 10):
    end = time.time()
    time_block = end - start
    start = end
    avg_sec = functools.reduce(
        lambda x, y: x + y, times[-min(len(times), 500):]) / len(times[-min(len(times), 500):])
    print('Average time per 10:', round(avg_sec, 2), 'seconds')

Another thought I had is that this save_df module I coded (it's at like 400 lines of code already x_x) is actually a crucial part of my ultimate code base. I'm pretty happy that I spent so much time writing it into proper functions.

Fixed Slow Database Queries - Indexing to the Rescue!

Nov 012018

So I ran my summarizer yesterday and it took literally all day to run only 200 products through the lex sum function. So I went through my code and added a timer for each major step in the process like so:

start = time.time()
asin_list = get_asins(limit)
end = time.time()
print('Get ASINs: ', end - start)

 Turns out it was taking over 60 seconds per query . I did the math and at the rate it was going, it would take almost two years to complete every product in my database. So I started looking around at different ways to group large databases. Turns out databases are a lot more complicated than I believed. It felt like looking for a PHP solution back in high school when I didn't know enough to know what to look for. Finally I stumbled upon a feature called Indexing. First I added the indexing code inside of my script, which had no effect, but it seemed like it had worked properly. Still though I was not going to give up that easy and I decided to open up postgres directly in the terminal and poke around to see if the indexing was applied properly. Turns out that it was not applied at all. Here is the code I used to index the asin table in reviews:

# Remote Connect 
postgres psql -U ryan -h -p 5432 databasename
# Display table Indexes
SELECT * FROM pg_indexes WHERE tablename = 'reviews';

# Create Index
CREATE INDEX asin_index ON reviews (asin);

Ureka! It worked, now the script that took all day to run yesterday ran in about a minute flat! That is the biggest difference in performance time I've ever experienced and I cant wait to see where else indexing will help my databases.

Other than that, Erin showed me a bunch of stuff in illustrator and Phototshop.

  • ctrl+click with select tool enables auto-select
  • ctrl+d — deselect
  • ctrl+shift+i — invert selection
  • ctrl+j — duplicate layer
  • ctrl+alt+j — duplicate and name layer