Show HN: Small ML assisted CLI to look up packages in common repositories

  • I made this little tool for myself for when I’m reading and working with code repos wanting to learn about what a dependency does. Typically I would copy the package name and then google it or go the relevant package repository to look it up there.

    I wanted to simplify this process by making a common interface to look up packages from many common package repositories without having to care too much about format of the input and remembering CLI parameters.

    The basic idea is that you copy paste a snippet of text from the console, from GitHub in the browser, or by piping to stdin into ā€œAlpakrā€ and out comes a summary plus link to the package in the package repository.

        $ echo ā€˜
          "flow-bin": "^0.123.0",
          "husky": "^3.1.0ā€,’ | alpakr
    
    
    outputs:

        flow-bin -> Binary wrapper for Flow - A static type checker for JavaScript
        https://npmjs.com/package/flow-bin
    
        husky -> Modern native Git hooks made easy
        https://npmjs.com/package/husky
    
    
    The ML part determines if the text pasted/piped is of a certain kind: pip, composer, npm, ruby gems, and cargo toml + compile stream. In the above example it would output ā€˜npm’.

    Once the packager is determined it is (mostly) trivial to extract packages names to do the actual look up in the packagers repository, eg. If ā€˜cargo’ then packages are looked up in crates.io, ā€˜composer’ is packagist.org, ā€˜npm’ in npmjs.com and so on.

    All package details are transformed to a common format and stored/cached in a DynamoDB for future lookups.

    It has been a fun project to work on covering subjects I wanted to learn and improve on: ML with scikit-learn, building images to run in AWS Lambda, running Rust in AWS Lambda (built on Mac M1), warming Lambdas, multi-region serverless, and more.

    For me the working out the ML preprocessing function was a fun task. If you think about it is more the flow of symbols rather than the actual names/words that matters when you are only looking at snippet of the full document. Also not all documents are structured e.g I wanted it to support the stream of ā€œCompiling regex v1.0.0ā€ when building a Rust binary. Meaning alpakr supports copying text of that format as well.

    What was not so fun was retrieving the training data. I would love to hear of some better approach/source. I searched for package files on Github and used the API to fetch the files. Took a long time.

    I also enjoyed trying to make a small efficient model. I ended up with 150kb model that had a 98.5% accuracy against test data. Ironically the framework dependencies needed to run it is short of 300mb. But I’m guessing a small model also loads faster, so decided to keep it small.

    In addition to installing the tool with npm you can install it with `cargo install alpakr-cli`.

    See either npm or crate for examples of using the CLI.

    I was thinking about adding a `—hint` function to override the ML part, but I kind of like how simple it is without. What do you think?

    ps: I’ve also made a mobile app so you can read code in the phone, flip to the app that copies code from clipboard (opt-out possible) and looks up automatically.