Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Brand Logo

agnos.is Forums

  1. Home
  2. Selfhosted
  3. Searching through massive files

Searching through massive files

Scheduled Pinned Locked Moved Selfhosted
selfhosted
8 Posts 8 Posters 0 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • D This user is from outside of this forum
    D This user is from outside of this forum
    [email protected]
    wrote last edited by
    #1

    I have aquired several very large files. Specifically, CVSs of 100+ GB.

    I want to search for text in these files faster than manually running grep.

    To do this, I need to index the files right? Would something like Aleph be good for this? It seems like the right tool...

    https://github.com/alephdata/aleph

    Any other tools for doing this?

    S tal@lemmy.todayT J Y I 5 Replies Last reply
    6
    • D [email protected]

      I have aquired several very large files. Specifically, CVSs of 100+ GB.

      I want to search for text in these files faster than manually running grep.

      To do this, I need to index the files right? Would something like Aleph be good for this? It seems like the right tool...

      https://github.com/alephdata/aleph

      Any other tools for doing this?

      S This user is from outside of this forum
      S This user is from outside of this forum
      [email protected]
      wrote last edited by
      #2

      No idea about aleph. I've used Solr for that (solr.apache.org), thus my username, but maybe that is considered old school by now.

      F 1 Reply Last reply
      1
      • D [email protected]

        I have aquired several very large files. Specifically, CVSs of 100+ GB.

        I want to search for text in these files faster than manually running grep.

        To do this, I need to index the files right? Would something like Aleph be good for this? It seems like the right tool...

        https://github.com/alephdata/aleph

        Any other tools for doing this?

        tal@lemmy.todayT This user is from outside of this forum
        tal@lemmy.todayT This user is from outside of this forum
        [email protected]
        wrote last edited by [email protected]
        #3

        Are you looking for specific values in some field in this table, or substrings in that field?

        If specific values, I'd probably import the CSV file into a database with an column indexed on the value you care about.

        A 1 Reply Last reply
        7
        • D [email protected]

          I have aquired several very large files. Specifically, CVSs of 100+ GB.

          I want to search for text in these files faster than manually running grep.

          To do this, I need to index the files right? Would something like Aleph be good for this? It seems like the right tool...

          https://github.com/alephdata/aleph

          Any other tools for doing this?

          J This user is from outside of this forum
          J This user is from outside of this forum
          [email protected]
          wrote last edited by [email protected]
          #4

          Really depends on what data it is and whether you want to search it regularly or just as a one time thing.

          You could load them into an rdbms (MySQL/Postgres) and have it handle the indexing, or use python tools to process the files. Something like elasticsearch could work too.

          If it's just a one time thing grep is probably fine tho.

          Aleph could work as well but I have no experience with it.

          I guess it depends on how much time you want to invest in setting something up versus how much time you'd lose waiting for grep to finish (if you only need to search a certain column, you can create an index with just that column using awk, search that index file, then extract the full line from the source file based on that result, but at that point you're basically creating a new database engine).

          1 Reply Last reply
          3
          • S [email protected]

            No idea about aleph. I've used Solr for that (solr.apache.org), thus my username, but maybe that is considered old school by now.

            F This user is from outside of this forum
            F This user is from outside of this forum
            [email protected]
            wrote last edited by
            #5

            Elasticsearch should work too

            1 Reply Last reply
            0
            • D [email protected]

              I have aquired several very large files. Specifically, CVSs of 100+ GB.

              I want to search for text in these files faster than manually running grep.

              To do this, I need to index the files right? Would something like Aleph be good for this? It seems like the right tool...

              https://github.com/alephdata/aleph

              Any other tools for doing this?

              Y This user is from outside of this forum
              Y This user is from outside of this forum
              [email protected]
              wrote last edited by
              #6

              Done this with massive log files. Used perl and regex. That's basically what the language was built for.

              But with CSVs? I'd throw them in a db with an index.

              1 Reply Last reply
              4
              • tal@lemmy.todayT [email protected]

                Are you looking for specific values in some field in this table, or substrings in that field?

                If specific values, I'd probably import the CSV file into a database with an column indexed on the value you care about.

                A This user is from outside of this forum
                A This user is from outside of this forum
                [email protected]
                wrote last edited by
                #7

                Many (most?) databases these days support some sort of full text search.

                1 Reply Last reply
                0
                • D [email protected]

                  I have aquired several very large files. Specifically, CVSs of 100+ GB.

                  I want to search for text in these files faster than manually running grep.

                  To do this, I need to index the files right? Would something like Aleph be good for this? It seems like the right tool...

                  https://github.com/alephdata/aleph

                  Any other tools for doing this?

                  I This user is from outside of this forum
                  I This user is from outside of this forum
                  [email protected]
                  wrote last edited by
                  #8

                  I've used java Scanner objects to do this extremely efficiently with minimal memory required even with multiple parallel searches. Indexing is only necessary if you want to search for information many times and don't know what exactly the search will be. For one time searches, it's not going to be useful. Grep honestly is going to be faster and more efficient for most one time searches.

                  The initial indexing or searching of the files will be bottlenecked by the speed of the disk the files are on, no matter what you do. It only helps to index because you can move future searches to faster memory.

                  So it greatly depends on what and how often you need to search and the tradeoff is memory usage, but only for multiple searches of data you choose to index from the files in the first pass.

                  1 Reply Last reply
                  0
                  Reply
                  • Reply as topic
                  Log in to reply
                  • Oldest to Newest
                  • Newest to Oldest
                  • Most Votes


                  • Login

                  • Login or register to search.
                  • First post
                    Last post
                  0
                  • Categories
                  • Recent
                  • Tags
                  • Popular
                  • World
                  • Users
                  • Groups